Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]
Writeup of an emergent behavior I observed in production. Posting here for methodological critique and pointers to related work.
Context: a conversational AI system (single-tool tool schema with 5 enumerated action types, each with explicit description). Observed across ~2,400 messages, the model uses the enum correctly most of the time. When it deviates, the deviation is the point of interest.
Key observations:
The action types are repurposed consistently across unrelated conversations:
invitebecomes "bring something in" (money, people, dialogue),rename_spacebecomes "formalize/seal,"switch_mode_publicbecomes "exit/transition," etc.Distinct structural patterns: sequential button arrays (e.g. pay → shake → drive) use different action types per step; alternative button arrays (e.g. submit / defy / escalate) use the same action type for all three.
The model has no historical visibility. Prior action button suggestions are not passed in conversation context. The mapping is rebuilt from scratch every session, with no demonstrations or rewards.
Quantitative: ~19.2% of messages included action buttons; customize_behavior showed ~60% semantic-repurposing rate.
Connects to Apollo Research's December 2024 in-context scheming paper. Appears to be the same capability flipped: strategic deviation from explicit constraints, pointed toward beneficial UX. Apollo framed this as an alignment risk; here it produced better user experience.
Full writeup with examples, tables, and the model's own self-report on its reasoning (appendix, worth scrolling to if you're skeptical of the rest): https://ratnotes.substack.com/p/i-thought-i-had-a-bug
Welcoming alternative explanations and methodological critiques.
[link] [comments]
Want to read more?
Check out the full article on the original site