Don’t Blame the Mannequin – O’Reilly

April 22, 2026

16

The next article initially appeared on the Asimov’s Addendum Substack and is being republished right here with the creator’s permission.

A rambling response to what Claude itself deemed a “easy question” with clear formatting necessities.

Are LLMs dependable?

LLMs have constructed up a popularity for being unreliable.¹ Small modifications within the enter can result in large modifications within the output. The identical immediate run twice may give completely different or contradictory solutions. Fashions typically battle to stay to a specified format until the immediate is worded excellent. And it’s laborious to inform when a mannequin is assured in its reply or if it may simply as simply have gone the opposite manner.

It’s simple accountable the mannequin for all of those reliability failures. However the API endpoint and surrounding tooling matter too. Mannequin suppliers restrict the form of interactions that builders may have with a mannequin, in addition to the outputs that the mannequin can present, by limiting what their APIs expose to builders and third-party corporations. Issues like the complete chain-of-thought and the logprobs (the chances of all potential choices for the subsequent token) are hidden from builders, whereas superior instruments for guaranteeing reliability like constrained decoding and prefilling are usually not made out there. All options which might be simply out there with open weight fashions and are inherent to the way in which LLMs work.

Each choice made by mannequin builders on what instruments and outputs to supply to builders by way of their API is not only an architectural alternative but additionally a coverage choice. Mannequin suppliers instantly decide what degree of management and reliability builders have entry to. This has implications for what apps might be constructed, how dependable a system is in apply, and the way effectively a developer can steer outcomes.

The factitious limits on enter

Trendy LLMs are normally constructed round chat templates. Each enter and output, except software calls and system or developer messages, is filtered by way of a dialog between a consumer and an assistant—directions are given as consumer messages; responses are returned as assistant messages. This turns into extraordinarily evident when taking a look at how fashionable LLM APIs work. The completions API, an endpoint initially launched by OpenAI and extensively adopted throughout the business (together with by a number of open mannequin suppliers like OpenRouter and Collectively AI) takes enter within the type of consumer and assistant messages and outputs the subsequent message.²

The give attention to a chat interface in an API has its advantages. It makes it simple for builders to purpose about enter and output being utterly separate. However chat APIs do extra than simply use a chat template beneath the hood; they actively restrict what third-party builders can management.

When interacting with LLMs by way of an API, the boundary between enter and output is usually a agency one. A developer units earlier messages, however they normally can’t prefill a mannequin’s response, that means builders can’t power a mannequin to start a response with a sure sentence or paragraph.³ This has real-world implications for individuals constructing with LLMs. With out the flexibility to prefill, it turns into a lot more durable to manage the preamble. If you recognize the mannequin wants to begin its reply in a sure manner, it’s inefficient and dangerous to not implement it on the token degree.⁴ And the constraints lengthen past simply the beginning of a response. With out the flexibility to prefill solutions, you additionally lose the flexibility to partially regenerate solutions if solely a part of the reply is fallacious.⁵

One other deficiency that’s notably seen is how the mannequin’s chain-of-thought reasoning is dealt with. Most massive AI corporations have made a behavior of hiding the fashions’ reasoning tokens from the consumer (and solely exhibiting summaries), reportedly to protect towards distillation and to let the mannequin purpose uncensored (for AI security causes). This has second-order results, one in all which is the strict separation of reasoning from messages. Not one of the main mannequin suppliers allow you to prefill or write your individual reasoning tokens. As a substitute it’s essential to depend on the mannequin’s personal reasoning and can’t reuse reasoning traces to regenerate the identical message.

There are legit causes for not permitting prefilling. It might be argued that permitting prefilling will drastically improve the assault space of immediate injections. One research discovered that prefill assaults work very effectively towards even state-of-the-art open weight fashions. However in apply, the mannequin shouldn’t be the one line of protection towards attackers. Many corporations already run prompts towards classification fashions to seek out immediate injections, and the identical kind of safeguard may be used towards prefill assault makes an attempt.

Output with few controls

Prefilling shouldn’t be the one casualty of a clear separation between enter and output. Even inside a message, there are levers which might be out there on an area open weight mannequin that simply aren’t potential when utilizing an ordinary API. This issues as a result of these controls permit builders to preemptively validate outputs and be certain that responses comply with a sure construction, each lowering variability and bettering reliability. For instance, most LLM APIs help one thing they name structured output, a mode that forces the mannequin to generate output in a given JSON format; nevertheless, structured output doesn’t inherently must be restricted to JSON.⁶ That very same approach, constrained decoding, or limiting the tokens the mannequin can produce at any time, might be used for rather more than that. It might be used to generate XML, have the mannequin fill in blanks Mad Libs-style, power the mannequin to jot down a narrative with out utilizing sure letters, or even implement legitimate chess strikes at inference time. It’s a robust function that enables builders to exactly outline what output is appropriate and what isn’t—guaranteeing dependable output that meets the developer’s parameters.

The rationale for that is possible that LLM APIs are constructed for a variety of builders, most of whom use the mannequin for easy chat-related functions. APIs weren’t designed to present builders full management over output as a result of not everybody wants or desires that complexity. However that’s not an argument towards together with these options; it’s solely an argument for a number of endpoints. Many corporations have already got a number of supported endpoints: OpenAI has the “completions” and “responses” APIs, whereas Google has the “generate content material” and “interactions” APIs. It’s not infeasible for them to make a 3rd, more-advanced endpoint.

An absence of visibility

Even the mannequin output that third-party builders do get by way of the mannequin’s API is usually a watered-down model of the output the mannequin offers. LLMs don’t simply generate one token at a time. They output the logprobs. When utilizing an API, nevertheless, Google solely offers the highest 20 almost certainly logprobs. OpenAI not offers any logprobs for GPT 5 fashions, whereas Anthropic has by no means supplied any in any respect. This has real-world penalties for reliability. Log possibilities are one of the crucial helpful alerts a developer has for understanding mannequin confidence. When a mannequin assigns practically equal likelihood to competing tokens, that uncertainty itself is significant data. And even for these corporations who present the highest 20 tokens, that’s typically not sufficient to cowl bigger classification duties.

In the case of reasoning tokens even much less output data is supplied. Main suppliers akin to Anthropic,⁷ Google, and OpenAI⁸ solely present summarized considering for his or her proprietary fashions. And OpenAI solely provides that when a sound authorities ID is provided to OpenAI. This not solely takes away the flexibility for the consumer to actually examine how a mannequin arrived at a sure reply, nevertheless it additionally limits the flexibility for the developer to diagnose why a question failed. When a mannequin offers a fallacious reply, a full reasoning hint tells you whether or not it misunderstood the query, made a defective logical step, or just received unfortunate on the ultimate token. A abstract obscures a few of that, solely offering an approximation of what truly occurred. This isn’t a difficulty with the mannequin—the mannequin continues to be producing its full reasoning hint. It’s a difficulty with what data is supplied to the top developer.

The case for not together with logprobs and reasoning tokens is analogous. The danger of distillation will increase with the quantity of data that the API returns. It’s laborious to distill on tokens you can’t see, and with out giving logprobs, the distillation will take longer and every instance will present much less data.⁹ And this danger is one thing that AI corporations want to contemplate rigorously, since distillation is a robust approach to imitate the skills of sturdy fashions for an inexpensive worth. However there are additionally dangers in not offering this data to customers. DeepSeek R1, regardless of being deemed a nationwide safety danger by many, nonetheless shot straight to the highest of US app shops upon launch and is utilized by many researchers and scientists, largely as a consequence of its openness. And in a world the place open fashions are getting an increasing number of highly effective, not giving builders correct entry to a mannequin’s outputs may imply shedding builders to cheaper and extra open options.

Reliability requires management and visibility

The reliability issues of present LLMs don’t stem solely from the fashions themselves but additionally from the tooling that suppliers give builders. For native open weight fashions it’s normally potential to commerce off complexity for reliability. Your complete reasoning hint is at all times out there and logprobs are absolutely clear, permitting the developer to look at how a solution was arrived at. Consumer and AI messages might be edited or generated on the developer’s discretion, and constrained decoding might be used to provide textual content that follows any arbitrary format. For closed weight fashions, that is changing into much less and fewer the case. The selections made round what options to limit in APIs harm builders and in the end finish customers.

LLMs are more and more being utilized in high-stakes conditions akin to medication or regulation, and builders want instruments to deal with that danger responsibly. There are few technical boundaries to offering extra management and visibility to builders. Lots of the most high-impact enhancements akin to exhibiting considering output, permitting prefilling, or exhibiting logprobs, value virtually nothing, however could be a significant step in the direction of making LLMs extra controllable, constant and dependable.

There’s a place for a clear and easy API, and there may be some advantage to issues about distillation, however this shouldn’t be used as an excuse to remove essential instruments for diagnosing and fixing reliability issues. When fashions get utilized in high-stakes conditions, as they more and more are, failure to take reliability critically is an AI security concern.

Particularly, to take reliability critically, mannequin suppliers ought to enhance their API by permitting options that give builders extra visibility and management over their output. Reasoning must be supplied in full always, with any security violations dealt with the identical manner that they might have been dealt with within the ultimate reply. Mannequin suppliers ought to resume offering at the very least the highest 20 logprobs, over your complete output (reasoning included), in order that builders have some visibility into how assured the mannequin is in its reply. Constrained decoding must be prolonged past JSON and may help arbitrary grammars by way of one thing like regex or formal grammars.¹⁰ Builders must be granted full management over “assistant” output—they need to have the ability to prefill mannequin solutions, cease responses mid-generation, and department them at will. Even when not all of those options make sense over the usual API, nothing is stopping mannequin suppliers from making a brand new extra advanced API. They’ve completed it earlier than. The choice to withhold these options is a coverage alternative, not a technical limitation.

Enhancing intelligence shouldn’t be the one manner to enhance reliability and management, however it’s normally the one lever that will get pulled.

Don’t Blame the Mannequin – O’Reilly

Are LLMs dependable?

The factitious limits on enter

Output with few controls

An absence of visibility

Reliability requires management and visibility

Footnotes

Related Articles

Microsoft Chevron deal exhibits AI information centre energy push

Zoom, Salesforce, Dialpad, and Others Guess Large on Agentic AI for CX

FAI celebrates 10 yearssince Photo voltaic Impulse 2’strail-blazing fuel-free circumnavigation – sUAS Information

LEAVE A REPLY Cancel reply

Latest Articles

Microsoft Chevron deal exhibits AI information centre energy push

Zoom, Salesforce, Dialpad, and Others Guess Large on Agentic AI for CX

FAI celebrates 10 yearssince Photo voltaic Impulse 2’strail-blazing fuel-free circumnavigation – sUAS Information

Nano-Focusing Methods for Correct Microscopy & Metrology

Mars rover makes use of wiggly wheels impressed by lizard

ABOUT US