6.5 C
Canberra
Monday, June 22, 2026

Responses Bug in LM Studio


It began, as this stuff do, with a shortcut I used to be sure would work.

I’ve been constructing SwiftAgents, my Swift framework for speaking to language fashions, and one of many native suppliers it helps is LM Studio — the app quite a lot of us attain for to run fashions on our personal Macs. LM Studio just lately grew assist for the newer “Responses” API, the OpenAI-style endpoint that may keep in mind a dialog for you. As a substitute of re-sending the entire chat historical past on each flip, you ship solely the brand new message plus slightly breadcrumb — previous_response_id — that tells the server “you already keep in mind the remainder.” Much less knowledge over the wire, much less bookkeeping on the shopper. An apparent win, and I needed it in SwiftAgents.

Earlier than wiring it in for good, I requested Claude Code to benchmark it. Ten turns of the identical little dialog, run two methods: as soon as with the brand new chaining trick, and as soon as the old style means the place you resend your entire historical past each single time. I simply needed to verify the intelligent path was quicker earlier than committing to it.

The numbers got here again backwards.

When the shortcut is the great distance

Here’s what the benchmark discovered, working a small Qwen3 mannequin inside LM Studio. The left column is the “optimization” — chaining with previous_response_id, sending solely the brand new message every flip. The appropriate column is the brute-force strategy — resending your entire dialog, each time, like a caveman.

The quantity proven is what number of enter tokens the server really needed to course of on that flip:

Flip Chaining (solely the brand new message despatched) Full resend (entire historical past each time)
1 26 26
2 48 48
3 98 69
4 206 95
5 415 120
6 829 141
7 1,669 169
8 3,338 191
9 6,677 211
10 13,364 238

Learn it twice, as a result of I needed to. The wasteful strategy — resending every little thing — retains the workload flat, round 240 tokens by flip ten. The intelligent strategy, the place I ship virtually nothing, by some means makes the server grind by way of 13 thousand.

And have a look at the form of that left column: 26, 48, 98, 206, 415, 829… it doubles each flip. A textbook geometric balloon. Regardless of the server does internally when it “remembers” the dialog for you, it rebuilds the entire thing roughly twice as giant every time. For the reason that mannequin has to learn all of these tokens earlier than it may possibly say a phrase, the wait balloons proper together with the token depend. By flip ten a single reply took 28 seconds with chaining, in opposition to 3 seconds with out.

The optimization was, comfortably, the slowest doable option to maintain the dialog.

Ensuring it wasn’t simply me

A outcome that foolish deserves suspicion, so the subsequent step was to test whether or not I’d misconfigured one thing or stumbled onto one dangerous mannequin. The primary thought was to run the benchmark in opposition to official GPT 5.5 – and there the caching behaved precisely as you’d count on. Then I requested Claude Code to run the identical probe throughout a lot of LLMs I had beforehand downloaded.

The balloon confirmed up each single time — small fashions and enormous, previous architectures and brand-new ones, the plain ones and the flamboyant “reasoning” ones, and even a mixture-of-experts mannequin. Similar fingerprint every time: the chained path doubles each flip, the full-resend path stays flat.

Just a few of the extra memorable knowledge factors:

  • gpt-oss (a 20-billion-parameter mixture-of-experts mannequin): ballooned to 16,833 tokens by flip ten — for a dialog that was genuinely 283 tokens lengthy. That’s a 59× tax. The stunning irony right here is that this mannequin barely “thinks” out loud in any respect, but it scored the worst blowup of the lot, which informed us the bug has nothing to do with how a lot the mannequin generates and every little thing to do with how the server rebuilds the historical past.
  • A 12-billion Gemma mannequin: by flip ten, a single reply took 37.6 seconds as an alternative of the ~2.6 seconds the identical dialog wanted over the plain chat endpoint.

Importantly, this isn’t the Responses API being a nasty thought, and it isn’t LM Studio being dangerous software program — its peculiar chat endpoint is fast and caches superbly. It’s one particular characteristic, the server-side dialog reconstruction behind previous_response_id, that misbehaves. I do know it’s particular to LM Studio as a result of the plain factors of comparability don’t do it: OpenAI’s personal servers preserve the token depend equal to the actual dialog, and Ollama — which merely declines to be stateful — retains it flat too. Solely LM Studio’s reconstruction inflates.

So fairly than ship a characteristic that makes issues slower, I did the boring, right factor in SwiftAgents: on LM Studio it resends the total historical past and skips the chaining completely. And I wrote the entire thing up, with a runnable copy script, as a bug report on LM Studio’s tracker. Typically the deliverable is a paper path.

A aspect quest: the app I liked versus the one I didn’t

Someplace in the midst of all this benchmarking, a special query crept in.

I’ve at all times most popular LM Studio. It’s the better-looking app, it feels extra fashionable, and — the explanation that truly mattered to me — it supported MLX, Apple’s on-device machine-learning framework, lengthy earlier than Ollama did. On Apple Silicon, MLX is the quick path, so for a very good whereas LM Studio was merely the faster option to run a mannequin on a Mac. Ollama was the command-line workhorse I revered however didn’t attain for.

Whereas poking at Gemma 4, I seen Ollama had quietly closed that hole — it now runs the identical fashionable, accelerated mannequin codecs I’d switched to LM Studio for within the first place. Which meant, for the primary time, I may put the 2 of them on a very stage enjoying discipline: the similar mannequin, within the similar quantization, and simply race them.

So I did. Right here’s Gemma-4-E4B, an identical nvfp4 construct on each:

Ollama LM Studio
Studying your immediate (immediate processing) 910 tok/s 445 tok/s
Writing the reply (era) 62.7 tok/s 51.7 tok/s
Time till the primary phrase seems 72 ms 121 ms
Re-reading a 1,780-token immediate it simply noticed (heat cache) 65 ms 657 ms

Ollama wins each row. It reads prompts twice as quick, generates noticeably faster, begins answering sooner, and — the one which stunned me most — reuses its cache about ten instances extra cheaply. Ask it to re-read a immediate it simply processed and it’s finished in 65 milliseconds; LM Studio takes the higher a part of a second to do the identical factor.

I need to be truthful, as a result of there’s an trustworthy caveat buried in right here. The primary time I raced them I had LM Studio on MLX and Ollama on the older format, and in that mismatched setup LM Studio’s era seemed quicker. It was a entice — I used to be evaluating the quick format in opposition to the gradual one. The second I matched them quant-for-quant, the obvious win evaporated and Ollama pulled forward on every little thing. So I gained’t declare Ollama is universally quicker at every little thing for everybody; I’ll declare the factor my knowledge really helps, which is that on the identical mannequin in the identical format, Ollama got here out forward in all places I seemed.

That’s a barely uncomfortable conclusion for me, given how a lot I preferred the opposite app. However the stopwatch doesn’t care what’s prettier.

The half I preserve excited about

Right here’s the bit that genuinely tickles me, and it’s not likely about tokens in any respect.

I didn’t write any of those benchmarks. I described what I needed to know — “load a mannequin, run ten turns every means, observe the response time” — and Claude Code wrote the Python, ran it and computed all of the statistics. When it wanted a mannequin that wasn’t loaded, it drove LM Studio’s command-line device to load it, checked the API to verify it was actually resident, and benchmarked it.

At one level it quoted a era velocity that seemed too good, paused, determined the measurement window had been too brief to belief, rewrote the benchmark to generate an extended pattern, and re-ran it to get an trustworthy quantity. It even filed the bug report on my behalf. You possibly can see how more information was added as feedback as I used to be discovering extra knowledge.

On the similar time my agentic CI loop was ticking as properly on the SwiftAgents PR. When the pull request’s continuous-integration construct went crimson on Linux — as a result of a kind I’d used lives in a special module off the Mac — it recognized the failure, reached for my very own SwiftCross shim to repair it, pushed, watched the construct, discovered a second spot with the identical downside, mounted that too, and waited with me till all six platforms went inexperienced. I principally watched.

Just a few months in the past, writing a benchmark harness by hand would have been an excessive amount of work for me. So I wouldn’t have finished this analysis, however I’d have simply complained on Twitter about one other downside in someone else’s code. And I’d have been annoyed that I couldn’t do something about it. On this new actuality brokers do the analysis, the write-up and the submitting of the problem. The ball is now in LM Studio’s court docket. This new actuality nonetheless feels faintly like dishonest.

I put the benchmarking scripts in gist for reference.

What I modified

Two issues got here out of a day that was solely ever meant to verify a one-line optimization.

SwiftAgents now does the smart factor on LM Studio: it resends the total dialog and leaves previous_response_id chaining properly alone till the underlying balloon is mounted. The “optimization” stays on the shelf.

And alone machine, my default has quietly shifted from the app I preferred to the one which’s quicker. I nonetheless assume LM Studio is the nicer factor to take a look at. However I’ve been doing this lengthy sufficient to know that when the numbers are that constant, you go the place the numbers level — even once they level someplace you didn’t count on, and even when an AI is the one holding the stopwatch.

Do you employ any native inferencing? In that case, which do you favor?


Classes: Bug Studies

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles