Understanding what is measured

In this comparison, I measure the models on various axes. But what do the numbers mean, and why do they matter? As I said in the first post in this series, the benchmark produced various metrics. In this second post, I will go into some of these numbers and discuss what they mean.

What input tokens actually mean - and why the contrast is staggering

Most readers have a vague sense that "more tokens = more expensive", but no intuitive feel for what these numbers represent in practice.

Let's visualize this with a rough translation: 1000 tokens is approximately 750 words, or about 1.5 pages of dense technical documentation. With that baseline:

Model	Input tokens	Equivalent pages read
Kimi K2	18.7M	~14,025 pages
DeepSeek V3.2	3.46M	~2,595 pages
Gemini 3 Flash	1.33M	~998 pages
Claude Sonnet 4.6	496K	~372 pages
Devstral 2 2512	460K	~345 pages
Qwen3 Max	459K	~344 pages
Grok 4.1 Fast	297K	~223 pages
GPT-5.2	221K	~166 pages

Hence my use of the word "staggering". Kimi K2 read 14.000 pages of text , while GPT 5.2 read 166 pages. Or, if we use the Harry Potter series as a measurement, Kimi K2 read the whole series of 8 books 13 times (which is possibly on par with the most die-hard Harry Potter fans), while GPT 5.2 only read the first two books (and then presumably decided it preferred Lord of the Rings and ditched the series...).

To be serious: this is not because Kimi read more documentation - all models received the same four specification files. The difference is that it reused the same material.

Said another way, input tokens are not a measure of how much the models did. They're a measure of how efficiently they did it.

A model that reads the same thing 40 times isn't 40 times more thorough. It's 40 times less efficient. Which segues nicely to our next point:

How models optimize token usage

Cache reads are probably the least intuitive metric in the table, and likely the number most people will skip over (but shouldn't). They actually reveal the most about how each model approaches long tasks.

When an AI model processes a long document or conversation, modern providers allow portions of that context to be cached - stored temporarily so they don't need to be reprocessed from scratch on every API call. Cache reads are the count of tokens retrieved from that cache rather than being resent fresh.

What high cache reads mean

The model is efficiently reusing prior context. It established a large shared context early - the spec documents, the type definitions, the schema - and then made many small, targeted calls that read from that foundation rather than re-sending everything each time.

What low cache reads mean

The model is either making very few calls in total or re-sending large contexts on every call rather than caching them.

Model	Cache reads	Input tokens	Cache ratio
Kimi K2	2.62M	18.7M	0.14x
Devstral	75K	460K	0.16x
DeepSeek V3.2	598K	3.46M	0.17x
Grok 4.1 Fast	1.43M	297K	4.8x
Qwen3 Max	3.12M	459K	6.8x
Gemini 3 Flash	10.3M	1.33M	7.7x
Claude Sonnet 4.6	5.45M	496K	11x
GPT-5.2	5.67M	221K	25.7x

The contrast here is as extreme as for the input tokens, but the ranking changes somewhat here.

The cache ratio (cache reads / input tokens) is the most revealing column. GPT 5.2 at 25.7x and Claude at 11x are working very differently from Kimi at 0.14x. GPT 5.2 sent only 221K fresh input tokens and read 5.67M from cache - meaning for every token it sent fresh, it retrieved 25 cached tokens. Kimi sent 18.7M fresh input tokens and only retrieved 2.62M from cache - meaning it was constantly resending context rather than building on what it had already established.

Cache efficiency is not just a cost optimization - it correlates with output quality.

A model that efficiently caches context has a more stable shared foundation across its work. A model that resends context constantly is working from a noisier, less coherent base on each call. Kimi's 2.5-hour run time and its false completion claim at the halfway point are both symptoms of the same underlying inefficiency.

Output tokens - the closest proxy for "how much actual work was done"

If input tokens measure what the model read, output tokens measure what it wrote. This is the closest metric we have to "how much code and reasoning did this model actually produce".

Model	Output tokens	Files created	Tests written
Kimi K2	174K	29	37
Claude Sonnet 4.6	58.6K	30	52
GPT-5.2	55.8K	30	11
Gemini 3 Flash	43.8K	24	5
DeepSeek V3.2	39.9K	25	0
Grok 4.1 Fast	32K	23	10
Qwen3 Max	22.4K	27	1
Devstral 2 2512	4.3K	23	5

Devstral's 4.3K output tokens for a claimed full implementation is the number that immediately raises a flag here. Claude produced 58.6K output tokens for the same task - 13x more. When the review agent opened Devstral's agent files, they contained // TODO: implement LLM call comments. 4.300 tokens is enough to write skeletons and comments. It is not enough to write seven functional LLM agents, an orchestrator, training mode and a CLI.

Kimi's 174K output tokens tells a different story: it produced a genuinely large implementation (thankfully, considering it was working for 2.5 hours...), with 37 tests. The problem was efficiency, not output volume.

Output tokens are not a quality signal in isolation - they're a sanity check.

A model claiming 26/26 completed steps with 4.300 output tokens should immediately prompt the question "where's the implementation?" It's the equivalent of submitting a 2-page report for a project that the brief said required 50 pages. Granted, the page count doesn't guarantee quality, but it sets a floor.

The cost-efficiency paradox

The two most cost-efficient models by score-per-dollar are Devstral ($0.07, score 77) and Grok ($0.15, score 87). Neither produced a runnable system.

This is the efficiency paradox of AI benchmarks: optimizing purely for cost produces implementations that look complete on paper, but aren't. It's cheap to write scaffolding. It's expensive to write everything the scaffolding needs to actually run.

The genuine efficiency story - excluding models with stub implementations - looks like this:

Model	Score	Cost	Score/$	Notes
Qwen3 Max	106	$0.69	154	Zero interventions, 15 min, real implementation
DeepSeek V3.2	107	$0.99	108	Required 4 nudges, doesn't compile
Gemini 3 Flash	111	$1.34	83	Compiles, 5 tests
GPT-5.2	130	$2.38	55	Compiles, 11 tests, worktree issue
Claude Sonnet 4.6	137	$4.00	34	Compiles, 52 tests, clean
Kimi K2	116	$10.65	11	Compiles, 37 tests, brutal cost

Qwen3 Max at 154 score/$ is the genuine efficiency standout - and nothing else comes close. A model that completes 72% of the rubric in 15 minutes for $0.69 with zero interventions is a qualitatively different proposition from one that completes 93% in 39 minutes for $4.00.

The right efficiency metric, however, depends on what you're building. For a prototype you'll iterate on anyway, Qwen3's 29-point gap from Claude probably doesn't matter that much. For a production build you need to get right the first time (especially when running in a build pipeline with no human in the loop), Claude's lead is worth the premium. The efficiency table is a decision tool, not a ranking.

Interventions - the hidden cost that doesn't appear on any invoice

Every intervention in this benchmark represented a human stopping what they were doing, reading the situation, making a judgment call, and responding. That's cognitive load, context-switching cost, and time that doesn't show up in token bills.

Model	Interventions	Type
DeepSeek V3.2	4	Connection timeouts
GPT-5.2	2	Confirmation-seeking at the start
Kimi K2	2	False completion + stall
Devstral 2 2512	2	Clarification + permission
Grok 4.1 Fast	2	Stall + stub challenge
Qwen3 Max	1	Post-completion challenge
Gemini 3 Flash	0	—
Claude Sonnet 4.6	0	—

Three models (Gemini, Claude, Qwen3) ran autonomously and required either zero challenges, or one post-completion challenge. Based only on this simple benchmark, these would be the three models you could realistically run overnight and return to a finished result. The others required varying degrees of babysitting.

The intervention type matters as much as the count:

Confirmation-seeking: (GPT 5.2, Devstral) - the model pauses to ask permission before proceeding. Manageable, but it means you can't walk away. You become a "proceed" button.

False completion: (Kimi, Grok) - the model declares done when it isn't. This is the most dangerous intervention type because it requires you to verify the claim rather than just approve a request. If you don't check, you ship incomplete work. But as mentioned earlier, of course we do not trust the models' self-reporting anyway, and great evals are as important as a great pipeline and a great specification.

Connection stalls: (DeepSeek) - not the model's fault, but operationally equivalent to the others. You still have to watch for it, notice it, and nudge it.

In a production setting, interventions are the most visible measure of how much you can trust a model with autonomous work. A model that costs $4 but runs without babysitting may be cheaper in total than a model that costs $0.69 but requires you to check in every 20 minutes.

The time dimension - what duration actually tells you

Duration in this benchmark isn't just "how long did it take". It's a composite signal that encodes model efficiency, connection stability, context management, and task complexity handling.

Model	Duration	Cost	Output tokens
Kimi K2	151 min	$10.65	174K
DeepSeek V3.2	42 min	$0.99	39K
Claude Sonnet 4.6	39 min	$4.00	58K
Devstral	19 min	$0.07	4K
GPT-5.2	17 min	$2.38	55K
Gemini 3 Flash	15 min	$1.34	43K
Qwen3 Max	15 min	$0.69	22K
Grok 4.1 Fast	10 min	$0.15	32K

Grok's 10-minute completion looks impressive until you see the implementation is mostly stubs. Devstral's 19 minutes produced only 4.300 output tokens - roughly the same as a single detailed email. What was it doing for all that time? Duration is meaningless without output volume as context.

The more interesting comparison is Claude (39 min, 58K output tokens) vs GPT 5.2 (17 min, 55K output tokens). Nearly identical output volume in half the time at 60% of the cost.

The difference isn't in what was produced - it's in how the model worked.

GPT 5.2's 25.7x cache ration (vs Claude's 11x) suggests it front-loaded context caching more aggressively, making subsequent calls faster and cheaper. You see how analyzing the various numbers together tells different stories?

That was the last of the number-crunching findings for now. In part 3 we'll explore a few other takeaways from running the benchmark.