blog.salamander.mobi
Understanding what is measured

Understanding what is measured

17. april 2026 · 9 min
In this series: Comparing AI models
  1. Part 1. Real-life comparison of 8 LLMs - how interchangeable are they?
  2. Part 2. Understanding what is measured

In this comparison, I measure the models on various axes. But what do the numbers mean, and why do they matter? As I said in the first post in this series, the benchmark produced various metrics. In this second post, I will go into some of these numbers and discuss what they mean.

What input tokens actually mean - and why the contrast is staggering

Most readers have a vague sense that "more tokens = more expensive", but no intuitive feel for what these numbers represent in practice.

Let's visualize this with a rough translation: 1000 tokens is approximately 750 words, or about 1.5 pages of dense technical documentation. With that baseline:

Model

Input tokens

Equivalent pages read

Kimi K2

18.7M

~14,025 pages

DeepSeek V3.2

3.46M

~2,595 pages

Gemini 3 Flash

1.33M

~998 pages

Claude Sonnet 4.6

496K

~372 pages

Devstral 2 2512

460K

~345 pages

Qwen3 Max

459K

~344 pages

Grok 4.1 Fast

297K

~223 pages

GPT-5.2

221K

~166 pages

Hence my use of the word "staggering". Kimi K2 read 14.000 pages of text , while GPT 5.2 read 166 pages. Or, if we use the Harry Potter series as a measurement, Kimi K2 read the whole series of 8 books 13 times (which is possibly on par with the most die-hard Harry Potter fans), while GPT 5.2 only read the first two books (and then presumably decided it preferred Lord of the Rings and ditched the series...).

To be serious: this is not because Kimi read more documentation - all models received the same four specification files. The difference is that it reused the same material.

Said another way, input tokens are not a measure of how much the models did. They're a measure of how efficiently they did it.

A model that reads the same thing 40 times isn't 40 times more thorough. It's 40 times less efficient. Which segues nicely to our next point:

How models optimize token usage

Cache reads are probably the least intuitive metric in the table, and likely the number most people will skip over (but shouldn't). They actually reveal the most about how each model approaches long tasks.

When an AI model processes a long document or conversation, modern providers allow portions of that context to be cached - stored temporarily so they don't need to be reprocessed from scratch on every API call. Cache reads are the count of tokens retrieved from that cache rather than being resent fresh.

What high cache reads mean

The model is efficiently reusing prior context. It established a large shared context early - the spec documents, the type definitions, the schema - and then made many small, targeted calls that read from that foundation rather than re-sending everything each time.

What low cache reads mean

The model is either making very few calls in total or re-sending large contexts on every call rather than caching them.

Model

Cache reads

Input tokens

Cache ratio

Kimi K2

2.62M

18.7M

0.14x

Devstral

75K

460K

0.16x

DeepSeek V3.2

598K

3.46M

0.17x

Grok 4.1 Fast

1.43M

297K

4.8x

Qwen3 Max

3.12M

459K

6.8x

Gemini 3 Flash

10.3M

1.33M

7.7x

Claude Sonnet 4.6

5.45M

496K

11x

GPT-5.2

5.67M

221K

25.7x

The contrast here is as extreme as for the input tokens, but the ranking changes somewhat here.

The cache ratio (cache reads / input tokens) is the most revealing column. GPT 5.2 at 25.7x and Claude at 11x are working very differently from Kimi at 0.14x. GPT 5.2 sent only 221K fresh input tokens and read 5.67M from cache - meaning for every token it sent fresh, it retrieved 25 cached tokens. Kimi sent 18.7M fresh input tokens and only retrieved 2.62M from cache - meaning it was constantly resending context rather than building on what it had already established.

Cache efficiency is not just a cost optimization - it correlates with output quality.

A model that efficiently caches context has a more stable shared foundation across its work. A model that resends context constantly is working from a noisier, less coherent base on each call. Kimi's 2.5-hour run time and its false completion claim at the halfway point are both symptoms of the same underlying inefficiency.

Output tokens - the closest proxy for "how much actual work was done"

If input tokens measure what the model read, output tokens measure what it wrote. This is the closest metric we have to "how much code and reasoning did this model actually produce".

Model

Output tokens

Files created

Tests written

Kimi K2

174K

29

37

Claude Sonnet 4.6

58.6K

30

52

GPT-5.2

55.8K

30

11

Gemini 3 Flash

43.8K

24

5

DeepSeek V3.2

39.9K

25

0

Grok 4.1 Fast

32K

23

10

Qwen3 Max

22.4K

27

1

Devstral 2 2512

4.3K

23

5

Devstral's 4.3K output tokens for a claimed full implementation is the number that immediately raises a flag here. Claude produced 58.6K output tokens for the same task - 13x more. When the review agent opened Devstral's agent files, they contained // TODO: implement LLM call comments. 4.300 tokens is enough to write skeletons and comments. It is not enough to write seven functional LLM agents, an orchestrator, training mode and a CLI.

Kimi's 174K output tokens tells a different story: it produced a genuinely large implementation (thankfully, considering it was working for 2.5 hours...), with 37 tests. The problem was efficiency, not output volume.

Output tokens are not a quality signal in isolation - they're a sanity check.

A model claiming 26/26 completed steps with 4.300 output tokens should immediately prompt the question "where's the implementation?" It's the equivalent of submitting a 2-page report for a project that the brief said required 50 pages. Granted, the page count doesn't guarantee quality, but it sets a floor.

The cost-efficiency paradox

The two most cost-efficient models by score-per-dollar are Devstral ($0.07, score 77) and Grok ($0.15, score 87). Neither produced a runnable system.

This is the efficiency paradox of AI benchmarks: optimizing purely for cost produces implementations that look complete on paper, but aren't. It's cheap to write scaffolding. It's expensive to write everything the scaffolding needs to actually run.

The genuine efficiency story - excluding models with stub implementations - looks like this:

Model

Score

Cost

Score/$

Notes

Qwen3 Max

106

$0.69

154

Zero interventions, 15 min, real implementation

DeepSeek V3.2

107

$0.99

108

Required 4 nudges, doesn't compile

Gemini 3 Flash

111

$1.34

83

Compiles, 5 tests

GPT-5.2

130

$2.38

55

Compiles, 11 tests, worktree issue

Claude Sonnet 4.6

137

$4.00

34

Compiles, 52 tests, clean

Kimi K2

116

$10.65

11

Compiles, 37 tests, brutal cost

Qwen3 Max at 154 score/$ is the genuine efficiency standout - and nothing else comes close. A model that completes 72% of the rubric in 15 minutes for $0.69 with zero interventions is a qualitatively different proposition from one that completes 93% in 39 minutes for $4.00.

The right efficiency metric, however, depends on what you're building. For a prototype you'll iterate on anyway, Qwen3's 29-point gap from Claude probably doesn't matter that much. For a production build you need to get right the first time (especially when running in a build pipeline with no human in the loop), Claude's lead is worth the premium. The efficiency table is a decision tool, not a ranking.

Interventions - the hidden cost that doesn't appear on any invoice

Every intervention in this benchmark represented a human stopping what they were doing, reading the situation, making a judgment call, and responding. That's cognitive load, context-switching cost, and time that doesn't show up in token bills.

Model

Interventions

Type

DeepSeek V3.2

4

Connection timeouts

GPT-5.2

2

Confirmation-seeking at the start

Kimi K2

2

False completion + stall

Devstral 2 2512

2

Clarification + permission

Grok 4.1 Fast

2

Stall + stub challenge

Qwen3 Max

1

Post-completion challenge

Gemini 3 Flash

0

Claude Sonnet 4.6

0

Three models (Gemini, Claude, Qwen3) ran autonomously and required either zero challenges, or one post-completion challenge. Based only on this simple benchmark, these would be the three models you could realistically run overnight and return to a finished result. The others required varying degrees of babysitting.

The intervention type matters as much as the count:

Confirmation-seeking: (GPT 5.2, Devstral) - the model pauses to ask permission before proceeding. Manageable, but it means you can't walk away. You become a "proceed" button.

False completion: (Kimi, Grok) - the model declares done when it isn't. This is the most dangerous intervention type because it requires you to verify the claim rather than just approve a request. If you don't check, you ship incomplete work. But as mentioned earlier, of course we do not trust the models' self-reporting anyway, and great evals are as important as a great pipeline and a great specification.

Connection stalls: (DeepSeek) - not the model's fault, but operationally equivalent to the others. You still have to watch for it, notice it, and nudge it.

In a production setting, interventions are the most visible measure of how much you can trust a model with autonomous work. A model that costs $4 but runs without babysitting may be cheaper in total than a model that costs $0.69 but requires you to check in every 20 minutes.

The time dimension - what duration actually tells you

Duration in this benchmark isn't just "how long did it take". It's a composite signal that encodes model efficiency, connection stability, context management, and task complexity handling.

Model

Duration

Cost

Output tokens

Kimi K2

151 min

$10.65

174K

DeepSeek V3.2

42 min

$0.99

39K

Claude Sonnet 4.6

39 min

$4.00

58K

Devstral

19 min

$0.07

4K

GPT-5.2

17 min

$2.38

55K

Gemini 3 Flash

15 min

$1.34

43K

Qwen3 Max

15 min

$0.69

22K

Grok 4.1 Fast

10 min

$0.15

32K

Grok's 10-minute completion looks impressive until you see the implementation is mostly stubs. Devstral's 19 minutes produced only 4.300 output tokens - roughly the same as a single detailed email. What was it doing for all that time? Duration is meaningless without output volume as context.

The more interesting comparison is Claude (39 min, 58K output tokens) vs GPT 5.2 (17 min, 55K output tokens). Nearly identical output volume in half the time at 60% of the cost.

The difference isn't in what was produced - it's in how the model worked.

GPT 5.2's 25.7x cache ration (vs Claude's 11x) suggests it front-loaded context caching more aggressively, making subsequent calls faster and cheaper. You see how analyzing the various numbers together tells different stories?

That was the last of the number-crunching findings for now. In part 3 we'll explore a few other takeaways from running the benchmark.

Comments