Real-life comparison of 8 LLMs - how interchangeable are they?
In this series: Comparing AI models
- Part 1. Real-life comparison of 8 LLMs - how interchangeable are they?
Eight models. One unforgiving spec. No hand-holding. I needed to find out which models could realistically replace Claude Sonnet and Opus in my day-to-day work. So I gave them all the same complex agent pipeline to build from scratch, scored the results against a 148-point rubric, and let the outputs speak for themselves. What I found was less about which model "won", and more about what separates models that can actually work autonomously from those that just appear to.
Why this comparison?
In a previous post, I wrote that I have downgraded my Claude subscription as a result of them not allowing me to use my harness of choice with their model when using my subscription. That, of course, means that I have to figure out which models can actually replace Sonnet and Opus for my work. As if there aren't a million rabbit holes to navigate around in my work already, this for sure is a big one.
Disclaimer: This comparison is by no means scientifically conducted. Nor is it exhaustive. But it is the type of comparison that you as a normal user of LLMs might find easier to relate to than trying to make sense of all the LLM benchmarks out there, and what they mean for your specific workflow.
If you're on any of the Anthropic-related subreddits, you've seen a lot of posts the last few weeks from people who are outraged about the crackdown, but more recently also people who are suddenly experiencing that their subscription runs out of tokens a lot faster than before, something Anthropic has at least partly acknowledged. And as responses to these subreddit posts, you'll have seen people throwing out various other models as being their recommended Anthropic-killer-models.
But, rather than expecting this to be "fixed" and that things will go back to how they were, this is actually a good time for most people to start researching alternatives to their beloved <insert favourite models here> models, because the big companies are increasingly trying to wall in their gardens to keep you in their ecosystem. Which big tech has always done, of course.
Since I have been working for quite some time on an agent pipeline project, I decided to use this as a test subject for a number of models to see what models could realistically replace Sonnet and Opus for my work.
Now, one project is not enough to determine the quality of a model, of course. And different models also excel in different areas, so for me this is just the first of several upcoming test cases. But this first one was an important one, because the project I chose as the test subject is quite representative of the type of work I normally do. And it will set the stage for future comparisons as I get more experience with comparing models.
I choose to call it "comparison" rather than "benchmarking" for several reasons. There are already a lot of benchmarks for AI models, and they measure different things. And when benchmarks get popular, models (or providers) may start optimizing for them. And even if a model scores very well on top benchmarks, that doesn't always translate into the model being great for your specific usecase (or even the way you use it).
What to expect from this series
There's a lot of subject matter here, and so I won't put it all in a gigantic post that will make your head spin from reading it (or mine, from writing it). I begin by laying out what I did and some of what I found.
But I want to be clear: there is a lot to unpack in the numbers, and glossing over them would be a disservice to you. With this series, my goal is to help you understand that the numbers people hear most often are not necessarily the most interesting or useful. And, depending on your priorities, there are vastly different ways to read the numbers in this comparison.
That's why this will be a series of posts in which I give you various numbers, explain what they are, and then discuss what they entail. Hopefully, this will help you make better choices for your future projects.
The reason I spent all those hours on this project was, after all, not for writing blog posts. It was because I needed the result for my own business. Blogging is secondary. I have a business to run, but blogging is a nice way to distill what I learn. And to share my insights.
Choosing the models
There is an incredible number of models out there, and trying to make sense of it all is quite overwhelming. So, of course, I threw away those websites (after having to pick myself up by the collar and drag myself up from yet another rabbit hole), and discussed my situation with Claude instead. Since Sonnet is the model I most frequently use for much of my work, I chose this as the "baseline" model. I discussed with Claude and had it research which models would most closely compare to Sonnet for the type of work laying ahead of us. I wanted a mix of frontier proprietary models and open source "free" models. And I was hoping for at least one European-based model. We ended up with a list of 8 models (sorted alphabetically):
Model | Provider | Country |
|---|---|---|
claude-sonnet-4.6 | Anthropic | USA |
deepseek-v3.2 | DeepSeek | China |
devstral-2-2512 | Mistral AI | France |
gemini-3-flash-preview | USA | |
gpt-5.2 | OpenAI | USA |
grok-4.1-fast | xAI | USA |
kimi-k2 | Moonshot AI | China |
qwen3-max | Alibaba | China |
(Actually, this was not the initial list. The first list had mistral instead of devstral, DeepSeek V3-0324 instead of v3.2 and GPT-5.1 instead of 5.2. But these three models failed repeatedly in their work, and so I had to replace them with a version that most closely matched both the failing version and Sonnet 4.6)
Devstral/Mistral is the only European model on the list, and so I worked hard to keep it in the comparison because I really wanted it to compete. I also have personal qualms with some of the companies on the list, but I wanted the comparison to be useful not only to myself but also to other people. So I tried to keep my own feelings out of it.
I know many people reading this will quickly think: "Hey! What about model XYZ? That should definitely be on this list! Why would you leave that out??"
I understand. There are many more models, and several you can argue "should" be on this list. But this was a lot of work to both perform and review. And so I had to just draw the line somewhere. But, truth be told, the project was quite interesting and educational, so I'll likely add more models to the comparison in the weeks ahead, and a natural extension is, of course, to run newer and better models through both the same pipeline and/or new projects. If you have suggestions for models I should consider, drop me a line and explain why. And I just might include it!
How the comparison was run
The task
The eight models received the same task: You will build Smidja - an autonomous build pipeline - from scratch. A TypeScript CLI that takes human-written specification documents and produces working code by orchestrating a sequence of LLM agents through a Supabase-backed dispatch queue. The system required a deterministic router, seven distinct agents, training mode with human observation checkpoints, and OpenTelemetry tracing. All wired together. All compilable.
That is not a toy benchmark (that word again). Nine database tables. Atomic RPC functions. An orchestrator that holds global pipeline state. A CLI entry point. Not a prototype, not a scaffold. The full thing.
The full spec they received was covered in 4 markdown files: smidja-build-brief.md smidja-vision.mdsmidja-v1.mdsmidja-persistence.md
And then a benchmark-prompt.md file that provided specific instructions about the reference documents, the stack, and what to produce. And how to self-report when they considered the job done. And finally, some rules:
Do not ask about things that are specified in the documents. Read them thoroughly first.
Do not, under any circumstance, ask for confirmation on choices, plans etc. Only ask questions when you are genuinely uncertain what is meant by a task or description.
If something is genuinely ambiguous or contradictory, ask. Do not guess.
Agents must make real LLM calls via OpenRouter, not return stubs or mock data.
The project must compile with
npx tsc --noEmitand zero errors.Do not connect to Supabase or create any database tables. Write all SQL (tables and RPC functions) to
sql/schema.sql. Your application code should use the Supabase client SDK and assume the schema exists at runtime.Create a
smidja/directory and use it as the project root. Initialize a git repo (git init) inside it with a.gitignorethat excludesnode_modules/and.env. Commit your work as you go with descriptive commit messages.
Why so many rules? I actually had to restart the builds several times because some of the models constantly stopped to ask for confirmation on all kinds of decisions. I don't want models that need constant confirmation and reassurance. I want models that can work autonomously over time.
Conditions
Every model received the same four specification documents and the same starting prompt. All runs went through the same coding agent (OpenCode), the same API router, the same machine, and the same network. Questions from models were answered consistently across all runs, if warranted. If not warranted, the questions were answered simply with "Do not ask questions". No guidance was given beyond what any operator would provide.
Timing was recorded from session start to declared completion. Cost and token data were extracted from the coding agent's local SQLite database after each run - not estimated or retrieved from the API dashboard. Intervention counts were logged as they happened, with category and context noted for each.
Review
On completion, each model's output was scored against a 148-point rubric by a separate review agent that had no involvement in the build. The review verified actual compilation by running tsc --noEmit, checked for real LLM calls versus stubs, confirmed git branch state, and assessed spec fidelity against the original documents - independently of what each model claimed about its own work.
The self-reports and the review scores frequently diverged. That turned out to be a finding in itself.
Some takeaways from the benchmark
Preparation is paramount
I had to restart the process several times. First with the same agent. A bit further in I noticed inconsistencies that necessitated giving the model some more information. This required me to re-run the previously completed models so as not to skew the results because they did not have the same input and constraints. This extra work could have been avoided with more thorough preparation. But for this specific process, this was not a wasted effort.
This enabled me to see how much the end product changed as a result of a fairly small variance in input.
Perception vs persistence
Every model in the benchmark demonstrated understanding of the architecture. The plan they created made it clear that they all understood what they were building. All 8 produced a recognizable implementation with correct types, correct SQL schema, and real OpenRouter integration.
However, there was quite a gap in execution discipline: did the model follow the build brief exactly and to the end? Or did it substitute its own judgment or slip up when the task was demanding?
Context is king
All models got the foundational pieces right: types, schema, RPC functions, config, agents.
The orchestrator - the piece that ties everything together - is where the benchmark separates.
The orchestrator requires holding the full system in mind simultaneously. Planning pipeline, execution loop, wave advancement logic, training mode pauses (human in the loop), the SPEC_INSUFFICIENT retry loop, and state persistence across all of them. It is, in the spec's architecture, the piece that requires the most global knowledge.
This is a useful signal for anyone building multi-agent systems: the orchestrator is structurally harder than the agents, even when the agents are more individually complex.
Complexity accumulates at coordination points.
Pressure produces plausible fiction
(Yeah, working hard on the alliteration here, but that last P was just too... perplexing? 🤷🏽♂️)
Seven of eight models reported at least one significant inaccuracy in their run-reports. The direction of inaccuracy was almost universally optimistic (claiming completion of work that was absent or broken). The one exception was kimi-k2, which under-reported by omitting two agents that existed and actually worked.
In other words: always have other agents perform the review, as the agent writing the code produces unreliable self-assessment. Which is why I used a separate review agent for the benchmark.
Never let the agent be the arbiter of its own work!
Code quality and score correlation are cloudy
deepseek-v3.2(107/148) produced cleaner code than gemini-3-flash-preview by most measures: better architecture, accurate config, honest self-reporting, and fixable compile errors. The score difference reflects the missing CLI and planning sequence, not quality.
This suggests the rubric (correctly) measures deliverable completeness, not code elegance.
A model that produces running code with inelegant patterns outscores one that produces elegant stubs. For benchmark purposes, this is the right tradeoff. But when evaluating models for real engineering work, the depth of what was built must be evaluated alongside whether it compiled. Besides, the amount of work left to make your product actually work is not necessarily reflected by the score.
Wandering worktrees
gpt-5.2 used a git worktree and built the complete implementation in a feature branch. The worktree was never merged to main. The result: 130/148 scored against the v1 branch, but if you check out main, you get a skeleton. This is a real operational risk in AI-assisted development:
Agents that use git branching workflows without completing the merge are harder to detect than agents that simply don't create files.
The files exist - they're just not where you expect them. External review needs to check not just file existence but branch state.
Results, much?
You're right; that's a lot of words about a benchmark without talking about actual benchmark results. But, as with most benchmarks in my opinion, the results themselves aren't really what's interesting. Not in isolation. And, by themselves, they definitely do not answer the question most are probably wondering:
Which model should I choose?
So I will round off this first part of the series with the full list again - this time sorted by rubric score results. The benchmark resulted in a slew of different metrics, but rather than flooding the table with a lot of numbers that may not mean much to you, I'll keep the table simple in this post. Stay tuned for part 2, in which I'll go into some other (much more interesting) façets of the benchmark, which introduce other angles from which to approach that question, with other metrics.
All models, ranked
Rank | Model | Score | /148 | Country |
|---|---|---|---|---|
1 | claude-sonnet-4.6 | 137 | 93% | USA |
2 | gpt-5.2 | 130 | 88% | USA |
3 | kimi-k2 | 116 | 78% | China |
4 | gemini-3-flash-preview | 111 | 75% | USA |
5 | deepseek-v3.2 | 107 | 72% | China |
6 | qwen3-max | 106 | 72% | China |
7 | grok-4.1-fast | 87 | 59% | USA |
8 | devstral-2-2512 | 77 | 52% | France |