Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

blog.can.ac
kachapopopow
3 hours ago
239points
107 comments

Comments

benreesmanjust now
The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.

If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.

logicprog1 hour ago
I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!

I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.

I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.

mycall25 minutes ago
If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.

OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.

Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.

logicprog1 hour ago
Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.
rubenflamshep1 hour ago
It didn’t read as AI to me :)
kachapopopow40 minutes ago
why the long -'s
logicprog32 minutes ago
Because I like them?
kachapopopow17 minutes ago
reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.
barrenko23 minutes ago
2026 is the year of the harness.
fazgha43 minutes ago
So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?
ahofmann36 minutes ago
Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.
ink40 minutes ago
On a Mac, it's alt-dash in case you weren't being facetious
snazz39 minutes ago
Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen
macintux37 minutes ago
Technically option-shift-dash. option-dash is an en-dash.
throwup23820 minutes ago
Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.
bitwize22 minutes ago
I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.
aeon_ai44 minutes ago
Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.

That’s when the future really starts hitting you.

logicprog31 minutes ago
Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)
chrisweekly2 hours ago
Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

brendanmc61 hour ago
You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.
esafak1 hour ago
Please stop; I just can't any more! Yes, I'm absolutely right.
nekitamo7 minutes ago
Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.
0xbadcafebee7 minutes ago
[delayed]
clx7532 minutes ago
During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.

As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).

Worked like a charm.

badhorseman11 minutes ago
Sounds interesting, do you have the code to share.
tosh2 hours ago
Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

robbomacrae1 hour ago
Indeed. The biggest waste might be the overuse of MCP for everything. Sure it makes the initial development easier but then for every connection you're using a hundred billion dollar parameter model to decide how to make the call when it's usually completely unnecessary and then prone to random errors. MCP is the hammer that can make literally everything look like a nail...
chasd001 hour ago
i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.

With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.

kachapopopow1 hour ago
you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems
fcanesin26 minutes ago
The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.

If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.

TZubiri22 minutes ago
But they rely on distilling the output of american leader models. Which will probably train against their own harness.

Someone has to do the baseline training, development, and innovation. it can't be clones all the way down

robotresearcher7 minutes ago
Why not? Humans are (very nearly) clones all the way down.
lillecarl1 minute ago
Citation needed, SOTA labs surely has technical protection and legaleese against using them for training. It's been done in th past but what indicates this is still the case?
woeirua3 hours ago
The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732
theturtletalks2 hours ago
Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.

When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.

0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

withinboredom3 hours ago
Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.
CuriouslyC2 hours ago
The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).

Ultimately the market is going to force them to open up and let people flex their subs.

Aurornis1 hour ago
> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?

At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.

horsawlarway2 hours ago
Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.

Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.

Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.

---

The right route is open models and open harnesses, ideally on local hardware.

deaux2 hours ago
At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.
DeathArrow2 hours ago
I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.
deaux2 hours ago
OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.
azuanrb35 minutes ago
Claude Code better than opencode for GLM models for me.
Aurornis2 hours ago
> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.

Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

disgruntledphd21 hour ago
> Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.

eshaham782 hours ago
The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.
0xdeafbeef9 minutes ago