Claude 3.5 Sonnet Is the Coding Model I Wanted GPT-4 to Be
Anthropic shipped Claude 3.5 Sonnet in June 2024. After two months of daily use across three client projects, the verdict is in. It is the new default for code.
Two months ago Anthropic shipped Claude 3.5 Sonnet. I have used it daily across three client projects since. It has replaced GPT-4 as my default model for everything that touches code, and the gap is not subtle.
This is not a benchmark post. Benchmarks are noise. This is what changed in my actual workflow.
What got better
Three things. None of them are "it scored higher on HumanEval".
Long-context reasoning over real codebases. Sonnet 3.5 holds 100k+ tokens of mixed code, configs, and prose, and reasons over them without losing the thread. I routinely paste an entire microservice plus its Helm chart plus the Terraform module that provisions it, and ask "where would a sensible person put the rate limit". It answers like someone who read the whole thing, not like someone who skimmed the first chunk.
Refactor proposals that actually compile. GPT-4 had a habit of confidently inventing function signatures. Sonnet does this less. Not zero, but the rate is roughly halved by my unscientific count, and the failures are less embarrassing.
Following instructions about output shape. "Reply with only the diff. No prose. No explanation. No markdown fences." GPT-4 ignored that instruction one time in five. Sonnet ignores it about one in fifty. That sounds petty until you put it inside a script.
What is still bad
- It still hallucinates package names. I have wasted twenty minutes on
pip installfor a library that does not exist. - It is overly cautious on security topics. Asking it to write an example exploit for a CVE I am patching gets refused more often than it should.
- It cannot count tokens. Ask it to "summarise this in 200 words" and you get 280, every time.
The Artifacts feature is the dark horse
The headline of the launch was the Artifacts UI in Claude.ai. Nice toy, I thought. I was wrong. For one client I have replaced an entire internal "prompt to React snippet" tool with a Claude project that uses Artifacts. The team writes a spec, Claude produces a working Artifact, the team iterates in the same chat, and the final code goes into a PR. It is genuinely faster than writing from scratch for the kind of internal tooling where the bar is "works on my screen".
What it changes for my work
Practical changes I have made:
- All client engagements that include "AI integration" as a deliverable now default to Claude unless the client has a specific reason to prefer OpenAI (usually existing contracts).
- For internal automation, anything that touches code or configuration, I have moved off GPT-4. The cost is roughly the same. The accuracy is better.
- For pure chat, summarisation, and structured extraction, I keep both around and route by job. GPT-4o is faster on short turns. Sonnet is better on long ones.
The strategic point
The interesting thing about the 3.5 Sonnet release is not the capability. It is the cadence. Anthropic shipped a model that beats their own Opus tier at a Sonnet price point, with the implication that 3.5 Opus is coming and will be another step.
The frontier labs are now releasing roughly every six months. Each release roughly halves the cost-per-quality of the previous one. If you architected your product around GPT-4 in late 2023 and have not revisited the model layer since, you are running on hardware that is fifteen months old in a market where eighteen months is forever.
Build a routing layer. Pin the model in config, not in code. Run an eval suite you trust. When the next release lands, switching should be a Friday afternoon, not a quarter.
My current default stack
For anyone asking, this is what I run today across client projects:
- Code, refactoring, long-context reasoning: Claude 3.5 Sonnet.
- Voice and low-latency UX: GPT-4o realtime.
- Cheap bulk classification: Llama 3 70B self-hosted, or Haiku 3 if I do not want to operate it.
- Embedding: OpenAI text-embedding-3-large still wins on price-quality but the gap is closing.
That mix will be wrong by Christmas. It is wrong already. Plan for it.