Claude 3.5 Sonnet Is the Coding Model I Wanted GPT-4 to Be

Two months ago Anthropic shipped Claude 3.5 Sonnet. I have used it daily across three client projects since. It has replaced GPT-4 as my default model for everything that touches code, and the gap is not subtle.

This is not a benchmark post. Benchmarks are noise. This is what changed in my actual workflow.

What got better

Three things. None of them are "it scored higher on HumanEval".

Long-context reasoning over real codebases. Sonnet 3.5 holds 100k+ tokens of mixed code, configs, and prose, and reasons over them without losing the thread. I routinely paste an entire microservice plus its Helm chart plus the Terraform module that provisions it, and ask "where would a sensible person put the rate limit". It answers like someone who read the whole thing, not like someone who skimmed the first chunk.

Refactor proposals that actually compile. GPT-4 had a habit of confidently inventing function signatures. Sonnet does this less. Not zero, but the rate is roughly halved by my unscientific count, and the failures are less embarrassing.

Following instructions about output shape. "Reply with only the diff. No prose. No explanation. No markdown fences." GPT-4 ignored that instruction one time in five. Sonnet ignores it about one in fifty. That sounds petty until you put it inside a script.

What is still bad

It still hallucinates package names. I have wasted twenty minutes on pip install for a library that does not exist.
It is overly cautious on security topics. Asking it to write an example exploit for a CVE I am patching gets refused more often than it should.
It cannot count tokens. Ask it to "summarise this in 200 words" and you get 280, every time.

The Artifacts feature is the dark horse

The headline of the launch was the Artifacts UI in Claude.ai. Nice toy, I thought. I was wrong. For one client I have replaced an entire internal "prompt to React snippet" tool with a Claude project that uses Artifacts. The team writes a spec, Claude produces a working Artifact, the team iterates in the same chat, and the final code goes into a PR. It is genuinely faster than writing from scratch for the kind of internal tooling where the bar is "works on my screen".

What it changes for my work

Practical changes I have made:

All client engagements that include "AI integration" as a deliverable now default to Claude unless the client has a specific reason to prefer OpenAI (usually existing contracts).
For internal automation, anything that touches code or configuration, I have moved off GPT-4. The cost is roughly the same. The accuracy is better.
For pure chat, summarisation, and structured extraction, I keep both around and route by job. GPT-4o is faster on short turns. Sonnet is better on long ones.

The strategic point

The interesting thing about the 3.5 Sonnet release is not the capability. It is the cadence. Anthropic shipped a model that beats their own Opus tier at a Sonnet price point, with the implication that 3.5 Opus is coming and will be another step.

The frontier labs are now releasing roughly every six months. Each release roughly halves the cost-per-quality of the previous one. If you architected your product around GPT-4 in late 2023 and have not revisited the model layer since, you are running on hardware that is fifteen months old in a market where eighteen months is forever.

Build a routing layer. Pin the model in config, not in code. Run an eval suite you trust. When the next release lands, switching should be a Friday afternoon, not a quarter.

My current default stack

For anyone asking, this is what I run today across client projects:

Code, refactoring, long-context reasoning: Claude 3.5 Sonnet.
Voice and low-latency UX: GPT-4o realtime.
Cheap bulk classification: Llama 3 70B self-hosted, or Haiku 3 if I do not want to operate it.
Embedding: OpenAI text-embedding-3-large still wins on price-quality but the gap is closing.

That mix will be wrong by Christmas. It is wrong already. Plan for it.

Claude 3.5 Sonnet Is the Coding Model I Wanted GPT-4 to Be

What got better

What is still bad

The Artifacts feature is the dark horse

What it changes for my work

The strategic point

My current default stack

More notes.

Cloud cost in 2026: why your AI workloads are ten times your compute bill

The 'ChatGPT is outdated' narrative is half right

Azure's AI Foundry, Copilot, and the OpenAI partnership in 2026