Devin Was a Demo, Not a Product
Cognition launched Devin in March 2024 as the first AI software engineer. Four months in, the bench dust has settled. Here is what the autonomous agent hype actually delivered.
Cognition Labs launched Devin in March 2024 with a polished video, a SWE-bench score, and the phrase "first AI software engineer". The internet lost its mind. Four months later the dust has settled and the picture is clearer.
Devin was a demo. A good demo. Not a product.
What was real
Cognition built a competent agent harness. The headline number, 13.86% resolution on SWE-bench, was a real improvement over the prior public state of the art at the time. The product surface, an agent that plans, writes code, runs tests, and reports back, was credible. They raised a lot of money quickly and that was rational on the information available in March.
What was theatre
The launch video was edited. That is fine, every product launch video is edited. But the gap between the curated demo and the live behaviour was wider than usual. Independent reviewers pulled apart specific Upwork and bug-fix demonstrations and found that tasks Devin "completed" had been quietly mangled, that the agent fabricated work product, and that the time-on-task figures hid a lot of human steering.
This was not a Cognition-specific problem. This was the entire 2024 agent market.
Why agent demos lie
LLM agents have a structural reason to look better in demos than in production:
- Curated tasks. A task where the test suite tells you what good looks like is a special case, not the median.
- Cherry-picked seeds. Agents are non-deterministic. Run a task ten times, ship the one that worked.
- Hidden context. The repo was small, the dependencies were modern, the bug was already half-localised.
- Loose definitions of done. "Tests pass" is not the same as "PR merges and the feature works in production for six months".
Production engineering has none of these properties. The repos are old. The dependencies are weird. The tests are flaky. The definition of done is "the on-call doesn't get paged at 3am". Agents fall over hard the moment they leave the curated track.
Where agents do work in 2024
I am not anti-agent. I run agentic workflows for two clients today. Both are scoped tightly:
- A code review bot that reads PRs, runs static analysis, and posts structured comments. It is wrong sometimes. It is wrong less often than the average junior reviewer, and a human merges.
- A migration tool that proposes changes across a monorepo for a specific framework upgrade. It fails on 30% of files. The 70% it gets right paid for the engineering inside a sprint.
Both are narrow. Both keep a human in the loop. Both are evaluated against deterministic test fixtures, not vibes.
The lesson for the rest of 2024
The Devin cycle taught the industry three things, slowly:
- Benchmark scores do not generalise. SWE-bench is a useful signal. It is not a product.
- Autonomy is a slider, not a switch. Useful agents in 2024 are 20% autonomous, not 100%. The teams shipping value are the ones who set the slider deliberately and built the harness to match.
- The serious players moved quietly. Anthropic shipped tool use that worked. GitHub iterated on Copilot Workspace without claiming to replace anyone. Cursor built a pleasant editor with good agentic affordances. None of them released a viral video.
If you are evaluating an agent product in the back half of 2024, ignore the demo. Ask for read access to a real customer's repo and run your own task list. Watch the live failure modes. The vendors who survive that test are the ones worth your budget.
A note on the founders
I have nothing against Cognition. They will iterate. The 2024 launch is not the end of their story, and "first version was oversold" is the modal Silicon Valley path. The lesson is for buyers, not builders. When the first AI engineer ships, you will not need a launch video to recognise it. Your incident channel will quietly stop firing.