← All posts
llamaopen-sourcellmmeta

Llama 3 Is the Moment Open Weights Stopped Being a Toy

Meta dropped Llama 3 in April 2024. The 70B model is the first open-weights release I would actually deploy for a paying client.

22 April 2024·4 min read

Meta shipped Llama 3 last week and the 70B variant is the first open-weights model I would put in front of a paying client without an apology.

I have spent the last twelve months telling people that "open" models were either toys or trapdoors. Llama 2 70B was fine for chat demos and miserable for anything that touched code, structured output, or long context. Mistral was sharp but small. Everything else was a benchmark cosplay.

Llama 3 is a step change. Not because the architecture is novel, it isn't, but because Meta finally threw enough tokens at it. 15T tokens of pretraining, careful instruction tuning, and an 8K context that they have already promised to extend. The 8B model punches at GPT-3.5. The 70B is genuinely competitive with GPT-4 on a lot of practical tasks.

Why this matters for platform teams

For the last year I have had the same conversation with every CTO I work with:

  • "Can we self-host?"
  • "Yes, but the quality gap is brutal."
  • "Then we'll keep paying OpenAI."

Llama 3 collapses that gap for a chunk of workloads. Not all of them. But enough that the conversation shifts from "is open viable" to "which workloads stay on a frontier API and which come home".

The economics are obvious once you run the numbers. A 70B model on two A100s with vLLM hits a few hundred tokens a second, and on AWS that is roughly two pounds an hour at on-demand. If you are doing high-volume classification, summarisation, or internal RAG against your own corpus, the API bill stops making sense well below the seven-figure mark.

Where it still falls down

I do not want to oversell this. Llama 3 has problems:

  • The 8K context is short. Half my use cases need 32K minimum.
  • Function calling is bolted on, not native. You will write parsing logic.
  • The licence is "open weights with strings". If you cross 700M monthly active users you owe Meta a phone call. Most of you won't, but read it.
  • Safety tuning is aggressive. You will need to do your own DPO pass for any agent that needs to handle adversarial input gracefully.

What I am doing about it

For two existing clients this quarter I am moving internal tooling, the stuff nobody outside the company sees, off GPT-4 onto Llama 3 70B running on g5.12xlarge boxes. The migration is mostly prompt rewriting. The savings pay for the engineering inside a quarter.

I am leaving the customer-facing surface on a frontier API for now. The quality ceiling there still matters. Latency p99 still matters. And if a frontier vendor ships a major upgrade next month I want my product to ride the curve, not eat a six-week fine-tune.

The strategic read

The interesting question is not whether Llama 3 is good. It is what Meta does next. They have signalled a 400B parameter version is training. If that lands at frontier quality with permissive weights, the entire pricing layer of the LLM market gets compressed in a year.

If you are building a startup whose moat is "we wrap GPT-4", you have maybe twelve months to find a different moat. If you are running platform for an enterprise, start now on the muscles you will need: GPU capacity planning, evaluation harnesses, vector store hygiene, and a serving stack that does not assume the model is somebody else's problem.

The toy era is over. Open weights are now infrastructure. Treat them like it.