GPT-4o: The Multimodal Bet and What It Breaks in Your Stack
OpenAI shipped GPT-4o in May 2024. Native audio in, audio out. Half the price of GPT-4 Turbo. Here is what actually changes in production systems.
OpenAI's spring update did the obvious thing and the non-obvious thing in the same breath.
The obvious thing: a faster, cheaper GPT-4. Half the price of Turbo, twice the rate limits, and 4o now sits at the top of the leaderboard for the workloads most teams care about.
The non-obvious thing: a single model that takes audio, image, and text natively, and emits the same. No separate Whisper hop. No separate TTS hop. One forward pass, end-to-end. The demo that everyone shared was the singing and the flirty voice. The thing engineers should care about is the latency floor.
Why "one model" matters
Today's voice assistant is a pipeline:
- VAD detects speech.
- Whisper transcribes.
- GPT-4 reasons.
- A TTS engine speaks.
Every hop adds latency, kills prosody, and discards information. By the time the LLM sees the text, it has lost the user's tone, the pause before they said "yes", the sigh of frustration. Output goes through the same lossy reverse path.
GPT-4o collapses that to one model. End-to-end audio. The reported median voice latency is around 320ms. That is the same ballpark as a human turn-taking gap. Below that threshold, the experience stops feeling like talking to a machine and starts feeling like a phone call.
If you have shipped a voice product on the old pipeline you already know it sounds robotic, especially around interruptions. 4o changes the floor.
What this breaks
A few things in your stack that will look stupid in six months:
- Your transcription pipeline as a separate service. If you run Whisper as a sidecar to feed an LLM, you are paying twice and losing fidelity. Plan to consolidate.
- Your prompt eval harness. Most evaluation tooling I have seen assumes text-in, text-out. Audio-in evaluation is a different problem. You will need new metrics, new fixtures, new ways of seeding regression tests.
- Your latency SLOs. If your voice product targets 800ms p95 today because that was achievable, your competitors will move to 350ms. Adjust.
- Your data pipeline. Audio is heavier than text. Storage, transit, retention, redaction all get harder. Talk to legal early.
What it does not change
The model is still a stochastic text generator wearing a microphone. It will hallucinate. It will agree with confidently wrong users. It will refuse things it should not refuse and answer things it should. None of the multimodal magic fixes the alignment problem.
It is also still a hosted API with rate limits and a vendor relationship. If you are building anything that needs to run offline, on-device, or in an air-gapped environment, 4o is irrelevant to you and Llama 3 is not.
What I am doing
For one client I'm rebuilding their support voicebot from a Whisper-then-GPT pipeline onto 4o realtime. I expect the engineering to take a month and the product team to spend two months redesigning the conversation because suddenly the model can handle interruptions gracefully and the old script breaks.
For everyone else I am pushing one habit: stop thinking of "the LLM" as a box that takes a string and emits a string. Start thinking of it as a multimodal endpoint with a context window and a price per second. The shape of your application changes once the model can hear.
The pricing detail nobody is reading
GPT-4o is half the price of Turbo on input, a third on output. That sounds incremental until you remember that frontier models have halved in price roughly every nine months for two years. If that curve continues, by mid-2025 you are running near-frontier inference for the price of GPT-3.5 today.
Build with that in your roadmap. The thing you cannot afford this year you can afford next year. Architect for the migration, not the snapshot.