← All services
Prompts that hold up in production

AI Prompt Engineering

Production-grade prompt engineering: structured prompting, schema-constrained outputs, retrieval-grounded answers, eval harnesses, and the regression discipline that distinguishes a working AI feature from a hopeful one.

OpenAIAnthropicGeminiVercel AI SDKInstructorDSPypromptfoo

The thing nobody tells you about prompt engineering is that the prompt isn't the artefact. The eval is. A prompt that scores 92% on a 200-example eval set is a thing you can defend, ship, and improve. A prompt that "feels good" is a feature about to regress next time the upstream model updates.

What I do

Prompt design — for the actual problem you're solving, not a generic 'helpful assistant'. Structured prompting with system / role / context layering. Schema-constrained outputs where the answer feeds another system. Retrieval-grounded answers when freshness matters.

Eval harness — golden sets that represent the real distribution of your inputs. Model-graded evals for soft criteria (helpfulness, tone). Regression suite that runs on every prompt change. Score deltas surfaced in PR review.

Multi-model portability — design prompts that survive a swap from GPT-4 to Claude to Gemini. The eval tells you whether the new model is actually better for your use case.

Prompt-injection audit — adversarial test set, system-prompt leak detection, refusal patterns. The basics most teams skip.

Documentation — a prompt library your team can read, extend, and improve. Versioned. Tested. Owned.

When this is the right engagement

You have an AI feature in production (or about to be) and the quality is inconsistent. Or you're picking between models and don't have data to decide. Or you've been burned by a regression after a model update and want never to be again.