The worst thing that happened to prompt engineering was being discovered by marketing departments. The second-worst thing was being taught as if it were a mystical practice. For a backend engineer, prompting is neither a spell nor a shortcut. It is API design against an unreliable subroutine. You already know how to do most of it; you just haven't named it.
Reframe: an LLM call is an unreliable subroutine
Think of each prompt as a function. It takes structured input, returns structured output, has variance, occasionally throws, and costs a measurable amount of money per call. Every instinct you've built around writing robust code against a flaky downstream service applies: retries, timeouts, idempotency, input validation, output validation, fallbacks.
The only two things that are unusual about this particular subroutine:
- The interface is specified in English rather than in a schema.
- The failure mode is usually a plausible-looking wrong answer, rather than an error.
Both of those are manageable, and neither requires a new branch of computer science.
Design the interface first
Before you write any prompt text, write the input and output of the call as you would for a normal API:
// Input
{
"doc": "...",
"question": "..."
}
// Output
{
"answer": "...",
"confidence": 0.0, // 0-1
"citations": [string]
}
With the interface fixed, the prompt becomes: "given the input schema, produce the output schema." Most of the mess in badly-engineered LLM systems is that there is no schema — the "output" is whatever the model felt like returning. Fix that first and the rest is cleanup.
All the frontier models now support structured output. Use it. If you're parsing free text out of an LLM response with regular expressions, stop, and reach for the structured-output API.
Keep the prompt small and in one place
The single most common mess I see is a 6,000-token system prompt accreted over eighteen months, half of which is contradictory. "Be concise. Provide detail. Be friendly. Be professional. Don't use emojis. Use emojis when appropriate." The model handles this less well than you'd hope.
A clean prompt for a production task fits on two screens. It contains: the role, the task, the input schema, the output schema, and three or four worked examples. That's it. If your prompt is longer, the odds are you are using it to compensate for an underspecified task.
Evals are tests. Write them first.
The discipline that separates serious LLM-using systems from toys is evaluation. An eval is a curated set of (input, expected-output) pairs and a programmatic way to score whether the actual output is close enough.
Start with twenty examples. Hand-curate them. Cover the common case, three edge cases, and at least one adversarial case. When you change the prompt, rerun the evals and see whether the score improved. When a user reports a bug, add the case to the eval set. Over time, the eval set becomes the spec for what the system is supposed to do, and prompt changes become principled rather than superstitious.
This is exactly how you write unit tests against a flaky dependency. You already know how to do this.
Budget your variance
Every LLM call has variance. The same input will not always produce the same output. You need to decide, for each call site, how much variance you can live with and build accordingly.
- High-variance, creative task: temperature 0.7+, one call, accept the output.
- Low-variance, deterministic-ish task: temperature 0, validate the output against a schema, retry once on invalidation.
- High-stakes task: temperature 0, two or three calls with different framings, reconcile (majority vote, or flag disagreement for human review).
Variance is a design variable. Don't accept the default.
Idempotency and cost control
LLM calls are expensive. A page that issues a hundred LLM calls will cost you real money at scale. Two cheap wins:
- Cache by input hash. Many calls are repeated with identical input and can be served from a cache.
- Use prompt caching where the provider supports it (Anthropic does). A large fixed context paired with small variable inputs gets much cheaper with prompt caching enabled. The API is usually one header.
Guardrails at the boundaries, not inside the prompt
If your task is security-sensitive, do not rely on "please don't reveal the system prompt" inside the prompt. The model may comply or may not; you cannot prove it. Instead, put the guardrails in code around the call:
- Validate the input against a whitelist of expected shapes.
- Validate the output against the expected schema.
- Scan the output for forbidden patterns (tokens, PII, etc.) after it's returned.
- Log the full prompt and response for audit.
The prompt is a request; the wrapping code is the contract.
If you already know how to write robust backend code against a flaky API, you already know 80% of prompt engineering. The remaining 20% is just that the API happens to speak English.
— Nivaan