where reliability livesas models improve, the prompt simplifies and the harness sophisticates.PROMPTnatural languageshow your prompt to a colleaguerole assignmentcontext + documentstask descriptionformat guidancefew-shot examplesambiguity floor stays > 0.HARNESScode, not prosedeterministic guaranteestool definitionsoutput schemasrouting by model tierprompt chaining + validationretry strategiescontext managementLLM FUNDAMENTALS · PART 5
· 7 min read ·

LLM Fundamentals: Part 5 -- Prompt Engineering vs Harness Engineering

ai llm-fundamentals

This is Part 5 of the LLM Fundamentals series.

You now know the API structure: three roles, stateless calls, alternating turns. But knowing the mechanics of sending a message does not tell you what to put in it, or whether the prompt is even the right place to solve your problem.

Prompt engineering has a branding problem. Search for it and you find “magic phrases” and “secret hacks” that treat models like vending machines. In practice, the techniques that consistently work all do one thing: reduce ambiguity for a system that predicts one token at a time. But as models improve, a different question emerges. How much of your system’s reliability should live in the prompt, and how much should live in the code around it?

Prompt Engineering: What Still Works

Anthropic publishes a golden rule: show your prompt to a colleague with no context. If they would be confused, Claude will be too. Every technique below works because it makes the task less ambiguous, which from Post 2 means the probability distribution concentrates on tokens you actually want.

XML tags create labeled boundaries between content types. When a prompt mixes documents with instructions, XML tags prevent the model from confusing one for the other. I switched to tagged prompts for all document processing and saw fewer hallucinated instructions immediately.

<document>
{{CONTRACT_TEXT}}
</document>
<instructions>
Extract all payment terms as a JSON array.
</instructions>

Few-shot examples steer format and tone more reliably than descriptions. 3 to 5 input-output pairs shape the conditional probability distribution by loading context with consistent patterns. I use three as a minimum for format adherence.

Role assignment in the system prompt primes every subsequent token. “You are a senior security engineer” produces different output than “Answer questions about this code” because the role conditions the entire generation.

Document placement affects quality measurably. Long documents above the query, instructions at the end, can improve performance by up to 30%.

These techniques work. I still use all of them daily. But they have a ceiling.

Where Prompts Hit a Wall

I spent weeks refining a classification prompt that needed to handle 15 categories with edge cases and overlapping definitions. Every revision improved accuracy on one category while degrading another. More few-shot examples pushed against context limits. Longer system prompts with detailed rules created contradictions.

Prompt engineering becomes a losing game when you try to encode complex logic into natural language instructions. Natural language is ambiguous by design. You can reduce that ambiguity with structure and examples, but you cannot eliminate it. At some point, the prompt is no longer the right tool.

Harness Engineering: Building Around the Model

Vivek Trivedy’s framing is the cleanest one-liner I have read on this: Agent = Model + Harness. If you are not the model, you are the harness.

Harness engineering is everything outside the prompt that shapes model behavior. Tool definitions, output schemas, routing logic, context management, retry strategies, validation layers, hooks, sandboxes, sub-agents. Code, not prose.

Consider the classification problem above. Instead of a single prompt with 15 categories and 30 examples, I restructured it:

  • A cheap model (Haiku) does a first pass, classifying into 4 broad groups
  • A second call handles fine-grained classification within each group, with a small focused prompt and relevant examples only
  • Output schema validation rejects malformed responses and retries
  • Confidence thresholds route ambiguous cases to human review

No single prompt in this pipeline is longer than 20 lines. Each one does one focused job. Reliability comes from the orchestration, not from any individual prompt being clever.

Vertical four-stage cascade: step 1 triage uses Haiku to cut into four broad groups; step 2 classify runs a focused per-group prompt for the fine label; step 3 validate runs a JSON schema check that retries once on malformed output; step 4 gate applies a confidence threshold, shipping high-confidence labels and queuing ambiguous ones for human review.

Where Each Approach Fits

Prompt engineering solves communication problems. If the model does not understand what you want, a better prompt fixes that. XML tags, examples, roles, chain of thought, document placement: these all make your intent clearer to a next-token predictor.

Harness engineering solves systems problems. If you need guaranteed output formats, use structured output schemas instead of asking nicely for JSON. If you need reliable multi-step workflows, use prompt chaining with validation between steps instead of a mega-prompt. If you need cost control, route by model tier instead of asking the expensive model to be brief.

A pattern I have settled into for production work:

  • Prompt: role, context, task, format guidance. Keep it short and focused.
  • Harness: tool definitions, output validation, routing, retries, context management. Make it robust.

Two-column diagram: prompt holds role, context, task, format guidance, and few-shot examples and works in natural language with an ambiguity floor that stays above zero; harness holds tool definitions, output schemas, routing by model tier, prompt chaining with validation, and retry strategies and works in code with deterministic format guarantees.

Anthropic’s own prompt engineering guidance reflects this shift. Earlier models needed aggressive prompting with caps and emphasis to follow instructions. Current Claude models are responsive enough that dialing back to neutral phrasing works better. As models improve, the prompt simplifies while the harness sophisticates.

The Ratchet: Every Mistake Becomes a Rule

Addy Osmani’s harness-engineering synthesis names the most useful habit in the discipline: treat agent mistakes as permanent signals, not flukes to retry and forget. If an agent ships a PR with a commented-out test that gets merged by accident, the next iteration of CLAUDE.md states “never comment out tests; delete or fix them.” A pre-commit hook flags .skip( in the diff. The reviewer sub-agent blocks commented-out tests at review time.

Constraints get added when you observe a real failure. They get removed when a capable model renders them redundant. Every line in a good system prompt traces back to a specific historical failure. The harness becomes a discipline shaped by the codebase’s unique failure history, not a framework you bolt on.

The cheap operational consequence: success is silent, failures are loud. A typecheck that passes makes no noise. A typecheck that fails injects the error directly back into the loop for self-correction. The harness speaks only when there is something to fix.

Prompt Chaining as the Bridge

Prompt chaining sits at the boundary between the two disciplines. You are still writing prompts, but the architecture around them, sequencing, data passing, error handling, is pure harness work.

I build almost every production workflow as a chain. Identify, then classify, then act. Each step is debuggable in isolation. If one step produces bad output, I fix that step without rewriting the pipeline. Two reasons this beats mega-prompts, both grounded in earlier posts: smaller prompts mean less context rot, and focused context produces more concentrated probability distributions.

Posts 6 Through 10: Almost Entirely Harness

Extended thinking is an API parameter, not a prompt. Structured output is a schema constraint. Tool use is a function calling protocol. Agent loops are orchestration code. From here on, the work shifts from writing better text to designing better systems around the text the model produces.

Next up: extended thinking. Post 6 covers chain of thought as a first-class API feature, where Claude allocates dedicated reasoning tokens before responding.