LLM Fundamentals: Part 10 -- From Loop to Agent | BDIGITAL

This is Part 10 of the LLM Fundamentals series.

In Post 9, I assembled the pieces from the entire series into an agentic loop: a while loop that calls the model, executes tools, feeds results back, and repeats until the task is done. It works. But running that loop on a toy problem and running it in production are different things. Production means real money, real latency, and real context pressure. This post is a roadmap for bridging that gap, not a tutorial.

Prompt Caching

Every iteration of an agentic loop sends the full conversation history back to the API. By turn 15, you might be sending 50,000 tokens of context, and 90% of it is identical to what you sent on turn 14. Without caching, you pay full price for those repeated tokens every single time.

Prompt caching solves this by letting the API recognize content it has already processed. You mark stable content (tool definitions, system prompts, earlier conversation turns) with cache breakpoints, and on subsequent requests the API reads those tokens from cache instead of reprocessing them. Cached reads cost 10% of the base input token price, which means a 90% cost reduction on everything before the breakpoint.

I cache two things in every agentic workflow: tool definitions and the system prompt. Both are identical across every loop iteration, and together they often account for thousands of tokens. For multi-turn conversations, automatic caching mode handles breakpoint placement for you. One cache_control field at the request level, and the API figures out the rest.

Model Routing

Not every step in an agentic workflow needs the same model. Classification and straightforward tool selection are fast tasks that a smaller model handles well, while complex reasoning and nuanced code generation benefit from a larger one. Using Opus for everything works but costs more than it should.

Anthropic’s model selection guidance breaks it down: Haiku for real-time applications and high-volume processing, Sonnet as the general-purpose workhorse for code generation and agentic tool use, Opus for complex reasoning and multi-hour research tasks. I mix all three within a single workflow. A Haiku call classifies incoming requests and routes them. Sonnet handles the main loop iterations. Opus steps in when the task requires deep analysis.

Routing logic can be simple. If the accumulated context exceeds a certain size or the task description signals complexity, route to Opus. Otherwise, Sonnet. For sub-tasks like generating summaries or extracting metadata, Haiku. No ML-based router needed for most workflows.

Context Management

From Post 3: as token count grows, accuracy degrades. Context rot is a real phenomenon even with 1M-token windows. An agentic loop that runs for 40 iterations accumulates tool calls, results, reasoning, and responses that the model must wade through on every subsequent turn. Most of that context is stale.

Server-side compaction addresses this directly. When input tokens exceed a configurable threshold (default 150,000), the API summarizes earlier conversation turns and replaces them with a compact summary block. Subsequent requests start from the summary instead of replaying the full history. You can customize the summarization instructions to preserve whatever you need most for your workflow, like variable names, file paths, or decision rationale.

Context awareness in current Claude models (Sonnet 4.5, Sonnet 4.6, and Haiku 4.5) adds another layer. Claude tracks its remaining token budget throughout a conversation and receives updates after each tool call. Instead of guessing how much space remains, the model knows precisely and can adjust its behavior: producing more concise responses as the window fills, or flagging when it is approaching capacity.

Agent SDKs

An agentic loop is fundamentally simple: a while loop, an API call, a tool dispatcher, and a message accumulator. Production agents need more than that. They need permission boundaries that prevent dangerous operations, session persistence so a long task survives disconnections, hooks that run custom logic before or after tool execution, plus retry logic and structured logging for debugging multi-step runs.

You can build all of this yourself, and doing so teaches you exactly how agents work. The Claude Agent SDK packages those patterns into reusable infrastructure: built-in tools for file operations and shell commands, session management across interactions, permission systems that scope what the agent can do, and hook points for injecting custom behavior at every stage of the loop.

Subagents

A single agent with a broad tool set and a massive system prompt will eventually hit the limits of context and focus. Subagents split the work. Instead of one agent that knows everything, you create specialized agents with their own context windows, tool sets, and system prompts. A coding agent that only has access to file tools and code execution. A research agent with web search and document retrieval. An orchestrator that delegates to whichever subagent fits the current step.

Each subagent operates in its own context, which means it is not polluted by irrelevant information from other parts of the workflow. A research subagent does not need to see 30 rounds of code editing context, and a coding subagent does not need to see 15 search results from a research phase. Isolation keeps each agent’s context clean and focused, directly countering the context rot problem from Post 3.

MCP

Every tool integration I have described in this series follows the same pattern: define a schema, handle the call, return a result. Model Context Protocol (MCP) standardizes that pattern into a protocol. Instead of writing custom tool handlers for every data source and service, you connect to MCP servers that expose tools, resources, and prompts through a uniform interface.

An MCP server for a database exposes query tools, one for a calendar exposes scheduling tools, one for a CRM exposes customer lookup. Your agent connects to whichever servers it needs, discovers available tools automatically, and calls them through the same protocol regardless of what sits behind the server.

Series Recap

I started this series with tokens: the atomic unit that LLMs process. From there, text generation explained how models pick the next token from a probability distribution. Context windows defined working memory and its limits. Messages API showed the stateless protocol underneath every call. Prompt engineering and extended thinking covered how to communicate intent and enable reasoning. Structured output guaranteed machine-readable responses. Tool use gave models the ability to act. Post 9 combined everything into the agentic loop.

And this post, Post 10, is the roadmap for scaling that loop into something you would actually deploy. Caching to manage cost. Model routing to balance capability against speed. Compaction to fight context rot. Subagents to divide complexity. MCP to standardize integrations.

None of these concepts are locked to a single provider or framework. Prompt caching, model routing, context management, and tool protocols exist across the ecosystem. If you understand why each pattern exists and what problem it solves, the mental model transfers. Swap Anthropic for another provider, swap the SDK for a different framework, and the architecture stays the same. Tokens are still tokens, context still rots, tools still need schemas, and agents are still loops underneath.