LLM Fundamentals: Part 4 -- Messages API
This is Part 4 of the LLM Fundamentals series.
Context windows define how much a model can hold. But what does the content inside that window actually look like when you send it to the API?
Roles, statelessness, and conversation structure exist across every LLM provider; implementation details differ. From Part 4 forward the examples use Claude’s Messages API directly.
Three Roles, One Structure
Every message you send to Claude carries a role. There are three roles in the Messages API: system, user, and assistant. Each serves a distinct purpose, and understanding what they do changes how you design prompts.
System sets behavior. It defines who Claude is for this conversation, what rules to follow, what tone to use, what constraints apply. System prompts go in a separate top-level parameter, not inside the messages array. I think of the system prompt as stage directions that the audience never sees but the actor always follows.
User carries human input. Your questions, instructions, documents, images, tool results: anything you send to Claude lives in a user message.
Assistant holds Claude’s prior responses. When you include assistant messages in your request, you are telling Claude “this is what you said before,” giving it conversation history to build on.
A minimal request with all three:
import anthropic
client = anthropic.Anthropic()response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system="You are a concise Python tutor.", messages=[ {"role": "user", "content": "What does enumerate do?"} ],)print(response.content[0].text)Nine lines to send a prompt and get a response. Most of the complexity in production systems comes from managing what goes into that messages array over time.
Stateless by Design
Statelessness is the concept most often missed when people start with LLM APIs: the Messages API is completely stateless. Claude does not remember your last request. With no session, no memory, and no persistent thread, every API call is a self-contained transaction.
When you want a multi-turn conversation, you build it yourself. After each response, you append Claude’s reply to your messages array, add the user’s next input, and send the entire history back. Turn 1 sends one user message. Turn 5 sends four previous exchanges plus the new message. Turn 20 sends everything from turns 1 through 19 plus the current input.
If you read Post 3 on context windows, this should sound familiar. Statelessness is the reason context grows linearly per turn, because you are literally re-sending the entire conversation with every request.
Statelessness is elegant once you stop fighting it. Statelessness means you have complete control over what the model sees. You can summarize older turns, drop irrelevant tool results, inject synthetic context, or rewrite the conversation history entirely between requests. No hidden state means no surprises.
Alternating Turns
Messages must alternate between user and assistant roles. You cannot send two user messages back to back or two assistant messages in sequence. If you do, the API merges consecutive same-role messages into a single turn.
In practice, this constraint rarely causes problems because natural conversations already alternate. Where it matters is in programmatic construction: if your code appends a user message, processes a tool result, and wants to add another user message, you need to restructure. Tool results go inside user messages as content blocks, keeping the alternation intact.
Synthetic Assistant Messages
One powerful technique I use regularly: placing an assistant message that Claude never actually generated. You construct a fake prior response and include it in the messages array to steer behavior.
Say you want Claude to respond only in JSON. Instead of writing elaborate instructions, you can start the assistant’s response for it. Including {"role": "assistant", "content": "{" in your messages tells Claude “you already started outputting JSON, keep going.” It is a form of behavioral priming that works because the model generates token by token, conditioned on everything before it, including fabricated history.
Note that prefilled assistant responses on the final turn are deprecated in Claude 4.6 models. For format control, structured outputs and clear system instructions have replaced the need for prefills. But placing synthetic assistant messages earlier in the conversation, to simulate prior exchanges or demonstrate expected behavior, remains fully supported and useful.
Stop Reasons Tell You What Happened
When Claude responds, the stop_reason field tells you why generation ended. Ignoring it is a common source of bugs in production code.
Four stop reasons cover most scenarios:
- end_turn means Claude finished naturally. It said everything it wanted to say. This is what you see on a normal, complete response.
- max_tokens means Claude hit the token ceiling you set in
max_tokens. Its response was truncated, possibly mid-sentence. If you see this, either raise the limit or implement continuation logic. - stop_sequence means Claude encountered one of your custom stop strings. Useful for structured parsing where you want generation to halt at a delimiter.
- tool_use means Claude wants to call a tool and is waiting for you to execute it and return the result. Tool calls are covered in Post 8; the loop they fit inside is Post 9.
Checking stop_reason on every API call is the difference between robust and fragile loops. A response that looks complete but has stop_reason: "max_tokens" is silently truncated, and that difference matters when you are parsing structured output or feeding the response into a downstream system.
Building a Conversation Loop
A multi-turn conversation looks like this:
messages = []
while True: user_input = input("You: ") messages.append({"role": "user", "content": user_input})
response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system="You are a helpful assistant.", messages=messages, ) # Production code branches on response.stop_reason here: # max_tokens → continue, tool_use → execute tool, end_turn → done. # Post 8 covers the tool_use branch in detail.
assistant_text = response.content[0].text messages.append({"role": "assistant", "content": assistant_text}) print(f"Claude: {assistant_text}")Each iteration appends the user’s input and Claude’s response to the same list. By the tenth exchange, that list contains twenty messages, and all twenty get sent with every request. Statelessness means the model sees the full conversation each time, which is both the strength of this approach and the constraint you need to manage as conversations grow.
Session State Lives in Your Code
Once you internalize that every API call is independent, architectural decisions follow naturally. Session management is your responsibility, not the API’s. Conversation storage is your database, not Claude’s memory. Context pruning, summarization, and history truncation are problems you solve in your application layer.
Knowing the roles, understanding statelessness, and checking stop reasons gives you the foundation to build anything from a simple chatbot to a multi-step agent. Everything from this point forward in the series builds on this structure.
Next up: prompt engineering. You know the API mechanics now. Post 5 covers why certain prompting patterns work, grounded in how models actually generate text, and the techniques that consistently produce better results.