We have spent the last year building agents — for customers, for ourselves, for research. The thing that nobody outside the work tells you is that the model is the easy part. The workflow around it is where the agent lives or dies.
Anthropic's "Building effective agents" article from late 2024 said the same thing in more diplomatic language. We agree with most of it; this is the operator version.
What an agent actually is
Strip the hype away and an agent is three things stacked together:
- A workflow — a series of steps with branches, retries, and clearly defined inputs and outputs at each node.
- A model — invoked at some of those nodes to make decisions a deterministic rule could not.
- Tools — functions the model can call to read state or change it.
Most teams start by picking the model. That is the wrong end. The model is interchangeable. The workflow is the product.
The taxonomy Anthropic publishes — workflows (predetermined paths) vs agents (LLM-directed paths) — is useful but slightly misleading: it implies "agentic" is more advanced. In production our experience is the opposite. The systems that hold up are workflows with a model at the branch points, not the LLM running the show.
The workflow is the contract
A working agent has a contract with the world. Given these inputs in this format, after at most N steps, you will get one of these outputs. The contract is what you sell. The contract is what gets audited. The contract is what you debug when something goes wrong.
If you cannot describe the contract on a whiteboard without saying "the model decides" three times, you do not have an agent yet. You have a chat with a model.
Simon Willison has written about this from the user-facing angle; we run into it from the operator angle weekly.
Where models go in the stack
The model lives at the decision points of the workflow. The deterministic engineering lives everywhere else.
A good rule of thumb: every model call should be inside a function whose input is constrained and output is parsed. Constraint on the input gives you bounded behaviour to test; parsing on the output gives you a hard failure when the model drifts. Structured-output APIs — Anthropic's tool use, OpenAI's function calling, Pydantic-AI, BAML — exist for this exact reason. Use them.
The deterministic parts of an agent are the load-bearing parts. The model is the wallpaper.
When something breaks in production, it is almost never the model that broke. It is the prompt that drifted, the parser that did not handle the new edge case, the retry policy that masked a real outage. Engineering problems.
The observability you need on day one
The single biggest investment that pays back across every agent we have shipped is structured event logging at each workflow node. Not the prompt, not the response — the node. Inputs, outputs, latency, cost, the model and version, the tool calls, the retries.
With that, when an agent misbehaves you can answer in five minutes:
- Which node degraded?
- When did it start?
- What changed?
Without that, you ship vibes and pray.
Eugene Yan's writing on evals and observability is the best public reference for this. Hamel Husain's "Your AI Product Needs Evals" is the second best. Read both before writing your first agent.
Cost is a feature, not a footnote
Agents make many model calls. Many of those calls can be cached. Many of those can be skipped if the previous output was good enough. None of this matters until your monthly bill arrives.
We treat cost as a first-class metric, alongside latency and accuracy. Every agent has a target cost per execution. If we exceed it, we triage the same way we triage a performance regression. Anthropic's prompt caching and OpenAI's prompt caching made this affordable; both are mandatory in production agents now.
Build the boring parts well
The unsexy reality is that the agents that make it to production look mostly like the systems we have always built — well-defined state machines, careful contracts, structured logs, retries, idempotency keys. The model gives them judgement at specific points. That is all.
If you treat the agent as a workflow with smart nodes, you can ship it on a deadline and operate it on a budget. If you treat the model as the whole thing, you are going to have an interesting demo and a hard time in week three.
This is also why Walter OS — our internal agentic OS — looks more like a build system than a chatbot. Skills, hooks, dispatchers, audit logs, trust tiers. The model is the smallest moving part.
Further reading
- Building effective agents — Anthropic, December 2024. Mandatory.
- Patterns for building LLM-based systems & products — Eugene Yan, 2023. The reference for production patterns.
- Your AI product needs evals — Hamel Husain. The discipline post.
- Simon Willison's blog. Worth following continuously; he tracks the ecosystem better than most teams' internal slacks do.
- BAML — Boundary's structured-output framework. Underrated. Solves the parsing problem cleanly.
> end of article · v0.1 · 2026
Juan Cruz Fernandez
co-founder · product & systems
Designs the product loop and the operating system around it. Believes most product problems are operational problems in disguise.
> more from Juan