Operating SRE for Agentic Systems: Reliability Patterns for LLM Tools and Autonomous Workflows

Harshil Shah
Mar 30
8 min read

// For engineering leaders shipping production AI systems that don't just predict, they act

Traditional SRE assumes your system's blast radius is bounded. An LLM agent with tool access doesn't have that property. The failure modes aren't slower, they're different in kind.

Most reliability engineering practice evolved around systems that are deterministic enough to reason about formally. You write an SLO. You define your error budget. You set up alerts on latency and error rate. That still applies to the infrastructure layer under an agentic system. But above that layer, something structurally different is happening, and the instinct to bolt conventional SRE tooling onto LLM-driven workflows without rethinking the assumptions underneath it is how you end up paging at 3am because an agent made twelve API calls it shouldn't have and nobody caught it until a downstream system started failing silently.

The honest answer is that agentic reliability engineering is still a young discipline. The patterns that work are emerging from teams who've shipped these systems into production and learned from what broke. This is a synthesis of those patterns.

// 01What Makes Agentic Failure Different

Conventional software fails in ways you can enumerate in advance. A service returns a 500. A queue backs up. A database connection exhausts its pool. You write runbooks for these because the failure space is finite and repeatable.

LLM-driven systems fail laterally. The model decides to call a tool in a sequence nobody anticipated. An intermediate result gets misinterpreted and the agent pursues a coherent-looking but entirely wrong subgoal for several steps before anything surfaces an error. The tool itself succeeds, the HTTP response is 200, the action taken was just not the right one. Standard error rate monitoring doesn't catch that at all.

There's also the compounding problem. A single agent step that goes slightly sideways gets passed as context into the next step. The next step, working from bad context, produces output that's more confidently wrong. By step four or five, you have an agent that's substantially off course while your infrastructure metrics look completely healthy. This is the failure mode most teams underestimate before they've seen it in production.

!! Critical pattern

Most agentic failures don't surface as infrastructure errors. The underlying services are running fine. The failure is semantic, and catching it requires observability that understands intent, not just execution.

// 02Guardrails Are Not Optional Middleware

Every production agentic system needs explicit guardrails at two layers: input validation before the model sees a request, and output validation before any tool call executes. Skipping either one is how you ship a system where a crafted user input convinces your agent to take actions in the wrong scope entirely.

Input guardrails handle the obvious cases like prompt injection and scope violations. They also handle subtler ones: requests that are syntactically valid but semantically outside the agent's intended operating domain. A customer support agent that starts reasoning about internal system configuration because a user phrased something in a particular way is a real failure class, and it doesn't require adversarial intent to trigger.

Output guardrails are harder to think about because the model's output isn't final output in an agentic system, it's a plan. The validation question isn't "is this response coherent" but "is this sequence of tool calls within the bounds of what this agent is authorized to do." That means you need a formal definition of those bounds, which most teams don't have until after the first incident that would have been prevented by one.

Concretely: define tool call whitelists per agent role. Define parameter bounds for each tool. Run those checks synchronously before any tool executes. The overhead is trivial relative to the cost of an unbounded agent action reaching a production system.

// 03Evals Are Your Integration Tests for Nondeterministic Systems

The thing most people get wrong about evals is treating them as a benchmarking exercise. Leaderboard scores on standard tasks don't tell you whether your agent handles the specific edge cases your production traffic generates. That's what evals are for, and the eval suite you need to write is not the one the model provider published.

Build your eval set from production traces. Every time an agent workflow produces an unexpected outcome, that's a regression test waiting to be written. Every edge case your QA team surfaces becomes a fixture. The eval suite should grow directly from operational experience, not from abstract coverage goals.

Running evals in CI is table stakes. The pipeline catches prompt regressions before deployment the same way unit tests catch code regressions. What's harder to operationalize but equally important is continuous eval in production, where a sample of live traffic gets evaluated against expected behavior asynchronously and deviations alert. This is computationally expensive, which means you need to be deliberate about what you sample and what the alert thresholds are. But running evals only pre-deploy means you're blind to model drift, which happens without any code change at all when a provider updates a model version underneath you.

Pattern note

Model providers update base models without always announcing it in a way that triggers your deployment pipeline. Version-pin where possible. Eval continuously where you can't.

// 04Tool Failure Modes Deserve Their Own Runbooks

An agent's tool layer is a distributed system with all the failure modes that implies: timeouts, partial failures, rate limits, inconsistent responses under load. The difference from a standard microservice mesh is that the agent's behavior when a tool fails is itself nondeterministic. The model decides how to handle it. Sometimes it retries appropriately. Sometimes it hallucinates a successful result and keeps going. That second case is catastrophic and common enough to engineer against explicitly.

Every tool in your agent's toolkit needs a defined failure contract. What does a timeout look like to the calling agent? What's the retry policy, and crucially, is that policy enforced in the tool layer or left to the model? Leaving retry logic to the model's judgment is a bad bet. Enforce it in the tool wrapper, where it's deterministic and observable.

Idempotency matters here more than in most systems because agents will retry. A tool that creates a record, fails to return confirmation, and then creates a duplicate record on retry is producing real-world consequences. Write your tool wrappers to be idempotent and test that they actually are. This is basic API hygiene but teams skip it constantly in the rush to ship agentic features.

Circuit breakers belong in the tool layer too. When a downstream dependency starts degrading, you want the agent's tool calls to fail fast and loudly, not slowly and silently. A slow tool that partially succeeds is worse than a fast tool that fails cleanly, because the agent's uncertainty about whether the action completed creates its own set of downstream problems.

// 05Observability for LLM-Driven Flows Requires a Different Stack

Your existing APM tooling captures what you need for the infrastructure layer. It doesn't capture what you need for the reasoning layer. Those are different problems and usually require different instrumentation.

For agentic systems, a trace is not just a span tree of HTTP calls. It's a record of the agent's reasoning steps, tool invocations, intermediate context state, and decision points. You need to be able to replay a production incident and understand not just which services were called in which order, but what the model was working from when it made each decision and what it produced. Without that, post-incident analysis is guesswork.

Structured logging of every prompt, every completion, every tool call input and output gives you the raw material. The tooling to make sense of it is still maturing. Platforms like LangSmith, Arize, and Honeycomb with appropriate instrumentation can get you to workable observability. OpenTelemetry-compatible tracing with LLM-specific attributes is emerging as the instrumentation standard worth building toward. The key is that every agent step gets a trace ID, that trace IDs propagate through tool calls, and that the full reasoning chain is recoverable for any production workflow.

Latency percentiles still matter at the infrastructure layer, but the metric that matters most at the agent layer is task completion rate, specifically, what percentage of initiated workflows complete the intended goal without human intervention or error recovery. Measuring that requires knowing what the intended goal was, which means your agent workflows need explicit success criteria that the eval layer can check against.

// 06Rollback for Agentic Systems Is Mostly Mitigation, Not Reversal

Here's the part most teams avoid thinking about until they have to: an agent that's sent an email, called an external API, modified a database record, or triggered a real-world workflow cannot be rolled back in the traditional sense. The action happened. Your rollback strategy for agentic systems is actually a mitigation and containment strategy, and it needs to be designed before deployment, not after the first incident.

The practical patterns are: design tools with compensating actions where possible, maintain an audit log of every agent action that's detailed enough to support manual remediation, and build kill switches that can halt an agent mid-workflow when an operator identifies a runaway execution. That last one sounds obvious but gets deprioritized constantly. A hard stop capability that an on-call engineer can trigger without a code deploy is non-negotiable for production agentic systems.

Staged rollout applies here the same way it applies to any risky deployment. Shadow mode, where the agent reasons and produces a plan but doesn't execute tool calls, is underused as a pre-production validation step. It gives you the model's behavior on real traffic without the real-world consequences, which is the closest thing to a true staging environment an agentic system has.

// 07SLOs for Systems That Don't Fail Cleanly

Defining SLOs for agentic systems forces a productive conversation most engineering orgs haven't had: what does "working correctly" even mean for an autonomous workflow? Infrastructure uptime is necessary but not sufficient. An agent system where the infrastructure is healthy but the agent is consistently completing the wrong task is not a working system, and your SLO framework should reflect that.

Useful SLO dimensions for agentic systems include task completion rate (workflows that reach a valid terminal state), tool call precision (the ratio of intended to total tool invocations per workflow), escalation rate (how often the agent correctly identifies it needs human intervention), and human override frequency (how often operators are correcting agent actions in production). That last metric is particularly honest. High override frequency is a direct signal that your eval coverage or your guardrail coverage is inadequate.

Error budgets still apply. Spend yours on experiments that improve agent capability. When the budget runs out, freeze changes and focus on stabilization. The framework translates cleanly; the metrics feeding it just need to account for the semantic layer, not only the infrastructure layer.

// FAQCommon Questions

How do you set an error budget for a nondeterministic system?

Same principle as any SLO: define a reliability target, measure against it, and treat deviation as budget spend. The difference is that your reliability metric has to include semantic correctness, not just availability. A task completion rate target of 95% with a precision floor of 90% on tool calls is a concrete, measurable error budget. What you can't do is use infrastructure uptime as a proxy for agent reliability and call it an error budget.

What's the right granularity for tracing an agent workflow?

Trace every reasoning step as a span. Tool call inputs and outputs get logged in full. Intermediate context state gets captured at each decision point. The cost is storage and some query complexity. The alternative is a production incident you can't diagnose because you don't know what the agent was working from when it went wrong. Pay the storage cost.

Should evals run in CI, production, or both?

Both, with different purposes. CI evals catch regressions before deploy using your curated test fixtures. Production evals catch drift and novel failure modes on live traffic that your test set doesn't cover. They're complementary, not redundant. The teams who only run CI evals get surprised by model provider updates. The ones who only run production evals are flying blind before deployment.

How do you handle a model provider changing the underlying model without notice?

Version-pin your model endpoints wherever the provider allows it. Run continuous evals on a sample of production traffic so drift surfaces quickly regardless of cause. And treat model version as a deployment artifact in your change management process, which means the eval suite runs against any proposed version change before it reaches production. This is a governance question as much as a technical one: who has authority to approve a model version change, and what evidence do they need before approving it?

What's the minimum viable observability setup for a team just shipping their first agentic system?

Structured logging of every prompt, completion, and tool call with a consistent trace ID. An async eval job running on a sample of production traces checking against defined success criteria. A kill switch callable without a code deploy. That's the floor. Everything else, dedicated LLM observability platforms, sophisticated eval pipelines, semantic anomaly detection, is worth building toward but isn't the thing that saves you in the first incident.