The production failure that changed how we instrument agents
I still remember the first time one of our production agent systems failed without actually crashing. An invoice-processing agent entered a recursive reasoning loop and burned through hundreds of dollars in API credits over a weekend because it kept insisting a validation error existed when it did not.
We had logs. We had metrics. We had distributed traces. None of them told the story we actually needed.
My team spent two days reconstructing the session by hand, trying to understand what the agent saw, which tool output pushed it in the wrong direction, and why the retry logic kept reinforcing the same bad conclusion. That was the moment it became obvious that classic observability patterns do not map cleanly to agent systems.
Traditional APM assumes stateless services, short execution windows, and clear request-response boundaries. Agents violate all three assumptions. They carry memory, branch into sub-tasks, wait on asynchronous tools, pause for humans, and make decisions that only make sense when you inspect the whole session. That is why the observability model has to shift from isolated logs to replayable event history.
This is also why we built our broader orchestration layer around Kafka rather than point-to-point coordination. I covered that architectural shift earlier in Why We Built Our Multi-Agent System on Kafka, but observability turned out to be one of the strongest arguments for that decision.
Why traditional observability breaks in agentic workflows
Most teams begin with the obvious stack. They push agent logs into CloudWatch, OpenSearch, ELK, Datadog, or another familiar toolchain. For simple assistants, that can be enough. For multi-agent production systems, it starts to collapse quickly.
The problem is not that these tools are useless. The problem is that they were designed for service health, not behavioral reconstruction.
In a real multi-agent workflow, a single user task might span hours, involve multiple tool invocations, branch into sub-agents, accumulate state over dozens of steps, and wait for external systems that return late, malformed, or contradictory results. By the time something fails, the evidence is spread across logs, traces, queues, and tool backends.
What engineers actually need in that moment is not another dashboard. They need answers to operational questions such as:
- What context did the agent hold when it made the bad decision?
- Which tool response caused the workflow to drift?
- Did the planner hand the wrong task to the wrong sub-agent?
- Was the failure caused by the model, the state store, the tool contract, or the retry policy?
- Can we replay the session exactly as it unfolded?
The Observability Paradigm Shift
Traditional Microservices
- ⏱️ Execution: Milliseconds to seconds
- 📦 State: Stateless (context clears)
- 📏 Flow: Linear, request-response
- 📊 Output: Disconnected logs & traces
Agentic Workflows
- ⏳ Execution: Minutes to hours
- 🧠 State: Accumulates memory & context
- 🔀 Flow: Branching, recursive, async
- 📼 Output: Replayable event streams
That is why I keep coming back to the same conclusion: agent observability is not primarily a logging problem. It is a session reconstruction problem.
This production reality also connects directly to another pattern we have seen repeatedly: many agents do not fail because the model is weak. They fail because the data, context, or execution substrate is fragile. I made that case more directly in AI Agents Fail in Production for a Boring Reason: Their Data Is Not Ready, and observability is one of the fastest ways to expose those hidden failures.
The missing requirement is replayability
The breakthrough for us came when we stopped thinking about an agent run as a stream of log messages and started thinking about it as a sequence of state transitions.
A tool response changes the state. A planner decision changes the state. A human approval changes the state. A retry changes the state. A delegated task changes the state. Once you model the workflow that way, the requirement becomes obvious: if you cannot replay those transitions in order, you cannot explain the system with confidence.
Replayability changes the debugging model completely. Instead of guessing what happened from disconnected evidence, you can reconstruct what actually happened step by step.
That is why Kafka fits this problem so well. When you partition by session_id, you preserve the ordered event history of a session. When each important state transition is emitted as a structured event, you can rebuild the timeline, inspect branch points, compare tool inputs and outputs, and determine exactly where the workflow diverged.
This is not theoretical polish. It is the difference between spending two days on manual incident archaeology and isolating the bad tool response in minutes.
Why Kafka works better than logs alone
Kafka is useful here for a simple reason: it treats the execution record as durable, ordered, and replayable by design.
That makes it a much better foundation for agent observability than a pile of uncorrelated log lines. Logs tell you what components wrote down. A replayable event stream tells you what the workflow did.
In practice, that means an engineer can inspect a session like this:
- User request enters the system.
- Planner agent decomposes the task.
- A validation sub-agent requests external payment status.
- The tool returns an ambiguous field.
- The agent interprets the field as a blocking error.
- The retry policy replays the same path without correcting state.
- The loop continues until a guardrail or budget limit intervenes.
Session Replay on Kafka: Reconstructing State
That is the level of visibility teams actually need in production.
It also aligns with the larger architectural point we have made on Novatechflow before: agents on Kafka are only as strong as the underlying event contracts, state boundaries, and operational discipline around them. If the streaming layer is messy, the agents become confidently wrong faster. That is the heart of AI Agents On Kafka Are Only As Smart As Your Architecture.
What we capture as events
The easiest mistake here is under-instrumentation. Teams often log the final action but not the state transitions that explain why it happened.
For production multi-agent systems, the event model should capture more than request timing and tool latency. It should include the operational breadcrumbs needed to rebuild the session.
Core events we want in the stream
- session start and session end
- planner decisions and routing boundaries
- tool invocation requests
- tool responses and validation outcomes
- sub-agent handoffs
- policy interventions and safety gates
- retry attempts and termination conditions
- references to state snapshots or memory versions
- cost, token, and latency accumulation by step
We do not need to dump raw internal reasoning into storage to make the system debuggable. In many cases, that would be the wrong choice for privacy, safety, or compliance reasons. What we do need is enough structured evidence to reconstruct the operational path and inspect the decision boundaries.
{
"event_type": "tool_invocation",
"session_id": "sess_8921",
"agent_id": "invoice_validator",
"timestamp": 1709382002,
"payload": {
"tool_name": "stripe_api.get_status",
"caller_step": "payment_check_04",
"request_ref": "obj_1288",
"state_snapshot_hash": "a1b2c3d4",
"retry_count": 1,
"budget_remaining": 14
}
}
That kind of event is far more useful than a vague line such as “calling validation tool.” It tells you where in the workflow the call happened, what state version was active, and whether the system was already inside a retry loop.
Where standard tracing still helps, and where it falls short
This is not an argument against tracing. Distributed tracing is still valuable for latency, dependency visibility, and service-level health. The problem is that trace continuity in event-driven systems is often harder than teams expect, especially once Kafka sits between services and the workflow spans multiple asynchronous hops.
You can see that pain reflected in recurring Stack Overflow threads from engineers trying to propagate trace context across Kafka producers and consumers, or trying to restore trace IDs correctly after moving to newer Spring and Micrometer tracing setups. Examples include discussions around trace propagation across Kafka events and KafkaTemplate observability in modern Spring stacks such as Propagating & Managing TraceID across multiple events in Kafka, Configuration to enable tracing in Spring Kafka 3, and How to propagate distributed tracing context across Kafka messages.
Those discussions matter because they show the real implementation friction. Teams are not just asking how to visualize spans. They are struggling to keep context intact once workflows become asynchronous and event-driven.
We see a similar pattern on the agent side. Modern agent observability tools are improving quickly, and platforms such as LangSmith are putting more emphasis on tracing deep agent behavior and debugging multi-step workflows. That is useful progress, and it validates the operational need for better visibility into agent execution. But even there, the hardest production problems still come back to durable event history, state versioning, and replay across the broader system boundary.
Session replay is where debugging gets practical
Once the event stream is in place, the next step is building a replay workflow engineers can actually use during incidents.
A useful replay interface should show:
- the complete ordered event timeline for a session
- planner decisions and branch points
- tool inputs and outputs by step
- state changes or memory references between transitions
- retry loops and stop conditions
- cost accumulation over time
- handoffs across sub-agents
That is the difference between observability as retention and observability as operations. Once you can replay the session, postmortems improve, incident response gets faster, and failure modes that once looked random start to look systematic.
This is also where commercial and open-source Kafka-native observability tooling starts to become useful. On the Scalytics side, that is exactly the space we have been pushing into with pieces such as Kafka Observability for AI Agents: Slash LLM Ops Costs and How KafScale Transforms Raw Kafka Data into Agent-Ready Context. The point is not just storing more telemetry. It is turning raw event history into operationally useful replay, debugging, and downstream evaluation.
Cost control matters as much as trace fidelity
The first objection to rich session-level observability is almost always cost. Teams assume that if they store enough detail to replay agent sessions, the infrastructure bill will explode.
That only happens if you treat every event as permanently hot data.
A better pattern is tiered storage. Recent sessions remain fast to query for active incident response. Older segments move into cheaper object storage while staying replayable when needed for audit, regression analysis, or compliance review. That model fits agent systems well because the most expensive failures are often not immediate crashes. They are slow-burn issues such as loops, silent drift, malformed contracts, or brittle handoff logic that only become obvious across many sessions.
Keeping replayable history while managing storage cost is one of the core reasons streaming-native architectures are a better fit than trying to bolt agent debugging onto a conventional logging pipeline after the fact.
The bigger lesson: observability has to follow the session, not the service
The real shift here is conceptual. With microservices, the unit of observability is often the service call. With agent systems, the more useful unit is the session.
That session might contain multiple agents, multiple tools, delayed responses, planner revisions, and external corrections. If your instrumentation does not follow that full path, you end up with impressive-looking dashboards that still cannot explain the behavior that mattered.
This is also why the orchestration design matters so much. In our own work with OpenClaw-based systems, context hygiene and state control repeatedly mattered more than abstract “agent intelligence.” I made that case in How I Built a Secure, High-Performance AI Agent Squad with OpenClaw and Building a Model-Agnostic Multi-Agent System with OpenClaw. Observability is inseparable from that discipline. If the workflow is opaque, the architecture is not production-ready.
Conclusion
Reliable multi-agent systems need more than logs, metrics, and a few spans around tool calls. They need replayable, ordered, session-level visibility.
That is why I believe agent observability should be designed as an event-streaming problem. Once every important state transition is captured durably and in order, debugging stops feeling like archaeology. You can inspect where the workflow drifted, which tool output caused the branch, why the retry loop persisted, and what to fix in the next release.
If you are serious about running agents in production, stop asking whether you have enough logs. Start asking whether you can replay the session.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.