AI Agents Fail in Production for a Boring Reason: Their Data Is Not Immutable, Queryable, or Close Enough

Most agent projects stall not because the model is weak, but because the agent cannot reliably retrieve complete historical context, reproduce decisions, or prove what it saw. The pattern that scales is storage native: persist immutable facts in object storage, version them with table snapshots, and run ephemeral compute that reads directly from the data layer. This makes agent runs auditable, backfillable, and cheaper to operate than long lived stateful services tied to ingestion paths.

The money is there. The production gap is still massive.

Enterprise generative AI spend tripled from $11.5B in 2024 to $37B in 2025, with roughly half landing in infrastructure and model access depending on how you segment the stack. The point is simple: budgets are moving fast.

Sources: Menlo Ventures: 2025 State of Generative AI in the Enterprise. Report PDF.

At the same time, enterprise IT leaders are telling KPMG they are implementing or planning to implement AI agents within two years. That is not a hobby signal. That is an operations signal.

Source: KPMG: Is your infrastructure ready for your AI aspirations?

Yet even teams who ship agents report that observability is non optional. LangChain’s agent surveys frame this as a practical reality: production agents need tracing, controls, and feedback loops because failures are often integration and state failures, not “the model is dumb.”

Source: LangChain: State of AI Agents

The hard truth: agents do not need more prompts. They need an immutable record.

Most “agent memory” implementations are a thin layer of retrieval glued onto a stateless loop. That can demo well and still fail in production for three reasons:

1) No audit trail

If you cannot reconstruct exactly what the agent saw, you cannot explain decisions, debug incidents, or satisfy governance requirements. “We think it retrieved the right docs” does not survive an outage review.

2) No reproducibility

If yesterday’s data is overwritten, compacted incorrectly, or missing late arriving facts, you cannot rerun the same decision path. Evaluation becomes noisy. Rollbacks become guesswork.

3) Context is separated from data engineering reality

Context windows are scarce and expensive. Serious memory systems move facts out of the prompt and into durable stores, then fetch selectively. Even vendor tooling converges on “in-context working set” plus “archival memory” outside the context window.

Sources: Letta: Agent Memory. Letta Docs: Memory overview.

Layer 3: Ephemeral Compute Stateless logic. Spins up, reads context, acts, shuts down.

⬇️ Reads Snapshot

Layer 2: Versioned Tables (Iceberg) Time-travel & snapshot management. Queryable history.

⬇️ Points to

Layer 1: Object Storage (S3) The immutable "Ground Truth." 11 nines durability.

The streaming layer most teams get wrong

Before facts land in object storage, they usually flow through a streaming layer. This is where agent infrastructure silently breaks.

The common mistake: treating the streaming platform as a processing engine. Teams embed transformation logic, state management, and retention policies inside the broker. The result is a system that cannot separate "durable log" from "compute job," which means backfills, replays, and agent context assembly all compete for the same resources.

The storage-native alternative is simpler:

Streaming as a durable commit log only. Events land, get a sequence number, and persist. No embedded processing.
Object storage as the primary sink. The streaming layer writes directly to S3-compatible storage. Retention becomes a storage policy, not a broker tuning problem.
Compute runs separately. Flink, Spark, or custom jobs read from storage, not from broker state. Replay is free. Backfill does not degrade real-time paths.

This pattern is not hypothetical. It is how teams avoid the "Kafka as database" trap that has burned multiple large-scale deployments. If your streaming layer tries to be storage and compute, it will eventually fail at both.

For agent infrastructure specifically: the event log is your audit trail. If it lives inside a stateful broker that compacts or expires data based on broker config, you cannot guarantee reproducibility. If it lands in object storage with Iceberg versioning on top, you can query any point in time indefinitely.

The storage native pattern for agents

For agents, the winning architecture looks boring on paper. That is exactly why it works.

Layer 1: Immutable facts in object storage

Object storage wins on durability and cost, and it is increasingly treated as the default persistence layer for large scale AI workloads. Amazon S3, for example, is designed for 99.999999999% durability (11 nines) and stores data redundantly across multiple Availability Zones.

Sources: AWS Docs: Data protection in Amazon S3. AWS: S3 storage classes and durability

Why this matters for agents: the agent’s “ground truth” cannot be a mutable database row that changes under its feet. It needs event histories, snapshots, and the ability to replay.

Layer 2: Versioned tables for time travel and reproducible decisions

Immutable logs are necessary but not sufficient. Agents also need queryable structure. Table formats like Apache Iceberg provide snapshot based versioning and time travel semantics, which makes “what did the agent know at time T” a solvable question, not a forensic project.

Source: lakeFS: Iceberg tables management and snapshots

Layer 3: Ephemeral compute that reads directly from storage

If compute is tied to ingestion services, every backfill, evaluation run, or agent replay competes with real time workloads. The more agents you add, the more fragile the system becomes.

The storage native alternative decouples them: spin up compute when needed, read directly from object storage, write results back as new immutable artifacts, then shut down. This is the same decoupling that made modern data lakes work.

If you want a concrete reference architecture in the streaming to storage direction, see “Streaming Data Becomes Storage-Native.” The key idea translates cleanly to agent context: durable records live in object storage, while compute scales independently.

Reference: Scalytics: Streaming Data Becomes Storage-Native

Where block storage fits: fast forks for agent sandboxes

Object storage is excellent for durable history. Some agent workflows also need fast, isolated, writable sandboxes. That is where modern block storage patterns matter: copy on write volumes, instant snapshots, and forkable environments for evaluation, migrations, and “try then revert” behaviors.

This is not a new concept. It is the same reason serious CI systems use ephemeral environments. Agents just amplify the need because they create more parallel experiments and more rollback scenarios.

Federated execution is not optional. It is the only scalable way to assemble context.

Agent context rarely lives in one system. Tickets in one place. Customer events in another. Feature tables elsewhere. Documents scattered across drives and buckets.

This is where Apache Wayang becomes directly relevant. Wayang is a cross platform data processing framework that can mix engines and data systems in one pipeline, and it can optimize platform selection based on cost models.

Reference Architecture: Storage-Native Agent Infrastructure

Sources: Apache Wayang project site. Apache Wayang: About. SIGMOD Record: Apache Wayang overview

Practical translation for agent infrastructure: instead of building brittle one off connectors, you define a context assembly job. Wayang can push parts of that job to the right engines, then return a compact context artifact the agent can rely on. This also reduces data gravity pain because you do not need to centralize everything before you can act.

Note on provenance: the Scalytics team originally created Apache Wayang before it became a community-driven Apache project. Treat it as an Apache ecosystem building block, not a proprietary dependency.

Why streaming-to-storage matters for agents:

If your event stream writes directly to object storage with no intermediate state, you get: infinite retention at storage cost (not broker cost), replay without impacting live traffic, and a clean separation between "what happened" and "what to do about it." The agent reads history from storage. It does not compete with your ingestion pipeline.

A production checklist you can actually use

1) Can you replay any agent decision from last month?

If not, you do not have agent infrastructure. You have a demo. Store immutable inputs and outputs, and version the structured layer.

2) Is your ingestion path isolated from evaluation and backfills?

If not, agents will eventually degrade your real time services. Separate the write path from the read heavy analytics and replay path.

3) Do you have a “context artifact” per run?

One run should produce a durable artifact: a snapshot pointer, a query plan, and the exact retrieved records. This is what makes tracing meaningful, not just pretty dashboards.

4) Can you fork environments safely for tool execution?

Agents that touch systems need sandboxes. Block storage snapshots and copy on write patterns make “try then revert” operationally real.

The takeaway

Agent success is an infrastructure problem disguised as a prompt problem. Treat context as data engineering: immutable history in object storage, versioned tables for time travel, ephemeral compute for replay and evaluation, and federated execution to assemble context without centralizing everything.

When you do this, agents stop being magic. They become operable software.

Related guides:

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

novatechflow | Alexander Alten

Search This Blog