Skip to main content

Why We Built Our Multi-Agent System on Kafka (And What We Learned)

The 3:47am Incident That Changed Our Architecture

At 3:47am on a Tuesday, our monitoring dashboard lit up. Three different teams had just published the same article about agent observability. Marketing wrote it for the corporate blog. Sales adapted it for a prospect deck. Content produced it for our technical newsletter.

All three versions were good. All three were complete. All three appeared within 20 minutes of each other.

The problem? Nobody knew the other teams were working on it.

This wasn’t a coordination failure. It was an architecture failure. Our multi-agent system had no shared truth. Each department’s agent operated independently, pulling from the same source material, generating similar content, with zero awareness of parallel work.

That morning, we rebuilt our agent communication architecture on Kafka. Here’s why, what we learned, and the patterns that emerged from six months in production.


The Architecture That Failed

Our original multi-agent system looked clean on paper:

PaxMachina (Coordinator)
├── Marketing Agent (Blog content)
├── Sales Agent (Prospect materials)
├── Content Agent (Newsletter)
├── Research Agent (Competitive intel)
└── SEO Agent (Optimization)

Each agent monitored its own queue. When PaxMachina had a task, it would:

  1. Check which agent should handle it
  2. Write the task to that agent’s queue (a directory of JSON files)
  3. Poll for results (another directory)

Simple. Decentralized. Completely broken.

The fundamental problem: File-based queues provided no coordination. When Research finished a competitive analysis at 3:30am, three different content agents each grabbed it independently. They all thought “great, new research, let me write about this.” Zero communication. No shared state. Three duplicate articles.


Why We Chose Kafka

We evaluated three approaches to fix this:

Option 1: Centralized orchestrator

  • PaxMachina tracks all work in a database
  • Agents poll for tasks and check for duplicates
  • Creates a single point of failure
  • Database becomes bottleneck at scale

Option 2: Shared state database

  • Agents write “I’m working on X” to shared DB
  • Check database before starting work
  • Race conditions on who claims tasks
  • Coordination logic in every agent

Option 3: Event streaming (Kafka)

  • All agent communication flows through topics
  • Every event is visible to all agents
  • Immutable log provides shared truth
  • Requires Kafka infrastructure

We chose Kafka because the duplicate article problem was fundamentally about visibility. Agents needed to see what others were doing. File queues hid this. Kafka made it explicit.


Architecture: Kafka as Shared Truth

Topic Structure

We redesigned around event streams:

research-events topic:

  • Research agent publishes when new analysis completes
  • All content agents consume
  • Partitioned by research topic for ordering

content-claims topic:

  • Agents publish “I’m writing about X” before starting
  • Others consume and skip duplicate work
  • Short retention (24 hours)

content-results topic:

  • Completed articles, decks, newsletters
  • Partitioned by content type
  • Long retention for audit trail

The Coordination Pattern

When Research completes competitive analysis:

# Research agent publishes completion
producer.send('research-events', {
    'event_type': 'analysis_complete',
    'topic': 'agent_observability_platforms',
    'output_path': 's3://research/obs-platforms-analysis.md',
    'timestamp': '2026-02-14T03:30:15Z'
})

Content agents consume and coordinate:

# Marketing agent consumes research events
for message in consumer:
    research = message.value
    
    # Check if anyone claimed this already
    existing_claim = check_content_claims(research['topic'])
    
    if existing_claim:
        log(f"Skipping {research['topic']} - already claimed by {existing_claim.agent}")
        continue
    
    # Claim the work
    producer.send('content-claims', {
        'agent': 'marketing',
        'topic': research['topic'],
        'format': 'blog_post',
        'claimed_at': now()
    })
    
    # Now safe to proceed
    write_article(research)

This pattern prevents the 3am duplicate article scenario. When Marketing claims “writing blog about agent observability,” Sales and Content see that claim and skip to the next topic.


The “Time Travel” Debugging Incident

A few days into production, Oracle (our audit agent) started producing nonsensical keyword recommendations. Suggested optimizing for “kubernetes observability” when we were coding about agent tracing.

The debugging process demonstrated Kafka’s value:

# Replay events for the workflow
kafka-console-consumer --bootstrap-server kafka-1:9092 \
  --topic research-events \
  --from-beginning \
  --property print.timestamp=true \
  | grep "workflow_7k2p"

We discovered Oracle was processing a research event from 12 hours earlier. The workflow looked like this:

  1. 3:00am: Research publishes analysis about Kubernetes
  2. 3:30am: Team writes article about Kubernetes (correctly)
  3. 2:00pm: Research publishes analysis about agent tracing
  4. 2:15pm: Oracle processes the 3am Kubernetes event (consumer lag)
  5. 2:20pm: Oracle recommends Kubernetes patterns for agent (wrong)

The bug? Oracle’s consumer group fell behind due to external API rate limits. By the time it caught up, it was processing stale events in the wrong context.

The fix wouldn’t have been possible without Kafka’s immutable log. We replayed the exact sequence of events, saw the lag, and fixed Oracle’s rate limit handling. File-based queues would have shown us “Oracle got the wrong input” without the timeline to understand why.


Pattern: Cross-Department Event Visibility

The duplicate article problem revealed a deeper need: cross-department coordination.

We added a coordination layer:

# Department coordinators subscribe to all content claims
coordinators = {
    'marketing': MarketingCoordinator(),
    'sales': SalesCoordinator(),
    'content': ContentCoordinator()
}

consumer = KafkaConsumer(
    'content-claims',
    group_id='department-coordinators'
)

for message in consumer:
    claim = message.value
    
    # Notify all departments
    for dept, coordinator in coordinators.items():
        coordinator.notify_active_work(claim)
    
    # Update shared dashboard
    update_dashboard(claim)

Now when Marketing claims an article, Sales sees it on their dashboard. They can:

  • Skip duplicate work
  • Coordinate complementary content (blog + deck)
  • Time releases (don’t publish same day)

This coordination happens through Kafka events. No synchronous calls. No shared database locks. Just event streams that everyone can see.


Pattern: Exactly-Once Processing

Kafka’s exactly-once semantics prevent another class of bugs we hit early: duplicate processing.

Initial implementation without transactions:

# Buggy: can process twice if crash at wrong moment
for message in consumer:
    task = message.value
    
    result = process_task(task)  # Expensive LLM call
    
    producer.send('results', result)  # Publish result
    
    consumer.commit()  # Commit offset

If the agent crashes after publishing the result but before committing the offset, it reprocesses the task on restart. With LLM calls costing money, duplicate processing adds up.

With transactions:

# Fixed: atomic commit
producer.init_transactions()

for message in consumer:
    producer.begin_transaction()
    
    try:
        result = process_task(message.value)
        producer.send('results', result)
        
        # Atomic: result publish + offset commit
        producer.send_offsets_to_transaction(
            {TopicPartition('tasks', message.partition): message.offset + 1},
            consumer.config['group_id']
        )
        
        producer.commit_transaction()
    except Exception:
        producer.abort_transaction()
        raise

If the agent crashes, the transaction aborts. Result isn’t published, offset isn’t committed. On restart, it reprocesses and completes atomically. Zero duplicate work.


Pattern: Progress Visibility

The 3am incident taught us another lesson: we couldn’t see what agents were doing in real-time.

We added progress events:

def write_article(research):
    total_steps = 5
    
    # Publish progress as we go
    publish_progress(step=1, total=5, status="Analyzing research")
    analysis = analyze_research(research)
    
    publish_progress(step=2, total=5, status="Generating outline")
    outline = generate_outline(analysis)
    
    publish_progress(step=3, total=5, status="Writing draft")
    draft = write_draft(outline)
    
    publish_progress(step=4, total=5, status="SEO optimization")
    optimized = optimize_seo(draft)
    
    publish_progress(step=5, total=5, status="Publishing")
    publish(optimized)
    
    return optimized

Progress events flow to a dashboard:

Marketing Agent: Writing draft (step 3/5)
Sales Agent: Idle
Content Agent: SEO optimization (step 4/5)
Research Agent: Processing competitive analysis

This visibility prevents:

  • Starting duplicate work (see Marketing is already writing)
  • Timeout confusion (see agent is making progress)
  • Capacity planning (see which agents are idle)

What Broke (And How We Fixed It)

Problem 1: Topic Retention Caught Us

We used KafScale's defaults: 7-day retention. Two weeks later, needed to debug an issue from 10 days ago. Events were gone.

Fix: Tiered retention

  • 7 days hot (Kafka brokers)
  • 30 days warm (Kafka tiered storage)
  • 90 days cold (S3 archive)

Now we can replay recent workflows instantly and access older events for deep analysis.

Problem 2: Consumer Lag Alerts Missing

Oracle’s “time travel” bug sat undetected for hours. We had no alerts on consumer lag.

Fix: Prometheus metrics + alerts

# Expose lag metric
kafka_consumer_lag = Gauge(
    'kafka_consumer_lag_seconds',
    'Seconds behind latest message',
    ['consumer_group', 'topic', 'partition']
)

# Alert when lag > 5 minutes
for partition in consumer.assignment():
    lag = calculate_lag(partition)
    kafka_consumer_lag.labels(
        consumer_group=consumer.config['group_id'],
        topic=partition.topic,
        partition=partition.partition
    ).set(lag)

Alert rule:

- alert: ConsumerLagging
  expr: kafka_consumer_lag_seconds > 300
  annotations:
    summary: "{{ $labels.consumer_group }} lagging on {{ $labels.topic }}"

Now we catch consumer lag before it causes data inconsistency.

Problem 3: Event Schema Evolution

Some days in, we needed to add a field to research events. Changing the schema broke running consumers.

Fix: Backward-compatible schema evolution

# Old schema
{
  'topic': 'agent_observability',
  'output_path': 's3://...'
}

# New schema (backward compatible)
{
  'topic': 'agent_observability',
  'output_path': 's3://...',
  'summary': 'Optional brief summary',  # New field
  'schema_version': 2  # Version marker
}

Consumers check schema version and handle accordingly:

research = message.value

if research.get('schema_version', 1) == 1:
    # Old schema: no summary field
    summary = "No summary available"
else:
    # New schema: summary present
    summary = research.get('summary', '')

This allows gradual migration without breaking existing consumers.


Operational Benefits We Didn’t Expect

Multi-Consumer Flexibility

We added a monitoring dashboard that consumes the same events agents process. Different consumer group, same topics. Zero agent code changes.

Later added:

  • Audit logging consumer for compliance
  • Analytics consumer for metrics
  • Testing consumer for integration tests

Each new consumer is independent. Agents don’t know or care how many systems read their events.

Natural Backpressure

When agents fall behind, tasks queue in Kafka. The queue depth (consumer lag) is visible and measurable.

Compare to our old file-based approach: if an agent was overloaded, new files just piled up in the directory. No visibility until something broke.

With Kafka, we see lag growing and can:

  • Scale up consumers (more agent instances)
  • Throttle producers (slow down task creation)
  • Alert operators (manual intervention needed)

Debugging Time Improvement

Before Kafka: Debugging the duplicate article incident would have required:

  1. Check each agent’s log files
  2. Correlate timestamps across systems
  3. Reconstruct timeline manually
  4. Hope logs weren’t rotated

With Kafka: Replay the workflow events in chronological order. See exactly what happened, when, and why. Total time: 10-15 minutes vs hours.


When Kafka Makes Sense

Our architecture works for our scale (7 agents, hundreds of daily workflows). Kafka makes sense when you have:

Event-driven coordination where visibility into other agents’ work prevents duplicate effort or coordination failures.

Need for replay and debugging where reconstructing what happened matters for incident response or optimization.

Multiple consumers of the same events where different systems (dashboards, analytics, audit) need visibility into agent activity.

Tolerance for operational complexity because running Kafka requires infrastructure knowledge, monitoring, and maintenance.

It doesn’t fit if:

  • Simple request-response patterns (HTTP is simpler)
  • No Kafka expertise on team (steep learning curve)
  • Low event volume (overhead not justified)
  • Need turnkey simplicity over flexibility

Lessons for Your Architecture

If you’re evaluating Kafka for multi-agent systems:

Start with visibility problems: If agents can’t see what others are doing, Kafka helps. If they just need to queue tasks, simpler options exist.

Plan partition count upfront: You can’t easily increase partitions later without breaking ordering. Estimate scale and 5x it.

Instrument consumer lag immediately: Lag is your primary health metric. Alert when it exceeds reasonable bounds.

Schema evolution strategy from day one: Backward compatibility prevents breaking running systems when you need to add fields.

Retention tiers matter: Debugging requires history. Compliance may require years. Design archive strategy before going live.

Keep events small: Reference large payloads (research documents, generated content) externally. Kafka throughput degrades with large messages.


Two Weeks Later

The 3am duplicate article incident hasn’t repeated. Our Kafka-based architecture provides:

  • Zero duplicate work through content claim coordination
  • Complete audit trail of who did what when
  • 15-minute debugging instead of hours of log archaeology
  • Visible coordination across departments and agents

We still hit issues. Consumer lag. Schema evolution challenges. Partition hotspots. But these are operational problems with operational solutions, not architectural flaws that wake you up at 3am.

The simultaneous article publication was simultaneously obvious and terrifying in retrospect. Of course independent agents would duplicate work without coordination. But it took a production incident to make the architecture failure visceral.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...