Why We Built Our Multi-Agent System on Kafka (And What We Learned)

The 3:47am Incident That Changed Our Architecture

At 3:47am on a Tuesday, our monitoring dashboard lit up. Three different teams had just published the same article about agent observability. Marketing wrote it for the corporate blog. Sales adapted it for a prospect deck. Content produced it for our technical newsletter.

All three versions were good. All three were complete. All three appeared within 20 minutes of each other.

The problem? Nobody knew the other teams were working on it.

This wasn’t a coordination failure. It was an architecture failure. Our multi-agent system had no shared truth. Each department’s agent operated independently, pulling from the same source material, generating similar content, with zero awareness of parallel work.

That morning, we rebuilt our agent communication architecture on Kafka. Here’s why, what we learned, and the patterns that emerged from six months in production.

The Architecture That Failed

Our original multi-agent system looked clean on paper:

PaxMachina (Coordinator)
├── Marketing Agent (Blog content)
├── Sales Agent (Prospect materials)
├── Content Agent (Newsletter)
├── Research Agent (Competitive intel)
└── SEO Agent (Optimization)

Each agent monitored its own queue. When PaxMachina had a task, it would:

Check which agent should handle it
Write the task to that agent’s queue (a directory of JSON files)
Poll for results (another directory)

Simple. Decentralized. Completely broken.

The fundamental problem: File-based queues provided no coordination. When Research finished a competitive analysis at 3:30am, three different content agents each grabbed it independently. They all thought “great, new research, let me write about this.” Zero communication. No shared state. Three duplicate articles.

Why We Chose Kafka

We evaluated three approaches to fix this:

Option 1: Centralized orchestrator

PaxMachina tracks all work in a database
Agents poll for tasks and check for duplicates
Creates a single point of failure
Database becomes bottleneck at scale

Option 2: Shared state database

Agents write “I’m working on X” to shared DB
Check database before starting work
Race conditions on who claims tasks
Coordination logic in every agent

Option 3: Event streaming (Kafka)

All agent communication flows through topics
Every event is visible to all agents
Immutable log provides shared truth
Requires Kafka infrastructure

We chose Kafka because the duplicate article problem was fundamentally about visibility. Agents needed to see what others were doing. File queues hid this. Kafka made it explicit.

Architecture: Kafka as Shared Truth

Topic Structure

We redesigned around event streams:

research-events topic:

Research agent publishes when new analysis completes
All content agents consume
Partitioned by research topic for ordering

content-claims topic:

Agents publish “I’m writing about X” before starting
Others consume and skip duplicate work
Short retention (24 hours)

content-results topic:

Completed articles, decks, newsletters
Partitioned by content type
Long retention for audit trail

The Coordination Pattern

When Research completes competitive analysis:

# Research agent publishes completion
producer.send('research-events', {
    'event_type': 'analysis_complete',
    'topic': 'agent_observability_platforms',
    'output_path': 's3://research/obs-platforms-analysis.md',
    'timestamp': '2026-02-14T03:30:15Z'
})

Content agents consume and coordinate:

# Marketing agent consumes research events
for message in consumer:
    research = message.value
    
    # Check if anyone claimed this already
    existing_claim = check_content_claims(research['topic'])
    
    if existing_claim:
        log(f"Skipping {research['topic']} - already claimed by {existing_claim.agent}")
        continue
    
    # Claim the work
    producer.send('content-claims', {
        'agent': 'marketing',
        'topic': research['topic'],
        'format': 'blog_post',
        'claimed_at': now()
    })
    
    # Now safe to proceed
    write_article(research)

This pattern prevents the 3am duplicate article scenario. When Marketing claims “writing blog about agent observability,” Sales and Content see that claim and skip to the next topic.

The “Time Travel” Debugging Incident

A few days into production, Oracle (our audit agent) started producing nonsensical keyword recommendations. Suggested optimizing for “kubernetes observability” when we were coding about agent tracing.

The debugging process demonstrated Kafka’s value:

# Replay events for the workflow
kafka-console-consumer --bootstrap-server kafka-1:9092 \
  --topic research-events \
  --from-beginning \
  --property print.timestamp=true \
  | grep "workflow_7k2p"

We discovered Oracle was processing a research event from 12 hours earlier. The workflow looked like this:

3:00am: Research publishes analysis about Kubernetes
3:30am: Team writes article about Kubernetes (correctly)
2:00pm: Research publishes analysis about agent tracing
2:15pm: Oracle processes the 3am Kubernetes event (consumer lag)
2:20pm: Oracle recommends Kubernetes patterns for agent (wrong)

The bug? Oracle’s consumer group fell behind due to external API rate limits. By the time it caught up, it was processing stale events in the wrong context.

The fix wouldn’t have been possible without Kafka’s immutable log. We replayed the exact sequence of events, saw the lag, and fixed Oracle’s rate limit handling. File-based queues would have shown us “Oracle got the wrong input” without the timeline to understand why.

Pattern: Cross-Department Event Visibility

The duplicate article problem revealed a deeper need: cross-department coordination.

We added a coordination layer:

# Department coordinators subscribe to all content claims
coordinators = {
    'marketing': MarketingCoordinator(),
    'sales': SalesCoordinator(),
    'content': ContentCoordinator()
}

consumer = KafkaConsumer(
    'content-claims',
    group_id='department-coordinators'
)

for message in consumer:
    claim = message.value
    
    # Notify all departments
    for dept, coordinator in coordinators.items():
        coordinator.notify_active_work(claim)
    
    # Update shared dashboard
    update_dashboard(claim)

Now when Marketing claims an article, Sales sees it on their dashboard. They can:

Skip duplicate work
Coordinate complementary content (blog + deck)
Time releases (don’t publish same day)

This coordination happens through Kafka events. No synchronous calls. No shared database locks. Just event streams that everyone can see.

Pattern: Exactly-Once Processing

Kafka’s exactly-once semantics prevent another class of bugs we hit early: duplicate processing.

Initial implementation without transactions:

# Buggy: can process twice if crash at wrong moment
for message in consumer:
    task = message.value
    
    result = process_task(task)  # Expensive LLM call
    
    producer.send('results', result)  # Publish result
    
    consumer.commit()  # Commit offset

If the agent crashes after publishing the result but before committing the offset, it reprocesses the task on restart. With LLM calls costing money, duplicate processing adds up.

With transactions:

# Fixed: atomic commit
producer.init_transactions()

for message in consumer:
    producer.begin_transaction()
    
    try:
        result = process_task(message.value)
        producer.send('results', result)
        
        # Atomic: result publish + offset commit
        producer.send_offsets_to_transaction(
            {TopicPartition('tasks', message.partition): message.offset + 1},
            consumer.config['group_id']
        )
        
        producer.commit_transaction()
    except Exception:
        producer.abort_transaction()
        raise

If the agent crashes, the transaction aborts. Result isn’t published, offset isn’t committed. On restart, it reprocesses and completes atomically. Zero duplicate work.

Pattern: Progress Visibility

The 3am incident taught us another lesson: we couldn’t see what agents were doing in real-time.

We added progress events:

def write_article(research):
    total_steps = 5
    
    # Publish progress as we go
    publish_progress(step=1, total=5, status="Analyzing research")
    analysis = analyze_research(research)
    
    publish_progress(step=2, total=5, status="Generating outline")
    outline = generate_outline(analysis)
    
    publish_progress(step=3, total=5, status="Writing draft")
    draft = write_draft(outline)
    
    publish_progress(step=4, total=5, status="SEO optimization")
    optimized = optimize_seo(draft)
    
    publish_progress(step=5, total=5, status="Publishing")
    publish(optimized)
    
    return optimized

Progress events flow to a dashboard:

Marketing Agent: Writing draft (step 3/5)
Sales Agent: Idle
Content Agent: SEO optimization (step 4/5)
Research Agent: Processing competitive analysis

This visibility prevents:

Starting duplicate work (see Marketing is already writing)
Timeout confusion (see agent is making progress)
Capacity planning (see which agents are idle)

What Broke (And How We Fixed It)

Problem 1: Topic Retention Caught Us

We used KafScale's defaults: 7-day retention. Two weeks later, needed to debug an issue from 10 days ago. Events were gone.

Fix: Tiered retention

7 days hot (Kafka brokers)
30 days warm (Kafka tiered storage)
90 days cold (S3 archive)

Now we can replay recent workflows instantly and access older events for deep analysis.

Problem 2: Consumer Lag Alerts Missing

Oracle’s “time travel” bug sat undetected for hours. We had no alerts on consumer lag.

Fix: Prometheus metrics + alerts

# Expose lag metric
kafka_consumer_lag = Gauge(
    'kafka_consumer_lag_seconds',
    'Seconds behind latest message',
    ['consumer_group', 'topic', 'partition']
)

# Alert when lag > 5 minutes
for partition in consumer.assignment():
    lag = calculate_lag(partition)
    kafka_consumer_lag.labels(
        consumer_group=consumer.config['group_id'],
        topic=partition.topic,
        partition=partition.partition
    ).set(lag)

Alert rule:

- alert: ConsumerLagging
  expr: kafka_consumer_lag_seconds > 300
  annotations:
    summary: "{{ $labels.consumer_group }} lagging on {{ $labels.topic }}"

Now we catch consumer lag before it causes data inconsistency.

Problem 3: Event Schema Evolution

Some days in, we needed to add a field to research events. Changing the schema broke running consumers.

Fix: Backward-compatible schema evolution

# Old schema
{
  'topic': 'agent_observability',
  'output_path': 's3://...'
}

# New schema (backward compatible)
{
  'topic': 'agent_observability',
  'output_path': 's3://...',
  'summary': 'Optional brief summary',  # New field
  'schema_version': 2  # Version marker
}

Consumers check schema version and handle accordingly:

research = message.value

if research.get('schema_version', 1) == 1:
    # Old schema: no summary field
    summary = "No summary available"
else:
    # New schema: summary present
    summary = research.get('summary', '')

This allows gradual migration without breaking existing consumers.

Operational Benefits We Didn’t Expect

Multi-Consumer Flexibility

We added a monitoring dashboard that consumes the same events agents process. Different consumer group, same topics. Zero agent code changes.

Later added:

Audit logging consumer for compliance
Analytics consumer for metrics
Testing consumer for integration tests

Each new consumer is independent. Agents don’t know or care how many systems read their events.

Natural Backpressure

When agents fall behind, tasks queue in Kafka. The queue depth (consumer lag) is visible and measurable.

Compare to our old file-based approach: if an agent was overloaded, new files just piled up in the directory. No visibility until something broke.

With Kafka, we see lag growing and can:

Scale up consumers (more agent instances)
Throttle producers (slow down task creation)
Alert operators (manual intervention needed)

Debugging Time Improvement

Before Kafka: Debugging the duplicate article incident would have required:

Check each agent’s log files
Correlate timestamps across systems
Reconstruct timeline manually
Hope logs weren’t rotated

With Kafka: Replay the workflow events in chronological order. See exactly what happened, when, and why. Total time: 10-15 minutes vs hours.

When Kafka Makes Sense

Our architecture works for our scale (7 agents, hundreds of daily workflows). Kafka makes sense when you have:

Event-driven coordination where visibility into other agents’ work prevents duplicate effort or coordination failures.

Need for replay and debugging where reconstructing what happened matters for incident response or optimization.

Multiple consumers of the same events where different systems (dashboards, analytics, audit) need visibility into agent activity.

Tolerance for operational complexity because running Kafka requires infrastructure knowledge, monitoring, and maintenance.

It doesn’t fit if:

Simple request-response patterns (HTTP is simpler)
No Kafka expertise on team (steep learning curve)
Low event volume (overhead not justified)
Need turnkey simplicity over flexibility

Lessons for Your Architecture

If you’re evaluating Kafka for multi-agent systems:

Start with visibility problems: If agents can’t see what others are doing, Kafka helps. If they just need to queue tasks, simpler options exist.

Plan partition count upfront: You can’t easily increase partitions later without breaking ordering. Estimate scale and 5x it.

Instrument consumer lag immediately: Lag is your primary health metric. Alert when it exceeds reasonable bounds.

Schema evolution strategy from day one: Backward compatibility prevents breaking running systems when you need to add fields.

Retention tiers matter: Debugging requires history. Compliance may require years. Design archive strategy before going live.

Keep events small: Reference large payloads (research documents, generated content) externally. Kafka throughput degrades with large messages.

Two Weeks Later

The 3am duplicate article incident hasn’t repeated. Our Kafka-based architecture provides:

Zero duplicate work through content claim coordination
Complete audit trail of who did what when
15-minute debugging instead of hours of log archaeology
Visible coordination across departments and agents

We still hit issues. Consumer lag. Schema evolution challenges. Partition hotspots. But these are operational problems with operational solutions, not architectural flaws that wake you up at 3am.

The simultaneous article publication was simultaneously obvious and terrifying in retrospect. Of course independent agents would duplicate work without coordination. But it took a production incident to make the architecture failure visceral.

Related guides:

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

novatechflow | Alexander Alten

Search This Blog