Skip to main content

Production CDC Architecture: Debezium Scaling Lessons

Production CDC architecture breaks under load long before most teams expect it. With Debezium, Kafka Connect, and Postgres, the failure patterns are consistent: WAL pressure builds up, connector lag drifts unnoticed, and snapshot phases exhaust memory under bursty traffic.

This is based on running these pipelines across high throughput systems, including workloads above 10k TPS. The difference between a system that works and one that holds under pressure comes down to observability, WAL discipline, and how connector scaling is handled.


Production Debezium CDC Architecture

Operational reality vs. tutorial defaults under real load (10k+ TPS)

The Default "Tutorial" Setup

Assumes low throughput and stable networks. Fails under pressure.

Source: Postgres
Single WAL Slot
  • Shared slot coupling multiple connectors
  • Default WAL retention settings
  • Risk: Replication Deadlocks during failover
Pipe: Kafka Connect
Default Debezium Config
  • tasksMax: 1 (No parallelism)
  • Standard JVM Heap (often -Xmx1G)
  • snapshot.mode: initial on huge tables
  • Failure: JVM OOM during snapshot phase
Sink: Kafka Brokers
Default Topic Config
  • Low partition count
  • No internal metric visibility
  • Failure: Silent Lag Accumulation
ULTIMATE OUTCOME:
Forced Full Resnapshots & Extended Outages
The Production-Hardened Setup

Prioritizes observability, isolation, and stability over guarantees.

Source: Postgres
Dedicated WAL Slots
  • Strict 1 slot per connector isolation
  • Tuning max_wal_senders & slots
  • Monitoring pg_replication_slots
  • Benefit: Isolated failure blast radius
Pipe: Kafka Connect
Hardened Spec (StatefulSet)
  • tasksMax: 4 (Scaled 1 task/2 vCPU)
  • Baseline -Xmx4G + G1GC tuning
  • Tuning poll.interval & max.batch.size
  • Benefit: Stable 10k+ TPS throughput
Sink: Kafka Brokers
Observability Enabled
  • JMX metrics scraped by Prometheus
  • Alerting on records-lag-max
  • Schema History Topic managed explicitly
  • Benefit: Incident detection in under 30 mins
ULTIMATE OUTCOME:
Sustained High Throughput & Rapid Recovery

Production CDC Architecture with Debezium: Scaling Kafka Connect Under Real Load

Production CDC architecture is the discipline of reliably capturing database changes at scale and streaming them into systems like Kafka without data loss, lag accumulation, or operational instability. In practice, this means managing WAL pressure, connector backlogs, schema drift, and downstream backpressure under sustained load.

I have built and operated Debezium CDC pipelines from Postgres to Kafka across more than a dozen production systems, including environments exceeding 10k TPS across 100+ tables. Under these conditions, default configurations fail. WAL backlogs grow silently, snapshots exhaust heap memory, and connector lag drifts for hours before detection.

In multiple production systems, introducing proper observability and WAL tuning reduced incident detection time from hours to minutes and eliminated repeated resnapshot cycles during peak load. What follows are the configurations, metrics, and decisions that consistently hold up in production.

Bottom Line Up Front

Monitor key JMX metrics such as records-lag-max and snapshot status from the first deployment. In several production systems, this alone detected more than 80 percent of incidents before they escalated. Without this baseline, teams end up diagnosing offset drift and WAL issues under pressure instead of preventing them.

Kafka Connect CDC Production Failure Modes Under Load

Tutorials make Debezium appear simple, but Kafka Connect CDC production environments expose real constraints. Multi-broker Kafka clusters, replication slot pressure, and downstream backpressure quickly surface weaknesses in default setups.

In a 12k TPS workload across 120 tables, default polling intervals caused lag to grow into hours. Snapshot phases triggered memory exhaustion in Connect workers, leading to repeated restarts and partial data loss scenarios. These failures appear consistently under bursty load.

Debezium documentation covers connector setup but does not address WAL retention behavior under sustained throughput or the impact of multi-table snapshots on JVM memory. Without production-grade monitoring, teams rely on logs and manual correlation between Postgres replication state and Kafka lag, which is too slow for real incidents.

Production CDC Architecture Observability

Observability is the foundation of production CDC architecture. Kafka Connect and Debezium expose critical metrics through JMX that must be scraped and monitored continuously.

  • records-lag-max tracks backlog growth and signals pipeline saturation
  • snapshot status detects stalled or long running snapshots
  • records-per-sec highlights throughput degradation
  • event.count per table identifies hot tables and uneven load distribution

In one deployment, alerting on lag exceeding one hour surfaced a replication backlog several hours before it caused downstream outages. This reduced mean time to recovery from multiple hours to under 30 minutes.

Postgres must be monitored alongside Kafka. Queries against pg_replication_slots reveal slot pressure early. Once slot usage approaches critical thresholds, intervention is required immediately to avoid forced resnapshots.

Debezium CDC Architecture Trade-Offs in Production

Across multiple production deployments, the primary optimization target is operational stability rather than theoretical correctness. Systems that prioritize perfect guarantees over throughput and recovery tend to fail under load.

Approach Advantages Trade-offs When to Use
Log-based CDC (pgoutput) Low database overhead, native schema evolution Requires WAL slot management and tuning Default for Postgres production pipelines
Snapshot plus streaming Full recovery capability High CPU and memory usage during snapshot Systems needing rebuild after outages
At-least-once delivery High throughput, simpler operations Duplicate handling required downstream Analytics, IoT, event streaming above 1k TPS
Exactly-once semantics No duplicates, strong guarantees 20 to 50 percent latency overhead Strict regulatory systems only

In one production system, enabling exactly-once semantics increased latency from 400 milliseconds to over 900 milliseconds and caused SLA violations. In contrast, downstream deduplication was simpler and more reliable.

A critical rule from production experience is to never share WAL slots across connectors. Shared slots introduce failure coupling and can lead to replication deadlocks during failover, forcing full resnapshots and extended recovery time.

Postgres CDC Architecture: Running Production Connector Spec

For postgres cdc architecture, I deploy Kafka Connect as a StatefulSet with a hardened Postgres connector spec. This is the baseline I use before tuning for table count, write volume, and failover behavior. It reflects what has held stable across high throughput systems rather than default examples:

apiVersion: kafka.connect/v1
kind: Connector
metadata:
  name: postgres-cdc
  labels:
    app: debezium
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 4  # 1 task per 2 vCPU for 10k TPS
  config:
    database.hostname: prod-postgres.example.com
    database.port: 5432
    database.user: replicator
    database.password: ${str:secret-ref}
    database.dbname: inventory
    database.server.name: dbserver1
    table.include.list: public.inventory.*,public.customers
    slot.name: debezium_slot
    plugin.name: pgoutput  # Postgres 10+
    publication.name: dbz_publication
    snapshot.mode: initial
    heartbeat.interval.ms: 30000
    max.batch.size: 2048
    max.queue.size: 8192
    max.queue.size.in.bytes: 0
    poll.interval.ms: 500

Apply with kubectl apply -f connector.yaml. Scale tasksMax at roughly one task per two vCPUs, then validate under synthetic load. In my 10k TPS deployments, this ratio held stable without runaway lag or JVM pressure.

Connect worker configuration needs to match the same production assumptions:

group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.topic=connect-offsets
config.storage.topic=connect-configs
status.storage.topic=connect-status
offset.storage.replication.factor=3
config.storage.replication.factor=3
KAFKA_OPTS="-Xms1G -Xmx4G -XX:+UseG1GC"

Metrics must be exposed from the start. Prometheus scraping through a dedicated endpoint avoids blind spots during incidents:

- job_name: 'kafka-connect'
  static_configs:
    - targets: ['connect:8083']
  metrics_path: /metrics
  params:
    group: ['kafka.connect']

This setup is not static. Every production system diverges based on table distribution, write amplification, and downstream pressure, but starting from this baseline avoids the common failure modes that appear when relying on default connector configurations.

Postgres CDC Architecture Implementation

A typical production deployment uses Kafka Connect in a distributed setup with tuned connector configuration:

tasksMax: 4
max.batch.size: 2048
max.queue.size: 8192
poll.interval.ms: 500
heartbeat.interval.ms: 30000

Scaling follows a practical rule of one task per two CPU cores for sustained throughput around 10k TPS. JVM heap sizing must match table count and snapshot size, with 4GB as a baseline for mid-sized deployments.

Postgres configuration is equally critical. Settings such as max_replication_slots, max_wal_senders, and wal_sender_timeout must be tuned to avoid slot starvation and connection drops.

Advanced Debezium CDC Patterns

Schema evolution requires explicit handling. Using a schema history topic and structured formats such as Avro prevents failures during DDL changes.

Connector isolation is another key pattern. Grouping tables into separate connectors limits blast radius when one schema or workload behaves unexpectedly.

Snapshot strategy should remain controlled. Continuous snapshot modes often lead to memory pressure and instability. Initial snapshot plus streaming with heartbeat monitoring provides a stable baseline.

Across multiple systems, removing continuous snapshot modes eliminated recurring out-of-memory failures and reduced recovery time.

Operational Next Steps

Run this under load before trusting it. Introduce artificial lag, pause replication, and push throughput beyond expected peaks. Watch how quickly lag grows, how snapshots behave, and where recovery breaks down.

In production CDC architecture, initial setup is not the hard part. Systems fail during spikes, failovers, and schema changes. The difference between a pipeline that works and one that holds under pressure comes down to observability, WAL discipline, and clear operational boundaries.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...