Production CDC Architecture: Debezium Scaling Lessons

Production CDC architecture breaks under load long before most teams expect it. With Debezium, Kafka Connect, and Postgres, the failure patterns are consistent: WAL pressure builds up, connector lag drifts unnoticed, and snapshot phases exhaust memory under bursty traffic.

This is based on running these pipelines across high throughput systems, including workloads above 10k TPS. The difference between a system that works and one that holds under pressure comes down to observability, WAL discipline, and how connector scaling is handled.

Production Debezium CDC Architecture

Operational reality vs. tutorial defaults under real load (10k+ TPS)

The Default "Tutorial" Setup

Assumes low throughput and stable networks. Fails under pressure.

Source: Postgres

Single WAL Slot

Shared slot coupling multiple connectors
Default WAL retention settings
Risk: Replication Deadlocks during failover

Pipe: Kafka Connect

Default Debezium Config

tasksMax: 1 (No parallelism)
Standard JVM Heap (often -Xmx1G)
snapshot.mode: initial on huge tables
Failure: JVM OOM during snapshot phase

Sink: Kafka Brokers

Default Topic Config

Low partition count
No internal metric visibility
Failure: Silent Lag Accumulation

ULTIMATE OUTCOME:
Forced Full Resnapshots & Extended Outages

The Production-Hardened Setup

Prioritizes observability, isolation, and stability over guarantees.

Source: Postgres

Dedicated WAL Slots

Strict 1 slot per connector isolation
Tuning max_wal_senders & slots
Monitoring pg_replication_slots
Benefit: Isolated failure blast radius

Pipe: Kafka Connect

Hardened Spec (StatefulSet)

tasksMax: 4 (Scaled 1 task/2 vCPU)
Baseline -Xmx4G + G1GC tuning
Tuning poll.interval & max.batch.size
Benefit: Stable 10k+ TPS throughput

Sink: Kafka Brokers

Observability Enabled

JMX metrics scraped by Prometheus
Alerting on records-lag-max
Schema History Topic managed explicitly
Benefit: Incident detection in under 30 mins

ULTIMATE OUTCOME:
Sustained High Throughput & Rapid Recovery

Production CDC Architecture with Debezium: Scaling Kafka Connect Under Real Load

Production CDC architecture is the discipline of reliably capturing database changes at scale and streaming them into systems like Kafka without data loss, lag accumulation, or operational instability. In practice, this means managing WAL pressure, connector backlogs, schema drift, and downstream backpressure under sustained load.

I have built and operated Debezium CDC pipelines from Postgres to Kafka across more than a dozen production systems, including environments exceeding 10k TPS across 100+ tables. Under these conditions, default configurations fail. WAL backlogs grow silently, snapshots exhaust heap memory, and connector lag drifts for hours before detection.

In multiple production systems, introducing proper observability and WAL tuning reduced incident detection time from hours to minutes and eliminated repeated resnapshot cycles during peak load. What follows are the configurations, metrics, and decisions that consistently hold up in production.

Bottom Line Up Front

Monitor key JMX metrics such as records-lag-max and snapshot status from the first deployment. In several production systems, this alone detected more than 80 percent of incidents before they escalated. Without this baseline, teams end up diagnosing offset drift and WAL issues under pressure instead of preventing them.

Kafka Connect CDC Production Failure Modes Under Load

Tutorials make Debezium appear simple, but Kafka Connect CDC production environments expose real constraints. Multi-broker Kafka clusters, replication slot pressure, and downstream backpressure quickly surface weaknesses in default setups.

In a 12k TPS workload across 120 tables, default polling intervals caused lag to grow into hours. Snapshot phases triggered memory exhaustion in Connect workers, leading to repeated restarts and partial data loss scenarios. These failures appear consistently under bursty load.

Debezium documentation covers connector setup but does not address WAL retention behavior under sustained throughput or the impact of multi-table snapshots on JVM memory. Without production-grade monitoring, teams rely on logs and manual correlation between Postgres replication state and Kafka lag, which is too slow for real incidents.

Production CDC Architecture Observability

Observability is the foundation of production CDC architecture. Kafka Connect and Debezium expose critical metrics through JMX that must be scraped and monitored continuously.

records-lag-max tracks backlog growth and signals pipeline saturation
snapshot status detects stalled or long running snapshots
records-per-sec highlights throughput degradation
event.count per table identifies hot tables and uneven load distribution

In one deployment, alerting on lag exceeding one hour surfaced a replication backlog several hours before it caused downstream outages. This reduced mean time to recovery from multiple hours to under 30 minutes.

Postgres must be monitored alongside Kafka. Queries against pg_replication_slots reveal slot pressure early. Once slot usage approaches critical thresholds, intervention is required immediately to avoid forced resnapshots.

Debezium CDC Architecture Trade-Offs in Production

Across multiple production deployments, the primary optimization target is operational stability rather than theoretical correctness. Systems that prioritize perfect guarantees over throughput and recovery tend to fail under load.

Approach	Advantages	Trade-offs	When to Use
Log-based CDC (pgoutput)	Low database overhead, native schema evolution	Requires WAL slot management and tuning	Default for Postgres production pipelines
Snapshot plus streaming	Full recovery capability	High CPU and memory usage during snapshot	Systems needing rebuild after outages
At-least-once delivery	High throughput, simpler operations	Duplicate handling required downstream	Analytics, IoT, event streaming above 1k TPS
Exactly-once semantics	No duplicates, strong guarantees	20 to 50 percent latency overhead	Strict regulatory systems only

In one production system, enabling exactly-once semantics increased latency from 400 milliseconds to over 900 milliseconds and caused SLA violations. In contrast, downstream deduplication was simpler and more reliable.

A critical rule from production experience is to never share WAL slots across connectors. Shared slots introduce failure coupling and can lead to replication deadlocks during failover, forcing full resnapshots and extended recovery time.

Postgres CDC Architecture: Running Production Connector Spec

For postgres cdc architecture, I deploy Kafka Connect as a StatefulSet with a hardened Postgres connector spec. This is the baseline I use before tuning for table count, write volume, and failover behavior. It reflects what has held stable across high throughput systems rather than default examples:

apiVersion: kafka.connect/v1
kind: Connector
metadata:
  name: postgres-cdc
  labels:
    app: debezium
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 4  # 1 task per 2 vCPU for 10k TPS
  config:
    database.hostname: prod-postgres.example.com
    database.port: 5432
    database.user: replicator
    database.password: ${str:secret-ref}
    database.dbname: inventory
    database.server.name: dbserver1
    table.include.list: public.inventory.*,public.customers
    slot.name: debezium_slot
    plugin.name: pgoutput  # Postgres 10+
    publication.name: dbz_publication
    snapshot.mode: initial
    heartbeat.interval.ms: 30000
    max.batch.size: 2048
    max.queue.size: 8192
    max.queue.size.in.bytes: 0
    poll.interval.ms: 500

Apply with kubectl apply -f connector.yaml. Scale tasksMax at roughly one task per two vCPUs, then validate under synthetic load. In my 10k TPS deployments, this ratio held stable without runaway lag or JVM pressure.

Connect worker configuration needs to match the same production assumptions:

group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.topic=connect-offsets
config.storage.topic=connect-configs
status.storage.topic=connect-status
offset.storage.replication.factor=3
config.storage.replication.factor=3
KAFKA_OPTS="-Xms1G -Xmx4G -XX:+UseG1GC"

Metrics must be exposed from the start. Prometheus scraping through a dedicated endpoint avoids blind spots during incidents:

- job_name: 'kafka-connect'
  static_configs:
    - targets: ['connect:8083']
  metrics_path: /metrics
  params:
    group: ['kafka.connect']

This setup is not static. Every production system diverges based on table distribution, write amplification, and downstream pressure, but starting from this baseline avoids the common failure modes that appear when relying on default connector configurations.

Postgres CDC Architecture Implementation

A typical production deployment uses Kafka Connect in a distributed setup with tuned connector configuration:

tasksMax: 4
max.batch.size: 2048
max.queue.size: 8192
poll.interval.ms: 500
heartbeat.interval.ms: 30000

Scaling follows a practical rule of one task per two CPU cores for sustained throughput around 10k TPS. JVM heap sizing must match table count and snapshot size, with 4GB as a baseline for mid-sized deployments.

Postgres configuration is equally critical. Settings such as max_replication_slots, max_wal_senders, and wal_sender_timeout must be tuned to avoid slot starvation and connection drops.

Advanced Debezium CDC Patterns

Schema evolution requires explicit handling. Using a schema history topic and structured formats such as Avro prevents failures during DDL changes.

Connector isolation is another key pattern. Grouping tables into separate connectors limits blast radius when one schema or workload behaves unexpectedly.

Snapshot strategy should remain controlled. Continuous snapshot modes often lead to memory pressure and instability. Initial snapshot plus streaming with heartbeat monitoring provides a stable baseline.

Across multiple systems, removing continuous snapshot modes eliminated recurring out-of-memory failures and reduced recovery time.

Operational Next Steps

Run this under load before trusting it. Introduce artificial lag, pause replication, and push throughput beyond expected peaks. Watch how quickly lag grows, how snapshots behave, and where recovery breaks down.

In production CDC architecture, initial setup is not the hard part. Systems fail during spikes, failovers, and schema changes. The difference between a pipeline that works and one that holds under pressure comes down to observability, WAL discipline, and clear operational boundaries.

Related guides:

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

novatechflow | Alexander Alten

Search This Blog