Production CDC architecture breaks under load long before most teams expect it. With Debezium, Kafka Connect, and Postgres, the failure patterns are consistent: WAL pressure builds up, connector lag drifts unnoticed, and snapshot phases exhaust memory under bursty traffic.
This is based on running these pipelines across high throughput systems, including workloads above 10k TPS. The difference between a system that works and one that holds under pressure comes down to observability, WAL discipline, and how connector scaling is handled.
Production Debezium CDC Architecture
Operational reality vs. tutorial defaults under real load (10k+ TPS)
Assumes low throughput and stable networks. Fails under pressure.
- Shared slot coupling multiple connectors
- Default WAL retention settings
- Risk: Replication Deadlocks during failover
- tasksMax: 1 (No parallelism)
- Standard JVM Heap (often -Xmx1G)
- snapshot.mode: initial on huge tables
- Failure: JVM OOM during snapshot phase
- Low partition count
- No internal metric visibility
- Failure: Silent Lag Accumulation
Forced Full Resnapshots & Extended Outages
Prioritizes observability, isolation, and stability over guarantees.
- Strict 1 slot per connector isolation
- Tuning max_wal_senders & slots
- Monitoring pg_replication_slots
- Benefit: Isolated failure blast radius
- tasksMax: 4 (Scaled 1 task/2 vCPU)
- Baseline -Xmx4G + G1GC tuning
- Tuning poll.interval & max.batch.size
- Benefit: Stable 10k+ TPS throughput
- JMX metrics scraped by Prometheus
- Alerting on records-lag-max
- Schema History Topic managed explicitly
- Benefit: Incident detection in under 30 mins
Sustained High Throughput & Rapid Recovery
Production CDC Architecture with Debezium: Scaling Kafka Connect Under Real Load
Production CDC architecture is the discipline of reliably capturing database changes at scale and streaming them into systems like Kafka without data loss, lag accumulation, or operational instability. In practice, this means managing WAL pressure, connector backlogs, schema drift, and downstream backpressure under sustained load.
I have built and operated Debezium CDC pipelines from Postgres to Kafka across more than a dozen production systems, including environments exceeding 10k TPS across 100+ tables. Under these conditions, default configurations fail. WAL backlogs grow silently, snapshots exhaust heap memory, and connector lag drifts for hours before detection.
In multiple production systems, introducing proper observability and WAL tuning reduced incident detection time from hours to minutes and eliminated repeated resnapshot cycles during peak load. What follows are the configurations, metrics, and decisions that consistently hold up in production.
Bottom Line Up Front
Monitor key JMX metrics such as records-lag-max and snapshot status from the first deployment. In several production systems, this alone detected more than 80 percent of incidents before they escalated. Without this baseline, teams end up diagnosing offset drift and WAL issues under pressure instead of preventing them.
Kafka Connect CDC Production Failure Modes Under Load
Tutorials make Debezium appear simple, but Kafka Connect CDC production environments expose real constraints. Multi-broker Kafka clusters, replication slot pressure, and downstream backpressure quickly surface weaknesses in default setups.
In a 12k TPS workload across 120 tables, default polling intervals caused lag to grow into hours. Snapshot phases triggered memory exhaustion in Connect workers, leading to repeated restarts and partial data loss scenarios. These failures appear consistently under bursty load.
Debezium documentation covers connector setup but does not address WAL retention behavior under sustained throughput or the impact of multi-table snapshots on JVM memory. Without production-grade monitoring, teams rely on logs and manual correlation between Postgres replication state and Kafka lag, which is too slow for real incidents.
Production CDC Architecture Observability
Observability is the foundation of production CDC architecture. Kafka Connect and Debezium expose critical metrics through JMX that must be scraped and monitored continuously.
records-lag-maxtracks backlog growth and signals pipeline saturationsnapshot statusdetects stalled or long running snapshotsrecords-per-sechighlights throughput degradationevent.count per tableidentifies hot tables and uneven load distribution
In one deployment, alerting on lag exceeding one hour surfaced a replication backlog several hours before it caused downstream outages. This reduced mean time to recovery from multiple hours to under 30 minutes.
Postgres must be monitored alongside Kafka. Queries against pg_replication_slots reveal slot pressure early. Once slot usage approaches critical thresholds, intervention is required immediately to avoid forced resnapshots.
Debezium CDC Architecture Trade-Offs in Production
Across multiple production deployments, the primary optimization target is operational stability rather than theoretical correctness. Systems that prioritize perfect guarantees over throughput and recovery tend to fail under load.
| Approach | Advantages | Trade-offs | When to Use |
|---|---|---|---|
| Log-based CDC (pgoutput) | Low database overhead, native schema evolution | Requires WAL slot management and tuning | Default for Postgres production pipelines |
| Snapshot plus streaming | Full recovery capability | High CPU and memory usage during snapshot | Systems needing rebuild after outages |
| At-least-once delivery | High throughput, simpler operations | Duplicate handling required downstream | Analytics, IoT, event streaming above 1k TPS |
| Exactly-once semantics | No duplicates, strong guarantees | 20 to 50 percent latency overhead | Strict regulatory systems only |
In one production system, enabling exactly-once semantics increased latency from 400 milliseconds to over 900 milliseconds and caused SLA violations. In contrast, downstream deduplication was simpler and more reliable.
A critical rule from production experience is to never share WAL slots across connectors. Shared slots introduce failure coupling and can lead to replication deadlocks during failover, forcing full resnapshots and extended recovery time.
Postgres CDC Architecture: Running Production Connector Spec
For postgres cdc architecture, I deploy Kafka Connect as a StatefulSet with a hardened Postgres connector spec. This is the baseline I use before tuning for table count, write volume, and failover behavior. It reflects what has held stable across high throughput systems rather than default examples:
apiVersion: kafka.connect/v1
kind: Connector
metadata:
name: postgres-cdc
labels:
app: debezium
spec:
class: io.debezium.connector.postgresql.PostgresConnector
tasksMax: 4 # 1 task per 2 vCPU for 10k TPS
config:
database.hostname: prod-postgres.example.com
database.port: 5432
database.user: replicator
database.password: ${str:secret-ref}
database.dbname: inventory
database.server.name: dbserver1
table.include.list: public.inventory.*,public.customers
slot.name: debezium_slot
plugin.name: pgoutput # Postgres 10+
publication.name: dbz_publication
snapshot.mode: initial
heartbeat.interval.ms: 30000
max.batch.size: 2048
max.queue.size: 8192
max.queue.size.in.bytes: 0
poll.interval.ms: 500
Apply with kubectl apply -f connector.yaml. Scale tasksMax at roughly one task per two vCPUs, then validate under synthetic load. In my 10k TPS deployments, this ratio held stable without runaway lag or JVM pressure.
Connect worker configuration needs to match the same production assumptions:
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.topic=connect-offsets
config.storage.topic=connect-configs
status.storage.topic=connect-status
offset.storage.replication.factor=3
config.storage.replication.factor=3
KAFKA_OPTS="-Xms1G -Xmx4G -XX:+UseG1GC"
Metrics must be exposed from the start. Prometheus scraping through a dedicated endpoint avoids blind spots during incidents:
- job_name: 'kafka-connect'
static_configs:
- targets: ['connect:8083']
metrics_path: /metrics
params:
group: ['kafka.connect']
This setup is not static. Every production system diverges based on table distribution, write amplification, and downstream pressure, but starting from this baseline avoids the common failure modes that appear when relying on default connector configurations.
Postgres CDC Architecture Implementation
A typical production deployment uses Kafka Connect in a distributed setup with tuned connector configuration:
tasksMax: 4
max.batch.size: 2048
max.queue.size: 8192
poll.interval.ms: 500
heartbeat.interval.ms: 30000
Scaling follows a practical rule of one task per two CPU cores for sustained throughput around 10k TPS. JVM heap sizing must match table count and snapshot size, with 4GB as a baseline for mid-sized deployments.
Postgres configuration is equally critical. Settings such as max_replication_slots, max_wal_senders, and wal_sender_timeout must be tuned to avoid slot starvation and connection drops.
Advanced Debezium CDC Patterns
Schema evolution requires explicit handling. Using a schema history topic and structured formats such as Avro prevents failures during DDL changes.
Connector isolation is another key pattern. Grouping tables into separate connectors limits blast radius when one schema or workload behaves unexpectedly.
Snapshot strategy should remain controlled. Continuous snapshot modes often lead to memory pressure and instability. Initial snapshot plus streaming with heartbeat monitoring provides a stable baseline.
Across multiple systems, removing continuous snapshot modes eliminated recurring out-of-memory failures and reduced recovery time.
Operational Next Steps
Run this under load before trusting it. Introduce artificial lag, pause replication, and push throughput beyond expected peaks. Watch how quickly lag grows, how snapshots behave, and where recovery breaks down.
In production CDC architecture, initial setup is not the hard part. Systems fail during spikes, failovers, and schema changes. The difference between a pipeline that works and one that holds under pressure comes down to observability, WAL discipline, and clear operational boundaries.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.