← Back to work

Engineer (solo build) · 2026

Real-Time CDC Pipeline

A production-shaped change-data-capture pipeline with a live storefront: place an order and watch it land in the analytics warehouse seconds later, with an ops dashboard reporting end-to-end latency, lag and a dead-letter queue.

Source ↗
Cover image for Real-Time CDC Pipeline

The problem

Analytics teams want fresh data, but running reports against the operational database couples them to it and adds load, while nightly batch jobs leave the warehouse hours stale. Change data capture is the modern answer: stream every row-level change out of the source database as it happens. I wanted to build that properly, end to end, rather than read about it: a real pipeline where you place an order on a storefront and watch it appear in the analytics warehouse seconds later. This is that project, built solo.

Approach

The capture is log-based, not polling-based: Debezium reads the Postgres write-ahead log, so the source database is never queried for changes and never modified with triggers. From there the changes flow through Kafka into a stream processor and land in a column-store warehouse built for analytics. The whole thing is deliberately production-shaped: every service has a health check, the warehouse is the only thing the dashboard reads from, and bad events are quarantined rather than allowed to break the stream. A storefront and an ops dashboard sit on top so the pipeline is something you can actually watch work, not just a diagram.

Architecture

What I'd do differently

The Spark job's checkpoint is its single recovery point, and an abrupt shutdown can leave the offset log half-written; production would put the checkpoint on durable storage and alert on consumer lag. Schema evolution is the next hardening step: Debezium handles a changing source schema, and the stream transform would need to follow it. As a learning build it is intentionally single-node, not a multi-broker, multi-partition deployment.