Note: Moonlink is in preview. Expect changes. Join our Community to stay updated on a stable release!
Moonlink
Moonlink is a Rust library that enables sub-second mirroring (CDC) of Postgres tables into Iceberg. It serves as a drop-in replacement for the Debezium + Kafka + Flink + Spark stack.
Under the hood, it extends Iceberg with a real-time storage engine optimized for low-latency, high-throughput ingestion from update-heavy sources like Postgres logical replication.
The Problem​
Traditional CDC pipelines struggle with Iceberg as a destination. Iceberg wasn't designed for high-frequency updates and deletes:
- Updates require expensive full-table scans
- Even with Iceberg v2's equality deletes, the number of delete files grows rapidly
- Small file problem degrades query performance
Moonlink takes a fundamentally different approach by building the first streaming storage engine for Iceberg. Which provides:
- Sub-second Ingestion: Handles inserts, updates and deletes with minimal latency
- Up-to-date Reads: Provides unified view of streaming buffer and Iceberg state
Architecture​
Moonlink converts Iceberg into a real-time columnstore by adding a streaming storage engine on top of it. This consists of:
- In-Memory Arrow Buffer: Which accumulates recent inserts/updates in columnar format
- Hybrid Index: Maps primary key or row hash → (batch/file, row)
- Deletion Log + RoaringBitmap Deletion Vectors: Efficiently tracks deletions in Iceberg PuffinFile
Hot data lands in Moonlink's streaming storage engine, and is periodically flushed into Iceberg. This architecture is inspired by real-time streaming storage systems like Fluss.
Write Path​
Raw Inserts
- Insert records are accumulated in in-memory Arrow buffers
- When size/time thresholds are crossed, batches are serialized to Parquet files
Raw Deletions
- Deletion records are converted to positional deletion logs using the Moonlink index
- Deletion logs are periodically flushed to Deletion Vectors for efficiency
Read Path​
Moonlink provides a unified read view that combines in-memory state with Iceberg files for real-time data access.
If you don't require an 'up-to-date' view of the table, you can query the Iceberg table directly.
Note: Moonlink writes Iceberg tables with deletion vectors (Iceberg v3). Not all engines support this just yet.
How does Moonlink work within pg_mooncake?​
pg_mooncake runs moonlink as a Background Worker Process that manages tables, handles CDC ingestion, and processes read requests.
Postgres backends communicate with moonlink via unix streams:
- DDL commands: Create/drop table and snapshot creation
- Commands are forwarded to moonlink and execution waits for completion
- Read queries:
- Request current Read State (Parquet files, arrow batches, deletion vectors)