Skip to main content
Version: 0.2

Note: Moonlink is in preview. Expect changes. Join our Community to stay updated on a stable release!

Moonlink

Moonlink is a Rust library that enables sub-second mirroring (CDC) of Postgres tables into Iceberg. It serves as a drop-in replacement for the Debezium + Kafka + Flink + Spark stack.

Under the hood, it extends Iceberg with a real-time storage engine optimized for low-latency, high-throughput ingestion from update-heavy sources like Postgres logical replication.

The Problem​

Traditional CDC pipelines struggle with Iceberg as a destination. Iceberg wasn't designed for high-frequency updates and deletes:

  • Updates require expensive full-table scans
  • Even with Iceberg v2's equality deletes, the number of delete files grows rapidly
  • Small file problem degrades query performance

Moonlink takes a fundamentally different approach by building the first streaming storage engine for Iceberg. Which provides:

  • Sub-second Ingestion: Handles inserts, updates and deletes with minimal latency
  • Up-to-date Reads: Provides unified view of streaming buffer and Iceberg state

Architecture​

Moonlink converts Iceberg into a real-time columnstore by adding a streaming storage engine on top of it. This consists of:

  • In-Memory Arrow Buffer: Which accumulates recent inserts/updates in columnar format
  • Hybrid Index: Maps primary key or row hash → (batch/file, row)
  • Deletion Log + RoaringBitmap Deletion Vectors: Efficiently tracks deletions in Iceberg PuffinFile

Moonlink Architecture

Hot data lands in Moonlink's streaming storage engine, and is periodically flushed into Iceberg. This architecture is inspired by real-time streaming storage systems like Fluss.

Write Path​

Raw Inserts

  • Insert records are accumulated in in-memory Arrow buffers
  • When size/time thresholds are crossed, batches are serialized to Parquet files

Raw Deletions

  • Deletion records are converted to positional deletion logs using the Moonlink index
  • Deletion logs are periodically flushed to Deletion Vectors for efficiency

Read Path​

Moonlink provides a unified read view that combines in-memory state with Iceberg files for real-time data access.

If you don't require an 'up-to-date' view of the table, you can query the Iceberg table directly.

Note: Moonlink writes Iceberg tables with deletion vectors (Iceberg v3). Not all engines support this just yet.

pg_mooncake runs moonlink as a Background Worker Process that manages tables, handles CDC ingestion, and processes read requests.

Postgres backends communicate with moonlink via unix streams:

  1. DDL commands: Create/drop table and snapshot creation
    • Commands are forwarded to moonlink and execution waits for completion
  2. Read queries:
    • Request current Read State (Parquet files, arrow batches, deletion vectors)