Version: 0.1

Load Data

This guide covers different methods for loading data into pg_mooncake columnstore tables. It is assumed that you have a columnstore table users_columnstore with the same schema as your source data.

From PostgreSQL Tables

The simplest way to load data is directly from an existing PostgreSQL table:

INSERT INTO users_columnstore
SELECT * FROM users_rowstore;

Note: Columnstore tables are not optimized for small, frequent writes and updates. For best performance, batch your writes using a cron job or triggers from a staging table.

From Files (Parquet, CSV, JSON)

To load data from files into columnstore tables, pg_mooncake provides two approaches:

Method 1: Using COPY Command

The most direct way to load files into a columnstore table:

-- Load a single file
COPY users_columnstore FROM 's3://bucket/file.parquet';

-- Load multiple files using a wildcard pattern
COPY users_columnstore FROM 's3://bucket/folder/*.parquet';

-- Load from CSV with options
COPY users_columnstore FROM 's3://bucket/data.csv' WITH (FORMAT 'csv', HEADER true);

Method 2: Using INSERT with Read Functions

For more control over the data being loaded:

-- Then load data using INSERT and read functions
INSERT INTO users_columnstore
SELECT * FROM mooncake.read_parquet('s3://bucket/file.parquet') AS (
  id INT,
  name TEXT,
  created_at TIMESTAMP
);

You must define the schema of the file in the AS clause.

Note: To load files from S3 or other cloud storage, you need to set up cloud storage credentials first. Ensure your secret has the appropriate scope to access the cloud storage bucket.

From Iceberg/Delta Tables

pg_mooncake can load data from existing Iceberg or Delta Lake tables directly into your columnstore tables:

Loading from Iceberg Tables

-- Load an Iceberg table into a columnstore table
COPY users_columnstore FROM 's3://bucket/table';

-- Or use INSERT with iceberg_scan
INSERT INTO users_columnstore
SELECT * FROM iceberg_scan('s3://bucket/table') AS (
  id INT,
  name TEXT,
  created_at TIMESTAMP
);

Loading from Delta Tables

-- Load a Delta table into a columnstore table
COPY users_columnstore FROM 's3://bucket/table';

-- Or use INSERT with delta_scan
INSERT INTO users_columnstore
SELECT * FROM delta_scan('s3://bucket/table') AS (
  id INT,
  name TEXT,
  created_at TIMESTAMP
);

From HuggingFace Datasets

pg_mooncake provides seamless integration for loading data from HuggingFace datasets directly into your columnstore tables:

Find a HuggingFace dataset (e.g., lmsys_chat_1m_clean)

Navigate to the files and locate the path to the data file:

https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean/blob/main/data/train-00000-of-00004.parquet

Add the hf:// prefix and remove /blob/main to load directly into your columnstore table:

COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';

For more complex loading scenarios, you can transform the data while loading:

-- Load with transformation
INSERT INTO users_columnstore (user_id, message, created_at)
SELECT 
  id, 
  message, 
  timestamp 
FROM mooncake.read_parquet('hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet') 
  AS p(id INT, message TEXT, timestamp TIMESTAMP)
WHERE message IS NOT NULL;

You can also load multiple HuggingFace dataset files at once:

-- Load multiple files
COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-*.parquet';

Performance Considerations

Batch your writes into larger transactions for better performance.
Each write operation creates a new Parquet file. The ideal batch size is between 100,000 and 1,000,000 rows.
It is recommended to run a no-op query like UPDATE table_name SET column = column periodically to ensure that your Parquet files and metadata are in an optimal state.

For any questions here or to report a bug, reach out!.

From PostgreSQL Tables​

From Files (Parquet, CSV, JSON)​

Method 1: Using COPY Command​

Method 2: Using INSERT with Read Functions​

From Iceberg/Delta Tables​

Loading from Iceberg Tables​

Loading from Delta Tables​

From HuggingFace Datasets​

Performance Considerations​