Load Data
This guide covers different methods for loading data into pg_mooncake columnstore tables. It is assumed that you have a columnstore table users_columnstore
with the same schema as your source data.
From PostgreSQL Tables​
The simplest way to load data is directly from an existing PostgreSQL table:
INSERT INTO users_columnstore
SELECT * FROM users_rowstore;
Note: Columnstore tables are not optimized for small, frequent writes and updates. For best performance, batch your writes using a cron job or triggers from a staging table.
From Files (Parquet, CSV, JSON)​
To load data from files into columnstore tables, pg_mooncake provides two approaches:
Method 1: Using COPY Command​
The most direct way to load files into a columnstore table:
-- Load a single file
COPY users_columnstore FROM 's3://bucket/file.parquet';
-- Load multiple files using a wildcard pattern
COPY users_columnstore FROM 's3://bucket/folder/*.parquet';
-- Load from CSV with options
COPY users_columnstore FROM 's3://bucket/data.csv' WITH (FORMAT 'csv', HEADER true);
Method 2: Using INSERT with Read Functions​
For more control over the data being loaded:
-- Then load data using INSERT and read functions
INSERT INTO users_columnstore
SELECT * FROM mooncake.read_parquet('s3://bucket/file.parquet') AS (
id INT,
name TEXT,
created_at TIMESTAMP
);
You must define the schema of the file in the AS
clause.
Note: To load files from S3 or other cloud storage, you need to set up cloud storage credentials first. Ensure your secret has the appropriate scope to access the cloud storage bucket.
From Iceberg/Delta Tables​
pg_mooncake can load data from existing Iceberg or Delta Lake tables directly into your columnstore tables:
Loading from Iceberg Tables​
-- Load an Iceberg table into a columnstore table
COPY users_columnstore FROM 's3://bucket/table';
-- Or use INSERT with iceberg_scan
INSERT INTO users_columnstore
SELECT * FROM iceberg_scan('s3://bucket/table') AS (
id INT,
name TEXT,
created_at TIMESTAMP
);
Loading from Delta Tables​
-- Load a Delta table into a columnstore table
COPY users_columnstore FROM 's3://bucket/table';
-- Or use INSERT with delta_scan
INSERT INTO users_columnstore
SELECT * FROM delta_scan('s3://bucket/table') AS (
id INT,
name TEXT,
created_at TIMESTAMP
);
From HuggingFace Datasets​
pg_mooncake provides seamless integration for loading data from HuggingFace datasets directly into your columnstore tables:
-
Find a HuggingFace dataset (e.g., lmsys_chat_1m_clean)
-
Navigate to the files and locate the path to the data file:
https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean/blob/main/data/train-00000-of-00004.parquet
-
Add the
hf://
prefix and remove/blob/main
to load directly into your columnstore table:COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';
-
For more complex loading scenarios, you can transform the data while loading:
-- Load with transformation
INSERT INTO users_columnstore (user_id, message, created_at)
SELECT
id,
message,
timestamp
FROM mooncake.read_parquet('hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet')
AS p(id INT, message TEXT, timestamp TIMESTAMP)
WHERE message IS NOT NULL; -
You can also load multiple HuggingFace dataset files at once:
-- Load multiple files
COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-*.parquet';
Performance Considerations​
- Batch your writes into larger transactions for better performance.
- Each write operation creates a new Parquet file. The ideal batch size is between 100,000 and 1,000,000 rows.
- It is recommended to run a no-op query like
UPDATE table_name SET column = column
periodically to ensure that your Parquet files and metadata are in an optimal state.
For any questions here or to report a bug, reach out!.