Skip to main content
Version: 0.1

Query External Files

pg_mooncake allows you to directly query files in your data lake through PostgreSQL without loading the data first. This effectively extends your database to include your data lake storage.

Supported File Formats​

pg_mooncake supports the following file formats:

  • Parquet
  • CSV
  • JSON
  • Iceberg
  • Delta Lake

Querying Parquet Files​

To query a Parquet file:

-- Query a Parquet file directly
SELECT
user_id,
event_type,
created_at
FROM mooncake.read_parquet('s3://bucket/events.parquet') AS (
user_id INT,
event_type TEXT,
created_at TIMESTAMP
)
WHERE created_at > '2023-01-01';

You must define the schema of the file in the AS clause.

Querying CSV Files​

To query a CSV file:

-- Query a CSV file directly
SELECT
r['user_id']::INT,
r['event_type'],
r['created_at']::TIMESTAMP
FROM mooncake.read_csv('s3://bucket/events.csv', AUTO_DETECT=TRUE) r
WHERE r['created_at']::TIMESTAMP > '2023-01-01';

You can also specify CSV options:

SELECT * FROM mooncake.read_csv('s3://bucket/data.csv', 
HEADER=TRUE,
DELIMITER=',',
AUTO_DETECT=TRUE
);

Querying JSON Files​

To query a JSON file:

SELECT 
json_extract_path_text(data, 'user_id')::INT AS user_id,
json_extract_path_text(data, 'event_type') AS event_type,
json_extract_path_text(data, 'created_at')::TIMESTAMP AS created_at
FROM mooncake.read_json('s3://bucket/events.json') AS (data JSONB)
WHERE json_extract_path_text(data, 'created_at')::TIMESTAMP > '2023-01-01';

Querying Iceberg Tables​

To query an Iceberg table:

SELECT 
user_id,
event_type,
created_at
FROM iceberg_scan('s3://bucket/iceberg_table') AS (
user_id INT,
event_type TEXT,
created_at TIMESTAMP
)
WHERE created_at > '2023-01-01';

Querying Delta Lake Tables​

To query a Delta Lake table:

SELECT 
user_id,
event_type,
created_at
FROM delta_scan('s3://bucket/delta_table') AS (
user_id INT,
event_type TEXT,
created_at TIMESTAMP
)
WHERE created_at > '2023-01-01';

Using Wildcards and Partitions​

You can use wildcards to query multiple files at once:

-- Query all parquet files in a directory
SELECT * FROM mooncake.read_parquet('s3://bucket/events/*.parquet') AS (
user_id INT,
event_type TEXT,
created_at TIMESTAMP
);

-- Query files in a partitioned directory structure
SELECT * FROM mooncake.read_parquet('s3://bucket/events/year=2023/month=*/day=*/*.parquet') AS (
user_id INT,
event_type TEXT,
created_at TIMESTAMP
);

Joining with PostgreSQL Tables​

Yes, you can join external files with PostgreSQL tables without loading the data first. This is one of the most powerful features of pg_mooncake:

-- Join a Parquet file with a PostgreSQL table
SELECT
u.name,
u.email,
e.event_type,
e.created_at
FROM users u
JOIN mooncake.read_parquet('s3://bucket/events.parquet') AS e(
user_id INT,
event_type TEXT,
created_at TIMESTAMP
) ON u.id = e.user_id
WHERE e.created_at > '2023-01-01';

Joining Multiple External Files​

You can also join multiple external files together:

-- Join two Parquet files
SELECT
u.name,
u.email,
e.event_type,
e.created_at
FROM mooncake.read_parquet('s3://bucket/users.parquet') AS u(
id INT,
name TEXT,
email TEXT
)
JOIN mooncake.read_parquet('s3://bucket/events.parquet') AS e(
user_id INT,
event_type TEXT,
created_at TIMESTAMP
) ON u.id = e.user_id
WHERE e.created_at > '2023-01-01';

Querying HuggingFace Datasets​

pg_mooncake offers direct integration with HuggingFace datasets:

-- Query a HuggingFace dataset directly
SELECT *
FROM mooncake.read_parquet('hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet') AS (
id INT,
message TEXT,
timestamp TIMESTAMP
)
WHERE timestamp > '2023-01-01';