Skip to main content
Version: 0.1

Querying columnstore outside of Postgres

A major benefit of pg_mooncake is its use of open formats (Parquet, Iceberg, Delta) that can be accessed by tools beyond PostgreSQL. This enables hybrid workflows combining PostgreSQL with other data processing systems.

Inspecting Table Metadata​

Finding Table Locations​

To find where your columnstore tables are physically stored, query the mooncake.columnstore_tables catalog table:

SELECT * FROM mooncake.columnstore_tables;

Example output:

table_name         |                                          path                                          
-------------------+----------------------------------------------------------------------------------------
users_columnstore | /usr/local/pgsql/data/mooncake_local_tables/mooncake_postgres_users_columnstore_16446/
users_log | /usr/local/pgsql/data/mooncake_local_tables/mooncake_postgres_users_log_16455/
events | s3://my-bucket/mooncake_postgres_events_16459/

The path column shows either a local filesystem path or a cloud storage path (S3, GCS, etc.) where the table data is stored. In this directory, you'll find a Delta Lake table with complete metadata that includes transaction logs and Parquet data files.

Viewing Data Files​

To inspect the individual Parquet files for a specific table, use the mooncake.data_files view:

SELECT * FROM mooncake.data_files;

Example output:

#oidfile_namefile_metadata
1328220ffb9a71-89a5-43d7-a55b-e04db56d680b.parquet\x1502192c3500180d6475636b64625f

Accessing Data with Python​

Using Polars​

Polars is a fast DataFrame library for Python that works well with Delta Lake tables. You can use scan_delta() to directly query your pg_mooncake tables since they're stored in Delta Lake format:

  1. Install Polars:
pip install 'polars'
  1. Create a DataFrame from your table location:
import polars as pl

# For local tables
local_path = '/usr/local/pgsql/data/mooncake_local_tables/mooncake_postgres_users_columnstore_16446/'
df = pl.scan_delta(local_path)

# For S3 tables
s3_path = 's3://my-bucket/mooncake_postgres_events_16459/'
df = pl.scan_delta(s3_path)
  1. Create a DataFrame
import polars as pl
s3_path = 's3://mooncake-test/reddit_comments_final/'
df = pl.scan_delta(s3_path)
  1. Run Python Transformations
df.filter(
(pl.col("pickup_time") >= '2021-01-01T00:00:00+00:00') &
(pl.col("pickup_time") < '2021-02-01T00:00:00+00:00')
).collect()