How-To Lakehouse Read With/Without A Catalog v1.3

Overview

This guide demonstrates how to configure PGAA Lakehouse clusters for accessing data lake formats, both with and without catalog integration. PGAA supports direct access to Delta Lake and Apache Iceberg tables, enabling analytical workloads on object storage without data movement.

Prerequisites

System Requirements

PGAA Lakehouse cluster provisioned and accessible
PostgreSQL File System (PGFS) version 2.0.1 or later recommended
Administrative privileges for catalog and storage configuration

Access Requirements

For catalog integration: Valid credentials for target catalog service
For direct access: Read permissions on target S3 buckets or storage locations
Network connectivity to storage endpoints

Configuration Approaches

PGAA supports three primary approaches for lakehouse data access:

Catalog-managed access: Full integration with external catalog services
Direct Delta Lake access: Query Delta tables without catalog dependency
Direct Iceberg access: Query Iceberg tables using metadata files

AWS S3 Tables Integration

Python Client Configuration

AWS S3 Tables requires REST API compatibility for Python clients. Configure PyIceberg as follows:

from pyiceberg.catalog import load_catalog # Define S3 Tables parameters REGION = "eu-north-1" # Your S3 Tables bucket region ARN = "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket" s3tables_catalog = load_catalog( "s3tables_catalog", **{ "type": "rest", "warehouse": ARN, "uri": f"https://s3tables.{REGION}.amazonaws.com/iceberg", "rest.sigv4-enabled": "true", "rest.signing-name": "s3tables", "rest.signing-region": REGION, } )

Note: PyIceberg versions ≤0.9.1 perform credential resolution at each API call and cannot pass S3 options directly during catalog definition.

PGAA Native Configuration

PGAA implements a native Rust S3 Tables client, eliminating REST proxy overhead:

SELECT bdr.replicate_ddl_command($$ SELECT pgaa.delete_catalog('s3tables_catalog'); SELECT pgaa.add_catalog( 's3tables_catalog', 'iceberg-s3tables', '{"arn": "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket", "region": "eu-north-1"}' ); SELECT pgaa.import_catalog('s3tables_catalog'); $$);

Key Differences:

Catalog type: iceberg-s3tables (not iceberg-rest)
Required parameters: ARN and region only
Authentication: AWS credential chain (profiles, environment variables, IMDS)
Explicit credentials not supported; relies on SDK credential resolution

Direct Delta Lake Access

Storage Location Setup

Configure storage locations for Delta Lake tables:

-- PGFS 2.0.1 and later SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-us-east-1-prod', '{"aws_skip_signature": "true"}' ); -- Legacy syntax (before PGFS 2.0.1) SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-eu-west-1-prod', NULL, -- managed storage location ID (unused) '{"aws_skip_signature": "true"}', -- options '{}' -- credentials );

Creating Delta Tables

Map Delta Lake tables using PGAA table access method:

-- Create customer table CREATE TABLE public.customer () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/customer' ); -- Verify table access SELECT count(*) FROM public.customer; -- Create additional TPC-H tables CREATE TABLE public.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/lineitem'); CREATE TABLE public.orders () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/orders'); CREATE TABLE public.nation () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/nation');

Working with Large Datasets

Configure separate schema for scale testing:

CREATE SCHEMA tpch; -- Create 1TB scale factor tables CREATE TABLE tpch.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/lineitem'); CREATE TABLE tpch.customer () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/customer'); -- Additional tables follow same pattern...

Direct Iceberg Access

Configuring Iceberg Storage

-- Create storage location for Iceberg data SELECT pgfs.create_storage_location( 'biganimal-sample-data-dev', 's3://beacon-analytics-demo-data-us-east-1-dev', '{"aws_skip_signature": "true"}' ); -- Create Iceberg table reference CREATE TABLE iceberg_table () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data-dev', pgaa.path = 'iceberg-example/default.db/iceberg_table', pgaa.format = 'iceberg' -- Explicitly specify format ); -- Query Iceberg data SELECT * FROM iceberg_table ORDER BY key ASC;

Catalog Attachment

Connecting to Existing Catalogs

Attach and query data from configured catalogs:

-- Add REST catalog SELECT pgaa.add_catalog( 'lakekeeper-test', 'iceberg-rest', '{ "url": "https://catalog-endpoint.example.com", "token": "your-api-token", "warehouse": "warehouse-id", "danger_accept_invalid_certs": "true" }' ); -- Attach catalog to current session SELECT pgaa.attach_catalog('lakekeeper-test'); -- Query catalog tables SELECT COUNT(*) FROM tpch_sf_1.lineitem;

Verifying Catalog Access

-- Check available schemas SELECT * FROM information_schema.schemata WHERE schema_name LIKE 'tpch%'; -- Verify offloaded data views SELECT * FROM partitioned_table_offloaded LIMIT 10;

Query Validation

Sample Analytical Query

Validate configuration with complex joins:

SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate >= CAST('1993-10-01' AS date) AND o_orderdate < CAST('1994-01-01' AS date) AND l_returnflag = 'R' AND c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue DESC LIMIT 20;

Expected Results:

Customer#000057040: revenue 734235.2455
Customer#000143347: revenue 721002.6948
Customer#000060838: revenue 679127.3077

Best Practices

Storage Configuration

Use PGFS 2.0.1+ simplified syntax when available
Configure region-appropriate endpoints for optimal performance
Implement proper credential management through AWS credential chain

Catalog Management

Regularly refresh OAuth tokens for REST catalogs
Use native clients when available (S3 Tables)
Import catalog metadata after configuration changes

Performance Optimization

Partition large datasets appropriately
Create materialized views for frequently accessed data
Monitor query execution plans for optimization opportunities

Troubleshooting

Common Issues

Storage Access Errors

Verify S3 bucket permissions and network connectivity
Check credential chain configuration for AWS authentication

Catalog Connection Failures

Validate catalog endpoint URLs and credentials
Ensure proper SSL certificate validation settings

Query Performance Degradation

Review table statistics and partition pruning
Consider data locality and caching strategies

Next Steps

Configure monitoring for lakehouse queries
Establish data refresh schedules for materialized views
Implement cost optimization through intelligent caching
Develop governance policies for catalog access control

← Prev

How-To Integrate with Third-Party Iceberg Catalogs

↑ Up

Analytics Accelerator

Configure Analytics Storage and Data Tiering with PGAA and PGD