How-To Lakehouse Read With/Without A Catalog v1.3

Overview

This guide demonstrates how to configure PGAA Lakehouse clusters for accessing data lake formats, both with and without catalog integration. PGAA supports direct access to Delta Lake and Apache Iceberg tables, enabling analytical workloads on object storage without data movement.

Prerequisites

System Requirements

  • PGAA Lakehouse cluster provisioned and accessible
  • PostgreSQL File System (PGFS) version 2.0.1 or later recommended
  • Administrative privileges for catalog and storage configuration

Access Requirements

  • For catalog integration: Valid credentials for target catalog service
  • For direct access: Read permissions on target S3 buckets or storage locations
  • Network connectivity to storage endpoints

Configuration Approaches

PGAA supports three primary approaches for lakehouse data access:

  1. Catalog-managed access: Full integration with external catalog services
  2. Direct Delta Lake access: Query Delta tables without catalog dependency
  3. Direct Iceberg access: Query Iceberg tables using metadata files

AWS S3 Tables Integration

Python Client Configuration

AWS S3 Tables requires REST API compatibility for Python clients. Configure PyIceberg as follows:

from pyiceberg.catalog import load_catalog # Define S3 Tables parameters REGION = "eu-north-1" # Your S3 Tables bucket region ARN = "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket" s3tables_catalog = load_catalog( "s3tables_catalog", **{ "type": "rest", "warehouse": ARN, "uri": f"https://s3tables.{REGION}.amazonaws.com/iceberg", "rest.sigv4-enabled": "true", "rest.signing-name": "s3tables", "rest.signing-region": REGION, } )

Note: PyIceberg versions ≤0.9.1 perform credential resolution at each API call and cannot pass S3 options directly during catalog definition.

PGAA Native Configuration

PGAA implements a native Rust S3 Tables client, eliminating REST proxy overhead:

SELECT bdr.replicate_ddl_command($$ SELECT pgaa.delete_catalog('s3tables_catalog'); SELECT pgaa.add_catalog( 's3tables_catalog', 'iceberg-s3tables', '{"arn": "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket", "region": "eu-north-1"}' ); SELECT pgaa.import_catalog('s3tables_catalog'); $$);

Key Differences:

  • Catalog type: iceberg-s3tables (not iceberg-rest)
  • Required parameters: ARN and region only
  • Authentication: AWS credential chain (profiles, environment variables, IMDS)
  • Explicit credentials not supported; relies on SDK credential resolution

Direct Delta Lake Access

Storage Location Setup

Configure storage locations for Delta Lake tables:

-- PGFS 2.0.1 and later SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-us-east-1-prod', '{"aws_skip_signature": "true"}' ); -- Legacy syntax (before PGFS 2.0.1) SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-eu-west-1-prod', NULL, -- managed storage location ID (unused) '{"aws_skip_signature": "true"}', -- options '{}' -- credentials );

Creating Delta Tables

Map Delta Lake tables using PGAA table access method:

-- Create customer table CREATE TABLE public.customer () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/customer' ); -- Verify table access SELECT count(*) FROM public.customer; -- Create additional TPC-H tables CREATE TABLE public.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/lineitem'); CREATE TABLE public.orders () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/orders'); CREATE TABLE public.nation () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/nation');

Working with Large Datasets

Configure separate schema for scale testing:

CREATE SCHEMA tpch; -- Create 1TB scale factor tables CREATE TABLE tpch.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/lineitem'); CREATE TABLE tpch.customer () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/customer'); -- Additional tables follow same pattern...

Direct Iceberg Access

Configuring Iceberg Storage

-- Create storage location for Iceberg data SELECT pgfs.create_storage_location( 'biganimal-sample-data-dev', 's3://beacon-analytics-demo-data-us-east-1-dev', '{"aws_skip_signature": "true"}' ); -- Create Iceberg table reference CREATE TABLE iceberg_table () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data-dev', pgaa.path = 'iceberg-example/default.db/iceberg_table', pgaa.format = 'iceberg' -- Explicitly specify format ); -- Query Iceberg data SELECT * FROM iceberg_table ORDER BY key ASC;

Catalog Attachment

Connecting to Existing Catalogs

Attach and query data from configured catalogs:

-- Add REST catalog SELECT pgaa.add_catalog( 'lakekeeper-test', 'iceberg-rest', '{ "url": "https://catalog-endpoint.example.com", "token": "your-api-token", "warehouse": "warehouse-id", "danger_accept_invalid_certs": "true" }' ); -- Attach catalog to current session SELECT pgaa.attach_catalog('lakekeeper-test'); -- Query catalog tables SELECT COUNT(*) FROM tpch_sf_1.lineitem;

Verifying Catalog Access

-- Check available schemas SELECT * FROM information_schema.schemata WHERE schema_name LIKE 'tpch%'; -- Verify offloaded data views SELECT * FROM partitioned_table_offloaded LIMIT 10;

Query Validation

Sample Analytical Query

Validate configuration with complex joins:

SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate >= CAST('1993-10-01' AS date) AND o_orderdate < CAST('1994-01-01' AS date) AND l_returnflag = 'R' AND c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue DESC LIMIT 20;

Expected Results:

  • Customer#000057040: revenue 734235.2455
  • Customer#000143347: revenue 721002.6948
  • Customer#000060838: revenue 679127.3077

Best Practices

Storage Configuration

  1. Use PGFS 2.0.1+ simplified syntax when available
  2. Configure region-appropriate endpoints for optimal performance
  3. Implement proper credential management through AWS credential chain

Catalog Management

  1. Regularly refresh OAuth tokens for REST catalogs
  2. Use native clients when available (S3 Tables)
  3. Import catalog metadata after configuration changes

Performance Optimization

  1. Partition large datasets appropriately
  2. Create materialized views for frequently accessed data
  3. Monitor query execution plans for optimization opportunities

Troubleshooting

Common Issues

Storage Access Errors

  • Verify S3 bucket permissions and network connectivity
  • Check credential chain configuration for AWS authentication

Catalog Connection Failures

  • Validate catalog endpoint URLs and credentials
  • Ensure proper SSL certificate validation settings

Query Performance Degradation

  • Review table statistics and partition pruning
  • Consider data locality and caching strategies

Next Steps

  1. Configure monitoring for lakehouse queries
  2. Establish data refresh schedules for materialized views
  3. Implement cost optimization through intelligent caching
  4. Develop governance policies for catalog access control