pgraft System Architecture¶

Overview¶

pgraft implements a distributed consensus system using the Raft algorithm integrated with PostgreSQL. This document describes the overall system architecture, component interactions, and operational flows.

System Components¶

1. PostgreSQL Cluster Nodes¶

Each PostgreSQL instance in the cluster runs the pgraft extension and participates in the Raft consensus protocol.

2. Raft Consensus Layer¶

The core consensus engine implemented in Go, providing: - Leader election - Log replication - Cluster membership management - Failure detection and recovery

3. Network Communication¶

TCP-based peer-to-peer communication between cluster nodes for: - Raft protocol messages - Heartbeat signals - Log replication - Configuration changes

4. Shared Memory Interface¶

PostgreSQL shared memory used for: - Command queue between SQL and background worker - Cluster state persistence - Worker status tracking - Command status monitoring

High-Level Architecture¶

┌─────────────────────────────────────────────────────────────────┐ │ PostgreSQL Cluster │ ├─────────────────┬─────────────────┬─────────────────────────────┤ │ Node 1 │ Node 2 │ Node 3 │ │ │ │ │ │ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐ │ │ │ PostgreSQL │ │ │ PostgreSQL │ │ │ PostgreSQL │ │ │ │ Server │ │ │ Server │ │ │ Server │ │ │ └─────────────┘ │ └─────────────┘ │ └─────────────┘ │ │ │ │ │ │ │ │ │ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ │ │ pgraft │ │ pgraft │ │ pgraft │ │ │ │ Extension │ │ Extension │ │ Extension │ │ │ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ │ │ │ │ │ │ │ │ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ │ │ Background │ │ Background │ │ Background │ │ │ │ Worker │ │ Worker │ │ Worker │ │ │ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ │ │ │ │ │ │ │ │ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │ │ │ Go Raft │ │ Go Raft │ │ Go Raft │ │ │ │ Library │ │ Library │ │ Library │ │ │ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │ │ │ │ │ │ │ │ └─────────┼───────┼─────────┼───────┼─────────┼───────────────────┘  │ │ │ │ │  └───────┼─────────┼───────┼─────────┘  │ │ │  ┌───────▼─────────▼───────▼───────┐  │ Network Layer │  │ (TCP Peer Communication) │  └─────────────────────────────────┘

Component Interaction Flow¶

1. Cluster Initialization¶

sequenceDiagram participant U as User participant N1 as Node 1 participant N2 as Node 2 participant N3 as Node 3 U->>N1: SELECT pgraft_init() N1->>N1: Start background worker N1->>N1: Initialize Raft node N1->>N1: Start network server U->>N2: SELECT pgraft_add_node('node2:5433') N1->>N2: Connect to node 2 N2->>N2: Start background worker N2->>N2: Join cluster U->>N3: SELECT pgraft_add_node('node3:5433') N1->>N3: Connect to node 3 N3->>N3: Start background worker N3->>N3: Join cluster Note over N1,N3: Cluster formed with 3 nodes

2. Leader Election Process¶

sequenceDiagram participant N1 as Node 1 (Leader) participant N2 as Node 2 (Follower) participant N3 as Node 3 (Follower) loop Heartbeat N1->>N2: AppendEntries (heartbeat) N1->>N3: AppendEntries (heartbeat) N2->>N1: AppendEntries Response N3->>N1: AppendEntries Response end Note over N1: Leader fails N2->>N2: Election timeout N2->>N3: RequestVote N3->>N2: Vote granted N2->>N2: Become leader N2->>N3: AppendEntries (heartbeat)

3. Log Replication¶

sequenceDiagram participant U as User participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 U->>L: INSERT/UPDATE/DELETE L->>L: Append to log L->>F1: AppendEntries (log entry) L->>F2: AppendEntries (log entry) F1->>L: AppendEntries Response F2->>L: AppendEntries Response L->>L: Commit entry L->>F1: AppendEntries (commit) L->>F2: AppendEntries (commit) L->>U: Transaction committed

Data Flow Architecture¶

1. Command Processing Flow¶

SQL Function → Command Queue → Background Worker → Go Raft Library → Network  ↑ ↓  └─────────── Command Status ← Shared Memory ← Raft State ←──────────┘

2. Shared Memory Layout¶

┌─────────────────────────────────────────────────────────────┐ │ Shared Memory │ ├─────────────────────────────────────────────────────────────┤ │ Worker State │ │ ├─ Status (IDLE/INIT/RUNNING/STOPPING/STOPPED) │ │ ├─ Node ID, Address, Port │ │ ├─ Cluster Name │ │ └─ Initialization Flags │ ├─────────────────────────────────────────────────────────────┤ │ Command Queue (Circular Buffer) │ │ ├─ Command Type (INIT/ADD_NODE/REMOVE_NODE/LOG_APPEND) │ │ ├─ Command Data │ │ ├─ Timestamp │ │ └─ Queue Head/Tail Pointers │ ├─────────────────────────────────────────────────────────────┤ │ Command Status FIFO │ │ ├─ Command ID │ │ ├─ Status (PENDING/PROCESSING/COMPLETED/FAILED) │ │ ├─ Error Message │ │ └─ Completion Time │ ├─────────────────────────────────────────────────────────────┤ │ Cluster State │ │ ├─ Current Leader ID │ │ ├─ Current Term │ │ ├─ Node Membership │ │ └─ Log Statistics │ └─────────────────────────────────────────────────────────────┘

Network Architecture¶

1. Peer-to-Peer Communication¶

┌─────────────┐ TCP ┌─────────────┐ TCP ┌─────────────┐ │ Node 1 │◄─────────►│ Node 2 │◄─────────►│ Node 3 │ │ │ │ │ │ │ │ Port: 5433 │ │ Port: 5434 │ │ Port: 5435 │ │ Raft Port: │ │ Raft Port: │ │ Raft Port: │ │ 8001 │ │ 8002 │ │ 8003 │ └─────────────┘ └─────────────┘ └─────────────┘

2. Message Types¶

RequestVote: Candidate requesting votes during elections
AppendEntries: Leader sending log entries and heartbeats
InstallSnapshot: Leader sending snapshot to catch up slow followers
Heartbeat: Regular leader-to-follower communication

Failure Scenarios and Recovery¶

1. Leader Failure¶

Normal Operation → Leader Fails → Election Timeout → New Election → New Leader

2. Network Partition¶

Full Connectivity → Network Split → Partition A (majority) → Partition B (minority)  ↓ ↓  Continues Operation Stops Accepting Writes

3. Node Recovery¶

Node Down → Node Restarts → Joins Cluster → Catches Up Log → Active Participant

Security Considerations¶

1. Network Security¶

TCP connections between peers
Configurable IP addresses and ports
No built-in encryption (relies on network-level security)

2. Access Control¶

PostgreSQL's native authentication
Extension functions require appropriate privileges
Shared memory access controlled by PostgreSQL

Performance Characteristics¶

1. Latency¶

Leader election: ~1-5 seconds (configurable)
Log replication: Network RTT + disk I/O
Heartbeat interval: 1 second (configurable)

2. Throughput¶

Single leader handles all writes
Followers can serve read-only queries
Log replication limited by network bandwidth

3. Scalability¶

Optimal with 3-5 nodes
More nodes increase election time
Network partitions affect availability

Configuration Parameters¶

1. Network Settings¶

pgraft.listen_address: IP address to bind
pgraft.listen_port: Port for Raft communication
pgraft.peer_timeout: Network timeout for peer connections

2. Raft Parameters¶

pgraft.heartbeat_interval: Heartbeat frequency (ms)
pgraft.election_timeout: Election timeout range (ms)
pgraft.max_log_entries: Maximum log entries per batch

3. Operational Settings¶

pgraft.cluster_name: Unique cluster identifier
pgraft.debug_enabled: Enable debug logging
pgraft.health_period_ms: Health check frequency

Monitoring and Observability¶

1. Cluster Health¶

Leader election status
Node membership status
Network connectivity
Log replication lag

2. Performance Metrics¶

Command processing latency
Network message rates
Memory usage
Background worker status

3. Logging¶

Raft protocol events
Network communication
Error conditions
Performance statistics

Deployment Considerations¶

1. Hardware Requirements¶

Sufficient RAM for shared memory
Network bandwidth for replication
Disk I/O for log persistence
CPU for consensus processing

2. Network Requirements¶

Low-latency network between nodes
Reliable network connectivity
Sufficient bandwidth for replication
Firewall configuration for peer ports

3. PostgreSQL Configuration¶

Shared memory allocation
Background worker limits
Connection limits
Logging configuration

This architecture provides a robust foundation for distributed PostgreSQL clusters with automatic failover, consistent replication, and high availability.