Enterprise-grade multi-region active/active architecture with near-zero recovery time, comprehensive DNS failover, and AWS Resilience Hub policy compliance for mission-critical applications.
- π Project Overview
- π₯ Why High Availability Matters
- ποΈ Architecture Design
- π Security & Network Controls
- β‘ Resilience Framework
- π§ͺ Chaos Engineering
- π CI/CD Automation
- π§ Infrastructure as Code
- π Documentation
- π License
This project implements a highly resilient serverless architecture with AWS Lambda functions deployed in private VPCs across multiple AWS regions (Ireland and Frankfurt). It features comprehensive security controls, automated failover mechanisms, and stringent disaster recovery capabilities through AWS Resilience Hub policy enforcement.
mindmap root((Lambda in Private VPC)) Infrastructure["π’ Infrastructure"] ["Multi-Region VPCs"] ["Private Subnets"] ["VPC Endpoints"] ["DNS Firewall"] ["Flow Logs"] Security["π Security"] ["Private DNS"] ["WAF Protection"] ["Network ACLs"] ["IAM Least Privilege"] ["KMS Encryption"] Resilience["π‘οΈ Resilience"] ["Mission-Critical Policy"] ["RTO/RPO Enforcement"] ["Multi-Region Active/Active"] ["Automatic Failover"] ["Chaos Engineering Tests"] Data["πΎ Data Layer"] ["DynamoDB Global Tables"] ["Cross-Region Replication"] ["Point-in-Time Recovery"] ["Backup/Restore Automation"] ["Dead Letter Queues"] Compute["βοΈ Compute & API"] ["Lambda Functions"] ["API Gateway"] ["Custom Domain"] ["Route 53 Failover"] ["Health Checks"] CI_CD["π CI/CD & Observability"] ["Security Scanning"] ["Automated Deployment"] ["CloudWatch Monitoring"] ["X-Ray Tracing"] ["Alarm Notifications"] - 99.99% Uptime through multi-region active/active architecture
- Near-zero RPO with DynamoDB global tables and cross-region replication
- Region-level RTO of 1 hour enforced by AWS Resilience Hub policy
- Comprehensive security controls with private VPCs and WAF protection
- Automated failover through Route 53 health checks and weighted routing
- Mission-critical compliance with industry best practices and standards
High availability isn't just a technical preferenceβit's a business imperative with far-reaching implications for modern organizations. Our multi-region active/active architecture directly addresses the following critical concerns:
mindmap root((High Availability<br>Impact Areas)) Financial["π° Financial Impact"] ["Direct Revenue Loss"] ["Recovery Costs"] ["Regulatory Penalties"] ["Operational Inefficiencies"] Operational["π’ Operational Impact"] ["Process Disruption"] ["Decision Delays"] ["Workflow Interruption"] ["Productivity Loss"] Reputational["π Reputation & Trust"] ["Customer Confidence"] ["Brand Perception"] ["Market Position"] ["Partner Relations"] Compliance["π Regulatory & Compliance"] ["Evidence Collection"] ["Audit Requirements"] ["Control Efficacy"] ["Legal Consequences"] - Direct Revenue Impact: For mission-critical systems, downtime typically costs $1,000-5,000 per minute
- Recovery Expenses: Emergency response activities and overtime costs add 30-50% to normal operational costs
- SLA Violations: Financial penalties for failing to meet contractual uptime commitments
- Operational Inefficiency: Teams resort to slower manual processes during outages, reducing productivity by 40-60%
- Critical Process Disruption: Security assessment and compliance processes stall during outages
- Decision Quality Degradation: Lack of real-time data forces decisions based on incomplete information
- Cross-system Impacts: Dependent systems and integration partners experience cascading failures
- Recovery Time Drain: IT teams diverted from strategic initiatives to handle recovery operations
pie title Reputational Impact By Hours of Downtime "1 hour (Low Impact)" : 1 "2-4 hours (Moderate)" : 3 "8-12 hours (High)" : 7 "24+ hours (Severe)" : 9 "48+ hours (Critical)" : 8 - Trust Erosion: Customer confidence drops significantly after prolonged or repeated outages
- Brand Damage: Social media amplifies service disruptions, creating lasting negative impressions
- Competitive Disadvantage: Competitors with better uptime gain market advantage during outages
- Partner Relations: Service disruptions strain relationships with business partners and integrators
graph TB subgraph "Regulatory & Compliance Impact" A1[Application Downtime] --> B1[Compliance Evidence Gaps] A1 --> B2[Audit Trail Disruption] A1 --> B3[Assessment Continuity Loss] B1 --> C1[Regulatory Requirements Violations] B2 --> C2[Audit Support Challenges] B3 --> C3[Compliance Posture Degradation] end classDef process fill:#f5f5f5,stroke:#333,stroke-width:1px; classDef impact fill:#ffeeee,stroke:#333,stroke-width:1px; classDef consequence fill:#ffcccc,stroke:#333,stroke-width:1px; class A1 process; class B1,B2,B3 process; class C1,C2,C3 impact; - NIST 800-53: Controls CP-2 (Contingency Plan), CP-7 (Alternate Processing Site), and CP-10 (System Recovery)
- ISO 27001:2022: Requirements A.17.1.1 through A.17.2.1 for business continuity and availability management
- PCI DSS: Requirements 12.10.1 for incident response capabilities and maintaining service availability
- GDPR: Obligations for ensuring "availability and resilience of processing systems and services"
- Industry SLAs: Contractual uptime requirements that carry financial and legal penalties when breached
Our multi-region active/active architecture, with its comprehensive resilience framework, addresses all these concerns by providing near-zero RTO/RPO metrics, automatic failover capabilities, and robust compliance documentation that satisfies regulatory requirements across multiple frameworks.
A true active/active multi-region architecture with isolated private subnets, global data replication, and automated failover systems.
flowchart TB subgraph "Multi-Region Active/Active Architecture" subgraph "Ireland (eu-west-1)" IR_VPC["VPC 10.1.0.0/16"] IR_SUBNETS["Private Subnets (3 AZs)"] IR_LAMBDA["Lambda Functions"] IR_DYNAMO["DynamoDB Global Table"] IR_API["API Gateway"] IR_DOMAIN["Custom Domain"] IR_DNS["DNS Firewall"] IR_EP["VPC Endpoints"] IR_VPC --> IR_SUBNETS IR_SUBNETS --> IR_LAMBDA IR_LAMBDA --> IR_DYNAMO IR_LAMBDA --> IR_API IR_API --> IR_DOMAIN IR_VPC --> IR_DNS IR_SUBNETS --> IR_EP end subgraph "Frankfurt (eu-central-1)" FR_VPC["VPC 10.5.0.0/16"] FR_SUBNETS["Private Subnets (3 AZs)"] FR_LAMBDA["Lambda Functions"] FR_DYNAMO["DynamoDB Global Table"] FR_API["API Gateway"] FR_DOMAIN["Custom Domain"] FR_DNS["DNS Firewall"] FR_EP["VPC Endpoints"] FR_VPC --> FR_SUBNETS FR_SUBNETS --> FR_LAMBDA FR_LAMBDA --> FR_DYNAMO FR_LAMBDA --> FR_API FR_API --> FR_DOMAIN FR_VPC --> FR_DNS FR_SUBNETS --> FR_EP end IR_DOMAIN -.-> R53["Route 53 Weighted/Failover"] FR_DOMAIN -.-> R53 IR_DYNAMO <--> FR_DYNAMO WAF["WAF v2"] --> IR_API WAF --> FR_API HC["Health Checks"] --> IR_API HC --> FR_API HC -.-> R53 REH["AWS Resilience Hub<br>Mission Critical Policy"] --> IR_LAMBDA REH --> FR_LAMBDA REH --> IR_DYNAMO REH --> FR_DYNAMO end classDef ireland fill:#4CAF50,stroke:#2E7D32,stroke-width:3px,color:#ffffff classDef frankfurt fill:#2196F3,stroke:#1565C0,stroke-width:3px,color:#ffffff classDef security fill:#F44336,stroke:#D32F2F,stroke-width:3px,color:#ffffff classDef routing fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#ffffff classDef resilience fill:#9C27B0,stroke:#7B1FA2,stroke-width:3px,color:#ffffff classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:3px,color:#000000 class IR_VPC,IR_SUBNETS,IR_LAMBDA,IR_DYNAMO,IR_API,IR_DOMAIN,IR_DNS,IR_EP ireland class FR_VPC,FR_SUBNETS,FR_LAMBDA,FR_DYNAMO,FR_API,FR_DOMAIN,FR_DNS,FR_EP frankfurt class WAF security class R53 routing class REH resilience class HC monitoring | Component | Implementation | Purpose |
|---|---|---|
| Private VPC Infrastructure | Dedicated VPCs in each region (10.1.0.0/16 & 10.5.0.0/16) | Network isolation and security |
| Multi-AZ Deployment | 3 subnets across availability zones per region | High availability within each region |
| VPC Endpoints | Interface & Gateway endpoints for S3, EC2, DynamoDB | Secure AWS service access without internet exposure |
| DNS Firewall | Allow *.amazonaws.com, block all others | Control outbound DNS traffic from VPC |
| API Gateway | Regional endpoints with custom domain names | Exposing Lambda functions securely |
| Lambda Functions | Node.js 20.x with VPC configuration | Serverless compute in private subnets |
| Global Tables | DynamoDB with multi-region replication | Consistent data across regions with near-zero RPO |
| Route 53 Routing | Weighted records with health check failover | Intelligent traffic distribution across regions |
graph TD subgraph "Comprehensive Security Framework" VPC["π’ VPC Security"] NW["π Network Controls"] IAM["π Identity & Access"] DATA["π Data Protection"] APP["π‘οΈ Application Security"] VPC --> DNS_FW["DNS Firewall<br>Allow AWS domains only"] VPC --> FLOW["Flow Logs<br>Network traffic auditing"] VPC --> PDNS["Private DNS<br>Secure name resolution"] NW --> NACL["Network ACLs<br>Stateless filtering"] NW --> SG["Security Groups<br>Stateful filtering"] NW --> DENY["Explicit denials<br>Block RDP (3389)"] IAM --> ROLES["Fine-grained roles<br>Least privilege"] IAM --> POLICY["Resource-based policies"] IAM --> TEMP["Temporary credentials"] DATA --> KMS["KMS Encryption<br>Custom keys"] DATA --> ENC_SNS["Encrypted SNS topics"] DATA --> ENC_LOG["Encrypted log groups"] APP --> WAF_IP["WAF IP reputation list"] APP --> WAF_ANON["WAF Anonymous IP protection"] APP --> WAF_CRS["WAF Common Rule Set"] APP --> WAF_BAD["WAF Known Bad Inputs"] APP --> WAF_OS["WAF OS protection rules"] end classDef vpc fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF classDef network fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF classDef iam fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF classDef data fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF classDef app fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF class VPC,DNS_FW,FLOW,PDNS vpc class NW,NACL,SG,DENY network class IAM,ROLES,POLICY,TEMP iam class DATA,KMS,ENC_SNS,ENC_LOG data class APP,WAF_IP,WAF_ANON,WAF_CRS,WAF_BAD,WAF_OS app | Security Control | Implementation | Details |
|---|---|---|
| Private VPC Design | No internet gateways or NAT gateways | Complete isolation from public internet |
| DNS Firewall Rules | Two rules (Allow AWS, Block All) | Only permits *.amazonaws.com domains |
| Custom Network ACLs | Inbound/outbound rule sets | Blocks RDP (3389), limits outbound to HTTPS (443) |
| Security Group Rules | Precise traffic control | Lambda-to-endpoints only, no other traffic |
| VPC Flow Logs | Integration with CloudWatch | Network traffic visibility with encrypted storage |
| WAF Protection | Six managed rule groups | IP reputation, anonymous IP, common attacks, Linux/Unix protection |
| KMS Encryption | Custom key with automatic rotation | Encrypts SNS topics, CloudWatch logs |
| IAM Least Privilege | Scoped down permissions | Specific roles and permissions for each component |
The AWS Resilience Hub integration enforces strict recovery time objectives (RTO) and recovery point objectives (RPO) through policy compliance and automated assessment.
graph TD subgraph "Mission Critical Resilience Framework" POLICY["Mission Critical Policy"] subgraph "Failure Domains" REGION["Regional Failure"] AZ["AZ Failure"] HW["Hardware Failure"] SW["Software Failure"] end POLICY --> REGION POLICY --> AZ POLICY --> HW POLICY --> SW REGION --> REG_RTO["RTO: 3600s (1h)"] REGION --> REG_RPO["RPO: 5s"] AZ --> AZ_RTO["RTO: 1s"] AZ --> AZ_RPO["RPO: 1s"] HW --> HW_RTO["RTO: 1s"] HW --> HW_RPO["RPO: 1s"] SW --> SW_RTO["RTO: 5400s (90m)"] SW --> SW_RPO["RPO: 300s (5m)"] end subgraph "Implementation Components" REG_RTO --> MULTI_REG["Multi-region active/active"] REG_RPO --> DDB_GLOB["DynamoDB global tables"] AZ_RTO & AZ_RPO --> MULTI_AZ["Multi-AZ deployment"] HW_RTO & HW_RPO --> AWS_INFRA["AWS infrastructure redundancy"] SW_RTO --> AUTO_RECOVER["Automated recovery procedures"] SW_RPO --> BACKUP_STRAT["Comprehensive backup strategy"] end classDef policy fill:#7B1FA2,stroke:#4A148C,stroke-width:3px,color:#FFFFFF classDef region fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF classDef az fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF classDef hardware fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF classDef software fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF classDef rto fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 classDef rpo fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF classDef impl fill:#607D8B,stroke:#455A64,stroke-width:2px,color:#FFFFFF class POLICY policy class REGION region class AZ az class HW hardware class SW software class REG_RTO,AZ_RTO,HW_RTO,SW_RTO rto class REG_RPO,AZ_RPO,HW_RPO,SW_RPO rpo class MULTI_REG,DDB_GLOB,MULTI_AZ,AWS_INFRA,AUTO_RECOVER,BACKUP_STRAT impl | Failure Domain | RTO | RPO | Implementation Strategy |
|---|---|---|---|
| Regional | 3600s (1 hour) | 5s | Multi-region active/active with Route 53 failover, Global Tables |
| Availability Zone | 1s | 1s | Multi-AZ deployment with automatic failover |
| Hardware | 1s | 1s | AWS managed infrastructure redundancy |
| Software | 5400s (90 min) | 300s (5 min) | Automated recovery procedures, backup/restore, chaos testing |
The architecture includes comprehensive disaster recovery testing using AWS Fault Injection Service (FIS) to validate resilience capabilities.
flowchart TD subgraph "Chaos Engineering Framework" DR["Fault Injection Service<br>Experiments"] subgraph "API Resilience Tests" API_FAIL["Lambda Access<br>Denial"] API_FAIL --> SSM_IAM["IAM Policy<br>Injection"] SSM_IAM --> DENY_LAMBDA["Deny Lambda<br>Access"] end subgraph "Data Layer Tests" DDB_DEL["DynamoDB<br>Table Deletion"] DDB_DEL --> SSM_DEL["Table Delete<br>Automation"] PITR["Point-In-Time<br>Recovery Test"] PITR --> SSM_PITR["PITR Restore<br>Automation"] BACKUP["Backup<br>Restoration Test"] BACKUP --> SSM_BACK["Backup Restore<br>Automation"] end DR --> API_FAIL DR --> DDB_DEL DR --> PITR DR --> BACKUP subgraph "Recovery Monitoring" MONITOR["Health Check<br>Monitoring"] FAILOVER["Route 53<br>Failover"] RESTORE["Recovery<br>Procedures"] end SSM_IAM & SSM_DEL & SSM_PITR & SSM_BACK --> MONITOR MONITOR --> FAILOVER MONITOR --> RESTORE end classDef framework fill:#7B1FA2,stroke:#4A148C,stroke-width:3px,color:#FFFFFF classDef experiment fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF classDef automation fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF classDef action fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 classDef recovery fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF class DR framework class API_FAIL,DDB_DEL,PITR,BACKUP experiment class SSM_IAM,SSM_DEL,SSM_PITR,SSM_BACK automation class DENY_LAMBDA action class MONITOR monitoring class FAILOVER,RESTORE recovery | Test Scenario | Implementation | Success Metrics | Recovery Method |
|---|---|---|---|
| API Gateway Lambda Access Denial | IAM deny policy injection via SSM | Health check recovery time < RTO | Automatic failover to other region |
| DynamoDB Table Deletion | Scheduled table deletion via SSM | Table recreation time < RTO | Automated restore from backup or PITR |
| Point-In-Time Recovery | SSM automation document execution | Data recovery with RPO validation | Restoration to specified timestamp |
| Backup Restoration | SSM automation with backup ARN | Backup validation and integrity check | Full table recovery from backup |
| Route 53 Health Check Validation | Health check failure trigger | Weighted routing adjustment < RTO | Automatic traffic redistribution |
flowchart LR GH_PUSH["GitHub Push/<br>Workflow Dispatch"] --> SEC_SCAN{"Security<br>Scanning"} SEC_SCAN --> CFN_LINT["cfn-lint"] SEC_SCAN --> CFN_NAG["cfn-nag"] SEC_SCAN --> CHECKOV["Checkov"] SEC_SCAN --> SCORECARD["Scorecard"] SEC_SCAN --> ZAP["ZAP API<br>Scan"] CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS<br>(eu-west-1)"] CONFIG_IR --> DEPLOY_IR["Deploy Core<br>Ireland"] DEPLOY_IR --> OUTPUTS["Collect<br>Outputs"] OUTPUTS --> CONFIG_FR["Configure AWS<br>(eu-central-1)"] CONFIG_FR --> DEPLOY_FR["Deploy Core<br>Frankfurt"] DEPLOY_FR --> DEPLOY_AUX["Deploy<br>Auxiliary Stacks"] DEPLOY_AUX --> DEPLOY_R53["Route 53<br>Configuration"] DEPLOY_AUX --> DEPLOY_WAF["WAF<br>Configuration"] DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub<br>App"] DEPLOY_AUX --> DEPLOY_DR["Disaster<br>Recovery Tests"] DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag &<br>Release"] classDef trigger fill:#D32F2F,stroke:#B71C1C,stroke-width:3px,color:#FFFFFF classDef security fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF classDef scan fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF classDef deploy fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF classDef aux fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF classDef release fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF class GH_PUSH trigger class SEC_SCAN security class CFN_LINT,CFN_NAG,CHECKOV,SCORECARD,ZAP scan class CONFIG_IR,DEPLOY_IR,OUTPUTS,CONFIG_FR,DEPLOY_FR deploy class DEPLOY_AUX,DEPLOY_R53,DEPLOY_WAF,DEPLOY_RHB,DEPLOY_DR aux class TAG release - Pre-Commit Security Validation: Multiple scanning tools analyze infrastructure templates
- Sequential Multi-Region Deployment: Ireland (primary) followed by Frankfurt (secondary)
- Cross-Region Resource Integration: Output collection and sharing between deployments
- Auxiliary Resource Configuration: Route 53, WAF, Resilience Hub, and Disaster Recovery
- Automated Version Management: Git tagging and release notes generation
- Rollback Capability: Automatic reversal on deployment failures
This project is entirely defined using CloudFormation templates with comprehensive resource definitions for each component.
| Template | Description | Key Resources |
|---|---|---|
| template.yml | Core Infrastructure | VPCs, Subnets, Lambda Functions, API Gateway, DynamoDB, DNS Firewall, Security Groups, Network ACLs, Flow Logs, KMS Keys |
| route53.yml | DNS Configuration | Weighted A/AAAA Records, Health Check Integration, Failover Configuration, Domain Name Integration |
| app.yml | Resilience Hub | Mission Critical Policy Definition, RTO/RPO Targets, Multi-Resource Mapping, Assessment Schedule |
| disaster-recovery.yml | DR Testing | FIS Experiments, SSM Automation Documents, IAM Roles & Policies, Recovery Procedures, Health Checks |
| waf.yml | Security Rules | WAF WebACL, AWS Managed Rule Groups, API Gateway Association |
- DNS Firewall Integration: Fully configured Route 53 DNS Firewall allowing only AWS domains
- Private DNS Configuration: Secure VPC DNS settings with customized resolution
- Comprehensive Network Controls: Custom ACLs and security groups with explicit deny rules
- Health Check System: Multiple Route 53 health checks for various service components
- Advanced WAF Protection: Six AWS managed rule groups including IP reputation and known attacks
- Global DynamoDB Tables: Cross-region replication with point-in-time recovery
- Principle of Least Privilege: Narrowly scoped IAM roles and permissions for all resources
-
DynamoDB Recovery Runbook: Automated Systems Manager procedures for:
- Point-in-Time Recovery
- Backup Restoration
- Table Recreation
- Cross-Region Synchronization
-
Lambda Function Recovery Runbook: Procedures covering:
- Version Management
- Provisioned Concurrency Adjustment
- Memory/Execution Time Optimization
- Error Handling and Retry Logic
-
API Gateway Recovery Runbook: Workflow documentation for:
- Endpoint Restoration
- Custom Domain Reconfiguration
- WAF Integration Recovery
- Route 53 Health Check Adjustments
-
IAM Automation Runbook: Procedures for:
- Role and Policy Recovery
- Permission Boundary Enforcement
- Trust Relationship Verification
- Cross-Account Access Management
- AWS Resilience Hub Documentation
- Disaster Recovery on AWS - Multi-site Active/Active
- AWS Well-Architected Framework - Reliability Pillar
- AWS Best Practices for DDoS Resiliency
- Route 53 Application Recovery Controller
| Impact Category | Financial | Operational | Reputational | Regulatory |
|---|---|---|---|---|
| π Confidentiality | ||||
| β Integrity | ||||
| β±οΈ Availability |
This project is licensed under the Apache License 2.0 - see LICENSE.md for details.
Last updated: 2025-04-16