DEV Community

Cover image for Zero-Downtime RDS to Aurora Serverless v2 Migration: A Step-by-Step Guide
Garrett Yan
Garrett Yan

Posted on

Zero-Downtime RDS to Aurora Serverless v2 Migration: A Step-by-Step Guide

Migrating from RDS to Aurora Serverless v2 can reduce database costs by up to 40% while improving performance and scalability. In this guide, I'll walk you through a production-tested migration strategy that ensures zero downtime for your applications.

Table of Contents

Why Aurora Serverless v2?

Aurora Serverless v2 offers several advantages over traditional RDS:

  • Auto-scaling: Scales compute capacity from 0.5 to 128 ACUs in seconds
  • Cost Efficiency: Pay only for the capacity you use
  • High Availability: Built-in fault tolerance across multiple AZs
  • Performance: Up to 5x faster than standard MySQL

Prerequisites

Before starting the migration, ensure you have:

  • RDS instance running MySQL 5.7+ or PostgreSQL 10+
  • AWS CLI configured with appropriate permissions
  • Terraform installed (for infrastructure as code)
  • Application connection strings that can be updated
  • Backup of your current database

Migration Strategy Overview

Our zero-downtime approach involves:

  1. Creating an Aurora read replica from RDS
  2. Promoting the replica to a standalone cluster
  3. Enabling Serverless v2 on the cluster
  4. Switching application traffic with minimal disruption

Step 1: Assess Your Current RDS Setup

First, gather metrics to properly size your Aurora Serverless v2 cluster:

# Get current RDS metrics aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name CPUUtilization \ --dimensions Name=DBInstanceIdentifier,Value=your-rds-instance \ --start-time 2024-01-01T00:00:00Z \ --end-time 2024-01-07T00:00:00Z \ --period 3600 \ --statistics Maximum,Average 
Enter fullscreen mode Exit fullscreen mode

Key metrics to analyze:

  • CPU utilization patterns
  • Connection count
  • IOPS requirements
  • Storage size

Step 2: Plan Aurora Serverless v2 Capacity

Based on your RDS metrics, calculate the required ACU range:

# terraform/aurora-serverless-v2.tf locals { # ACU calculation based on RDS instance type # db.r5.large = 2 vCPUs, 16 GB RAM ≈ 4-16 ACUs min_acu = 2 max_acu = 16 } resource "aws_rds_cluster" "aurora_serverless_v2" { cluster_identifier = "my-app-aurora-cluster" engine = "aurora-mysql" engine_mode = "provisioned" engine_version = "8.0.mysql_aurora.3.02.0" database_name = "myapp" master_username = "admin" master_password = random_password.db_password.result serverlessv2_scaling_configuration { max_capacity = local.max_acu min_capacity = local.min_acu } backup_retention_period = 7 preferred_backup_window = "03:00-04:00" enabled_cloudwatch_logs_exports = ["error", "general", "slowquery"] tags = { Environment = "production" ManagedBy = "terraform" } } 
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Aurora Read Replica

Create an Aurora read replica from your RDS instance:

# First, create a snapshot of RDS resource "aws_db_snapshot" "rds_snapshot" { db_instance_identifier = "existing-rds-instance" db_snapshot_identifier = "pre-migration-snapshot" } # Create Aurora cluster from snapshot resource "aws_rds_cluster" "aurora_from_snapshot" { cluster_identifier = "aurora-migration-cluster" engine = "aurora-mysql" engine_version = "8.0.mysql_aurora.3.02.0" snapshot_identifier = aws_db_snapshot.rds_snapshot.id # Enable binary logging for replication enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"] lifecycle { ignore_changes = [snapshot_identifier] } } 
Enter fullscreen mode Exit fullscreen mode

Step 4: Implement the Migration

4.1 Set Up Continuous Replication

Configure DMS for continuous replication:

# scripts/setup_dms_replication.py import boto3 import time dms = boto3.client('dms') def create_replication_instance(): response = dms.create_replication_instance( ReplicationInstanceIdentifier='rds-to-aurora-migration', ReplicationInstanceClass='dms.r5.large', AllocatedStorage=100, MultiAZ=True, Tags=[ {'Key': 'Purpose', 'Value': 'RDS-Aurora-Migration'}, ] ) return response['ReplicationInstance']['ReplicationInstanceArn'] def create_migration_task(source_endpoint, target_endpoint, rep_instance_arn): response = dms.create_replication_task( ReplicationTaskIdentifier='rds-aurora-continuous-sync', SourceEndpointArn=source_endpoint, TargetEndpointArn=target_endpoint, ReplicationInstanceArn=rep_instance_arn, MigrationType='full-load-and-cdc', TableMappings='''{ "rules": [{ "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "include" }] }''' ) return response # Monitor replication lag def check_replication_lag(): response = dms.describe_replication_tasks( Filters=[ { 'Name': 'replication-task-id', 'Values': ['rds-aurora-continuous-sync'] } ] ) task = response['ReplicationTasks'][0] stats = task['ReplicationTaskStats'] print(f"Tables loaded: {stats['TablesLoaded']}") print(f"Tables loading: {stats['TablesLoading']}") print(f"Full load progress: {stats['FullLoadProgressPercent']}%") return stats 
Enter fullscreen mode Exit fullscreen mode

4.2 Application Cutover Strategy

Implement a connection manager for seamless cutover:

# app/db_connection_manager.py import os import pymysql from datetime import datetime class DatabaseConnectionManager: def __init__(self): self.use_aurora = os.environ.get('USE_AURORA', 'false').lower() == 'true' self.rds_endpoint = os.environ.get('RDS_ENDPOINT') self.aurora_endpoint = os.environ.get('AURORA_ENDPOINT') def get_connection(self): endpoint = self.aurora_endpoint if self.use_aurora else self.rds_endpoint connection = pymysql.connect( host=endpoint, user=os.environ.get('DB_USER'), password=os.environ.get('DB_PASSWORD'), database=os.environ.get('DB_NAME'), connect_timeout=5, read_timeout=10, write_timeout=10, max_allowed_packet=64 * 1024 * 1024 ) # Log connection for monitoring  print(f"Connected to: {endpoint} at {datetime.now()}") return connection def health_check(self): try: conn = self.get_connection() with conn.cursor() as cursor: cursor.execute("SELECT 1") result = cursor.fetchone() conn.close() return True except Exception as e: print(f"Health check failed: {str(e)}") return False 
Enter fullscreen mode Exit fullscreen mode

4.3 Gradual Traffic Migration

Use Route 53 weighted routing for gradual migration:

# terraform/route53_weighted.tf resource "aws_route53_record" "database_weighted_rds" { zone_id = data.aws_route53_zone.main.zone_id name = "db.internal.myapp.com" type = "CNAME" ttl = "60" weighted_routing_policy { weight = var.rds_traffic_weight # Start at 100, gradually reduce } set_identifier = "rds" records = [aws_db_instance.rds.endpoint] } resource "aws_route53_record" "database_weighted_aurora" { zone_id = data.aws_route53_zone.main.zone_id name = "db.internal.myapp.com" type = "CNAME" ttl = "60" weighted_routing_policy { weight = var.aurora_traffic_weight # Start at 0, gradually increase } set_identifier = "aurora" records = [aws_rds_cluster.aurora_serverless_v2.endpoint] } 
Enter fullscreen mode Exit fullscreen mode

Step 5: Post-Migration Optimization

5.1 Auto-scaling Configuration

Fine-tune Aurora Serverless v2 scaling:

# scripts/optimize_aurora_scaling.py import boto3 import json from datetime import datetime, timedelta cloudwatch = boto3.client('cloudwatch') rds = boto3.client('rds') def analyze_acu_usage(cluster_id, days=7): """Analyze ACU usage patterns to optimize scaling""" end_time = datetime.utcnow() start_time = end_time - timedelta(days=days) response = cloudwatch.get_metric_statistics( Namespace='AWS/RDS', MetricName='ServerlessDatabaseCapacity', Dimensions=[ {'Name': 'DBClusterIdentifier', 'Value': cluster_id} ], StartTime=start_time, EndTime=end_time, Period=3600, # 1 hour  Statistics=['Average', 'Maximum', 'Minimum'] ) # Calculate optimal ACU range  data_points = response['Datapoints'] if data_points: avg_acu = sum(dp['Average'] for dp in data_points) / len(data_points) max_acu = max(dp['Maximum'] for dp in data_points) # Recommend settings with 20% headroom  recommended_min = max(0.5, avg_acu * 0.5) recommended_max = min(128, max_acu * 1.2) print(f"Current usage analysis:") print(f"Average ACU: {avg_acu:.2f}") print(f"Peak ACU: {max_acu:.2f}") print(f"Recommended min ACU: {recommended_min:.2f}") print(f"Recommended max ACU: {recommended_max:.2f}") return { 'min_acu': recommended_min, 'max_acu': recommended_max } def update_scaling_configuration(cluster_id, min_acu, max_acu): """Update Aurora Serverless v2 scaling configuration""" response = rds.modify_db_cluster( DBClusterIdentifier=cluster_id, ServerlessV2ScalingConfiguration={ 'MinCapacity': min_acu, 'MaxCapacity': max_acu }, ApplyImmediately=True ) print(f"Updated scaling configuration for {cluster_id}") print(f"New range: {min_acu} - {max_acu} ACUs") return response 
Enter fullscreen mode Exit fullscreen mode

5.2 Performance Monitoring

Set up comprehensive monitoring:

# terraform/aurora_monitoring.tf resource "aws_cloudwatch_dashboard" "aurora_serverless_v2" { dashboard_name = "aurora-serverless-v2-monitoring" dashboard_body = jsonencode({ widgets = [ { type = "metric" width = 12 height = 6 properties = { metrics = [ ["AWS/RDS", "ServerlessDatabaseCapacity", "DBClusterIdentifier", aws_rds_cluster.aurora_serverless_v2.id], [".", "CPUUtilization", ".", "."], [".", "DatabaseConnections", ".", "."] ] period = 300 stat = "Average" region = var.aws_region title = "Aurora Serverless v2 Metrics" } }, { type = "metric" width = 12 height = 6 properties = { metrics = [ ["AWS/RDS", "ReadLatency", "DBClusterIdentifier", aws_rds_cluster.aurora_serverless_v2.id], [".", "WriteLatency", ".", "."], [".", "ReadThroughput", ".", "."], [".", "WriteThroughput", ".", "."] ] period = 300 stat = "Average" region = var.aws_region title = "Database Performance" } } ] }) } # Alarms for critical metrics resource "aws_cloudwatch_metric_alarm" "high_cpu" { alarm_name = "aurora-serverless-v2-high-cpu" comparison_operator = "GreaterThanThreshold" evaluation_periods = "2" metric_name = "CPUUtilization" namespace = "AWS/RDS" period = "300" statistic = "Average" threshold = "80" alarm_description = "This metric monitors Aurora CPU utilization" dimensions = { DBClusterIdentifier = aws_rds_cluster.aurora_serverless_v2.id } alarm_actions = [aws_sns_topic.alerts.arn] } 
Enter fullscreen mode Exit fullscreen mode

Results and Lessons Learned

After completing the migration for our production workload:

Performance Improvements

  • Query performance: 35% faster on average
  • Connection time: Reduced from 250ms to 45ms
  • Failover time: Improved from 60s to <30s

Cost Savings

  • Compute costs: Reduced by 42% due to auto-scaling
  • Storage costs: 15% reduction with Aurora storage optimization
  • Overall savings: $3,200/month for our workload

Key Lessons

  1. Test scaling patterns thoroughly before production cutover
  2. Monitor replication lag closely during migration
  3. Use connection pooling to maximize efficiency
  4. Start conservative with ACU settings, then optimize

Common Pitfalls to Avoid

  • Don't skip the replication lag monitoring
  • Ensure all application connection strings are updated
  • Test failover scenarios before going live
  • Keep the old RDS instance for at least 7 days post-migration

Conclusion

Migrating from RDS to Aurora Serverless v2 requires careful planning but delivers significant benefits. The zero-downtime approach ensures business continuity while the auto-scaling capabilities of Serverless v2 provide both cost savings and performance improvements.

Have you migrated to Aurora Serverless v2? What challenges did you face? Share your experiences in the comments!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.