CloudPath Academy

Your guide to AWS certification success

Amazon Web Services AWS Broken Labs

AWS Certified SysOps Administrator - Associate (SOA-C03) Domain 2

Reliability and Business Continuity

Official Exam Guide: Domain 2: Reliability and Business Continuity
Skill Builder: AWS Certified SysOps Administrator - Associate (SOA-C03) Exam Prep

Note: Some Skill Builder labs require a subscription.


How to Study This Domain Effectively

Study Tips

  1. Understand the difference between scalability and elasticity - Scalability is the ability to handle increased load by adding resources, while elasticity is automatically adding or removing resources based on demand. The exam tests your knowledge of when to use Auto Scaling (elasticity) versus manual scaling (scalability), and how to configure scaling policies (target tracking, step scaling, scheduled scaling) for different scenarios. Practice setting up Auto Scaling groups with different scaling policies to understand their behaviors.

  2. Master Multi-AZ versus Multi-Region architectures - Know when each provides the appropriate level of availability and disaster recovery. Multi-AZ protects against Availability Zone (AZ) failures within a region (high availability), while Multi-Region protects against region-wide outages (disaster recovery). Exam questions test your ability to design architectures that meet specific Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements, so understand how different strategies (backup/restore, pilot light, warm standby, hot standby) map to RTO/RPO targets.

  3. Practice hands-on with AWS Backup - Set up backup plans, vaults, and test restores for different resource types (EC2, EBS, RDS, DynamoDB, EFS, S3). The exam tests your knowledge of backup frequency, retention policies, lifecycle transitions, and cross-region copy capabilities. Understanding how to configure backup plans using tags versus resource assignments is critical, as is knowing the restore process for each service type.

  4. Learn Elastic Load Balancing (ELB) health check configurations thoroughly - Understand how health check parameters (interval, timeout, healthy threshold, unhealthy threshold) affect target availability and how misconfigurations cause service disruptions. The exam presents troubleshooting scenarios where instances are being marked unhealthy incorrectly, or unhealthy instances aren’t being removed quickly enough. Practice configuring health checks for Application Load Balancer (ALB), Network Load Balancer (NLB), and Gateway Load Balancer (GWLB).

  5. Focus on cost-optimized backup and recovery strategies - Understand S3 storage classes for backups (S3 Standard-IA, S3 Glacier, S3 Glacier Deep Archive), EBS snapshot lifecycle policies, and RDS automated backup retention. The exam tests scenarios where you must balance cost with recovery requirements. Know how to implement lifecycle policies that automatically transition backups to cheaper storage classes while meeting compliance retention requirements.

  1. Start with Auto Scaling fundamentals - Study the EC2 Auto Scaling User Guide, focusing on launch templates, Auto Scaling groups, scaling policies, and lifecycle hooks. Understand how Auto Scaling works with load balancers and how health checks determine instance health. Create hands-on labs where you configure target tracking scaling (maintain CPU at 50%), step scaling (add instances based on alarm thresholds), and scheduled scaling (scale for predictable traffic patterns). This foundation is essential for understanding elasticity across all AWS services.

  2. Deep dive into high availability patterns - Study Multi-AZ deployments for RDS, ElastiCache, EFS, and ELB. Understand how each service implements high availability differently. For example, RDS Multi-AZ uses synchronous replication with automatic failover, while ElastiCache for Redis uses replication groups. Practice configuring these Multi-AZ deployments and understand the failure scenarios they protect against. Learn when Multi-AZ is sufficient versus when you need Multi-Region.

  3. Master load balancer configuration and troubleshooting - Study all three load balancer types (ALB, NLB, GWLB) and their use cases. Focus heavily on health check configuration, target group settings, and cross-zone load balancing. Practice troubleshooting common issues: targets marked unhealthy, connection timeouts, SSL certificate problems, and sticky session issues. Understand how to interpret load balancer CloudWatch metrics and access logs for troubleshooting.

  4. Comprehensive backup and restore practice - Use AWS Backup to create backup plans for EC2, EBS, RDS, DynamoDB, EFS, and S3. Practice restoring each resource type and understand the differences in restore procedures. Study point-in-time recovery for RDS and DynamoDB, EBS snapshot restoration, and S3 versioning recovery. Understand cross-region backup copies and how to implement them for disaster recovery. Document RTO and RPO for each backup strategy you implement.

  5. Study disaster recovery strategies systematically - Learn the four DR strategies (backup/restore, pilot light, warm standby, hot standby/multi-site) in order of increasing cost and decreasing RTO/RPO. For each strategy, understand the AWS services involved, implementation steps, costs, and typical RTO/RPO values. The exam tests your ability to recommend the appropriate DR strategy given specific business requirements (cost constraints, acceptable downtime, data loss tolerance). Practice architecting each DR pattern for a sample application.


Task 2.1: Implement scalability and elasticity

Skills & Corresponding Documentation

Skill 2.1.1: Configure and manage scaling mechanisms in compute environments

Why: Auto Scaling is fundamental to achieving elasticity in AWS and is one of the most tested topics in the SysOps exam. You must understand how to configure EC2 Auto Scaling groups, launch templates, scaling policies (target tracking, step scaling, simple scaling, scheduled scaling), and lifecycle hooks. Exam scenarios test your ability to troubleshoot scaling issues (instances not launching, scaling policies not triggering, instances not being replaced), optimize scaling performance (scale-out fast, scale-in slow), and integrate Auto Scaling with load balancers and CloudWatch alarms. Real-world SysOps administrators must design Auto Scaling configurations that handle traffic fluctuations while minimizing costs and maintaining application availability.

AWS Documentation:

Skill 2.1.2: Implement caching by using AWS services to enhance dynamic scalability (for example, Amazon CloudFront, Amazon ElastiCache)

Why: Caching is a critical performance optimization and scalability strategy that reduces backend load and improves response times. The exam tests your knowledge of when to use CloudFront (content delivery network for static and dynamic content) versus ElastiCache (in-memory data store for database query results, session data). You must understand CloudFront cache behaviors, Time To Live (TTL) settings, invalidation, and origin failover. For ElastiCache, know when to use Redis (persistence, complex data structures, pub/sub) versus Memcached (simple caching, multi-threading) and how to configure clusters, replication groups, and parameter groups. Understanding caching strategies helps reduce costs by decreasing compute and database load while improving application performance.

AWS Documentation:

Skill 2.1.3: Configure and manage scaling in AWS managed databases (for example, Amazon RDS, Amazon DynamoDB)

Why: Database scaling is critical for application performance and cost optimization, and the exam extensively tests both vertical (instance size) and horizontal (read replicas, sharding) scaling strategies. For RDS, you must understand how to scale instance types, add read replicas, enable storage autoscaling, and implement Aurora Auto Scaling. For DynamoDB, know the difference between provisioned capacity (with auto scaling) and on-demand capacity, how to use Global Secondary Indexes (GSIs) for query flexibility, and when to implement DynamoDB Accelerator (DAX) for caching. Understanding the scaling characteristics, costs, and limitations of each database service helps you design cost-effective, performant database architectures that meet application requirements.

AWS Documentation:


Task 2.2: Implement highly available and resilient environments

Skills & Corresponding Documentation

Skill 2.2.1: Configure and troubleshoot Elastic Load Balancing (ELB) and Amazon Route 53 health checks

Why: Load balancer and DNS health checks are fundamental to high availability and are heavily tested through troubleshooting scenarios. You must understand how ELB health check parameters (interval, timeout, healthy/unhealthy thresholds, path, port) determine target health and how misconfigured health checks cause service disruptions. For Route 53, understand health check types (endpoint, calculated, CloudWatch alarm), failover routing policies, and how health checks integrate with DNS records. Exam questions present scenarios where health checks are marking targets as unhealthy incorrectly, failing to detect actual failures, or causing routing problems. Understanding health check mechanics is critical for maintaining application availability and implementing automated failover.

AWS Documentation:

Skill 2.2.2: Configure fault-tolerant systems (for example, Multi-AZ deployments)

Why: Multi-AZ deployments are AWS’s primary mechanism for achieving high availability within a region and are fundamental to fault-tolerant architecture design. The exam tests your understanding of how different services implement Multi-AZ (RDS synchronous replication with automatic failover, ElastiCache replication groups, EFS automatic replication, ELB distribution across AZs) and when Multi-AZ provides sufficient resilience versus when Multi-Region is required. You must know how to configure Multi-AZ for RDS, Aurora, ElastiCache, and understand the failover process, detection times, and data consistency guarantees. Understanding Multi-AZ architectures is critical for designing systems that tolerate AZ failures without service disruption or data loss.

AWS Documentation:


Task 2.3: Implement backup and restore strategies

Skills & Corresponding Documentation

Skill 2.3.1: Automate snapshots and backups for AWS resources (for example, Amazon EC2 instances, RDS DB instances, Amazon Elastic Block Store [Amazon EBS] volumes, Amazon S3 buckets, DynamoDB tables) by using AWS services (for example, AWS Backup)

Why: Automated backups are essential for data protection and disaster recovery, and the exam extensively tests AWS Backup configuration and management. You must understand how to create backup plans with rules (frequency, retention, lifecycle policies), assign resources using tags or resource IDs, configure backup vaults with encryption and access policies, and implement cross-region backup copies. Know the native backup capabilities of each service (RDS automated backups, EBS snapshots, DynamoDB point-in-time recovery) and when AWS Backup provides additional value through centralized management and compliance reporting. Understanding backup automation reduces operational overhead and ensures consistent data protection across your AWS environment.

AWS Documentation:

Skill 2.3.2: Use various methods to restore databases (for example, point-in-time restore) to meet recovery time objective (RTO), recovery point objective (RPO), and cost requirements

Why: Database restore procedures are critical for disaster recovery and are tested through scenarios requiring you to meet specific RTO and RPO requirements. You must understand the differences between automated backup restoration (creates new instance from backup), snapshot restoration (manual backups), and point-in-time recovery (restore to any second within retention period). Know the restore times for each method, data loss characteristics (RPO), and costs. For DynamoDB, understand the difference between on-demand backups and point-in-time recovery. The exam tests your ability to select the appropriate restore method based on business requirements (how much downtime is acceptable, how much data loss is tolerable, budget constraints).

AWS Documentation:

Skill 2.3.3: Implement versioning for storage services (for example, Amazon S3, Amazon FSx)

Why: Versioning provides protection against accidental deletion and overwrites, and the exam tests your understanding of versioning configuration and lifecycle management. For S3, you must know how versioning works (every object modification creates a new version), how to enable/suspend versioning, how versioning affects storage costs (all versions are stored), and how to use lifecycle policies to expire or transition old versions. For FSx, understand automatic daily backups and user-initiated backups. The exam tests scenarios involving recovering deleted objects using versioning, implementing versioning with lifecycle policies to manage costs, and using Multi-Factor Authentication (MFA) Delete for additional protection against accidental deletion.

AWS Documentation:

Skill 2.3.4: Follow disaster recovery procedures

Why: Disaster recovery planning and execution is critical for business continuity and is tested through scenarios requiring you to design and implement DR strategies. You must understand the four DR strategies (backup and restore, pilot light, warm standby, multi-site active-active) and their characteristics (RTO, RPO, cost, complexity). Know how to implement each strategy using AWS services: backup/restore using AWS Backup and S3, pilot light with pre-provisioned core infrastructure, warm standby with scaled-down production environment, and multi-site using Route 53 for traffic distribution. The exam tests your ability to select the appropriate DR strategy based on business requirements and understand the tradeoffs between cost and recovery time. Understanding DR procedures is essential for minimizing business impact during disasters.

AWS Documentation:


AWS Service FAQs


AWS Whitepapers


Final Thoughts

Domain 2 focuses on designing and implementing resilient, highly available systems with comprehensive backup and disaster recovery strategies. Success requires deep understanding of Auto Scaling, Multi-AZ architectures, load balancer configurations, and backup/restore procedures across multiple AWS services. Practice implementing each disaster recovery strategy (backup/restore, pilot light, warm standby, multi-site) to understand the tradeoffs between RTO, RPO, and cost. The ability to troubleshoot health checks, configure automated backups, and select appropriate scaling strategies is essential for real-world SysOps operations. Combine documentation study with extensive hands-on practice in configuring resilient architectures that survive failures while meeting business continuity requirements.