CloudPath Academy

Your guide to AWS certification success

Amazon Web Services AWS Broken Labs

AWS Certified Developer Associate (DVA-C02) Domain 4

Troubleshooting and Optimization

Official Exam Guide: Domain 4: Troubleshooting and Optimization
Skill Builder: AWS Developer Associate (DVA-C02) Exam Prep

Note: Some Skill Builder labs require a subscription.


How to Study This Domain Effectively

Study Tips

  1. Create intentional failures to practice troubleshooting - The best way to master troubleshooting is to break things deliberately. Deploy Lambda functions with errors, misconfigure IAM permissions, create resource bottlenecks, trigger throttling, and then use CloudWatch Logs, X-Ray traces, and metrics to diagnose the issues. Troubleshooting skills developed through hands-on debugging translate directly to exam scenarios where you must identify root causes from log excerpts and metric patterns.

  2. Build comprehensive observability into a sample application - Create an application that demonstrates all observability patterns: structured logging with JSON, custom CloudWatch metrics with embedded metric format, X-Ray tracing with annotations and metadata, CloudWatch Alarms for anomalies, and dashboards for visualization. Actually implementing these patterns helps you understand not just what they do, but when to use each one—knowledge that’s tested extensively throughout Domain 4.

  3. Master CloudWatch Logs Insights query language - Practice writing CloudWatch Logs Insights queries to filter, aggregate, and analyze log data. Learn to find errors, calculate latency percentiles, identify top API callers, and trace request flows through distributed systems. The exam tests your ability to write queries that extract meaningful information from logs, so hands-on query practice is essential.

  4. Experiment with performance optimization techniques - Profile Lambda functions with different memory settings using AWS Lambda Power Tuning, implement caching at multiple levels (API Gateway, CloudFront, ElastiCache, DAX), tune DynamoDB read/write capacity, and measure the impact. Understanding optimization through experimentation builds intuition for identifying performance bottlenecks and selecting appropriate optimization strategies during the exam.

  5. Create a troubleshooting decision tree - Build a systematic reference for diagnosing common AWS issues: Lambda timeouts (check memory, VPC NAT gateway, downstream dependencies), throttling (check service quotas, reserved concurrency, burst capacity), cold starts (use provisioned concurrency, optimize package size), permission errors (check IAM policies, resource policies, trust relationships). This structured approach helps you quickly navigate troubleshooting scenarios on the exam.

  1. Start with CloudWatch fundamentals - Begin by deeply understanding CloudWatch Logs (log groups, streams, retention, filtering), CloudWatch Metrics (namespaces, dimensions, statistics, periods), and CloudWatch Alarms (threshold types, composite alarms, anomaly detection). CloudWatch is the foundation for all AWS observability, and mastering it is essential for troubleshooting and optimization questions throughout the exam.

  2. Master X-Ray for distributed tracing - Study X-Ray concepts (traces, segments, subsegments, annotations, metadata), learn to instrument applications with X-Ray SDK, understand service maps for visualizing dependencies, and practice analyzing traces to identify bottlenecks. X-Ray appears frequently in exam questions about debugging distributed applications and microservices architectures.

  3. Learn systematic debugging approaches - Study how to troubleshoot Lambda errors (CloudWatch Logs for exceptions, X-Ray for cold starts, metrics for throttling), debug API Gateway issues (stage logs, execution logs, access logs), diagnose DynamoDB performance problems (CloudWatch metrics for throttling, capacity), and resolve IAM permission errors (CloudTrail for denied actions). Understanding service-specific debugging patterns is critical.

  4. Deep dive into performance optimization - Study Lambda optimization (memory vs CPU relationship, cold start reduction, VPC optimization), caching strategies (where to cache, TTL selection, cache invalidation), DynamoDB optimization (partition key design, GSI vs query patterns, DAX), and API optimization (response compression, request validation). Learn to identify bottlenecks and select appropriate optimization techniques.

  5. Practice log analysis and metric interpretation - Work with real CloudWatch Logs and metrics to diagnose issues. Learn to recognize patterns in logs that indicate specific problems (timeouts, memory errors, permission denials), interpret metric graphs to identify anomalies, and correlate logs with metrics for comprehensive analysis. Complete with practice exams focused on Domain 4 to identify troubleshooting knowledge gaps.


Task 1: Assist in a root cause analysis

Skills & Corresponding Documentation

Skill 4.1.1: Debug code to identify defects

Why: Code debugging is fundamental for developers and is tested through scenarios presenting code with bugs or unexpected behavior. You must understand how to use CloudWatch Logs to find exceptions, interpret stack traces, use X-Ray to identify slow operations, set breakpoints in local development with SAM CLI, and systematically isolate defects. Exam questions present symptoms and expect you to identify the debugging approach that would reveal the root cause.

AWS Documentation:

Skill 4.1.2: Interpret application metrics, logs, and traces

Why: Metric and log interpretation is heavily tested because understanding observability data is essential for diagnosing issues. You must be able to read CloudWatch metric graphs to identify spikes or drops, analyze log patterns to find errors, interpret X-Ray traces to identify latency sources, and correlate metrics with logs to understand system behavior. Exam questions present observability data (graphs, log excerpts, traces) and expect you to draw correct conclusions about application health and issues.

AWS Documentation:

Skill 4.1.3: Query logs to find relevant data

Why: Log querying is extensively tested because applications generate massive amounts of log data that must be filtered to find relevant information. You must understand CloudWatch Logs Insights query language, how to filter by fields, aggregate data, calculate statistics, parse JSON logs, and extract patterns. Exam questions present troubleshooting scenarios and expect you to write or identify queries that would extract the needed diagnostic information.

AWS Documentation:

Skill 4.1.4: Implement custom metrics (for example, Amazon CloudWatch embedded metric format [EMF])

Why: Custom metrics are tested because standard metrics don’t capture application-specific measurements. You must understand how to publish custom metrics using PutMetricData API, implement embedded metric format in Lambda for high-throughput metrics, define dimensions and namespaces, aggregate metrics, and choose appropriate metric units. Exam questions present scenarios requiring business or application metrics and expect you to implement appropriate custom metric solutions.

AWS Documentation:

Skill 4.1.5: Review application health by using dashboards and insights

Why: Dashboards and insights provide unified views of application health and are tested through scenarios about monitoring and alerting. You must understand CloudWatch Dashboards for visualizing metrics, CloudWatch Contributor Insights for analyzing high-cardinality data, CloudWatch Application Insights for automated problem detection, and how to create meaningful visualizations. Exam questions present monitoring requirements and expect you to configure appropriate dashboard and insight solutions.

AWS Documentation:

Skill 4.1.6: Troubleshoot deployment failures by using service output logs

Why: Deployment troubleshooting is critical because failed deployments block releases and must be diagnosed quickly. You must understand how to read CloudFormation stack events and failure reasons, analyze CodeBuild build logs, interpret CodeDeploy deployment logs, debug Lambda deployment errors, and use service-specific logs to identify root causes. Exam questions present deployment failure scenarios and expect you to identify the diagnostic approach that would reveal the problem.

AWS Documentation:

Skill 4.1.7: Debug service integration issues in applications

Why: Service integration debugging is tested because distributed applications involve multiple AWS services that must interact correctly. You must understand how to use CloudTrail to trace API calls across services, debug IAM permission issues affecting integrations, troubleshoot VPC connectivity problems, analyze service integration errors in X-Ray, and verify service configurations. Exam questions present integration failures and expect you to identify debugging approaches that reveal the root cause.

AWS Documentation:

AWS Service FAQs:

AWS Whitepapers:


Task 2: Instrument code for observability

Skills & Corresponding Documentation

Skill 4.2.1: Describe differences between logging, monitoring, and observability

Why: Understanding observability fundamentals is tested because these concepts guide instrumentation decisions. You must know that logging records events, monitoring tracks metrics over time, and observability enables understanding system state from external outputs. You should understand when each is appropriate, how they complement each other, and that observability requires instrumentation for logs, metrics, and traces. Exam questions present instrumentation scenarios and expect you to identify the appropriate observability approach.

AWS Documentation:

Skill 4.2.2: Implement an effective logging strategy to record application behavior and state

Why: Logging strategy is fundamental for troubleshooting and is tested through scenarios about what and how to log. You must understand log levels (DEBUG, INFO, WARN, ERROR), what information to log (request IDs, user context, errors, state changes), log retention policies, performance impact of logging, and avoiding sensitive data in logs. Exam questions present logging requirements and expect you to implement appropriate logging practices.

AWS Documentation:

Skill 4.2.3: Implement code that emits custom metrics

Why: Custom metric implementation is tested because applications need to expose business and operational metrics. You must understand how to use CloudWatch PutMetricData API in code, implement embedded metric format for Lambda, choose appropriate metric dimensions and namespaces, batch metric publishing for efficiency, and emit metrics asynchronously. Exam questions present code scenarios and expect you to implement or identify correct metric emission code.

AWS Documentation:

Skill 4.2.4: Add annotations for tracing services

Why: X-Ray annotations are tested because they enable filtering and indexing of traces for analysis. You must understand the difference between annotations (indexed, searchable) and metadata (non-indexed, contextual), how to add annotations in code using X-Ray SDK, annotating Lambda segments, and using annotations to filter traces in X-Ray console. Exam questions present tracing requirements and expect you to implement appropriate annotation strategies.

AWS Documentation:

Skill 4.2.5: Implement notification alerts for specific actions (for example, notifications about quota limits or deployment completions)

Why: Alerting is tested because proactive notifications enable rapid response to issues. You must understand CloudWatch Alarms for metric thresholds, alarm states (OK, ALARM, INSUFFICIENT_DATA), SNS for alarm notifications, EventBridge for event-based notifications, and how to configure appropriate alarm thresholds and evaluation periods. Exam questions present alerting requirements and expect you to configure appropriate notification mechanisms.

AWS Documentation:

Skill 4.2.6: Implement tracing by using AWS services and tools

Why: Distributed tracing is extensively tested because microservices architectures require understanding request flows across services. You must understand how to enable X-Ray tracing on Lambda functions, instrument applications with X-Ray SDK, enable API Gateway X-Ray tracing, trace requests through service integrations, and analyze trace maps. Exam questions present distributed system scenarios and expect you to implement comprehensive tracing solutions.

AWS Documentation:

Skill 4.2.7: Implement structured logging for application events and user actions

Why: Structured logging is tested because parseable log formats enable automated analysis and querying. You must understand JSON log format for machine-readable logs, how to include correlation IDs for request tracing, log standard fields (timestamp, level, message, context), and how structured logs integrate with CloudWatch Logs Insights. Exam questions present logging requirements and expect you to implement structured logging patterns.

AWS Documentation:

Skill 4.2.8: Configure application health checks and readiness probes

Why: Health checks are tested because they enable load balancers and orchestration systems to route traffic only to healthy instances. You must understand ALB/NLB target health checks, ECS task health checks, Lambda function health monitoring, how health check failures trigger replacement, and configuring appropriate health check intervals and thresholds. Exam questions present high-availability requirements and expect you to configure appropriate health check mechanisms.

AWS Documentation:

AWS Service FAQs:

AWS Whitepapers:


Task 3: Optimize applications by using AWS services and features

Skills & Corresponding Documentation

Skill 4.3.1: Define concurrency

Why: Concurrency understanding is fundamental for Lambda and application scaling. You must know that concurrency represents simultaneous executions, understand Lambda’s account-level concurrency limit (1000 by default), reserved concurrency for allocation, provisioned concurrency for warm starts, and how concurrency affects throttling. Exam questions present scaling scenarios and expect you to understand concurrency limits and configuration options.

AWS Documentation:

Skill 4.3.2: Profile application performance

Why: Performance profiling is tested because optimization requires measuring where time is spent. You must understand how to use X-Ray to identify slow segments, CloudWatch Logs to measure execution time, Lambda Insights for performance metrics, and how to profile code locally. Exam questions present performance issues and expect you to identify profiling approaches that would reveal bottlenecks.

AWS Documentation:

Skill 4.3.3: Determine minimum memory and compute power for an application

Why: Right-sizing is tested because over-provisioning wastes money while under-provisioning causes performance issues. You must understand Lambda’s memory-to-CPU relationship, how to use Lambda Power Tuning to find optimal memory settings, ECS task sizing considerations, and cost-performance tradeoffs. Exam questions present resource allocation scenarios and expect you to determine appropriate resource configurations.

AWS Documentation:

Skill 4.3.4: Use subscription filter policies to optimize messaging

Why: Filter policies are tested because they reduce unnecessary message processing and costs. You must understand SNS subscription filter policies to deliver only relevant messages, EventBridge event patterns for filtering events, SQS message filtering limitations, and how filtering reduces Lambda invocations. Exam questions present messaging scenarios and expect you to implement appropriate filtering to optimize processing.

AWS Documentation:

Skill 4.3.5: Cache content based on request headers

Why: Header-based caching is tested because different users or devices may require different cached responses. You must understand CloudFront cache key policies that include headers, API Gateway caching with stage variables and query strings, how to configure cache-control headers, and cache TTL settings. Exam questions present caching requirements with user-specific or device-specific content and expect you to configure appropriate header-based caching.

AWS Documentation:

Skill 4.3.6: Implement application-level caching to improve performance

Why: Application caching is heavily tested because it dramatically improves performance and reduces costs. You must understand ElastiCache for Redis and Memcached, DynamoDB Accelerator (DAX) for DynamoDB caching, caching patterns (cache-aside, write-through, lazy loading), cache invalidation strategies, and TTL selection. Exam questions present performance requirements and expect you to implement appropriate caching solutions.

AWS Documentation:

Skill 4.3.7: Optimize application resource usage

Why: Resource optimization is tested because efficient resource use reduces costs and improves performance. You must understand Lambda package size reduction, connection pooling and reuse, minimizing cold starts, optimizing database queries, reducing data transfer, and choosing appropriate instance types. Exam questions present resource utilization issues and expect you to identify optimization strategies.

AWS Documentation:

Skill 4.3.8: Analyze application performance issues

Why: Performance analysis is tested because identifying bottlenecks requires systematic investigation. You must understand how to use X-Ray to find slow operations, analyze CloudWatch metrics to identify resource constraints, use profiling tools to find code hotspots, correlate logs and metrics to understand issues, and interpret performance indicators. Exam questions present performance problems and expect you to identify analysis approaches that would reveal root causes.

AWS Documentation:

Skill 4.3.9: Use application logs to identify performance bottlenecks

Why: Log-based performance analysis is tested because logs contain timing information that reveals bottlenecks. You must understand how to extract duration metrics from logs, identify slow operations by analyzing log timestamps, use CloudWatch Logs Insights to calculate percentiles, find outliers in execution time, and correlate slow operations with system state. Exam questions present performance issues and expect you to use logs to identify bottlenecks.

AWS Documentation:

AWS Service FAQs:

AWS Whitepapers:


Final Thoughts

Domain 4: Troubleshooting and Optimization is where developers prove their operational expertise and problem-solving abilities. This domain requires both theoretical knowledge and practical diagnostic skills developed through hands-on experience. Master CloudWatch (Logs, Metrics, Alarms, Insights) and X-Ray thoroughly—they’re your primary tools for observability and appear throughout the exam. Practice writing CloudWatch Logs Insights queries, interpreting X-Ray traces, and analyzing metric patterns to diagnose issues. Understanding performance optimization requires experimentation: test different Lambda memory settings, implement caching at various levels, and measure the impact. The troubleshooting methodology you develop—systematic analysis using logs, metrics, and traces to identify root causes—will serve you throughout your AWS career. Don’t just read about observability and optimization; implement comprehensive instrumentation in real applications, create intentional failures to practice debugging, and optimize based on measured performance data. This hands-on approach builds the diagnostic intuition needed for both exam success and production troubleshooting.