AWS Certified Developer Associate (DVA-C02) Domain 4
Troubleshooting and Optimization
Official Exam Guide: Domain 4: Troubleshooting and Optimization
Skill Builder: AWS Developer Associate (DVA-C02) Exam Prep
Note: Some Skill Builder labs require a subscription.
How to Study This Domain Effectively
Study Tips
-
Create intentional failures to practice troubleshooting - The best way to master troubleshooting is to break things deliberately. Deploy Lambda functions with errors, misconfigure IAM permissions, create resource bottlenecks, trigger throttling, and then use CloudWatch Logs, X-Ray traces, and metrics to diagnose the issues. Troubleshooting skills developed through hands-on debugging translate directly to exam scenarios where you must identify root causes from log excerpts and metric patterns.
-
Build comprehensive observability into a sample application - Create an application that demonstrates all observability patterns: structured logging with JSON, custom CloudWatch metrics with embedded metric format, X-Ray tracing with annotations and metadata, CloudWatch Alarms for anomalies, and dashboards for visualization. Actually implementing these patterns helps you understand not just what they do, but when to use each one—knowledge that’s tested extensively throughout Domain 4.
-
Master CloudWatch Logs Insights query language - Practice writing CloudWatch Logs Insights queries to filter, aggregate, and analyze log data. Learn to find errors, calculate latency percentiles, identify top API callers, and trace request flows through distributed systems. The exam tests your ability to write queries that extract meaningful information from logs, so hands-on query practice is essential.
-
Experiment with performance optimization techniques - Profile Lambda functions with different memory settings using AWS Lambda Power Tuning, implement caching at multiple levels (API Gateway, CloudFront, ElastiCache, DAX), tune DynamoDB read/write capacity, and measure the impact. Understanding optimization through experimentation builds intuition for identifying performance bottlenecks and selecting appropriate optimization strategies during the exam.
-
Create a troubleshooting decision tree - Build a systematic reference for diagnosing common AWS issues: Lambda timeouts (check memory, VPC NAT gateway, downstream dependencies), throttling (check service quotas, reserved concurrency, burst capacity), cold starts (use provisioned concurrency, optimize package size), permission errors (check IAM policies, resource policies, trust relationships). This structured approach helps you quickly navigate troubleshooting scenarios on the exam.
Recommended Approach
-
Start with CloudWatch fundamentals - Begin by deeply understanding CloudWatch Logs (log groups, streams, retention, filtering), CloudWatch Metrics (namespaces, dimensions, statistics, periods), and CloudWatch Alarms (threshold types, composite alarms, anomaly detection). CloudWatch is the foundation for all AWS observability, and mastering it is essential for troubleshooting and optimization questions throughout the exam.
-
Master X-Ray for distributed tracing - Study X-Ray concepts (traces, segments, subsegments, annotations, metadata), learn to instrument applications with X-Ray SDK, understand service maps for visualizing dependencies, and practice analyzing traces to identify bottlenecks. X-Ray appears frequently in exam questions about debugging distributed applications and microservices architectures.
-
Learn systematic debugging approaches - Study how to troubleshoot Lambda errors (CloudWatch Logs for exceptions, X-Ray for cold starts, metrics for throttling), debug API Gateway issues (stage logs, execution logs, access logs), diagnose DynamoDB performance problems (CloudWatch metrics for throttling, capacity), and resolve IAM permission errors (CloudTrail for denied actions). Understanding service-specific debugging patterns is critical.
-
Deep dive into performance optimization - Study Lambda optimization (memory vs CPU relationship, cold start reduction, VPC optimization), caching strategies (where to cache, TTL selection, cache invalidation), DynamoDB optimization (partition key design, GSI vs query patterns, DAX), and API optimization (response compression, request validation). Learn to identify bottlenecks and select appropriate optimization techniques.
-
Practice log analysis and metric interpretation - Work with real CloudWatch Logs and metrics to diagnose issues. Learn to recognize patterns in logs that indicate specific problems (timeouts, memory errors, permission denials), interpret metric graphs to identify anomalies, and correlate logs with metrics for comprehensive analysis. Complete with practice exams focused on Domain 4 to identify troubleshooting knowledge gaps.
Task 1: Assist in a root cause analysis
Skills & Corresponding Documentation
Skill 4.1.1: Debug code to identify defects
Why: Code debugging is fundamental for developers and is tested through scenarios presenting code with bugs or unexpected behavior. You must understand how to use CloudWatch Logs to find exceptions, interpret stack traces, use X-Ray to identify slow operations, set breakpoints in local development with SAM CLI, and systematically isolate defects. Exam questions present symptoms and expect you to identify the debugging approach that would reveal the root cause.
AWS Documentation:
- Debugging Lambda Functions
- Lambda Function Errors
- Using CloudWatch Logs for Debugging
- SAM CLI Local Debugging
- Debugging with AWS X-Ray
- Error Handling Best Practices
- Lambda Troubleshooting
Skill 4.1.2: Interpret application metrics, logs, and traces
Why: Metric and log interpretation is heavily tested because understanding observability data is essential for diagnosing issues. You must be able to read CloudWatch metric graphs to identify spikes or drops, analyze log patterns to find errors, interpret X-Ray traces to identify latency sources, and correlate metrics with logs to understand system behavior. Exam questions present observability data (graphs, log excerpts, traces) and expect you to draw correct conclusions about application health and issues.
AWS Documentation:
- CloudWatch Metrics Concepts
- Using Amazon CloudWatch Metrics
- CloudWatch Logs Concepts
- Analyzing Log Data with CloudWatch Logs Insights
- Understanding X-Ray Traces
- X-Ray Service Map
- Reading X-Ray Traces
- Lambda Metrics in CloudWatch
- Interpreting Lambda Metrics
Skill 4.1.3: Query logs to find relevant data
Why: Log querying is extensively tested because applications generate massive amounts of log data that must be filtered to find relevant information. You must understand CloudWatch Logs Insights query language, how to filter by fields, aggregate data, calculate statistics, parse JSON logs, and extract patterns. Exam questions present troubleshooting scenarios and expect you to write or identify queries that would extract the needed diagnostic information.
AWS Documentation:
- CloudWatch Logs Insights Query Syntax
- CloudWatch Logs Insights Sample Queries
- Querying Lambda Logs
- Filtering and Pattern Matching
- CloudWatch Logs Insights Functions
- Working with Log Groups and Streams
- Searching Log Data
Skill 4.1.4: Implement custom metrics (for example, Amazon CloudWatch embedded metric format [EMF])
Why: Custom metrics are tested because standard metrics don’t capture application-specific measurements. You must understand how to publish custom metrics using PutMetricData API, implement embedded metric format in Lambda for high-throughput metrics, define dimensions and namespaces, aggregate metrics, and choose appropriate metric units. Exam questions present scenarios requiring business or application metrics and expect you to implement appropriate custom metric solutions.
AWS Documentation:
- Publishing Custom Metrics
- PutMetricData API
- CloudWatch Embedded Metric Format
- EMF Specification
- Using EMF with Lambda
- Custom Metrics Dimensions
- High-Resolution Metrics
- Metric Math
Skill 4.1.5: Review application health by using dashboards and insights
Why: Dashboards and insights provide unified views of application health and are tested through scenarios about monitoring and alerting. You must understand CloudWatch Dashboards for visualizing metrics, CloudWatch Contributor Insights for analyzing high-cardinality data, CloudWatch Application Insights for automated problem detection, and how to create meaningful visualizations. Exam questions present monitoring requirements and expect you to configure appropriate dashboard and insight solutions.
AWS Documentation:
- CloudWatch Dashboards
- Creating CloudWatch Dashboards
- CloudWatch Contributor Insights
- Analyzing VPC Flow Logs with Contributor Insights
- CloudWatch Application Insights
- CloudWatch Automatic Dashboards
- Dashboard Widget Types
- Cross-Account Cross-Region Dashboards
Skill 4.1.6: Troubleshoot deployment failures by using service output logs
Why: Deployment troubleshooting is critical because failed deployments block releases and must be diagnosed quickly. You must understand how to read CloudFormation stack events and failure reasons, analyze CodeBuild build logs, interpret CodeDeploy deployment logs, debug Lambda deployment errors, and use service-specific logs to identify root causes. Exam questions present deployment failure scenarios and expect you to identify the diagnostic approach that would reveal the problem.
AWS Documentation:
- Troubleshooting CloudFormation
- CloudFormation Stack Events
- CodeBuild Build Logs
- Troubleshooting CodeBuild
- CodeDeploy Deployment Logs
- Troubleshooting CodeDeploy
- Lambda Deployment Errors
- SAM Deployment Troubleshooting
- Elastic Beanstalk Deployment Logs
Skill 4.1.7: Debug service integration issues in applications
Why: Service integration debugging is tested because distributed applications involve multiple AWS services that must interact correctly. You must understand how to use CloudTrail to trace API calls across services, debug IAM permission issues affecting integrations, troubleshoot VPC connectivity problems, analyze service integration errors in X-Ray, and verify service configurations. Exam questions present integration failures and expect you to identify debugging approaches that reveal the root cause.
AWS Documentation:
- AWS CloudTrail for API Debugging
- Viewing CloudTrail Events
- Troubleshooting IAM
- Testing IAM Policies with the Policy Simulator
- VPC Reachability Analyzer
- Troubleshooting VPC Connectivity
- X-Ray Service Integration Errors
- Lambda Integration Errors
- API Gateway Integration Debugging
AWS Service FAQs:
- Amazon CloudWatch FAQ
- AWS X-Ray FAQ
- AWS CloudTrail FAQ
- AWS CloudFormation FAQ
- AWS CodeBuild FAQ
- AWS CodeDeploy FAQ
AWS Whitepapers:
- Observability Best Practices
- Debugging Distributed Applications
- Operational Excellence Pillar - AWS Well-Architected Framework
Task 2: Instrument code for observability
Skills & Corresponding Documentation
Skill 4.2.1: Describe differences between logging, monitoring, and observability
Why: Understanding observability fundamentals is tested because these concepts guide instrumentation decisions. You must know that logging records events, monitoring tracks metrics over time, and observability enables understanding system state from external outputs. You should understand when each is appropriate, how they complement each other, and that observability requires instrumentation for logs, metrics, and traces. Exam questions present instrumentation scenarios and expect you to identify the appropriate observability approach.
AWS Documentation:
- Observability on AWS
- What is Observability?
- CloudWatch Overview
- Logging vs Monitoring vs Observability
- Application Observability Best Practices
- Telemetry in Distributed Systems
Skill 4.2.2: Implement an effective logging strategy to record application behavior and state
Why: Logging strategy is fundamental for troubleshooting and is tested through scenarios about what and how to log. You must understand log levels (DEBUG, INFO, WARN, ERROR), what information to log (request IDs, user context, errors, state changes), log retention policies, performance impact of logging, and avoiding sensitive data in logs. Exam questions present logging requirements and expect you to implement appropriate logging practices.
AWS Documentation:
- Lambda Logging Best Practices
- CloudWatch Logs
- Log Retention Settings
- Structured Logging Best Practices
- Lambda Function Logging
- Logging Sensitive Data
- Log Aggregation Patterns
Skill 4.2.3: Implement code that emits custom metrics
Why: Custom metric implementation is tested because applications need to expose business and operational metrics. You must understand how to use CloudWatch PutMetricData API in code, implement embedded metric format for Lambda, choose appropriate metric dimensions and namespaces, batch metric publishing for efficiency, and emit metrics asynchronously. Exam questions present code scenarios and expect you to implement or identify correct metric emission code.
AWS Documentation:
- Publishing Custom Metrics
- Using the AWS SDK to Publish Metrics
- CloudWatch Embedded Metric Format
- EMF Client Libraries
- Metric Units
- Batching Metric Requests
- Custom Metric Best Practices
Skill 4.2.4: Add annotations for tracing services
Why: X-Ray annotations are tested because they enable filtering and indexing of traces for analysis. You must understand the difference between annotations (indexed, searchable) and metadata (non-indexed, contextual), how to add annotations in code using X-Ray SDK, annotating Lambda segments, and using annotations to filter traces in X-Ray console. Exam questions present tracing requirements and expect you to implement appropriate annotation strategies.
AWS Documentation:
- X-Ray Concepts - Annotations and Metadata
- Instrumenting Code with X-Ray SDK
- Adding Annotations to Segments
- Lambda and X-Ray
- X-Ray SDK for Node.js
- X-Ray SDK for Java
- Filtering Traces with Annotations
Skill 4.2.5: Implement notification alerts for specific actions (for example, notifications about quota limits or deployment completions)
Why: Alerting is tested because proactive notifications enable rapid response to issues. You must understand CloudWatch Alarms for metric thresholds, alarm states (OK, ALARM, INSUFFICIENT_DATA), SNS for alarm notifications, EventBridge for event-based notifications, and how to configure appropriate alarm thresholds and evaluation periods. Exam questions present alerting requirements and expect you to configure appropriate notification mechanisms.
AWS Documentation:
- Using Amazon CloudWatch Alarms
- Creating CloudWatch Alarms
- CloudWatch Alarm States
- Amazon SNS for Alarm Notifications
- EventBridge for AWS Service Events
- Composite Alarms
- Anomaly Detection Alarms
- Service Quotas Notifications
Skill 4.2.6: Implement tracing by using AWS services and tools
Why: Distributed tracing is extensively tested because microservices architectures require understanding request flows across services. You must understand how to enable X-Ray tracing on Lambda functions, instrument applications with X-Ray SDK, enable API Gateway X-Ray tracing, trace requests through service integrations, and analyze trace maps. Exam questions present distributed system scenarios and expect you to implement comprehensive tracing solutions.
AWS Documentation:
- AWS X-Ray Developer Guide
- Instrumenting Your Application
- Using X-Ray with Lambda
- Enabling X-Ray Tracing on Lambda
- X-Ray and API Gateway
- X-Ray SDK Configuration
- X-Ray Service Map
- Tracing Header
- X-Ray Sampling Rules
Skill 4.2.7: Implement structured logging for application events and user actions
Why: Structured logging is tested because parseable log formats enable automated analysis and querying. You must understand JSON log format for machine-readable logs, how to include correlation IDs for request tracing, log standard fields (timestamp, level, message, context), and how structured logs integrate with CloudWatch Logs Insights. Exam questions present logging requirements and expect you to implement structured logging patterns.
AWS Documentation:
- Structured Logging
- JSON Logging in Lambda
- Lambda Powertools for Python - Logging
- Querying JSON Logs with CloudWatch Logs Insights
- Correlation IDs for Distributed Tracing
- Log Event Format
Skill 4.2.8: Configure application health checks and readiness probes
Why: Health checks are tested because they enable load balancers and orchestration systems to route traffic only to healthy instances. You must understand ALB/NLB target health checks, ECS task health checks, Lambda function health monitoring, how health check failures trigger replacement, and configuring appropriate health check intervals and thresholds. Exam questions present high-availability requirements and expect you to configure appropriate health check mechanisms.
AWS Documentation:
- Application Load Balancer Health Checks
- Network Load Balancer Health Checks
- ECS Task Health Checks
- Elastic Beanstalk Health Monitoring
- Lambda Function Health
- API Gateway Health Checks
- Health Check Best Practices
AWS Service FAQs:
AWS Whitepapers:
- Implementing Logging and Monitoring with CloudWatch
- Observability Best Practices
- Logging and Monitoring for Application Owners
Task 3: Optimize applications by using AWS services and features
Skills & Corresponding Documentation
Skill 4.3.1: Define concurrency
Why: Concurrency understanding is fundamental for Lambda and application scaling. You must know that concurrency represents simultaneous executions, understand Lambda’s account-level concurrency limit (1000 by default), reserved concurrency for allocation, provisioned concurrency for warm starts, and how concurrency affects throttling. Exam questions present scaling scenarios and expect you to understand concurrency limits and configuration options.
AWS Documentation:
- Lambda Concurrency
- Managing Lambda Concurrency
- Reserved Concurrency
- Provisioned Concurrency
- Lambda Scaling
- Concurrency Metrics
- Burst Concurrency
Skill 4.3.2: Profile application performance
Why: Performance profiling is tested because optimization requires measuring where time is spent. You must understand how to use X-Ray to identify slow segments, CloudWatch Logs to measure execution time, Lambda Insights for performance metrics, and how to profile code locally. Exam questions present performance issues and expect you to identify profiling approaches that would reveal bottlenecks.
AWS Documentation:
- Profiling Lambda Functions
- Lambda Performance Optimization
- X-Ray Performance Insights
- Lambda Insights
- CloudWatch Embedded Metrics for Performance
- Performance Monitoring Best Practices
Skill 4.3.3: Determine minimum memory and compute power for an application
Why: Right-sizing is tested because over-provisioning wastes money while under-provisioning causes performance issues. You must understand Lambda’s memory-to-CPU relationship, how to use Lambda Power Tuning to find optimal memory settings, ECS task sizing considerations, and cost-performance tradeoffs. Exam questions present resource allocation scenarios and expect you to determine appropriate resource configurations.
AWS Documentation:
- Lambda Function Memory Configuration
- Lambda Power Tuning Tool
- Optimizing Lambda Performance
- Lambda Memory and CPU
- ECS Task CPU and Memory
- Rightsizing Recommendations
- Compute Optimizer
Skill 4.3.4: Use subscription filter policies to optimize messaging
Why: Filter policies are tested because they reduce unnecessary message processing and costs. You must understand SNS subscription filter policies to deliver only relevant messages, EventBridge event patterns for filtering events, SQS message filtering limitations, and how filtering reduces Lambda invocations. Exam questions present messaging scenarios and expect you to implement appropriate filtering to optimize processing.
AWS Documentation:
- SNS Message Filtering
- SNS Subscription Filter Policies
- EventBridge Event Patterns
- Content Filtering in EventBridge
- Lambda Event Filtering
- SQS Message Attributes
Skill 4.3.5: Cache content based on request headers
Why: Header-based caching is tested because different users or devices may require different cached responses. You must understand CloudFront cache key policies that include headers, API Gateway caching with stage variables and query strings, how to configure cache-control headers, and cache TTL settings. Exam questions present caching requirements with user-specific or device-specific content and expect you to configure appropriate header-based caching.
AWS Documentation:
- CloudFront Cache Key and Origin Requests
- CloudFront Cache Policies
- Header-Based Caching
- API Gateway Response Caching
- Cache Key and Query String Parameters
- Cache-Control Headers
- Vary Header for Content Negotiation
Skill 4.3.6: Implement application-level caching to improve performance
Why: Application caching is heavily tested because it dramatically improves performance and reduces costs. You must understand ElastiCache for Redis and Memcached, DynamoDB Accelerator (DAX) for DynamoDB caching, caching patterns (cache-aside, write-through, lazy loading), cache invalidation strategies, and TTL selection. Exam questions present performance requirements and expect you to implement appropriate caching solutions.
AWS Documentation:
- Amazon ElastiCache
- ElastiCache for Redis
- ElastiCache for Memcached
- DynamoDB Accelerator (DAX)
- Caching Strategies
- Cache-Aside Pattern
- Write-Through Caching
- Cache Invalidation
- ElastiCache Best Practices
Skill 4.3.7: Optimize application resource usage
Why: Resource optimization is tested because efficient resource use reduces costs and improves performance. You must understand Lambda package size reduction, connection pooling and reuse, minimizing cold starts, optimizing database queries, reducing data transfer, and choosing appropriate instance types. Exam questions present resource utilization issues and expect you to identify optimization strategies.
AWS Documentation:
- Lambda Best Practices
- Optimizing Lambda Package Size
- Connection Management in Lambda
- Reducing Lambda Cold Starts
- DynamoDB Best Practices
- Optimizing DynamoDB Performance
- RDS Performance Best Practices
- Cost Optimization Pillar
Skill 4.3.8: Analyze application performance issues
Why: Performance analysis is tested because identifying bottlenecks requires systematic investigation. You must understand how to use X-Ray to find slow operations, analyze CloudWatch metrics to identify resource constraints, use profiling tools to find code hotspots, correlate logs and metrics to understand issues, and interpret performance indicators. Exam questions present performance problems and expect you to identify analysis approaches that would reveal root causes.
AWS Documentation:
- Lambda Performance Tuning
- Using X-Ray for Performance Analysis
- CloudWatch Metrics Analysis
- Lambda Insights for Performance Monitoring
- DynamoDB Performance Metrics
- API Gateway Metrics
- Performance Efficiency Pillar
Skill 4.3.9: Use application logs to identify performance bottlenecks
Why: Log-based performance analysis is tested because logs contain timing information that reveals bottlenecks. You must understand how to extract duration metrics from logs, identify slow operations by analyzing log timestamps, use CloudWatch Logs Insights to calculate percentiles, find outliers in execution time, and correlate slow operations with system state. Exam questions present performance issues and expect you to use logs to identify bottlenecks.
AWS Documentation:
- Analyzing Lambda Performance with Logs
- CloudWatch Logs Insights for Performance Analysis
- Querying Lambda Execution Duration
- Performance Metrics from Logs
- Log-Based Metrics
- Percentile Statistics
- Lambda Duration Reporting
AWS Service FAQs:
- Amazon CloudWatch FAQ
- AWS Lambda FAQ
- Amazon ElastiCache FAQ
- DynamoDB Accelerator (DAX) FAQ
- Amazon CloudFront FAQ
- AWS X-Ray FAQ
AWS Whitepapers:
- Performance Efficiency Pillar - AWS Well-Architected Framework
- Database Caching Strategies Using Redis
- Cost Optimization Pillar - AWS Well-Architected Framework
- Lambda Operator Guide
Final Thoughts
Domain 4: Troubleshooting and Optimization is where developers prove their operational expertise and problem-solving abilities. This domain requires both theoretical knowledge and practical diagnostic skills developed through hands-on experience. Master CloudWatch (Logs, Metrics, Alarms, Insights) and X-Ray thoroughly—they’re your primary tools for observability and appear throughout the exam. Practice writing CloudWatch Logs Insights queries, interpreting X-Ray traces, and analyzing metric patterns to diagnose issues. Understanding performance optimization requires experimentation: test different Lambda memory settings, implement caching at various levels, and measure the impact. The troubleshooting methodology you develop—systematic analysis using logs, metrics, and traces to identify root causes—will serve you throughout your AWS career. Don’t just read about observability and optimization; implement comprehensive instrumentation in real applications, create intentional failures to practice debugging, and optimize based on measured performance data. This hands-on approach builds the diagnostic intuition needed for both exam success and production troubleshooting.