AWS Certified Machine Learning Engineer - Associate (MLA-C01) Domain 4
ML Solution Monitoring, Maintenance, and Security
Official Exam Guide: Domain 4: ML Solution Monitoring, Maintenance, and Security
Skill Builder: AWS Certified Machine Learning Engineer - Associate (MLA-C01) Exam Prep
Note: Some Skill Builder labs require a subscription.
How to Study This Domain Effectively
Study Tips
-
Master drift detection and understand all drift types - Model drift degrades performance over time and is heavily tested. Data drift (input distribution changes - feature means, ranges, correlations change over time) requires comparing production data against training baseline using statistical tests (Kolmogorov-Smirnov, Chi-squared). Concept drift (relationship between features and target changes - customer behavior shifts, seasonal patterns change) requires monitoring prediction accuracy and comparing to baseline. Prediction drift (output distribution changes) indicates model behavior changes. SageMaker Model Monitor detects all drift types automatically by comparing production data against baseline constraints. The exam tests identifying drift types from symptoms, configuring monitoring schedules and thresholds, interpreting drift reports, and implementing automated responses (alerting, triggering retraining).
-
Understand comprehensive cost optimization strategies for ML - ML infrastructure costs can be significant and optimization is increasingly tested. Compute optimization: use Spot instances for training (90% savings), SageMaker Savings Plans for committed usage (64% savings), inference-optimized instances (ml.inf1) for neural networks. Storage optimization: S3 Intelligent-Tiering for training data, lifecycle policies to Glacier for old data. Endpoint optimization: serverless for intermittent traffic, multi-model endpoints to consolidate infrastructure, auto-scaling to match capacity with demand. The exam tests calculating costs for different scenarios, recommending optimizations based on usage patterns, using Cost Explorer to identify spending, and implementing tagging strategies for cost allocation.
-
Learn IAM for ML security comprehensively - IAM controls access to ML resources and is fundamental to security questions. Understand roles (SageMaker execution role for training jobs/endpoints, cross-account roles for shared resources), policies (identity-based for users, resource-based for S3 buckets/endpoints), and least privilege (granting minimum permissions needed). The exam tests scenarios requiring you to troubleshoot access denied errors (missing IAM permissions, overly restrictive policies), configure secure multi-account access (using roles, not long-term credentials), implement least privilege (specific actions and resources, not wildcards), and understand SageMaker Role Manager for pre-configured roles.
-
Practice with CloudWatch monitoring and troubleshooting - CloudWatch provides comprehensive monitoring for ML systems. Metrics track endpoint performance (Invocations, ModelLatency, Invocation4XXErrors), training job progress (train:loss, validation:accuracy), and infrastructure utilization (CPUUtilization, MemoryUtilization). Logs capture detailed execution information (training logs, endpoint logs, Lambda logs). Alarms trigger notifications when metrics breach thresholds. The exam tests configuring meaningful alarms (ModelLatency alarm triggers auto-scaling, 4XX error spike alerts on-call engineer), creating dashboards visualizing key metrics, querying logs with CloudWatch Logs Insights to troubleshoot issues, and using metrics to diagnose performance problems.
-
Understand A/B testing for model validation thoroughly - A/B testing validates new models with production traffic before full rollout. Configure production variants on endpoint with traffic distribution (90% to production model variant A, 10% to new model variant B). Monitor variant-specific metrics (accuracy, latency, error rate). Statistical analysis determines if variant B significantly outperforms A. Gradually shift traffic if B is better. The exam tests designing A/B tests (selecting metrics, determining sample size, setting statistical significance thresholds), implementing multi-variant endpoints in SageMaker, interpreting A/B test results, and understanding when to use A/B testing versus shadow mode (A/B affects production, shadow doesn’t).
Recommended Approach
-
Start with SageMaker Model Monitor fundamentals - Model Monitor continuously monitors production models for data quality, model quality, bias drift, and feature attribution drift. Learn the workflow: create baseline from training data (statistics and constraints), schedule monitoring jobs (hourly, daily), monitor produces violations report, configure alerts for violations, investigate and remediate issues. Practice enabling monitoring on endpoints, interpreting violation reports (which features drifted, by how much), and understanding when to retrain (significant drift detected, accuracy degradation). Model Monitor is central to production ML operations and is heavily tested.
-
Deep dive into cost analysis and optimization - Master AWS cost tools: Cost Explorer for analyzing historical spending (filter by service, tag, time period, visualize trends), AWS Budgets for proactive cost control (alerts when spending exceeds thresholds), Trusted Advisor for optimization recommendations (idle resources, right-sizing). Practice analyzing ML workload costs (training represents 40% of spend, inference 50%, storage 10%), identifying optimization opportunities (move to Reserved Instances, use Spot for training, enable auto-scaling), and implementing cost allocation tags (project, environment, team) enabling departmental charge-back.
-
Master security best practices systematically - Study IAM comprehensively: roles for services (SageMaker execution role with specific permissions), policies for users (data scientists need SageMaker permissions, not EC2), resource policies (S3 bucket policy allowing SageMaker access). Learn network security: VPC isolates resources, security groups control traffic, VPC endpoints enable private connectivity to AWS services. Understand encryption: data at rest (AWS Key Management Service (KMS) for S3, EBS), data in transit (Transport Layer Security (TLS) for API calls), encrypting training data and model artifacts. Practice implementing defense-in-depth (IAM + VPC + encryption + monitoring).
-
Implement comprehensive monitoring dashboards - Create CloudWatch dashboards visualizing key operational metrics: endpoint metrics (invocations per minute, average latency, error rates), training metrics (loss curves, validation accuracy), infrastructure metrics (CPU/memory utilization, auto-scaling activity), and cost metrics (daily spending by service). Practice using CloudWatch Logs Insights to query logs (find errors, analyze patterns), creating alarms for critical issues (endpoint down, 4XX errors spike, cost exceeds budget), and integrating with notification systems (SNS, PagerDuty). Effective dashboards enable rapid incident detection and troubleshooting.
-
Study complete incident response workflows - Understand how monitoring, alerting, and remediation integrate: CloudWatch alarm fires → SNS notification → Lambda automatically attempts remediation or alerts engineer → engineer investigates using logs and metrics → implements fix → verifies resolution with monitoring. Practice designing resilient systems (health checks, automatic retries, graceful degradation), implementing automated responses to common issues (scale up on high latency, restart on repeated errors), and conducting post-incident reviews (root cause analysis, prevention strategies). Production ML requires systematic operational practices.
Task 4.1: Monitor model inference
Knowledge Areas & AWS Documentation Reading List
1. Drift in ML models
Why: Drift causes models to degrade over time and detecting it proactively prevents poor predictions affecting business. Data drift (input distribution changes) occurs when production data characteristics differ from training data - feature means shift, new categorical values appear, correlations change. Common causes: seasonal changes (retail sales patterns differ by season), behavior changes (user preferences evolve), external factors (economic conditions change). Concept drift (relationship between features and target changes) occurs when the underlying patterns change - what predicts churn this quarter may not predict churn next quarter. Prediction drift (output distribution changes) indicates model behavior changes without necessarily affecting accuracy. The exam tests identifying drift types from symptoms, understanding that drift requires retraining, and implementing monitoring strategies detecting drift before significant accuracy degradation.
AWS Documentation:
- SageMaker Model Monitor
- Monitor Data Quality
- Monitor Model Quality
- Detecting Data Drift
- Understanding Drift
2. Techniques to monitor data quality and model performance
Why: Systematic monitoring enables proactive issue detection before business impact. Data quality monitoring checks: completeness (missing values), validity (values in expected ranges), consistency (referential integrity), timeliness (data freshness). Model performance monitoring tracks: prediction accuracy (comparing predictions to ground truth when available), prediction distribution (detecting unexpected output patterns), feature attribution (SHAP values for explainability), prediction confidence. The exam tests selecting appropriate monitoring techniques for specific scenarios (classification model → track precision/recall, regression model → track RMSE, time series → track forecast accuracy), configuring monitoring frequency (real-time critical systems require continuous monitoring, batch systems can use daily monitoring), and interpreting monitoring reports to identify issues.
AWS Documentation:
3. Design principles for ML lenses relevant to monitoring
Why: Well-Architected Framework ML Lens provides design principles ensuring production readiness. Monitoring principles include: establish baselines (compare current performance against baseline), automate detection (use Model Monitor, don’t rely on manual checks), define clear metrics (accuracy, latency, throughput), implement graduated alarms (warning thresholds before critical thresholds), enable rapid troubleshooting (comprehensive logging, distributed tracing), and automate remediation where possible (auto-scaling, automatic failover). The exam tests understanding of monitoring design patterns (proactive versus reactive), implementing comprehensive observability (metrics, logs, traces), and balancing monitoring coverage with cost (monitoring everything is expensive, focus on critical metrics).
AWS Documentation:
Skills & Corresponding Documentation
Monitoring models in production (for example, by using Amazon SageMaker Model Monitor)
Why: SageMaker Model Monitor provides automated monitoring for production models. Monitoring types: data quality (baseline statistics and constraints on features), model quality (prediction accuracy compared to ground truth), bias drift (fairness metrics over time), feature attribution drift (SHAP values changes). Implementation workflow: enable data capture on endpoint (logs inputs/predictions to S3), create baseline from training data (statistics like mean, standard deviation, constraints like min/max), schedule monitoring job (hourly, daily, custom), violations trigger CloudWatch alarms, investigate violations in reports. The exam tests implementing Model Monitor end-to-end, configuring appropriate baselines, interpreting violation reports, and integrating monitoring with alerting and retraining workflows.
AWS Documentation:
- Amazon SageMaker Model Monitor
- Enable Data Capture
- Create Baseline
- Schedule Monitoring Jobs
- Interpret Results
Monitoring workflows to detect anomalies or errors in data processing or model inference
Why: Workflow monitoring detects issues in ML pipelines preventing downstream failures. Monitoring points include: data ingestion (completeness, timeliness, format), feature engineering (transformation errors, null values), model training (convergence issues, accuracy thresholds), deployment (endpoint health, API errors), inference (latency, throughput, error rates). The exam tests designing comprehensive monitoring (covering entire pipeline, not just model endpoint), implementing anomaly detection (statistical methods, thresholds, machine learning-based), configuring alerts for different severity levels (critical issues page on-call, warnings create tickets), and troubleshooting pipeline failures using logs and metrics.
AWS Documentation:
- Monitor SageMaker Pipelines
- CloudWatch Metrics for SageMaker
- CloudWatch Logs for SageMaker
- EventBridge for ML Workflows
Detecting changes in the distribution of data that can affect model performance (for example, by using SageMaker Clarify)
Why: Distribution changes require model retraining before accuracy degrades significantly. SageMaker Clarify detects distribution drift by comparing production data against baseline using statistical tests. For numerical features: Kolmogorov-Smirnov test, Jensen-Shannon divergence. For categorical features: Chi-squared test, L-infinity distance. Drift reports show which features drifted, by how much, and statistical significance. The exam tests configuring Clarify for drift detection, interpreting drift reports (understanding test statistics and p-values), setting appropriate thresholds (too sensitive triggers false alarms, too lenient misses real drift), and implementing automated responses (alerting data science team, triggering retraining pipeline).
AWS Documentation:
- Monitor Bias Drift with SageMaker Clarify
- Monitor Feature Attribution Drift
- Detect Data Drift
- Statistical Tests for Drift
Monitoring model performance in production by using A/B testing
Why: A/B testing validates model improvements with real production traffic before full rollout. Implementation: create endpoint with multiple production variants (variant A = current model, variant B = new model), configure traffic distribution (90% A, 10% B), monitor variant-specific metrics (accuracy, latency, error rate), statistical analysis determines if B significantly outperforms A, gradually shift traffic to B if successful. The exam tests designing A/B tests (selecting evaluation metrics, determining sample size for statistical power, setting significance thresholds), implementing multi-variant endpoints, analyzing results (understanding statistical significance, confidence intervals), and rollout strategies (immediate switch if clearly better, gradual shift for risk mitigation).
AWS Documentation:
- A/B Testing with Production Variants
- Create Multi-Variant Endpoint
- Monitor Production Variants
- Traffic Distribution for Variants
Task 4.2: Monitor and optimize infrastructure and costs
Knowledge Areas & AWS Documentation Reading List
1. Key performance metrics for ML infrastructure (for example, utilization, throughput, availability, scalability, fault tolerance)
Why: Infrastructure metrics indicate health and efficiency of ML systems. Utilization (CPU, GPU, memory usage) indicates resource efficiency - high utilization maximizes cost efficiency, but too high risks performance degradation. Throughput (requests per second, predictions per minute) measures capacity. Availability (uptime percentage) measures reliability - production systems require 99.9%+ availability. Scalability (ability to handle increased load) enables handling traffic growth. Fault tolerance (continued operation despite failures) requires health checks, automatic failover, redundancy. The exam tests selecting appropriate metrics for monitoring objectives, setting thresholds balancing efficiency and reliability, troubleshooting issues from metric patterns, and understanding tradeoffs (high utilization reduces cost but increases latency).
AWS Documentation:
- SageMaker CloudWatch Metrics
- Endpoint Invocation Metrics
- Training Job Metrics
- Infrastructure Monitoring Best Practices
2. Monitoring and observability tools to troubleshoot latency and performance issues (for example, AWS X-Ray, Amazon CloudWatch Lambda Insights, Amazon CloudWatch Logs Insights)
Why: Troubleshooting requires visibility into system behavior. AWS X-Ray provides distributed tracing showing request flow across services (Lambda → API Gateway → SageMaker endpoint), identifying bottlenecks. CloudWatch Lambda Insights provides enhanced metrics for Lambda functions (cold start frequency, memory usage patterns). CloudWatch Logs Insights enables querying logs with SQL-like syntax (finding errors, analyzing patterns, extracting statistics). The exam tests selecting appropriate tools for troubleshooting scenarios (slow endpoint → use X-Ray to trace requests, Lambda timeout → use Lambda Insights for memory analysis, application errors → use Logs Insights to query error logs), implementing comprehensive observability, and analyzing traces/logs to diagnose root causes.
AWS Documentation:
- AWS X-Ray
- Tracing ML Workloads with X-Ray
- CloudWatch Lambda Insights
- CloudWatch Logs Insights
- Query Syntax for Logs Insights
3. How to use AWS CloudTrail to log, monitor, and invoke re-training activities
Why: CloudTrail provides audit trail of AWS API calls enabling compliance, security analysis, and automation. Every API call is logged (who made call, when, from where, what parameters, what response). For ML: track who created/updated/deleted endpoints, who initiated training jobs, who accessed model artifacts, who modified IAM policies. CloudTrail integrates with EventBridge to trigger automation (new training job started → notify team, model deployed → log for audit). The exam tests understanding CloudTrail use cases (compliance auditing, security investigation, operational troubleshooting), configuring trails (organization trail logs all accounts, management events versus data events), querying CloudTrail logs (using Athena for analysis), and triggering automation from CloudTrail events.
AWS Documentation:
- AWS CloudTrail User Guide
- Logging SageMaker API Calls with CloudTrail
- CloudTrail Event History
- Creating a Trail
- Querying CloudTrail Logs
4. Differences between instance types and how they affect performance (for example, memory optimized, compute optimized, general purpose, inference optimized)
Why: Instance type selection significantly impacts performance and cost. General purpose (ml.m5) provides balanced compute/memory for most workloads. Compute optimized (ml.c5) has higher CPU ratio for compute-intensive inference. Memory optimized (ml.r5) has higher memory for large models or batch sizes. Inference optimized (ml.inf1 with AWS Inferentia, ml.g5 with NVIDIA GPUs) provides best cost-performance for neural networks. The exam tests selecting instance types based on model characteristics (large neural network → memory optimized or inference optimized, tree-based model → compute optimized, simple model → general purpose), understanding cost-performance tradeoffs (inference-optimized costs less per prediction but requires model compilation), and right-sizing (starting with smaller instance, scaling up if needed).
AWS Documentation:
- SageMaker Instance Types
- Choosing Instance Types for Inference
- Inference Recommendations
- Instance Type Performance Comparison
- AWS Inferentia
5. Capabilities of cost analysis tools (for example, AWS Cost Explorer, AWS Billing and Cost Management, AWS Trusted Advisor)
Why: Cost analysis tools enable understanding and optimizing ML spending. Cost Explorer visualizes spending by service, region, tag, time period, identifies trends, forecasts future costs. Billing and Cost Management provides detailed cost reports, budget creation, anomaly detection. Trusted Advisor recommends optimizations (idle resources, right-sizing opportunities, Reserved Instance purchases). The exam tests using Cost Explorer for analysis (identifying that training represents 60% of ML costs, suggesting Spot instances), creating budgets with alerts (notify when monthly spend exceeds $10K), interpreting Trusted Advisor recommendations (suggesting right-sizing over-provisioned endpoints), and implementing cost allocation tags enabling departmental charge-back.
AWS Documentation:
- AWS Cost Explorer
- Analyzing Costs with Cost Explorer
- AWS Budgets
- AWS Trusted Advisor
- Cost Optimization Recommendations
6. Cost tracking and allocation techniques (for example, resource tagging)
Why: Cost allocation enables understanding spending by project, team, environment, and implementing charge-back. Tagging strategy: define consistent tags (Project, Environment, Owner, CostCenter), apply tags to all resources (endpoints, training jobs, S3 buckets), activate tags in Cost Explorer for filtering. Cost allocation reports break down spending by tags. The exam tests designing tagging strategies (required tags, tag naming conventions, automated enforcement), implementing tags on ML resources (applying tags during resource creation, bulk tagging existing resources), using tags for cost analysis (identifying that Project-X represents 40% of ML spending), and enforcing tagging (using AWS Config rules to detect untagged resources, Service Control Policies requiring tags).
AWS Documentation:
- Tagging AWS Resources
- Cost Allocation Tags
- Tagging SageMaker Resources
- Tag Policies
- Cost Tracking Best Practices
Skills & Corresponding Documentation
Configuring and using tools to troubleshoot and analyze resources (for example, CloudWatch Logs, CloudWatch alarms)
Why: Effective troubleshooting requires systematic approach using appropriate tools. CloudWatch Logs capture detailed execution information (training logs, endpoint logs, Lambda logs). Logs Insights queries logs to find patterns (error frequency, slow requests, specific user behavior). CloudWatch alarms notify when metrics breach thresholds (endpoint latency >500ms, 4XX errors >1%). The exam tests implementing logging comprehensively (structured logs with consistent formats, appropriate log levels, avoiding sensitive data in logs), creating queries for common troubleshooting scenarios (finding errors in last hour, calculating p99 latency), configuring actionable alarms (thresholds based on business requirements, alarm actions appropriate to severity), and using logs to diagnose root causes.
AWS Documentation:
- CloudWatch Logs
- Analyzing Log Data with CloudWatch Logs Insights
- CloudWatch Alarms
- Using CloudWatch Alarms
- Troubleshooting with CloudWatch
Creating CloudTrail trails
Why: CloudTrail trails enable comprehensive audit logging for compliance and security. Trail configuration includes: events to log (management events for control plane, data events for data plane like S3 object operations), log file storage location (S3 bucket), log file encryption (using KMS), organization trail (logs all accounts in organization). The exam tests when to enable CloudTrail (always for production, security/compliance requirements), configuring trails appropriately (management events minimum, data events for sensitive resources), securing trails (S3 bucket with restricted access, log file integrity validation), and using trails for auditing (who deleted model, who accessed training data).
AWS Documentation:
Setting up dashboards to monitor performance metrics (for example, by using Amazon QuickSight, CloudWatch dashboards)
Why: Dashboards provide at-a-glance operational visibility. CloudWatch dashboards visualize metrics (time series graphs, gauges, number widgets), aggregate views across resources, custom time ranges. QuickSight provides BI dashboards for business metrics (predictions per day, model accuracy trends, cost trends). The exam tests designing effective dashboards (showing key metrics prominently, appropriate time ranges, actionable visualizations), implementing dashboards (creating widgets, adding annotations, sharing dashboards with team), using dashboards for troubleshooting (correlating metrics to identify root causes), and dashboard best practices (automated refresh, mobile-friendly, role-based access).
AWS Documentation:
Monitoring infrastructure (for example, by using Amazon EventBridge events)
Why: EventBridge enables event-driven monitoring and automation. Events represent state changes (training job completed, endpoint deployed, CloudWatch alarm fired). Rules match events and route to targets (Lambda, SNS, SQS, Step Functions). The exam tests designing event-driven architectures (training job fails → Lambda investigates and retries, endpoint metrics degrade → alert and scale up), implementing EventBridge rules (event patterns matching specific conditions, targeting appropriate services), troubleshooting event flows (event not triggering rule may indicate incorrect pattern, target not invoked may indicate permissions issues), and integrating monitoring with automation (automatic remediation reduces operational burden).
AWS Documentation:
Rightsizing instance families and sizes (for example, by using SageMaker AI Inference Recommender and AWS Compute Optimizer)
Why: Right-sizing balances performance and cost by selecting optimal instance types. SageMaker Inference Recommender provides recommendations based on load testing (tests model on various instance types, measures latency/throughput/cost, recommends best option). Compute Optimizer analyzes historical utilization and recommends right-sizing (over-provisioned instances can downsize, under-provisioned should upsize). The exam tests using Inference Recommender for new deployments (providing model and representative payload, interpreting recommendations considering latency and cost requirements), using Compute Optimizer for existing deployments (identifying over-provisioned endpoints costing more than necessary), implementing recommendations safely (testing performance before production changes), and continuous right-sizing (reviewing quarterly as traffic patterns change).
AWS Documentation:
- SageMaker Inference Recommender
- Using Inference Recommender
- AWS Compute Optimizer
- Rightsizing Recommendations
Monitoring and resolving latency and scaling issues
Why: Latency and scaling issues directly impact user experience. Latency troubleshooting: identify bottleneck (model inference, preprocessing, network), optimize inference (model optimization, batching, caching), right-size instances (more powerful instances reduce compute latency). Scaling troubleshooting: validate auto-scaling configuration (correct metrics, thresholds, cooldowns), ensure sufficient capacity limits (max instances not too low), check for throttling (service quotas). The exam tests diagnosing latency issues from metrics (high ModelLatency → optimize model, high Overhead Latency → optimize preprocessing), resolving scaling issues (instances not scaling → check policy configuration, scaling but still slow → increase max instances), and implementing solutions (model optimization with SageMaker Neo, adding read replicas, implementing caching).
AWS Documentation:
Preparing infrastructure for cost monitoring (for example, by applying a tagging strategy)
Why: Effective cost monitoring requires preparation enabling analysis by relevant dimensions. Tagging preparation: define tag schema (Project, Environment, Owner, CostCenter), document tagging policy, implement automated enforcement (AWS Config rules, Lambda validation, Service Control Policies requiring tags), apply tags during resource creation (CloudFormation, CDK, Terraform templates with tags), activate cost allocation tags in Billing console. The exam tests designing comprehensive tagging strategies, implementing automated tag enforcement (preventing untagged resource creation), applying tags consistently (all ML resources tagged, no exceptions), and using tags for analysis (creating cost reports by project, implementing departmental charge-back).
AWS Documentation:
Troubleshooting capacity concerns that involve cost and performance (for example, provisioned concurrency, service quotas, auto scaling)
Why: Capacity issues affect both performance and cost. Provisioned concurrency (Lambda) keeps functions warm preventing cold starts (improves latency but costs more). Service quotas limit resource usage (endpoints per region, training jobs concurrent, API rate limits). Auto-scaling adjusts capacity but configuration issues cause problems (scaling too slowly, not scaling enough, scaling thrashing). The exam tests diagnosing capacity issues (service quota exceeded → request increase, cold starts causing latency → enable provisioned concurrency, insufficient capacity → increase max instances), balancing performance and cost (provisioned concurrency only for critical functions, auto-scaling prevents over-provisioning), and implementing solutions (requesting quota increases, optimizing auto-scaling configuration).
AWS Documentation:
- Lambda Provisioned Concurrency
- Service Quotas
- Requesting Quota Increases
- Troubleshooting Auto Scaling
Optimizing costs and setting cost quotas by using appropriate cost management tools (for example, AWS Cost Explorer, AWS Trusted Advisor, AWS Budgets)
Why: Proactive cost management prevents budget overruns and identifies waste. Cost Explorer analyzes spending (by service, region, tag, time), forecasts future costs. Trusted Advisor recommends optimizations (idle resources, Reserved Instance opportunities). Budgets alert when spending exceeds thresholds. The exam tests implementing cost controls (budgets with alerts, anomaly detection for unexpected spending), identifying optimization opportunities (Cost Explorer shows training on-demand costs 3× higher than necessary → use Spot instances, Trusted Advisor shows endpoints over-provisioned → right-size), implementing optimizations systematically (prioritize highest-cost resources, measure savings, iterate), and establishing cost governance (approval required for expensive resources, regular cost reviews).
AWS Documentation:
- Managing Costs with Cost Explorer
- Creating Budgets
- Cost Anomaly Detection
- Cost Optimization with Trusted Advisor
Optimizing infrastructure costs by selecting purchasing options (for example, Spot Instances, On-Demand Instances, Reserved Instances, SageMaker AI Savings Plans)
Why: Purchasing options significantly impact costs. Spot Instances (up to 90% savings) are suitable for interruptible workloads (training jobs with checkpointing, batch inference). On-Demand provides flexibility without commitment. Reserved Instances (up to 75% savings) require 1-3 year commitment, suitable for predictable workloads. SageMaker Savings Plans (up to 64% savings) provide flexibility across instance types while requiring commitment. The exam tests selecting purchasing options based on workload characteristics (training → Spot, production inference → Savings Plans or Reserved, development → On-Demand), calculating cost savings for different options, implementing mixed strategies (Reserved for baseline capacity, Spot for burst, On-Demand for flexibility), and understanding tradeoffs (Spot requires handling interruptions, Reserved reduces flexibility).
AWS Documentation:
- SageMaker Pricing
- SageMaker Savings Plans
- Managed Spot Training
- EC2 Purchasing Options
- Cost Optimization
Task 4.3: Secure AWS resources
Knowledge Areas & AWS Documentation Reading List
1. IAM roles, policies, and groups that control access to AWS services (for example, AWS Identity and Access Management [IAM], bucket policies, SageMaker Role Manager)
Why: IAM is foundational to AWS security controlling who can do what with which resources. Roles provide temporary credentials (SageMaker execution role for training/endpoints, cross-account roles for shared resources). Policies define permissions (identity-based attached to users/roles, resource-based attached to S3 buckets/KMS keys). Groups organize users for easier permission management. SageMaker Role Manager provides pre-configured roles for common scenarios (data scientist, MLOps engineer). The exam tests implementing least privilege (granting minimum permissions needed, not AdministratorAccess), troubleshooting access denied (missing IAM permissions, incorrect role trust policy), understanding policy evaluation logic (explicit deny overrides allow), and using SageMaker Role Manager to simplify permission management.
AWS Documentation:
- AWS IAM User Guide
- IAM Roles
- IAM Policies
- SageMaker Execution Roles
- SageMaker Role Manager
- S3 Bucket Policies
2. SageMaker AI security and compliance features
Why: SageMaker provides security features enabling compliant ML workflows. Network isolation (VPC deployment) restricts network access. Encryption at rest (KMS for model artifacts, training data) protects stored data. Encryption in transit (TLS for API calls) protects data in motion. Private workforce for Ground Truth ensures labelers are trusted. Resource access control (IAM policies, VPC security groups) implements defense-in-depth. The exam tests when to use security features (VPC for compliance requirements, encryption for sensitive data, private workforce for proprietary data), configuring features correctly (specifying KMS key, VPC configuration with appropriate subnets), and understanding compliance programs (SageMaker is HIPAA eligible, PCI DSS compliant).
AWS Documentation:
- SageMaker Security
- Data Protection in SageMaker
- Encryption at Rest
- Encryption in Transit
- SageMaker Compliance Programs
3. Controls for network access to ML resources
Why: Network controls prevent unauthorized access to ML resources. VPC isolates resources in private network. Security groups act as firewall controlling inbound/outbound traffic. VPC endpoints enable private connectivity to AWS services without internet gateway. Private subnets (no internet gateway) prevent internet access. Network ACLs provide subnet-level traffic filtering. The exam tests designing secure network architectures (endpoints in private subnets, security groups allowing only required traffic, VPC endpoints for S3/SageMaker API), troubleshooting network access issues (security group blocking traffic, missing route in route table), implementing defense-in-depth (security groups + NACLs + private subnets), and understanding when VPC deployment is required (compliance, accessing private data sources).
AWS Documentation:
- Connect SageMaker to Resources in a VPC
- Protect Training Jobs by Using a VPC
- Protect Endpoints by Using a VPC
- VPC Endpoints for SageMaker
- Security Groups for VPC
4. Security best practices for CI/CD pipelines
Why: CI/CD pipelines access sensitive resources and must be secured. Best practices include: least privilege for pipeline roles (only permissions needed), secrets management (Secrets Manager/Parameter Store, not hardcoded), code review before production deployment (pull request approvals), automated security testing (vulnerability scanning, policy validation), immutable infrastructure (no manual changes to production), audit logging (CloudTrail for API calls, CodePipeline execution history). The exam tests implementing secure pipelines (IAM roles with specific permissions, secrets retrieved from Secrets Manager, approval gates before production), preventing common vulnerabilities (exposed credentials, insufficient access controls), and understanding security-development tradeoffs (more controls increase safety but slow velocity).
AWS Documentation:
Skills & Corresponding Documentation
Configuring least privilege access to ML artifacts
Why: Least privilege minimizes damage from compromised credentials or insider threats. Implementation: identify required permissions (data scientist needs SageMaker CreateTrainingJob, model artifacts in S3), create IAM policy with specific actions and resources (not wildcards), test policy validates it works, audit periodically removes unused permissions. The exam tests implementing least privilege for common scenarios (data scientist role with permissions for training but not production deployment, MLOps engineer with deployment permissions, service role for training job with specific S3 bucket access), troubleshooting overly-restrictive policies (access denied despite seemingly appropriate permissions), and understanding policy evaluation (explicit deny, implicit deny, resource policies).
AWS Documentation:
Configuring IAM policies and roles for users and applications that interact with ML systems
Why: Proper IAM configuration enables secure access while preventing unauthorized actions. Users need policies for console/CLI access (data scientists: SageMaker permissions, S3 access to specific buckets). Applications need roles (training job: access to training data and model output location, endpoint: write CloudWatch metrics, Lambda: invoke endpoint). The exam tests creating appropriate policies (allowing required actions on specific resources), configuring roles for services (SageMaker execution role with trust policy allowing SageMaker service, permissions for required actions), troubleshooting policy issues (access denied despite policy that should allow, role trust policy not allowing service), and understanding cross-account access (using roles, not sharing credentials).
AWS Documentation:
- Identity-Based Policies for SageMaker
- Resource-Based Policies
- Using Service-Linked Roles
- Cross-Account Access
Monitoring, auditing, and logging ML systems to ensure continued security and compliance
Why: Continuous monitoring detects security incidents and demonstrates compliance. Monitoring includes: CloudTrail logging all API calls (who accessed what, when), CloudWatch monitoring resource usage (detecting unusual activity), GuardDuty detecting threats (compromised credentials, unusual API calls), Config tracking resource configuration changes (endpoint deployed without encryption). The exam tests implementing comprehensive logging (CloudTrail enabled, logs centralized, long-term retention), configuring alerts for security events (GuardDuty findings, Config non-compliant resources), using logs for incident investigation (determining who deleted model, when training data was accessed), and demonstrating compliance (producing audit reports showing controls implemented).
AWS Documentation:
- Logging and Monitoring in SageMaker
- Logging SageMaker API Calls with CloudTrail
- Compliance Validation for SageMaker
- AWS Config for SageMaker
Troubleshooting and debugging security issues
Why: Security troubleshooting requires systematic approach identifying root causes. Common issues: access denied (missing IAM permissions, incorrect policy, expired credentials), network connectivity failures (security group blocking traffic, missing VPC endpoint), encryption errors (KMS key not accessible, incorrect key policy). Troubleshooting process: reproduce issue, examine error messages, check CloudTrail for API calls, verify IAM permissions, validate network configuration, test with more permissive configuration to isolate issue, implement minimal fix. The exam tests diagnosing security issues from symptoms (specific error messages indicating specific issues), using appropriate tools (CloudTrail for API calls, VPC Flow Logs for network traffic), and implementing correct fixes (adding specific permission, not making everything public).
AWS Documentation:
Building VPCs, subnets, and security groups to securely isolate ML systems
Why: VPC provides network isolation for sensitive ML workloads. Architecture: create VPC with RFC 1918 CIDR block, create public subnets (internet gateway route) and private subnets (NAT gateway route), place ML resources in private subnets, create security groups allowing only required traffic (SageMaker endpoint security group allows traffic from application security group), configure VPC endpoints for AWS services (SageMaker API, S3), implement network ACLs for additional protection. The exam tests designing secure VPC architectures (private subnets for ML resources, VPC endpoints to avoid NAT costs), implementing defense-in-depth (security groups + NACLs + private subnets), troubleshooting network issues (connectivity failures due to security group misconfiguration), and understanding cost implications (NAT Gateway costs, VPC endpoint savings).
AWS Documentation:
AWS Service FAQs
- Amazon SageMaker FAQs
- Amazon CloudWatch FAQs
- AWS X-Ray FAQs
- AWS CloudTrail FAQs
- AWS Cost Explorer FAQs
- AWS Trusted Advisor FAQs
- AWS Budgets FAQs
- AWS IAM FAQs
- Amazon VPC FAQs
- AWS KMS FAQs
AWS Whitepapers
- Machine Learning Lens - AWS Well-Architected Framework
- Model Monitoring
- Cost Optimization for ML
- Security Best Practices for ML
- Operational Excellence for ML
- AWS Security Best Practices
Final Thoughts
Domain 4 completes the ML engineering lifecycle by focusing on production operations - monitoring, cost optimization, and security. SageMaker Model Monitor is essential for detecting drift and maintaining model quality, so invest significant time understanding all monitoring types (data quality, model quality, bias drift, feature attribution). Cost optimization requires understanding multiple strategies: right-sizing with Inference Recommender, purchasing options (Spot, Savings Plans, Reserved), and using Cost Explorer for analysis. Security skills are foundational: IAM for access control (implement least privilege rigorously), VPC for network isolation (place resources in private subnets, use VPC endpoints), encryption for data protection (KMS for at rest, TLS for in transit), and CloudTrail for audit logging. Effective monitoring combines CloudWatch metrics and alarms, comprehensive logging, and dashboards providing operational visibility. Success in this domain requires both deep AWS service knowledge and operational experience managing production ML systems. Complement documentation study with hands-on practice implementing complete monitoring, implementing cost controls, and securing ML workloads following security best practices. This operational expertise distinguishes production-ready ML engineers from those focused solely on model development.