AWS Certified Machine Learning - Specialty (MLS-C01) Domain 1
Data Engineering
Official Exam Guide: Domain 1: Data Engineering
Skill Builder: AWS Certified Machine Learning - Specialty Exam Prep
Domain Overview
Domain 1 (20%) focuses on creating data repositories for ML, implementing data ingestion solutions, and implementing data transformation solutions.
Task 1.1: Create data repositories for ML
Key Concepts:
- Identify data sources (content, location, primary sources like user data)
- Determine storage mediums (databases, S3, EFS, EBS)
Essential Documentation:
- Amazon S3 User Guide
- Amazon EFS User Guide
- Amazon EBS User Guide
- Amazon DynamoDB Developer Guide
- Amazon RDS User Guide
Task 1.2: Identify and implement a data ingestion solution
Key Concepts:
- Data job styles (batch load, streaming)
- Orchestrate data ingestion pipelines (batch-based and streaming-based ML workloads)
- Schedule jobs
Essential Documentation:
- Amazon Kinesis Data Streams Developer Guide
- Amazon Data Firehose Developer Guide
- Amazon EMR Management Guide
- AWS Glue Developer Guide
- Amazon Managed Service for Apache Flink
- AWS Lambda Developer Guide
Task 1.3: Identify and implement a data transformation solution
Key Concepts:
- Transform data in transit (ETL with AWS Glue, EMR, AWS Batch)
- Handle ML-specific data using MapReduce (Hadoop, Spark, Hive)
Essential Documentation:
- AWS Glue How It Works
- AWS Glue Jobs
- Apache Spark on Amazon EMR
- Apache Hadoop on Amazon EMR
- AWS Batch User Guide
AWS Service FAQs
Study Tips
-
Master data storage options - S3 for scalable object storage (training data, model artifacts), EFS for shared file systems, EBS for instance storage, databases for structured data.
-
Learn streaming vs batch - Kinesis Data Streams for real-time ingestion, Firehose for delivery to S3/Redshift, Glue for batch ETL, EMR for large-scale processing.
-
Understand ETL pipelines - AWS Glue for serverless ETL, EMR with Spark for complex transformations, Glue Data Catalog for metadata management.
-
Practice data lake architecture - S3 as data lake storage, Glue crawlers for schema discovery, Athena for SQL queries, Lake Formation for governance.
-
Study Apache Spark - DataFrame API, transformations vs actions, lazy evaluation, RDD operations, Spark SQL for ML data preparation.
Note: This is Domain 1 of 4, representing 20% of exam content.