AWS Certified Machine Learning - Specialty (MLS-C01) Domain 2
Exploratory Data Analysis
Official Exam Guide: Domain 2: Exploratory Data Analysis
Skill Builder: AWS Certified Machine Learning - Specialty Exam Prep
Domain Overview
Domain 2 (24%) focuses on sanitizing and preparing data, performing feature engineering, and analyzing and visualizing data for ML.
Task 2.1: Sanitize and prepare data for modeling
Key Concepts:
- Identify and handle missing data, corrupt data, stop words
- Format, normalize, augment, and scale data
- Determine if there’s sufficient labeled data
- Use data labeling tools (Amazon Mechanical Turk)
Essential Documentation:
Task 2.2: Perform feature engineering
Key Concepts:
- Extract features from text, speech, images, public datasets
- Feature engineering concepts: binning, tokenization, outliers, synthetic features, one-hot encoding, dimensionality reduction
Essential Documentation:
- SageMaker Data Wrangler Transforms
- SageMaker Processing Jobs
- Amazon Comprehend Developer Guide
- Amazon Rekognition Developer Guide
Task 2.3: Analyze and visualize data for ML
Key Concepts:
- Create graphs (scatter plots, time series, histograms, box plots)
- Interpret descriptive statistics (correlation, summary statistics, p-value)
- Perform cluster analysis (hierarchical, elbow plot, cluster size)
Essential Documentation:
AWS Service FAQs
Study Tips
-
Master data cleaning - Handle missing values (imputation, deletion), remove outliers, handle duplicates, address class imbalance (SMOTE, undersampling).
-
Learn feature engineering - Numeric: normalization, standardization, binning. Text: tokenization, TF-IDF, word embeddings. Categorical: one-hot encoding, label encoding.
-
Understand dimensionality reduction - PCA for linear reduction, t-SNE for visualization, feature selection methods (filter, wrapper, embedded).
-
Practice data visualization - Histograms for distributions, scatter plots for correlations, box plots for outliers, heatmaps for correlation matrices.
-
Study descriptive statistics - Mean, median, mode, standard deviation, correlation coefficients, p-values for hypothesis testing.
Note: This is Domain 2 of 4, representing 24% of exam content.