Generalization
Definition:
- Generalization
refers to a machine learning model's ability to perform well on new,
unseen data that is drawn from the same distribution as the training data.
- The core goal of supervised learning is to learn a model that generalizes
from the training set to accurately predict outcomes for new data points.
Importance:
- A model that generalizes well captures the underlying
patterns in the data instead of memorizing training examples.
- Without good generalization, a model may perform well on
the training data but poorly on any new data, which is undesirable in
real-world applications.
Overfitting
Definition:
- Overfitting occurs when a model learns the noise and random fluctuations
in the training data instead of the true underlying distribution.
- The model fits the training data too closely, capturing
minor details that do not generalize.
Characteristics:
- Very low error on the training set.
- Poor performance on new or test data.
- Decision boundaries or predictions are overly complex and
finely tuned to training points, including outliers.
Causes of Overfitting:
- Model complexity is too high relative to the amount and
noisiness of data.
- Insufficient training data to support a complex model.
- Lack of proper regularization or early stopping
strategies.
Illustrative Example:
- Decision trees with pure leaves classify every training
example correctly, which corresponds to overfitting by fitting to noise
and outliers (Figure 2-26 on page 88).
- k-Nearest Neighbor with k=1 achieves perfect training
accuracy but often poorly generalizes to new data.
Underfitting
Definition:
- Underfitting occurs when a model is too simple to capture the
underlying structure and patterns in the data.
- The model performs poorly on both the training data and new data.
Characteristics:
- High error on training data.
- High error on test data.
- Model predictions are overly simplified, missing
important relationships.
Causes of Underfitting:
- Model complexity is too low.
- Insufficient features or lack of expressive power.
- Too strong regularization preventing learning of
meaningful patterns.
The Trade-Off Between Overfitting
and Underfitting
Model Complexity vs. Dataset
Size:
- There is a balance
or "sweet spot" to be found where the model is complex enough to
explain the data but simple enough to avoid fitting noise.
- The relationship between model complexity and performance
typically forms a U-shaped curve.
Model Selection:
- Effective supervised learning requires choosing a model
with the right level of complexity.
- Techniques include hyperparameter tuning (e.g., k in
k-nearest neighbors), pruning in decision trees, regularization, and early
stopping.
Impact of Scale and Feature
Engineering:
- Proper scaling and representation of input features
significantly affect the model's ability to generalize and reduce
overfitting or underfitting.
Strategies to Mitigate
Overfitting and Underfitting
·
Mitigating Overfitting:
·
Use simpler models.
·
Apply regularization (L1/L2).
·
Early stopping in iterative algorithms.
·
Prune decision trees (post-pruning or
pre-pruning).
·
Increase training data size.
·
Mitigating Underfitting:
·
Use more complex models.
·
Add more features or use feature engineering.
·
Reduce regularization.
Summary
Aspect |
Overfitting |
Underfitting |
Model Complexity |
Too
high |
Too low |
Training
Performance |
Very
good |
Poor |
Test
Performance |
Poor |
Poor |
Cause |
Learning
noise; focusing on outliers and noise |
Oversimplification;
lack of feature learning |
Example |
Deep
decision trees, k-NN with k=1 |
Linear
model on a nonlinear problem |
The ultimate goal is to find a
model that generalizes well
by balancing these extremes.
Comments
Post a Comment