Generalization
Definition:
- Generalization
     refers to a machine learning model's ability to perform well on new,
     unseen data that is drawn from the same distribution as the training data.
- The core goal of supervised learning is to learn a model that generalizes
     from the training set to accurately predict outcomes for new data points.
Importance:
- A model that generalizes well captures the underlying
     patterns in the data instead of memorizing training examples.
- Without good generalization, a model may perform well on
     the training data but poorly on any new data, which is undesirable in
     real-world applications.
Overfitting
Definition:
- Overfitting occurs when a model learns the noise and random fluctuations
     in the training data instead of the true underlying distribution.
- The model fits the training data too closely, capturing
     minor details that do not generalize.
Characteristics:
- Very low error on the training set.
- Poor performance on new or test data.
- Decision boundaries or predictions are overly complex and
     finely tuned to training points, including outliers.
Causes of Overfitting:
- Model complexity is too high relative to the amount and
     noisiness of data.
- Insufficient training data to support a complex model.
- Lack of proper regularization or early stopping
     strategies.
Illustrative Example:
- Decision trees with pure leaves classify every training
     example correctly, which corresponds to overfitting by fitting to noise
     and outliers (Figure 2-26 on page 88).
- k-Nearest Neighbor with k=1 achieves perfect training
     accuracy but often poorly generalizes to new data.
Underfitting
Definition:
- Underfitting occurs when a model is too simple to capture the
     underlying structure and patterns in the data.
- The model performs poorly on both the training data and new data.
Characteristics:
- High error on training data.
- High error on test data.
- Model predictions are overly simplified, missing
     important relationships.
Causes of Underfitting:
- Model complexity is too low.
- Insufficient features or lack of expressive power.
- Too strong regularization preventing learning of
     meaningful patterns.
The Trade-Off Between Overfitting
and Underfitting
Model Complexity vs. Dataset
Size:
- There is a balance
     or "sweet spot" to be found where the model is complex enough to
     explain the data but simple enough to avoid fitting noise.
- The relationship between model complexity and performance
     typically forms a U-shaped curve.
Model Selection:
- Effective supervised learning requires choosing a model
     with the right level of complexity.
- Techniques include hyperparameter tuning (e.g., k in
     k-nearest neighbors), pruning in decision trees, regularization, and early
     stopping.
Impact of Scale and Feature
Engineering:
- Proper scaling and representation of input features
     significantly affect the model's ability to generalize and reduce
     overfitting or underfitting.
Strategies to Mitigate
Overfitting and Underfitting
·        
Mitigating Overfitting:
·        
Use simpler models.
·        
Apply regularization (L1/L2).
·        
Early stopping in iterative algorithms.
·        
Prune decision trees (post-pruning or
pre-pruning).
·        
Increase training data size.
·        
Mitigating Underfitting:
·        
Use more complex models.
·        
Add more features or use feature engineering.
·        
Reduce regularization.
Summary
| Aspect | Overfitting | Underfitting | 
| Model Complexity | Too
  high | Too low | 
| Training
  Performance | Very
  good | Poor | 
| Test
  Performance | Poor | Poor | 
| Cause | Learning
  noise; focusing on outliers and noise | Oversimplification;
  lack of feature learning | 
| Example | Deep
  decision trees, k-NN with k=1 | Linear
  model on a nonlinear problem | 
The ultimate goal is to find a
model that generalizes well
by balancing these extremes.
 

Comments
Post a Comment