1. Introduction to SVMs
- Support Vector Machines (SVMs) are supervised learning
algorithms primarily used for classification (and regression with SVR).
- They aim to find the optimal separating hyperplane that maximizes
the margin between classes for linearly separable data.
- Basic (linear) SVMs operate in the original feature
space, producing linear decision boundaries.
2. Limitations of Linear SVMs
- Linear SVMs have limited flexibility as their decision boundaries
are hyperplanes.
- Many real-world problems require more complex, non-linear decision
boundaries that linear SVM cannot provide.
3. Kernel Trick: Overcoming
Non-linearity
- To allow non-linear decision boundaries, SVMs exploit the
kernel trick.
- The kernel trick implicitly maps input data into a higher-dimensional feature space
where linear separation might be possible, without explicitly performing the costly mapping.
How the Kernel Trick Works:
- Instead of computing the coordinates of data points in
high-dimensional space (which could be infinite-dimensional), SVM
calculates inner products
(similarity measures) directly using kernel functions.
- These inner products correspond to an implicit mapping
into the higher-dimensional space.
- This avoids the curse of dimensionality and reduces
computational cost.
4. Types of Kernels
The most common kernels:
1.
Polynomial Kernel
- Computes all polynomial combinations of features up to a
specified degree.
- Enables capturing interactions and higher-order feature
terms.
- Example: kernel corresponds to sums like feature1²,
feature1 × feature2⁵, etc..
2.
Radial Basis Function (RBF) Kernel (Gaussian
Kernel)
- Corresponds to an infinite-dimensional feature space.
- Measures similarity based on the distance between points
in original space, decreasing exponentially with distance.
- Suitable when relationships are highly non-linear and not
well captured by polynomial terms.
5. Important Parameters in
Kernelized SVMs
1.
Regularization parameter (C)
- Controls the trade-off between maximizing the margin and
minimizing classification error.
- A small C encourages a wider margin but allows some
misclassifications (more regularization).
- A large C tries to classify all training points correctly
but might overfit.
2.
Kernel choice
- Selecting the appropriate kernel function is critical
(polynomial, RBF, linear, etc.).
- The choice depends on the data and problem structure.
3.
Kernel-specific parameters
- Each kernel function has parameters:
- Polynomial kernel: degree of polynomial.
- RBF kernel: gamma (shape of Gaussian; higher gamma means
points closer).
- These parameters govern the flexibility and complexity of
the decision boundary.
6. Strengths and Weaknesses
Strengths
- Flexibility:
- SVMs can create complex, non-linear boundaries suitable
for both low and high-dimensional data,.
- Effective in high dimensions:
- Works well even if the number of features exceeds the
number of samples.
- Kernel trick:
- Avoids explicit computations in very high-dimensional
spaces, saving computational resources.
Weaknesses
- Scalability:
- SVMs scale poorly with the number of samples.
- Practical for datasets up to ~10,000 samples; larger
datasets increase runtime and memory significantly.
- Parameter tuning and preprocessing:
- Requires careful preprocessing (feature scaling is
important), tuning of C, kernel, and kernel-specific parameters for good
performance.
- Interpretability:
- Model is difficult to interpret; explaining why a
prediction was made is challenging.
7. When to Use Kernelized SVMs?
- Consider kernelized SVMs if:
- Your features have similar scales or represent
homogeneous measurements (e.g., pixel intensities).
- The dataset is not too large (under ~10,000 samples).
- You require powerful non-linear classification with
well-separated classes.
8. Mathematical Background
(Overview)
- The underlying math is involved and detailed in advanced
texts such as The Elements of
Statistical Learning by Hastie, Tibshirani, and Friedman.
- Conceptually:
- The primal optimization problem tries to maximize the
margin while penalizing misclassifications.
- The dual problem allows the introduction of kernels,
enabling use of the kernel trick.
Summary
|
|
Purpose |
Classification
with linear or non-linear decision boundaries |
Key
idea |
Map
data to higher-dimensional space via kernels (kernel trick) |
Common
kernels |
Polynomial,
RBF (Gaussian) |
Parameters |
Regularization
C, kernel type, kernel-specific params (degree, gamma) |
Strengths |
Flexible
decision boundaries, works well in high-dimensions |
Weaknesses |
Poor
scaling to large datasets, requires tuning, less interpretable |
Use
cases |
Data
with uniform feature scaling, moderate size datasets |
Comments
Post a Comment