1. Introduction to SVMs
- Support Vector Machines (SVMs) are supervised learning
     algorithms primarily used for classification (and regression with SVR).
- They aim to find the optimal separating hyperplane that maximizes
     the margin between classes for linearly separable data.
- Basic (linear) SVMs operate in the original feature
     space, producing linear decision boundaries.
2. Limitations of Linear SVMs
- Linear SVMs have limited flexibility as their decision boundaries
     are hyperplanes.
- Many real-world problems require more complex, non-linear decision
     boundaries that linear SVM cannot provide.
3. Kernel Trick: Overcoming
Non-linearity
- To allow non-linear decision boundaries, SVMs exploit the
     kernel trick.
- The kernel trick implicitly maps input data into a higher-dimensional feature space
     where linear separation might be possible, without explicitly performing the costly mapping.
How the Kernel Trick Works:
- Instead of computing the coordinates of data points in
     high-dimensional space (which could be infinite-dimensional), SVM
     calculates inner products
     (similarity measures) directly using kernel functions.
- These inner products correspond to an implicit mapping
     into the higher-dimensional space.
- This avoids the curse of dimensionality and reduces
     computational cost.
4. Types of Kernels
The most common kernels:
1.     
Polynomial Kernel
- Computes all polynomial combinations of features up to a
     specified degree.
- Enables capturing interactions and higher-order feature
     terms.
- Example: kernel corresponds to sums like feature1²,
     feature1 × feature2⁵, etc..
2.    
Radial Basis Function (RBF) Kernel (Gaussian
Kernel)
- Corresponds to an infinite-dimensional feature space.
- Measures similarity based on the distance between points
     in original space, decreasing exponentially with distance.
- Suitable when relationships are highly non-linear and not
     well captured by polynomial terms.
5. Important Parameters in
Kernelized SVMs
1.     
Regularization parameter (C)
- Controls the trade-off between maximizing the margin and
     minimizing classification error.
- A small C encourages a wider margin but allows some
     misclassifications (more regularization).
- A large C tries to classify all training points correctly
     but might overfit.
2.    
Kernel choice
- Selecting the appropriate kernel function is critical
     (polynomial, RBF, linear, etc.).
- The choice depends on the data and problem structure.
3.    
Kernel-specific parameters
- Each kernel function has parameters:
- Polynomial kernel: degree of polynomial.
- RBF kernel: gamma (shape of Gaussian; higher gamma means
     points closer).
- These parameters govern the flexibility and complexity of
     the decision boundary.
6. Strengths and Weaknesses
Strengths
- Flexibility:
- SVMs can create complex, non-linear boundaries suitable
     for both low and high-dimensional data,.
- Effective in high dimensions:
- Works well even if the number of features exceeds the
     number of samples.
- Kernel trick:
- Avoids explicit computations in very high-dimensional
     spaces, saving computational resources.
Weaknesses
- Scalability:
- SVMs scale poorly with the number of samples.
- Practical for datasets up to ~10,000 samples; larger
     datasets increase runtime and memory significantly.
- Parameter tuning and preprocessing:
- Requires careful preprocessing (feature scaling is
     important), tuning of C, kernel, and kernel-specific parameters for good
     performance.
- Interpretability:
- Model is difficult to interpret; explaining why a
     prediction was made is challenging.
7. When to Use Kernelized SVMs?
- Consider kernelized SVMs if:
- Your features have similar scales or represent
     homogeneous measurements (e.g., pixel intensities).
- The dataset is not too large (under ~10,000 samples).
- You require powerful non-linear classification with
     well-separated classes.
8. Mathematical Background
(Overview)
- The underlying math is involved and detailed in advanced
     texts such as The Elements of
     Statistical Learning by Hastie, Tibshirani, and Friedman.
- Conceptually:
- The primal optimization problem tries to maximize the
     margin while penalizing misclassifications.
- The dual problem allows the introduction of kernels,
     enabling use of the kernel trick.
Summary
| 
 | 
 | 
| Purpose | Classification
  with linear or non-linear decision boundaries | 
| Key
  idea | Map
  data to higher-dimensional space via kernels (kernel trick) | 
| Common
  kernels | Polynomial,
  RBF (Gaussian) | 
| Parameters | Regularization
  C, kernel type, kernel-specific params (degree, gamma) | 
| Strengths | Flexible
  decision boundaries, works well in high-dimensions | 
| Weaknesses | Poor
  scaling to large datasets, requires tuning, less interpretable | 
| Use
  cases | Data
  with uniform feature scaling, moderate size datasets | 
 

Comments
Post a Comment