1. Problem Setup
·        
Definition: In multi-class classification, the goal is
to assign an input x∈Rd to one out of k
classes or categories. The label y can take values in
the set: y∈{1,2,...,k}
·        
Examples:
·        
Email
classification into three classes: spam, personal, and work-related.
·        
Handwritten
digit recognition where k=10.
2. Modeling Multi-class Classification
·        
Output
Representation:
Unlike binary classification, where the output is a scalar probability, in
multi-class classification we model a probability
distribution over k discrete classes: p(y=j∣x;θ)forj=1,…,k where θ represents
model parameters.
·        
Multinomial
Distribution: The
output distribution for a given x is modeled as a
multinomial distribution over k classes: p(y∣x;θ)=Multinomial(ϕ1,ϕ2,…,ϕk) with parameters (probabilities) ϕj=p(y=j∣x;θ) satisfying: ϕj≥0and∑j=1kϕj=1
3. Parameterization of the Model
·        
Parameter
Vectors: We have k parameter vectors: θ1,
θ2,…,θk with
θj∈Rd
·        
Scores
for each class: For
input x, compute the score for each class j as: sj = θjTx
These
scores represent a measure of confidence that x belongs
to class j.
4. The Softmax Function
·        
To
convert these scores sj into probabilities ϕj,
we use the softmax function:
ϕj=∑l=1keslesj
·        
Properties
of Softmax:
·        
Outputs
a valid probability distribution.
·        
Emphasizes
the highest scoring classes exponentially, making them more likely.
5. Loss Function: Cross-Entropy Loss
·        
Given
training examples {(x(i), y(i))}i=1n, the loss function is: L(θ)=−∑i=1nlogp(y(i)∣x(i);θ)
·        
Plugging
in the softmax probabilities: L(θ)=−∑i=1nlog∑j=1keθjTx(i)eθy(i)Tx(i)
·        
Goal: Minimize this negative log-likelihood (or
equivalently maximize the likelihood) over θ1,…,θk.
6. Training via Gradient Descent
·        
Gradient
Computation: The
gradient of the loss with respect to each parameter vector θj is: ∇θjL=−∑i=1nx(i)(1{y(i)=j}−p(y=j∣x(i);θ)) where 1{.} is the indicator function.
·        
Update
Rule: Parameters are
updated in the direction opposite to the gradient by an amount proportional to
the learning rate η: θj←θj−η∇θjL
7. Making Predictions
- Given
     a new input x, predict the class y^ as: y^=argmaxj∈{1,…,k}θjTx
- This
     corresponds to selecting the class with the highest linear score.
8. Relationship to Binary Classification
- The
     softmax regression (multiclass generalization) reduces to logistic
     regression for k=2, where the softmax converts to the sigmoid function: p(y=1∣x)=eθ1Tx+eθ2Txeθ1Tx=1+e−(θ1−θ2)Tx1
9. Summary Points
- The
     multinomial logistic regression model classifies inputs into one of k classes.
- Each
     class gets its own parameter vector θj.
- The
     softmax function converts linear scores into probabilities.
- Training
     optimizes the cross-entropy loss via gradient methods.
- The
     decision boundary between classes is linear (or piecewise linear), as it
     depends on linear functions θjTx.
- This
     approach generalizes the binary logistic regression model in an intuitive
     way.
10. Additional Notes
- Multi-class
     perceptrons can be implemented similarly by learning separate weight
     vectors and picking the max scoring class.
- More
     complex multi-class classifiers can involve neural networks that learn
     non-linear functions before the softmax output layer.
 

Comments
Post a Comment