1. Problem Setup
·
Definition: In multi-class classification, the goal is
to assign an input x∈Rd to one out of k
classes or categories. The label y can take values in
the set: y∈{1,2,...,k}
·
Examples:
·
Email
classification into three classes: spam, personal, and work-related.
·
Handwritten
digit recognition where k=10.
2. Modeling Multi-class Classification
·
Output
Representation:
Unlike binary classification, where the output is a scalar probability, in
multi-class classification we model a probability
distribution over k discrete classes: p(y=j∣x;θ)forj=1,…,k where θ represents
model parameters.
·
Multinomial
Distribution: The
output distribution for a given x is modeled as a
multinomial distribution over k classes: p(y∣x;θ)=Multinomial(ϕ1,ϕ2,…,ϕk) with parameters (probabilities) ϕj=p(y=j∣x;θ) satisfying: ϕj≥0and∑j=1kϕj=1
3. Parameterization of the Model
·
Parameter
Vectors: We have k parameter vectors: θ1,
θ2,…,θk with
θj∈Rd
·
Scores
for each class: For
input x, compute the score for each class j as: sj = θjTx
These
scores represent a measure of confidence that x belongs
to class j.
4. The Softmax Function
·
To
convert these scores sj into probabilities ϕj,
we use the softmax function:
ϕj=∑l=1keslesj
·
Properties
of Softmax:
·
Outputs
a valid probability distribution.
·
Emphasizes
the highest scoring classes exponentially, making them more likely.
5. Loss Function: Cross-Entropy Loss
·
Given
training examples {(x(i), y(i))}i=1n, the loss function is: L(θ)=−∑i=1nlogp(y(i)∣x(i);θ)
·
Plugging
in the softmax probabilities: L(θ)=−∑i=1nlog∑j=1keθjTx(i)eθy(i)Tx(i)
·
Goal: Minimize this negative log-likelihood (or
equivalently maximize the likelihood) over θ1,…,θk.
6. Training via Gradient Descent
·
Gradient
Computation: The
gradient of the loss with respect to each parameter vector θj is: ∇θjL=−∑i=1nx(i)(1{y(i)=j}−p(y=j∣x(i);θ)) where 1{.} is the indicator function.
·
Update
Rule: Parameters are
updated in the direction opposite to the gradient by an amount proportional to
the learning rate η: θj←θj−η∇θjL
7. Making Predictions
- Given
a new input x, predict the class y^ as: y^=argmaxj∈{1,…,k}θjTx
- This
corresponds to selecting the class with the highest linear score.
8. Relationship to Binary Classification
- The
softmax regression (multiclass generalization) reduces to logistic
regression for k=2, where the softmax converts to the sigmoid function: p(y=1∣x)=eθ1Tx+eθ2Txeθ1Tx=1+e−(θ1−θ2)Tx1
9. Summary Points
- The
multinomial logistic regression model classifies inputs into one of k classes.
- Each
class gets its own parameter vector θj.
- The
softmax function converts linear scores into probabilities.
- Training
optimizes the cross-entropy loss via gradient methods.
- The
decision boundary between classes is linear (or piecewise linear), as it
depends on linear functions θjTx.
- This
approach generalizes the binary logistic regression model in an intuitive
way.
10. Additional Notes
- Multi-class
perceptrons can be implemented similarly by learning separate weight
vectors and picking the max scoring class.
- More
complex multi-class classifiers can involve neural networks that learn
non-linear functions before the softmax output layer.
Comments
Post a Comment