Multi-class classification

1. Problem Setup

· Definition: In multi-class classification, the goal is to assign an input x∈Rd to one out of k classes or categories. The label y can take values in the set: y∈{1,2,...,k}

· Examples:

· Email classification into three classes: spam, personal, and work-related.

· Handwritten digit recognition where k=10.

2. Modeling Multi-class Classification

· Output Representation: Unlike binary classification, where the output is a scalar probability, in multi-class classification we model a probability distribution over k discrete classes: p(y=j∣x;θ)forj=1,…,k where θ represents model parameters.

· Multinomial Distribution: The output distribution for a given x is modeled as a multinomial distribution over k classes: p(y∣x;θ)=Multinomial(ϕ1,ϕ2,…,ϕk) with parameters (probabilities) ϕj=p(y=j∣x;θ) satisfying: ϕj≥0and∑j=1kϕj=1

3. Parameterization of the Model

· Parameter Vectors: We have k parameter vectors: θ1, θ2,…,θk with θj∈Rd

· Scores for each class: For input x, compute the score for each class j as: sj = θjTx

These scores represent a measure of confidence that x belongs to class j.

4. The Softmax Function

· To convert these scores sj into probabilities ϕj, we use the softmax function: ϕj=∑l=1keslesj

· Properties of Softmax:

· Outputs a valid probability distribution.

· Emphasizes the highest scoring classes exponentially, making them more likely.

5. Loss Function: Cross-Entropy Loss

· Given training examples {(x(i), y(i))}i=1n, the loss function is: L(θ)=−∑i=1nlogp(y(i)∣x(i);θ)

· Plugging in the softmax probabilities: L(θ)=−∑i=1nlog∑j=1keθjTx(i)eθy(i)Tx(i)

· Goal: Minimize this negative log-likelihood (or equivalently maximize the likelihood) over θ1,…,θk.

6. Training via Gradient Descent

· Gradient Computation: The gradient of the loss with respect to each parameter vector θj is: ∇θjL=−∑i=1nx(i)(1{y(i)=j}−p(y=j∣x(i);θ)) where 1{.} is the indicator function.

· Update Rule: Parameters are updated in the direction opposite to the gradient by an amount proportional to the learning rate η: θj←θj−η∇θjL

7. Making Predictions

Given a new input x, predict the class y^ as: y^=argmaxj∈{1,…,k}θjTx
This corresponds to selecting the class with the highest linear score.

8. Relationship to Binary Classification

The softmax regression (multiclass generalization) reduces to logistic regression for k=2, where the softmax converts to the sigmoid function: p(y=1∣x)=eθ1Tx+eθ2Txeθ1Tx=1+e−(θ1−θ2)Tx1

9. Summary Points

The multinomial logistic regression model classifies inputs into one of k classes.
Each class gets its own parameter vector θj.
The softmax function converts linear scores into probabilities.
Training optimizes the cross-entropy loss via gradient methods.
The decision boundary between classes is linear (or piecewise linear), as it depends on linear functions θjTx.
This approach generalizes the binary logistic regression model in an intuitive way.

10. Additional Notes

Multi-class perceptrons can be implemented similarly by learning separate weight vectors and picking the max scoring class.
More complex multi-class classifiers can involve neural networks that learn non-linear functions before the softmax output layer.

Mashtishk Vigyan Anusandhan

Search This Blog

Unveiling Hidden Neural Codes: SIMPL – A Scalable and Fast Approach for Optimizing Latent Variables and Tuning Curves in Neural Population Data