1. Classification Problem
- Definition: Classification
is a supervised learning task where the output variable y
is discrete-valued
rather than continuous.
- In
particular, consider binary
classification where y∈ {0,1} (e.g., spam detection: spam =1, not spam =0).
- Each
training example is a pair (x(i), y(i)), where x(i)∈Rd is a feature vector, and y(i) is the label.
2. Why Not Use Linear Regression for
Classification?
- Linear
regression tries to predict continuous values, which is problematic for
classification as the prediction can be outside [0,1].
- For
example, predicting y≈1.5 or −0.2 is meaningless when y is binary.
- Instead,
we want the output hθ(x) to be interpreted as the probability that y=1
given x.
3. Logistic Regression Model
Hypothesis:
hθ(x)=g(θTx)=1+e−θTx1,
where:
- g(z)=1+e−z1 is the sigmoid function, which
maps any real value to the interval (0, 1).
- θ∈Rd+1 are
parameters (including intercept term).
- hθ(x) can be interpreted as the estimated
probability P(y=1∣x;θ).
Decision Boundary:
- Predict
y=1
if hθ(x)≥0.5; otherwise, predict y=0.
- The
decision boundary corresponds to θTx=0, which is a linear boundary in input
space.
4. Loss Function and Cost Function
Probability Model:
- Logistic
regression models conditional probability directly:
P(y=1∣x;θ)=hθ(x),P(y=0∣x;θ)=1−hθ(x).
- Equivalently,
likelihood for data point (x(i),y(i)):
p(y(i)∣x(i);θ)=(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i).
Cost (Loss) Function:
- Use negative log-likelihood
(cross-entropy loss) as cost per example:
J(i)(θ)=−[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))].
- Overall
cost function (average over n examples):
J(θ)=n1∑i=1nJ(i)(θ).
- This
loss is convex in θ, enabling efficient
optimization.
5. Training Logistic Regression
·
Use
methods such as gradient
descent or more advanced optimization (Newton's method,
quasi-Newton) to minimize cost J(θ).
·
The
gradient of the cost function is:
∇θJ(θ)=n1∑i=1n(hθ(x(i))−y(i))x(i).
- Update rule in gradient descent:
θ:=θ−α∇θJ(θ),
where α is the learning rate.
6. Multi-class Classification
·
When y∈{1,2,...,k} for k>2, logistic regression generalizes to multinomial logistic regression
or Softmax regression.
·
Model
outputs hˉθ(x)∈Rk called logits.
·
The Softmax function converts
logits into probabilities:
P(y=j∣x;θ)=∑s=1kexp(hˉθ(x)s)exp(hˉθ(x)j).
- Loss
for example (x(i),y(i)) is the negative
log-likelihood:
J(i)(θ)=−logP(y(i)∣x(i);θ).
7. Discriminative vs. Generative Classification
Algorithms
- Discriminative
algorithms (like logistic regression) model p(y∣x)
directly or learn a direct mapping from x to y.
- Generative
algorithms model the joint distribution p(x,y)=p(x∣y)p(y).
- Example:
Gaussian Discriminant Analysis (GDA).
- Logistic
regression is an example of a discriminative approach focusing purely on p(y∣x).
8. Linear Hypothesis Class and Decision Boundaries
- Logistic
regression hypothesis class:
H={hθ:hθ(x)=1{θTx≥0}},
which are
classifiers with linear decision boundaries.
- More
generally, hypothesis classes can be extended to neural networks or other
complex architectures.
9. Perceptron Learning as Contrast to Logistic
Regression
·
Perceptron
also uses a linear classifier but with a different loss and update rule.
·
Logistic
regression provides probabilistic outputs and optimizes a convex cost function,
generally yielding better statistical properties.
10. Practical Considerations
- Feature
scaling often improves numerical stability.
- Regularization
(e.g., L2) is frequently added to cost to prevent overfitting.
- Logistic
regression handles input features linearly; non-linear boundaries require
feature engineering or kernel methods.
Summary:
Logistic
regression is a fundamental classification algorithm that models the
conditional probability of the positive class using a sigmoid of a linear
function of input features. It is trained via maximizing likelihood (or
minimizing cross-entropy loss) and extends naturally to multi-class problems
via Softmax. It is a discriminative model focusing directly on p(y∣x) and yields linear decision boundaries. It
contrasts with generative models by its direct approach to classification.
Comments
Post a Comment