Logistic regression is a fundamental classification algorithm widely used for binary and multi-class classification problems.
1. What is Logistic Regression?
Logistic
regression is a supervised learning algorithm designed for classification
tasks, especially binary classification where the response variable y takes values in {0,1}. Unlike linear regression, which predicts continuous outputs,
logistic regression predicts probabilities that an input x
belongs to the positive class (y=1).
2. Hypothesis Function and Model Formulation
In
logistic regression, the hypothesis function hθ(x) models the probability p(y=1∣x;θ) using the logistic (sigmoid) function applied to a linear
combination of input features:
hθ(x)=P(y=1∣x;θ)=1+e−θTx1
where:
- θ∈Rd+1 are
the parameters (weights),
- x∈Rd+1 is
the augmented feature vector (usually including a bias term),
- θTx is the linear predictor,
- the
function g(z)=1+e−z1 is the logistic or sigmoid
function,.
This
design ensures the output is always between 0 and 1, which can be interpreted
as a probability.
3. Statistical Model and Bernoulli Distribution
Logistic
regression assumes that the conditional distribution of y
given x follows a Bernoulli distribution parameterized
by ϕ=hθ(x):
y∣x;θ∼Bernoulli(hθ(x))
The
expectation of y is:
E[y∣x;θ]=ϕ=hθ(x)
The use
of the Bernoulli distribution leads naturally to the logistic function through
the generalized linear model (GLM) framework and the exponential family of
distributions.
- The
canonical response function for Bernoulli is logistic sigmoid g(η)=1+e−η1,
- The
canonical link function is the inverse of the response function g−1.
4. Parameter Estimation via Maximum Likelihood
Parameters
θ are typically estimated by maximizing the likelihood
of the observed data, or equivalently, minimizing the negative log-likelihood
(also called the cross-entropy loss function). For training examples {(x(i),y(i))}i=1n, the loss for a single example is:
J(i)(θ)=−logp(y(i)∣x(i);θ)=−(y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))))
And the
total cost function is the average loss over all examples:
J(θ)=n1∑i=1nJ(i)(θ)
The
optimization is usually done using gradient descent or variants.
5. Multi-class Logistic Regression (Softmax
Regression)
For
multi-class classification where y∈{1,2,…,k}, logistic regression generalizes to the
softmax function, mapping the outputs to a probability distribution over k classes:
Let the
model outputs be logits hˉθ(x)∈Rk, where each component corresponds to a
class:
P(y=j∣x;θ)=∑s=1kexp(hˉθ(x)s)exp(hˉθ(x)j)
The loss
function per training example is then the negative log likelihood:
J(i)(θ)=−logP(y(i)∣x(i);θ)=−log∑s=1kexp(hˉθ(x(i))s)exp(hˉθ(x(i))y(i))
The
overall loss is again the average over all training samples.
6. Discriminative vs. Generative Learning
Algorithms
Logistic
regression is classified as a discriminative algorithm because it models p(y∣x) directly, learning the boundary between
classes without modeling the data distribution p(x). This contrasts with generative algorithms that model p(x∣y) and p(y) to classify.
7. Hypothesis Class and Decision Boundaries
The set
of all classifiers corresponding to logistic regression forms the hypothesis
class H:
H={hθ:hθ(x)=1{θTx≥0}}
Here, 1{⋅} denotes the indicator function (output is
1 if condition holds, 0 otherwise). The decision boundary is the hyperplane θTx=0, which
is linear in the input space.
8. Learning Algorithm
In
practice, logistic regression parameters are learned by maximizing the
likelihood or equivalently minimizing the cross-entropy loss using optimization
algorithms such as batch gradient descent, stochastic gradient descent, or more
advanced variants. The gradient of the loss with respect to θ
can be computed explicitly, enabling efficient learning.
9. Extensions and Relations to Other Learning
Models
- Logistic
regression can be derived as a Generalized Linear Model (GLM) where the
link function is the logit (the inverse of the sigmoid).
- It
is closely related to the perceptron algorithm and linear classifiers, but
logistic regression outputs probabilities and has a probabilistic
interpretation unlike the perceptron.
- Logistic
regression models can be generalized further as parts of neural network
architectures representing hypothesis classes of more complex models.
Comments
Post a Comment