Context
·
Recall
that in logistic regression, we model the probability of the binary label y∈{0,1} given input x: hθ(x)=g(θTx)=1+exp(−θTx)1 where g(⋅) is the sigmoid function.
·
The
log-likelihood for n data points is: ℓ(θ)=∑i=1nlogp(y(i)∣x(i);θ) where p(y=1∣x;θ)=hθ(x),p(y=0∣x;θ)=1−hθ(x)
·
Maximizing
ℓ(θ) corresponds to minimizing the negative log-likelihood,
which yields the logistic loss.
Common Approach: Gradient Ascent/Descent
- Typically,
logistic regression parameters are estimated by applying gradient ascent (or
descent on negative log-likelihood), updating parameters iteratively based
on the gradient of ℓ(θ).
Alternative Algorithm: Newton's Method
(Iteratively Reweighted Least Squares - IRLS)
·
Another
algorithm for maximizing ℓ(θ) relies on second-order information,
specifically the Hessian
matrix of second derivatives of ℓ(θ).
·
This
method is Newton's method applied to maximize the log-likelihood, sometimes
called Iteratively Reweighted
Least Squares (IRLS) in logistic regression.
Key Idea of the Algorithm
·
Newton's
update iteratively
updates parameters by: θ(t+1)=θ(t)−H−1(θ(t))∇ℓ(θ(t)) where
·
∇ℓ(θ(t)) is the gradient of log-likelihood,
·
H(θ(t)) is the Hessian matrix of second
derivatives at iteration t.
·
The
Hessian captures curvature of the log-likelihood function, enabling more
efficient and often faster convergence than gradient ascent alone.
Logistic Regression (IRLS) Specifics
- At
each iteration:
- Compute
predicted probabilities using current parameters: pi(t)=hθ(t)(x(i))
- Construct
a diagonal weight matrix W where: Wii=pi(t)(1−pi(t))
- Calculate
the adjusted response variable z: z=Xθ(t)+W−1(y−p(t))
- Update
θ by solving the weighted least squares problem: θ(t+1)=(XTWX)−1XTWz
Benefits
- Faster
convergence near the optimum because Newton's method
uses curvature information.
- Quadratic
convergence once close to the solution compared to
linear convergence of gradient methods.
Tradeoffs
- Involves
computing and inverting the Hessian matrix, which is computationally
expensive for large datasets or high-dimensional features.
- Gradient-based
methods (stochastic gradient descent, mini-batch) are more scalable for
big data settings.
Summary
- The
alternative algorithm to maximize ℓ(θ)
is Newton’s method or IRLS, which uses both first and second derivatives.
- At
each iteration, parameters are updated by solving a weighted least squares
problem using the Hessian and gradient of the log-likelihood.
- This
often yields faster and more accurate convergence for logistic regression
model fitting.
Comments
Post a Comment