Cost Function And Hypothesis for Logistic Regression

Hypothesis

We want a hypothesis that is bounded between zero and one, regression hypothesis line extends beyond this limits. Hypothesis here also represents probability of observing an outcome.

Hypothesis by ISLR and Andrew N.G :

Odds and log-odds/logit

Log of odd is also called logit. So above is logit(p(x)) = b0 + b1X.

In terms of GLM we call it a logic link function. Logit has a property of broadcasting [0,1] range to [-inf, inf]. Log does not have that property.

In regression beta1 given average change in y for unit change in x. But here it says unit increase in x changes log-odds by beta1. It multiplies odds by exp(beta1) and hence depends on current value of odds and therefor is not linear.

Cost Function

For ISLR perspective it is likelihood that we want to maximize.

Andre N.G looks it from the perspective of modifying cost function of linear regression.

As we can see that Andrew N.G cost function is same as maximizing log likelihood of ISLR.

Least square in case of linear regression is special case of maximum likelihood. We know that derivation where we assume likelihood to be gaussian.

Logistic Regression For well separated classes

MLE estimation becomes unstable when classes are well separated. While this is acceptable for classification tasks, it is not ideal for risk estimation. Regularization techniques can help avoid this instability. Support Vector Machines (SVM) also perform well for well-separated classes. As shown in figure below[1], the intercept of the sigmoid function reaches -inf, and the slope reaches inf. This allows the slope to come from an infinitely weighted feature.

Why Can’t we use square loss ?

Labels (+1, and -1)
Prediction : +1 when w*x > 0 else -1
Consider scenario [2]
- label +1
- w*x = 100
- Prediction is correct
- Still we add it to loss (100-1)^2
Solutions are hinge loss and logistic loss

Why is not logistic loss not 0 for correct prediction ?
- There are more training samples. We want to find parameter such that it balances all of them
- If we make it 0 positive examples would no longer contribute to loss (like SVM)
  - That’s why hinge loss create support vectors. More about that on SVM’s blog post
- Also this is the property which will make coefficient higher for well separated classes
  - To control that we use regularisation
  - Regularisation is while training, not while prediction, we are just changing loss function
  - Predict is still +1 is w*x > 0 else -1

Why can’t we use step loss ?

When we are incurring loss, we also want to know which direction to go to reduce loss.
Gradient/slope is what provide this information. [2]
Step loss (0-1) has zero gradient

Reference

[1] https://stats.stackexchange.com/questions/254124/why-does-logistic-regression-become-unstable-when-classes-are-well-separated

[2] Notes of different loss function is taken from Mike Galbart’s course – https://github.com/UBC-CS/cpsc340/blob/master/lectures/L19demo.ipynb

Data Stories

Cost Function And Hypothesis for Logistic Regression

Hypothesis

Cost Function

Logistic Regression For well separated classes

Why Can’t we use square loss ?

Why can’t we use step loss ?

Reference

3 thoughts on “Cost Function And Hypothesis for Logistic Regression”

Leave a comment Cancel reply

Hypothesis

Cost Function

Logistic Regression For well separated classes

Why Can’t we use square loss ?

Why can’t we use step loss ?

Reference

Share this:

Related

3 thoughts on “Cost Function And Hypothesis for Logistic Regression”

Leave a comment Cancel reply