Correlation and Regression Slope

June 18, 2019May 18, 2023Archit Vora Leave a comment

In a simple regression model, the regression slope (β) represents the estimated change in the dependent variable (Y) corresponding to a one-unit increase in the independent variable (X). It quantifies the linear relationship between X and Y and indicates the direction and magnitude of the relationship.

The correlation coefficient (r) measures the strength and direction of the linear relationship between X and Y. It ranges from -1 to +1, with a value of 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

When the standard deviations of both X and Y are equal (SD(X) = SD(Y)), the regression slope (β) and the correlation coefficient (r) coincide.

The slope can be calculated as the correlation coefficient multiplied by the ratio of the standard deviations (β = r * SD(Y) / SD(X)). The correlation coefficient essentially represents the slope you would obtain from a regression of standardized variables (Y / SD(Y) on X / SD(X) or vice versa).

However, when the standard deviations of X and Y are not equal, the regression slope and the correlation coefficient provide distinct information:

The correlation coefficient is a bounded measure that can be interpreted independently of the scale of the variables. It indicates the strength of the linear relationship between X and Y, with values closer to ±1 indicating a stronger linear relationship. The regression slope, on its own, does not provide this information.
The regression slope represents the estimated change in the expected value of Y for a given unit increase in X. It provides information about the direction and magnitude of the relationship between X and Y in the original units of measurement. This information cannot be deduced from the correlation coefficient alone.

One more thing to add here is the relationship between correlation coefficient and co-variance. Formula is : r = Covariance (Y, X) / [ SD(Y) * SD(x) ]. We are normalising by SD of each variable. Also SD = sqrt ( variance ). We can also say that b = Covariance(X,Y) / VAR(X)

References

[0] : https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope

[1] : https://www.quora.com/Is-there-a-relationship-between-the-correlation-coefficient-and-the-slope-of-a-linear-regression-line

VIF and Multicollinearity

January 23, 2019October 25, 2020Archit Vora Leave a comment

VIF = Variance Inflation Factor

In linear regression collinearity can make coefficient unstable
- There will not be any issue in prediction accuracy but coefficients would be less reliable and p-value would be more
- Correlation coefficients help us detect correlation between pairs but not the multiple correlation x1 = 2*x3 + 4*x7
- PCA is one thing, we don’t want to transform variable to keep interpretability intact
- We want some way to reduce dimensions
In VIF, each feature is regression against all other features. If R2 is more which means this feature is correlated with other features. [0]
- VIF = 1 / (1 – R2)
- When R2 reaches 1, VIF reaches infinity
We try to remove features for which VIF > 5

vif1

Example at [1] shows the use of VIF to reduce no of features.
Once we identify high VIF for features we need to reduce it
- We can do it by eliminating some features
- How to identify which feature to remove?
  - Check the correlated features for feature having high VIF
  - In the example at [1] weight and BSA were correlated
  - Practically it is easy to measure weight so we kept it
    - So such decision depends on the practical implication
  - There can be the case that one feature is correlated with many others and we might want to remove it

Reference

[0] : https://www.youtube.com/watch?v=0SBIXgPVex8

[1] : https://newonlinecourses.science.psu.edu/stat501/node/347/

Generalized Linear Models (GLM)

September 18, 2018October 25, 2020Archit Vora 1 Comment

In standard linear regression we make two assumption :

P(Y/X) is a normal distribution
Mean is a linear function of parameter µ = β*X
P(Y/X) = Ν(µ, σ^2* I) # σ is standard deviation and I is identity matrix

In GLM we relax two things :

P(Y/X) is from any exponential family
Mean is some function of β*X
1. µ = f(β*X)
2. g(µ) = β*X
3. g = f^(-1)
4. g is called link function

Example of link functions:

log link
reciprocal link
logistic link

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

In gaussian regression we predict μ for each sample
- This μ comes from β0, β1, β2 which are same for each sample
For binomial regression we want to predict p for each sample
- This p comes from β0, β1, β2 which are same for each sample
One option :
- p = β0 + β1*x1 + β2*x2
Second option
- p = sigmoid (β0 + β1*x1 + β2*x2)
- f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
- It is logit link function
What are other options apart from sigmoid
- step function (Not differentiable, that is why we use (sigmoid)
- tanh is sometime used in deep learning
What if we go with option 1:
- Binomial distribution requires p to be in (0,1)
Example :
- How many fishes survive (alive/dead) given food and water

Poisson

Poisson distribution models probability of observing count
- - P(k) = exp(-λ) * (λ^k) / k !
Parameter λ >= 0
Option one:
- λ = β0 + β1*x1 + β2*x2
Option two:
- λ = exp ( β0 + β1*x1 + β2*x2 )
- f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
- It is log link function
What if we go with option one:
- We want λ > 0
- Relationship between input and output is not additive but multiplicative ?
  - Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
    Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
Example:
- How many seed will germinate given water and fertilizer

Parameter Estimation

We can do maximum likelihood estimate and find parameters β0, β1, β2
Deriving maximum likelihood for binomial:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Binomial(p) )
- max_lh = Multiply ( p if y=1 else (1-p) )
- log(max_lh) = Summation (y*logp + (1-y) log (1-p))
Deriving maximum likelihood for Poisson:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Poisson (u) )
- max_lh = Multiply (exp(-u) * u^y / y! )
- log(max_lh) = summation ( -u + y*log(u) – log (y!) )
Above two are rough derivations but conveys the idea
For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
For other we solve it via gradient/newton’s method.

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

Interpreting Statistical Values

January 15, 2017May 21, 2023Archit Vora 1 Comment

In this post, we will explore the values in the summary(model) output in R and understand their significance.

Here is a screenshot illustrating the summary:

Significance of Residue

We desire our residues to be normally distributed and centered around zero.
It’s similar to aiming at the bullseye on a dartboard.
- If the residues are biased in one direction, there is room for improvement.
- If the residues are equally biased in all directions, we can attempt to reduce the standard deviation.
- Irreducible error should be observed in all directions simultaneously.
Residues quantile provides an initial insight into symmetry.
R also provides the standard deviation of residuals, known as RSE (residue standard error).

The Relationship between t-value and p-value in the Coefficient Section

The values test if a variable has a relationship with the output.
This is a preset statistical question (null hypothesis) that cannot be changed.
If the coefficient is zero, it does not contribute; otherwise, it does.
The t-value indicates the number of standard deviations the mean is away from zero.
A larger t-value signifies a more significant variable.

Calculating p-values

Incorrect thinking: Taking samples from a larger population.
Each sample yields a different coefficient, which can be zero for some samples.
The variance of the estimated parameter can be mathematically derived using (X^T * X)-1 with σ2.
σ2 can be obtained from the residue error.
Bayesian view helps appreciate the distribution of coefficients rather than point estimation.
- P-values can be calculated naturally using the T-distribution, as there are no assumptions.
In the R result display, we have a mean and standard deviation.
- The coefficient is a probabilistic variable centered at the mean (Estimate in R summary).
- The mean is t standard deviations away from zero.
- The p-value represents the probability of observing a coefficient beyond t standard deviations from the mean.

Role of R^2

R^2 indicates how much of the variance is explained by the model. Refer to formulas above for a better understanding.
R^2 has an advantage over RSE as it always falls between 0 and 1.

Determining a Good Value of R^2

A good value of R^2 depends on the problem setting.
When we make perfect predictions, RSS = 0 and hence R^2 = 1
In physics, if we are confident the data follows a linear model, R^2 close to 1 is desirable.
In marketing, a small proportion of the variance can be explained by predictors, so R^2 = 0.1 can be realistic.

Difference between Absolute and Adjusted R^2

R^2 always increases with the number of variables, while adjusted R^2 decreases if the added variable is not significant.
The formula of adjusted R^2 incorporates the number of variables, so when a non-significant variable is added, the result decreases.
The formulas below illustrate that RSE may increase while RSS decreases, but they are not directly related to R^2.

Significance of F Statistics

The F-test determines if a group of variables is jointly significant, whereas the t-test examines the significance of individual variables.
F-statistics also have associated p-values.
The null hypothesis for the F-test is that the intercept-only model and your model are equal.
While R-squared provides an estimate of the relationship strength between the model and response variable, it does not offer a formal hypothesis test. This test is provided by the F-test.

Why Use F Statistics when Individual Coefficient p-values are Available?

It may seem that if one coefficient is significant (good p-value), the overall model will also be significant.
However, this assumption breaks down when the number of variables with poor p-values is large.

Determining Good Values of F-statistics

It depends on the values of n (number of observations in the training set) and p (number of independent variables).
When n is large, an F-value slightly greater than 1 is sufficient to reject the null hypothesis.
It is advisable to base decisions on corresponding p-values, which consider both n and p.

Degrees of Freedom:

Suppose you have two features, x1 and x2, and a target variable y.
The line equation is y = a1x1 + a2x2 + a3.
In a 3D space, three points define a unique line.
With n points, p(2) features, and 1 target, three points will always lie on the line, while (n-p-1) points can deviate from it. This difference represents the degrees of freedom.
Degrees of freedom are the difference between n and the number of non-zero coefficients, including the intercept.

Significance Score “***” in the Coefficient Section

R indicates the significance of a p-value by displaying stars.
The calculation of this value is likely done through bootstrapping.
Bootstrapping allows assigning measures of accuracy to sample estimates, such as bias, variance, confidence intervals, or prediction error.
In Bayesian inference, parameter distributions are obtained, allowing the calculation of p-values.

References

Found the formula for adjusted R2 here

Data Stories

Category Regression

Correlation and Regression Slope

VIF and Multicollinearity

Reference

Generalized Linear Models (GLM)

Logistic

Poisson

Parameter Estimation

References :

Interpreting Statistical Values

Significance of Residue

The Relationship between t-value and p-value in the Coefficient Section

Calculating p-values

Role of R^2

Determining a Good Value of R^2

Difference between Absolute and Adjusted R^2

Significance of F Statistics

Why Use F Statistics when Individual Coefficient p-values are Available?

Determining Good Values of F-statistics

Degrees of Freedom:

Significance Score “***” in the Coefficient Section

References