Correlation and Regression Slope

In a simple regression model, the regression slope (β) represents the estimated change in the dependent variable (Y) corresponding to a one-unit increase in the independent variable (X). It quantifies the linear relationship between X and Y and indicates the direction and magnitude of the relationship.

The correlation coefficient (r) measures the strength and direction of the linear relationship between X and Y. It ranges from -1 to +1, with a value of 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

When the standard deviations of both X and Y are equal (SD(X) = SD(Y)), the regression slope (β) and the correlation coefficient (r) coincide.

The slope can be calculated as the correlation coefficient multiplied by the ratio of the standard deviations (β = r * SD(Y) / SD(X)). The correlation coefficient essentially represents the slope you would obtain from a regression of standardized variables (Y / SD(Y) on X / SD(X) or vice versa).

However, when the standard deviations of X and Y are not equal, the regression slope and the correlation coefficient provide distinct information:

  1. The correlation coefficient is a bounded measure that can be interpreted independently of the scale of the variables. It indicates the strength of the linear relationship between X and Y, with values closer to ±1 indicating a stronger linear relationship. The regression slope, on its own, does not provide this information.
  2. The regression slope represents the estimated change in the expected value of Y for a given unit increase in X. It provides information about the direction and magnitude of the relationship between X and Y in the original units of measurement. This information cannot be deduced from the correlation coefficient alone.

One more thing to add here is the relationship between correlation coefficient and co-variance. Formula is : r = Covariance (Y, X) / [ SD(Y) * SD(x) ]. We are normalising by SD of each variable. Also SD = sqrt ( variance ). We can also say that b = Covariance(X,Y) / VAR(X)

References

[0] : https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope

[1] : https://www.quora.com/Is-there-a-relationship-between-the-correlation-coefficient-and-the-slope-of-a-linear-regression-line

VIF and Multicollinearity

VIF = Variance Inflation Factor

  • In linear regression collinearity can make coefficient unstable
    • There will not be any issue in prediction accuracy but coefficients would be less reliable and p-value would be more
    • Correlation coefficients help us detect correlation between pairs but not the multiple correlation x1 = 2*x3 + 4*x7
    • PCA is one thing, we don’t want to transform variable to keep interpretability intact
    • We want some way to reduce dimensions
  • In VIF, each feature is regression against all other features. If R2 is more which means this feature is correlated with other features.  [0]
    • VIF = 1 / (1 – R2)
    • When R2 reaches 1, VIF reaches infinity
  • We try to remove features for which VIF > 5

vif1

  • Example at [1] shows the use of VIF to reduce no of features.
  • Once we identify high VIF for features we need to reduce it
    • We can do it by eliminating some features
    • How to identify which feature to remove?
      • Check the correlated features for feature having high VIF
      • In the example at [1] weight and BSA were correlated
      • Practically it is easy to measure weight so we kept it
        • So such decision depends on the practical implication
      • There can be the case that one feature is correlated with many others and we might want to remove it      vif2vif2

 

Reference

[0] : https://www.youtube.com/watch?v=0SBIXgPVex8

[1] : https://newonlinecourses.science.psu.edu/stat501/node/347/