correlation | Data Stories

In a simple regression model, the regression slope (β) represents the estimated change in the dependent variable (Y) corresponding to a one-unit increase in the independent variable (X). It quantifies the linear relationship between X and Y and indicates the direction and magnitude of the relationship.

The correlation coefficient (r) measures the strength and direction of the linear relationship between X and Y. It ranges from -1 to +1, with a value of 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

When the standard deviations of both X and Y are equal (SD(X) = SD(Y)), the regression slope (β) and the correlation coefficient (r) coincide.

The slope can be calculated as the correlation coefficient multiplied by the ratio of the standard deviations (β = r * SD(Y) / SD(X)). The correlation coefficient essentially represents the slope you would obtain from a regression of standardized variables (Y / SD(Y) on X / SD(X) or vice versa).

However, when the standard deviations of X and Y are not equal, the regression slope and the correlation coefficient provide distinct information:

The correlation coefficient is a bounded measure that can be interpreted independently of the scale of the variables. It indicates the strength of the linear relationship between X and Y, with values closer to ±1 indicating a stronger linear relationship. The regression slope, on its own, does not provide this information.
The regression slope represents the estimated change in the expected value of Y for a given unit increase in X. It provides information about the direction and magnitude of the relationship between X and Y in the original units of measurement. This information cannot be deduced from the correlation coefficient alone.

One more thing to add here is the relationship between correlation coefficient and co-variance. Formula is : r = Covariance (Y, X) / [ SD(Y) * SD(x) ]. We are normalising by SD of each variable. Also SD = sqrt ( variance ). We can also say that b = Covariance(X,Y) / VAR(X)

References

[0] : https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope

[1] : https://www.quora.com/Is-there-a-relationship-between-the-correlation-coefficient-and-the-slope-of-a-linear-regression-line

VIF = Variance Inflation Factor

In linear regression collinearity can make coefficient unstable
- There will not be any issue in prediction accuracy but coefficients would be less reliable and p-value would be more
- Correlation coefficients help us detect correlation between pairs but not the multiple correlation x1 = 2*x3 + 4*x7
- PCA is one thing, we don’t want to transform variable to keep interpretability intact
- We want some way to reduce dimensions
In VIF, each feature is regression against all other features. If R2 is more which means this feature is correlated with other features. [0]
- VIF = 1 / (1 – R2)
- When R2 reaches 1, VIF reaches infinity
We try to remove features for which VIF > 5

vif1

Example at [1] shows the use of VIF to reduce no of features.
Once we identify high VIF for features we need to reduce it
- We can do it by eliminating some features
- How to identify which feature to remove?
  - Check the correlated features for feature having high VIF
  - In the example at [1] weight and BSA were correlated
  - Practically it is easy to measure weight so we kept it
    - So such decision depends on the practical implication
  - There can be the case that one feature is correlated with many others and we might want to remove it

Reference

[0] : https://www.youtube.com/watch?v=0SBIXgPVex8

[1] : https://newonlinecourses.science.psu.edu/stat501/node/347/

Data Stories

Tag correlation

Correlation and Regression Slope

VIF and Multicollinearity

Reference