VIF and Multicollinearity

VIF = Variance Inflation Factor

  • In linear regression collinearity can make coefficient unstable
    • There will not be any issue in prediction accuracy but coefficients would be less reliable and p-value would be more
    • Correlation coefficients help us detect correlation between pairs but not the multiple correlation x1 = 2*x3 + 4*x7
    • PCA is one thing, we don’t want to transform variable to keep interpretability intact
    • We want some way to reduce dimensions
  • In VIF, each feature is regression against all other features. If R2 is more which means this feature is correlated with other features.  [0]
    • VIF = 1 / (1 – R2)
    • When R2 reaches 1, VIF reaches infinity
  • We try to remove features for which VIF > 5

vif1

  • Example at [1] shows the use of VIF to reduce no of features.
  • Once we identify high VIF for features we need to reduce it
    • We can do it by eliminating some features
    • How to identify which feature to remove?
      • Check the correlated features for feature having high VIF
      • In the example at [1] weight and BSA were correlated
      • Practically it is easy to measure weight so we kept it
        • So such decision depends on the practical implication
      • There can be the case that one feature is correlated with many others and we might want to remove it      vif2vif2

 

Reference

[0] : https://www.youtube.com/watch?v=0SBIXgPVex8

[1] : https://newonlinecourses.science.psu.edu/stat501/node/347/