Correlation and Regression Slope

In a simple regression model, the regression slope (β) represents the estimated change in the dependent variable (Y) corresponding to a one-unit increase in the independent variable (X). It quantifies the linear relationship between X and Y and indicates the direction and magnitude of the relationship.

The correlation coefficient (r) measures the strength and direction of the linear relationship between X and Y. It ranges from -1 to +1, with a value of 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

When the standard deviations of both X and Y are equal (SD(X) = SD(Y)), the regression slope (β) and the correlation coefficient (r) coincide.

The slope can be calculated as the correlation coefficient multiplied by the ratio of the standard deviations (β = r * SD(Y) / SD(X)). The correlation coefficient essentially represents the slope you would obtain from a regression of standardized variables (Y / SD(Y) on X / SD(X) or vice versa).

However, when the standard deviations of X and Y are not equal, the regression slope and the correlation coefficient provide distinct information:

  1. The correlation coefficient is a bounded measure that can be interpreted independently of the scale of the variables. It indicates the strength of the linear relationship between X and Y, with values closer to ±1 indicating a stronger linear relationship. The regression slope, on its own, does not provide this information.
  2. The regression slope represents the estimated change in the expected value of Y for a given unit increase in X. It provides information about the direction and magnitude of the relationship between X and Y in the original units of measurement. This information cannot be deduced from the correlation coefficient alone.

One more thing to add here is the relationship between correlation coefficient and co-variance. Formula is : r = Covariance (Y, X) / [ SD(Y) * SD(x) ]. We are normalising by SD of each variable. Also SD = sqrt ( variance ). We can also say that b = Covariance(X,Y) / VAR(X)

References

[0] : https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope

[1] : https://www.quora.com/Is-there-a-relationship-between-the-correlation-coefficient-and-the-slope-of-a-linear-regression-line

VIF and Multicollinearity

VIF = Variance Inflation Factor

  • In linear regression collinearity can make coefficient unstable
    • There will not be any issue in prediction accuracy but coefficients would be less reliable and p-value would be more
    • Correlation coefficients help us detect correlation between pairs but not the multiple correlation x1 = 2*x3 + 4*x7
    • PCA is one thing, we don’t want to transform variable to keep interpretability intact
    • We want some way to reduce dimensions
  • In VIF, each feature is regression against all other features. If R2 is more which means this feature is correlated with other features.  [0]
    • VIF = 1 / (1 – R2)
    • When R2 reaches 1, VIF reaches infinity
  • We try to remove features for which VIF > 5

vif1

  • Example at [1] shows the use of VIF to reduce no of features.
  • Once we identify high VIF for features we need to reduce it
    • We can do it by eliminating some features
    • How to identify which feature to remove?
      • Check the correlated features for feature having high VIF
      • In the example at [1] weight and BSA were correlated
      • Practically it is easy to measure weight so we kept it
        • So such decision depends on the practical implication
      • There can be the case that one feature is correlated with many others and we might want to remove it      vif2vif2

 

Reference

[0] : https://www.youtube.com/watch?v=0SBIXgPVex8

[1] : https://newonlinecourses.science.psu.edu/stat501/node/347/

 

 

Generalized Linear Models (GLM)

In standard linear regression we make two assumption :

  1. P(Y/X) is a normal distribution
  2. Mean is a linear function of parameter µ  = β*X
  3. P(Y/X) = Ν(µ, σ^2* I)       # σ is standard deviation and I is identity matrix

 

In GLM we relax two things :

  1. P(Y/X) is from any exponential family
  2. Mean is some function of β*X
    1. µ = f(β*X)
    2. g(µ) = β*X
    3. g = f^(-1)
    4. g is called link function

 

Example of link functions:

  1. log link
  2. reciprocal link
  3. logistic link

 

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

  • In gaussian regression we predict μ for each sample
    • This μ comes from β0, β1, β2 which are same for each sample
  • For binomial regression we want to predict p for each sample
    • This p comes from β0, β1, β2 which are same for each sample
  • One option :
    • p = β0 + β1*x1 + β2*x2
  • Second option
    • p = sigmoid (β0 + β1*x1 + β2*x2)
    • f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
    • It is logit link function
  • What are other options apart from sigmoid
    • step function (Not differentiable, that is why we use (sigmoid)
    • tanh is sometime used in deep learning
  • What if we go with option 1:
    • Binomial distribution requires p to be in (0,1)
  • Example :
    • How many fishes survive (alive/dead) given food and water

 

Poisson

  • Poisson distribution models probability of observing count
      • P(k) = exp(-λ) * (λ^k) / k !

    Parameter λ >= 0

  • Option one:
    • λ = β0 + β1*x1 + β2*x2
  • Option two:
    • λ = exp ( β0 + β1*x1 + β2*x2 )
    • f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
    • It is log link function
  • What if we go with option one:
    • We want λ > 0
    • Relationship between input and output is not additive but multiplicative ?
      • Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
        Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
  • Example:
    • How many seed will germinate given water and fertilizer

 

Parameter Estimation

  • We can do maximum likelihood estimate and find parameters β0, β1, β2
  • Deriving maximum likelihood for binomial:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Binomial(p) )
    • max_lh = Multiply ( p if y=1 else (1-p) )
    • log(max_lh) = Summation (y*logp + (1-y) log (1-p))
  • Deriving maximum likelihood for Poisson:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Poisson (u) )
    • max_lh = Multiply (exp(-u) * u^y / y! )
    • log(max_lh) = summation ( -u + y*log(u) – log (y!) )
  • Above two are rough derivations but conveys the idea
  • For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
  • For other we solve it via gradient/newton’s method.

 

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

 

Interpreting Statistical Values

In this post, we will explore the values in the summary(model) output in R and understand their significance.

Here is a screenshot illustrating the summary:

rsummary

Significance of Residue

  • We desire our residues to be normally distributed and centered around zero.
  • It’s similar to aiming at the bullseye on a dartboard.
    • If the residues are biased in one direction, there is room for improvement.
    • If the residues are equally biased in all directions, we can attempt to reduce the standard deviation.
    • Irreducible error should be observed in all directions simultaneously.
  • Residues quantile provides an initial insight into symmetry.
  • R also provides the standard deviation of residuals, known as RSE (residue standard error).

The Relationship between t-value and p-value in the Coefficient Section

  • The values test if a variable has a relationship with the output.
  • This is a preset statistical question (null hypothesis) that cannot be changed.
  • If the coefficient is zero, it does not contribute; otherwise, it does.
  • The t-value indicates the number of standard deviations the mean is away from zero.
  • A larger t-value signifies a more significant variable.

Calculating p-values

  • Incorrect thinking: Taking samples from a larger population.
  • Each sample yields a different coefficient, which can be zero for some samples.
  • The variance of the estimated parameter can be mathematically derived using (X^T * X)-1 with σ2.
  • σ2 can be obtained from the residue error.
  • Bayesian view helps appreciate the distribution of coefficients rather than point estimation.
    • P-values can be calculated naturally using the T-distribution, as there are no assumptions.
  • In the R result display, we have a mean and standard deviation.
    • The coefficient is a probabilistic variable centered at the mean (Estimate in R summary).
    • The mean is t standard deviations away from zero.
    • The p-value represents the probability of observing a coefficient beyond t standard deviations from the mean.

formulas

Role of R^2

  • R^2 indicates how much of the variance is explained by the model. Refer to formulas above for a better understanding.
  • R^2 has an advantage over RSE as it always falls between 0 and 1.

Determining a Good Value of R^2

  • A good value of R^2 depends on the problem setting.
  • When we make perfect predictions, RSS = 0 and hence R^2 = 1
  • In physics, if we are confident the data follows a linear model, R^2 close to 1 is desirable.
  • In marketing, a small proportion of the variance can be explained by predictors, so R^2 = 0.1 can be realistic.

Difference between Absolute and Adjusted R^2

  • R^2 always increases with the number of variables, while adjusted R^2 decreases if the added variable is not significant.
  • The formula of adjusted R^2 incorporates the number of variables, so when a non-significant variable is added, the result decreases.
  • The formulas below illustrate that RSE may increase while RSS decreases, but they are not directly related to R^2.
adR2.PNG

Significance of F Statistics

  • The F-test determines if a group of variables is jointly significant, whereas the t-test examines the significance of individual variables.
  • F-statistics also have associated p-values.
  • The null hypothesis for the F-test is that the intercept-only model and your model are equal.
  • While R-squared provides an estimate of the relationship strength between the model and response variable, it does not offer a formal hypothesis test. This test is provided by the F-test.

Why Use F Statistics when Individual Coefficient p-values are Available?

  • It may seem that if one coefficient is significant (good p-value), the overall model will also be significant.
  • However, this assumption breaks down when the number of variables with poor p-values is large.

Determining Good Values of F-statistics

  • It depends on the values of n (number of observations in the training set) and p (number of independent variables).
  • When n is large, an F-value slightly greater than 1 is sufficient to reject the null hypothesis.
  • It is advisable to base decisions on corresponding p-values, which consider both n and p.

Degrees of Freedom:

  • Suppose you have two features, x1 and x2, and a target variable y.
  • The line equation is y = a1x1 + a2x2 + a3.
  • In a 3D space, three points define a unique line.
  • With n points, p(2) features, and 1 target, three points will always lie on the line, while (n-p-1) points can deviate from it. This difference represents the degrees of freedom.
  • Degrees of freedom are the difference between n and the number of non-zero coefficients, including the intercept.

Significance Score “***” in the Coefficient Section

  • R indicates the significance of a p-value by displaying stars.
  • The calculation of this value is likely done through bootstrapping.
  • Bootstrapping allows assigning measures of accuracy to sample estimates, such as bias, variance, confidence intervals, or prediction error.
  • In Bayesian inference, parameter distributions are obtained, allowing the calculation of p-values.

References

Found the formula for adjusted R2 here