Interpreting Statistical Values

In this post, we will explore the values in the summary(model) output in R and understand their significance.

Here is a screenshot illustrating the summary:

rsummary

Significance of Residue

  • We desire our residues to be normally distributed and centered around zero.
  • It’s similar to aiming at the bullseye on a dartboard.
    • If the residues are biased in one direction, there is room for improvement.
    • If the residues are equally biased in all directions, we can attempt to reduce the standard deviation.
    • Irreducible error should be observed in all directions simultaneously.
  • Residues quantile provides an initial insight into symmetry.
  • R also provides the standard deviation of residuals, known as RSE (residue standard error).

The Relationship between t-value and p-value in the Coefficient Section

  • The values test if a variable has a relationship with the output.
  • This is a preset statistical question (null hypothesis) that cannot be changed.
  • If the coefficient is zero, it does not contribute; otherwise, it does.
  • The t-value indicates the number of standard deviations the mean is away from zero.
  • A larger t-value signifies a more significant variable.

Calculating p-values

  • Incorrect thinking: Taking samples from a larger population.
  • Each sample yields a different coefficient, which can be zero for some samples.
  • The variance of the estimated parameter can be mathematically derived using (X^T * X)-1 with σ2.
  • σ2 can be obtained from the residue error.
  • Bayesian view helps appreciate the distribution of coefficients rather than point estimation.
    • P-values can be calculated naturally using the T-distribution, as there are no assumptions.
  • In the R result display, we have a mean and standard deviation.
    • The coefficient is a probabilistic variable centered at the mean (Estimate in R summary).
    • The mean is t standard deviations away from zero.
    • The p-value represents the probability of observing a coefficient beyond t standard deviations from the mean.

formulas

Role of R^2

  • R^2 indicates how much of the variance is explained by the model. Refer to formulas above for a better understanding.
  • R^2 has an advantage over RSE as it always falls between 0 and 1.

Determining a Good Value of R^2

  • A good value of R^2 depends on the problem setting.
  • When we make perfect predictions, RSS = 0 and hence R^2 = 1
  • In physics, if we are confident the data follows a linear model, R^2 close to 1 is desirable.
  • In marketing, a small proportion of the variance can be explained by predictors, so R^2 = 0.1 can be realistic.

Difference between Absolute and Adjusted R^2

  • R^2 always increases with the number of variables, while adjusted R^2 decreases if the added variable is not significant.
  • The formula of adjusted R^2 incorporates the number of variables, so when a non-significant variable is added, the result decreases.
  • The formulas below illustrate that RSE may increase while RSS decreases, but they are not directly related to R^2.
adR2.PNG

Significance of F Statistics

  • The F-test determines if a group of variables is jointly significant, whereas the t-test examines the significance of individual variables.
  • F-statistics also have associated p-values.
  • The null hypothesis for the F-test is that the intercept-only model and your model are equal.
  • While R-squared provides an estimate of the relationship strength between the model and response variable, it does not offer a formal hypothesis test. This test is provided by the F-test.

Why Use F Statistics when Individual Coefficient p-values are Available?

  • It may seem that if one coefficient is significant (good p-value), the overall model will also be significant.
  • However, this assumption breaks down when the number of variables with poor p-values is large.

Determining Good Values of F-statistics

  • It depends on the values of n (number of observations in the training set) and p (number of independent variables).
  • When n is large, an F-value slightly greater than 1 is sufficient to reject the null hypothesis.
  • It is advisable to base decisions on corresponding p-values, which consider both n and p.

Degrees of Freedom:

  • Suppose you have two features, x1 and x2, and a target variable y.
  • The line equation is y = a1x1 + a2x2 + a3.
  • In a 3D space, three points define a unique line.
  • With n points, p(2) features, and 1 target, three points will always lie on the line, while (n-p-1) points can deviate from it. This difference represents the degrees of freedom.
  • Degrees of freedom are the difference between n and the number of non-zero coefficients, including the intercept.

Significance Score “***” in the Coefficient Section

  • R indicates the significance of a p-value by displaying stars.
  • The calculation of this value is likely done through bootstrapping.
  • Bootstrapping allows assigning measures of accuracy to sample estimates, such as bias, variance, confidence intervals, or prediction error.
  • In Bayesian inference, parameter distributions are obtained, allowing the calculation of p-values.

References

Found the formula for adjusted R2 here

One thought on “Interpreting Statistical Values

Leave a comment