Generalized Linear Models (GLM)

In standard linear regression we make two assumption :

  1. P(Y/X) is a normal distribution
  2. Mean is a linear function of parameter µ  = β*X
  3. P(Y/X) = Ν(µ, σ^2* I)       # σ is standard deviation and I is identity matrix

 

In GLM we relax two things :

  1. P(Y/X) is from any exponential family
  2. Mean is some function of β*X
    1. µ = f(β*X)
    2. g(µ) = β*X
    3. g = f^(-1)
    4. g is called link function

 

Example of link functions:

  1. log link
  2. reciprocal link
  3. logistic link

 

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

  • In gaussian regression we predict μ for each sample
    • This μ comes from β0, β1, β2 which are same for each sample
  • For binomial regression we want to predict p for each sample
    • This p comes from β0, β1, β2 which are same for each sample
  • One option :
    • p = β0 + β1*x1 + β2*x2
  • Second option
    • p = sigmoid (β0 + β1*x1 + β2*x2)
    • f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
    • It is logit link function
  • What are other options apart from sigmoid
    • step function (Not differentiable, that is why we use (sigmoid)
    • tanh is sometime used in deep learning
  • What if we go with option 1:
    • Binomial distribution requires p to be in (0,1)
  • Example :
    • How many fishes survive (alive/dead) given food and water

 

Poisson

  • Poisson distribution models probability of observing count
      • P(k) = exp(-λ) * (λ^k) / k !

    Parameter λ >= 0

  • Option one:
    • λ = β0 + β1*x1 + β2*x2
  • Option two:
    • λ = exp ( β0 + β1*x1 + β2*x2 )
    • f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
    • It is log link function
  • What if we go with option one:
    • We want λ > 0
    • Relationship between input and output is not additive but multiplicative ?
      • Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
        Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
  • Example:
    • How many seed will germinate given water and fertilizer

 

Parameter Estimation

  • We can do maximum likelihood estimate and find parameters β0, β1, β2
  • Deriving maximum likelihood for binomial:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Binomial(p) )
    • max_lh = Multiply ( p if y=1 else (1-p) )
    • log(max_lh) = Summation (y*logp + (1-y) log (1-p))
  • Deriving maximum likelihood for Poisson:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Poisson (u) )
    • max_lh = Multiply (exp(-u) * u^y / y! )
    • log(max_lh) = summation ( -u + y*log(u) – log (y!) )
  • Above two are rough derivations but conveys the idea
  • For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
  • For other we solve it via gradient/newton’s method.

 

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

 

Exponential Family

Here is the basic concept :

simple_exponential

  • θ are parameters and X is data, both can be multidimensional
  • We want to restrict terms inside exponential to the form θ*X

 

Formal Definition:

formal_defination

  • η and T functions also help for the case when there is a mismatch in the dimension of θ and X.
  • g(θ) in basic concept above has been taken into exponential as B(θ).
    • It vaguely serves as normalization factor.
  • h(x) serves a distribution and exponential transfer this basic distribution

 

examples

 

Further Reading

Click to access chapter8.pdf

Click to access lecture12.pdf

 

 

 

 

 

Grubb’s Test for Anomaly Detection

Problem Statement:

We are receiving time series of count data everyday and we want to detect whenever there is drastic change in this count.

 

Grubb’s test assumes a t-distribution of input and find out the outliers for required confidence interval. We remove this outlier and repeat the test again. Here is the pseudo code:

 

Grubbs Test(X, p-val=0.05):
    Repeat :
        Z <- zscore(X)
        n < len(X)
        zi, index <- max(abs(Z)), index(max(abs(Z)))
        if zi > threshold(N, p-val):
            remove X[index] 
        else:
            break

 

Traditionally Grubb’s tests has a alternate hypothesis that exactly one outlier is present in data. In above we modified it get all possible outliers.

 

Test Hypothesis
Grubb’s Test H0: There are no outliers in the data set
Ha: There is exactly one outlier in the data
Tietjen-Moore test H0: There are no outliers in the data set
Ha: There are exactly k outliers
Generalized ESD test H0: There are no outliers in the data set
Ha: There are up to r outliers

 

Above can be extended to two sided tests as well.

 

We had followed this in time series based anomaly detection and following approach were considered for pre processing before applying Grubb’s test:

  • Raw Count (No processing)
  • Residuals after STL decomposition
  • Residuals after fitting ARIMA

In our case raw count had worked well enough.

 

Reference:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

Github Gist

 

IID Assumption

This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning

 

Some notes:

 

1) Method of cross validation changes based in i.i.d assumption.

2)

tom

Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.

 

Panel Data And Analysis

From Wikipedia

In statistics and econometrics, panel data or longitudinal data[1][2] are multi-dimensional data involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals

Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

A study that uses panel data is called a longitudinal study or panel study.

 

Cross sectional data:

Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.

Analysis of cross-sectional data usually consists of comparing the differences among the subjects.

 

Examples

 

Panel Data

panel

Cross Sectional Data

cs

Time Series

ts

ARIMA model

ARMA to ARIMA

  • When there is a trend in data we take differences
  • ARIMA – Auto regressive Integrated Moving average
  • Integrated term includes order of difference, In the example below it is d=2

arima

Below is the sample github gist and output pdf is avaialble at ARIMA model.pdf


title: "ARIMA model"
author: "Archit Vora"
date: "April 3, 2018"
output: html_document
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
“`
We will try to fit ARIMA model to BJsales dataset from 'dataset' package in R.
Here is the plot of it.
“`{r}
library(datasets)
plot(BJsales)
“`
Mean in changing over time and seems time series is not statioary. Let's take the difference.
“`{r}
plot(diff(BJsales))
“`
Seems It is still not staionary, let's take one more diff.
“`{r}
plot(diff(diff(BJsales)))
“`
Now it seems stationary. Let's plot acf and pacf of this doubly differenced series.
“`{r}
focus<- diff(diff(BJsales))
acf(focus)
pacf(focus)
“`
From ACF lag 1, 8 and 11 seems significant, while from PACF seems lag 1, 2, 3, 10 and 19 seems significant.
Keeping parsimonious principle in mind we shall consider order of 0 and 1 for MA terms and oreder of (0, 1, 2, 3) for AR terms.
Now let's try differenct models and check their AIC.
“`{r}
d <- 2
for (p in 0:3){
for (q in 0:1){
if(p+d+q<6){ # Ensures simple model
mm<- arima(x = BJsales, order = c(p, d, q))
pval <- Box.test(mm$residuals, lag = log(length(mm$residuals)))
sse <- sum(mm$residuals^2)
aic <- mm$aic
cat(p, d, q, "AIC = ", aic, "SSE = ", sse, "P-value = ", pval$p.value, "\n")
}
}
}
“`
Seems like (0, 2, 1) has smallest AIC but does not have significant p-value.
Let's leave it for this post and plots residual.
“`{r}
model <- arima(x = BJsales, order = c(0, 2, 1))
par(mfrow = c(2, 2))
plot(model$residuals)
acf(model$residuals)
pacf(model$residuals)
qqnorm(model$residuals)
“`
There seems no significant correlation as well as QQ-plot also looks okay.
Now let's plot the forecast.
“`{r}
library(forecast)
fc <- forecast(model, h=20)
plot(fc)
“`
All this can also be done by auto.arima routine of forecast package.
“`{r}
model <- auto.arima(BJsales, seasonal = FALSE)
model
fc <- forecast(model, h = 20)
plot(fc)
“`
auto.arima has come up with different model, but it is okay.

view raw

arima.Rmd

hosted with ❤ by GitHub

Fitting AR Processes

Yule Walker Equation in Matrix Form

 

ym1

  • If we write and above equation for k=1, 2, . . ., n and use the fact that ρ(k) = ρ(-k), we can write it in a matrix form.
  • Using the data we have we can estimate values of ρ  (auto correlation coefficients)
  • acf() routine in R gives us that
  • Using values of ρ we can then estimate values of Φ (parameters of AR process)

 

ym2

  • Above is an example for AR process
  • We can solve these equation for values of Φ1, Φ2 and Φ3

 

Reference:

Moving average and Auto-regressive Processes

Moving Average Processes MA(q)

  • Stock price depends on announcements of last two days
  • Auto correlation function cuts off at q

maq

 

Auto regressive Processes AR(p)

 

1

2

3

 

  • Below are the plots for AR(2) process
  • Depending upon the value of phi1 and phi2 ACF has alternative positive and negative values

 

56

 

Writing AR(p) process as MA process by substituting values of X(t-1). And yes phi is constant, we don’t need phi1, phi2 anymore.

78

 

Mean, variance and auto-correlation of AR(p) process, we have assumed Z = Norm(0, sigma2)

9

 

 

ACF of AR-p using Yule-Walker Equation

  • It is a method of solving difference equation in recursive relation
  • We first obtained auxiliary equation (also known as characteristic equation) which is polynomial and find root of that
  • Using these root we get weighted geometric series and find weights using some initial condition
  • We had learned in mathematics that this way of solving difference equation also related to solving differential equations
  • In the course they had solved it for Fibonacci series and root had come out to be golden ratio
  • For AR(p) ACF comes out to be difference equation, solving which can give us ACF for different values of lag

 

 

Reference

https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

 

 

 

 

 

Stationarity Conditions for MA(q) and AR(p) Processes

Sequence and Series

Convergent Sequence

1/2, 2/3, 3/4, . . . , n/(n+1)

Divergent Sequence

3, 9, 27, . . . . , 3^n

Series => Partial Sum of sequence

Convergent Series => if sum converges

Convergence Test

  • Integral Test
  • Comparison Test
  • Limit comparison test
  • Alternating Series Test
  • Ratio test
  • Root test
Geometric Series

  • a, ar, ar^2, . . . , ar^n
  • Convergent if r < 1
Representing function as (geometric) series

seriesRepresntation

Backward shift operator

  • B^kX(t) = X(t – k)

backOp

Invertibility

  • Two models have same ACF
  • Given ACF how to find out the model
  • We will go for model that is invertible
  • We can invert MA(1) into AR(∞)
  • Inverting is basically act of expanding function in geometric series
  • It is possible when growth r<1
  • Out of two models only one satisfies this condition
  • We will select that model given ACF

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

dual

How to check if series is both invertible and stationary

  • Check AR(p) polynomial for stationarity
  • Check MA(q) polynomial for invertibility
  • Both should hold

Reference

https://www.coursera.org/learn/practical-time-series-analysis/exam/ITocA/series-backward-shift-operator-invertibility-and-duality

 

[Time Series] Correlation and Stationarity

Co-variance vs Correlation

  • Correlation is co-variance divided by standard deviation of both variables
  • Hence it is independent of units and is always between -1 and 1, which makes comparison easier
  • Formula on the right is time series specific
    • It is auto correlation coefficient at lag k
    • It is define as ration of auto-correlation at lag k divide by auto-correlation at lag 0
    • This values are plotted on correlogram  (See one for MA(2) process below)

acf

 

Stationary Time Series

  • No systematic change in mean (No trend)
  • No systematic change in Variance
  • No periodic variation (Seasonality)

If time series is not stationary we apply several transformation to make it stationary.

For example applying difference operator to random walk makes it stationary.

 

 

Random Walk

  • Previous value of noise
  • If first value is zero then current value is summation of all the noises so far
  • X(t) = X(t-1) + Z(t)
  • Z(t) = Normal (mu, sigma2)
  • if X(0) = 0 then X(t) = sum(Z(k)) k form 0 to t
  • Expectation[X(t)] = t*mu   – –  Changes with time
  • Variance[X(t)] = t*sigma2   – – Changes with time
  • Not a stationary process
  • let Y(t) = X(t) – X(t-1) = Z(t)  – – Y(t) is a stationary process

 

Example of Stationary Process

Moving average and Auto regressive processes described here can be stationary under conditions described here.

 

 

References

 

Further reading