Generalized Linear Models (GLM)

September 18, 2018October 25, 2020Archit Vora 1 Comment

In standard linear regression we make two assumption :

P(Y/X) is a normal distribution
Mean is a linear function of parameter µ = β*X
P(Y/X) = Ν(µ, σ^2* I) # σ is standard deviation and I is identity matrix

In GLM we relax two things :

P(Y/X) is from any exponential family
Mean is some function of β*X
1. µ = f(β*X)
2. g(µ) = β*X
3. g = f^(-1)
4. g is called link function

Example of link functions:

log link
reciprocal link
logistic link

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

In gaussian regression we predict μ for each sample
- This μ comes from β0, β1, β2 which are same for each sample
For binomial regression we want to predict p for each sample
- This p comes from β0, β1, β2 which are same for each sample
One option :
- p = β0 + β1*x1 + β2*x2
Second option
- p = sigmoid (β0 + β1*x1 + β2*x2)
- f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
- It is logit link function
What are other options apart from sigmoid
- step function (Not differentiable, that is why we use (sigmoid)
- tanh is sometime used in deep learning
What if we go with option 1:
- Binomial distribution requires p to be in (0,1)
Example :
- How many fishes survive (alive/dead) given food and water

Poisson

Poisson distribution models probability of observing count
- - P(k) = exp(-λ) * (λ^k) / k !
Parameter λ >= 0
Option one:
- λ = β0 + β1*x1 + β2*x2
Option two:
- λ = exp ( β0 + β1*x1 + β2*x2 )
- f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
- It is log link function
What if we go with option one:
- We want λ > 0
- Relationship between input and output is not additive but multiplicative ?
  - Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
    Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
Example:
- How many seed will germinate given water and fertilizer

Parameter Estimation

We can do maximum likelihood estimate and find parameters β0, β1, β2
Deriving maximum likelihood for binomial:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Binomial(p) )
- max_lh = Multiply ( p if y=1 else (1-p) )
- log(max_lh) = Summation (y*logp + (1-y) log (1-p))
Deriving maximum likelihood for Poisson:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Poisson (u) )
- max_lh = Multiply (exp(-u) * u^y / y! )
- log(max_lh) = summation ( -u + y*log(u) – log (y!) )
Above two are rough derivations but conveys the idea
For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
For other we solve it via gradient/newton’s method.

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

Exponential Family

September 18, 2018September 18, 2018Archit Vora Leave a comment

Here is the basic concept :

simple_exponential

θ are parameters and X is data, both can be multidimensional
We want to restrict terms inside exponential to the form θ*X

Formal Definition:

formal_defination

η and T functions also help for the case when there is a mismatch in the dimension of θ and X.
g(θ) in basic concept above has been taken into exponential as B(θ).
- It vaguely serves as normalization factor.
h(x) serves a distribution and exponential transfer this basic distribution

examples

Grubb’s Test for Anomaly Detection

August 29, 2018August 29, 2018Archit Vora Leave a comment

Problem Statement:

We are receiving time series of count data everyday and we want to detect whenever there is drastic change in this count.

Grubb’s test assumes a t-distribution of input and find out the outliers for required confidence interval. We remove this outlier and repeat the test again. Here is the pseudo code:

Grubbs Test(X, p-val=0.05):
    Repeat :
        Z <- zscore(X)
        n < len(X)
        zi, index <- max(abs(Z)), index(max(abs(Z)))
        if zi > threshold(N, p-val):
            remove X[index]  
        else:
            break

Traditionally Grubb’s tests has a alternate hypothesis that exactly one outlier is present in data. In above we modified it get all possible outliers.

Test	Hypothesis
Grubb’s Test	H0: There are no outliers in the data set Ha: There is exactly one outlier in the data
Tietjen-Moore test	H0: There are no outliers in the data set Ha: There are exactly k outliers
Generalized ESD test	H0: There are no outliers in the data set Ha: There are up to r outliers

Above can be extended to two sided tests as well.

We had followed this in time series based anomaly detection and following approach were considered for pre processing before applying Grubb’s test:

Raw Count (No processing)
Residuals after STL decomposition
Residuals after fitting ARIMA

In our case raw count had worked well enough.

Reference:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

Github Gist

IID Assumption

July 7, 2018July 7, 2018Archit Vora Leave a comment

This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning

Some notes:

1) Method of cross validation changes based in i.i.d assumption.

For time series we use one step ahead cross validation.
Also in panel data when we have clustered structure we sample whole cluster to preserve shape of the data

tom

Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.

Panel Data And Analysis

July 7, 2018October 25, 2020Archit Vora Leave a comment

From Wikipedia

In statistics and econometrics, panel data or longitudinal data^[1]^[2] are multi-dimensional data involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals

Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

A study that uses panel data is called a longitudinal study or panel study.

Cross sectional data:

Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.

Analysis of cross-sectional data usually consists of comparing the differences among the subjects.

Examples

Panel Data

panel

Cross Sectional Data

Time Series

ARIMA model

April 3, 2018April 3, 2018Archit Vora Leave a comment

ARMA to ARIMA

When there is a trend in data we take differences
ARIMA – Auto regressive Integrated Moving average
Integrated term includes order of difference, In the example below it is d=2

arima

Below is the sample github gist and output pdf is avaialble at ARIMA model.pdf

	—
	title: "ARIMA model"
	author: "Archit Vora"
	date: "April 3, 2018"
	output: html_document
	—

	“`{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	“`

	We will try to fit ARIMA model to BJsales dataset from 'dataset' package in R.
	Here is the plot of it.

	“`{r}
	library(datasets)
	plot(BJsales)
	“`

	Mean in changing over time and seems time series is not statioary. Let's take the difference.

	“`{r}
	plot(diff(BJsales))
	“`

	Seems It is still not staionary, let's take one more diff.

	“`{r}
	plot(diff(diff(BJsales)))
	“`
	Now it seems stationary. Let's plot acf and pacf of this doubly differenced series.

	“`{r}
	focus<- diff(diff(BJsales))
	acf(focus)
	pacf(focus)
	“`

	From ACF lag 1, 8 and 11 seems significant, while from PACF seems lag 1, 2, 3, 10 and 19 seems significant.

	Keeping parsimonious principle in mind we shall consider order of 0 and 1 for MA terms and oreder of (0, 1, 2, 3) for AR terms.

	Now let's try differenct models and check their AIC.

	“`{r}
	d <- 2
	for (p in 0:3){
	for (q in 0:1){
	if(p+d+q<6){ # Ensures simple model
	mm<- arima(x = BJsales, order = c(p, d, q))
	pval <- Box.test(mm$residuals, lag = log(length(mm$residuals)))
	sse <- sum(mm$residuals^2)
	aic <- mm$aic
	cat(p, d, q, "AIC = ", aic, "SSE = ", sse, "P-value = ", pval$p.value, "\n")
	}
	}
	}

	“`

	Seems like (0, 2, 1) has smallest AIC but does not have significant p-value.
	Let's leave it for this post and plots residual.

	“`{r}
	model <- arima(x = BJsales, order = c(0, 2, 1))
	par(mfrow = c(2, 2))
	plot(model$residuals)
	acf(model$residuals)
	pacf(model$residuals)
	qqnorm(model$residuals)
	“`
	There seems no significant correlation as well as QQ-plot also looks okay.
	Now let's plot the forecast.

	“`{r}
	library(forecast)
	fc <- forecast(model, h=20)
	plot(fc)
	“`

	All this can also be done by auto.arima routine of forecast package.
	“`{r}
	model <- auto.arima(BJsales, seasonal = FALSE)
	model
	fc <- forecast(model, h = 20)
	plot(fc)
	“`


	auto.arima has come up with different model, but it is okay.

view raw

arima.Rmd

hosted with ❤ by GitHub

Fitting AR Processes

March 31, 2018April 3, 2018Archit Vora Leave a comment

Yule Walker Equation in Matrix Form

ym1

If we write and above equation for k=1, 2, . . ., n and use the fact that ρ(k) = ρ(-k), we can write it in a matrix form.
Using the data we have we can estimate values of ρ (auto correlation coefficients)
acf() routine in R gives us that
Using values of ρ we can then estimate values of Φ (parameters of AR process)

ym2

Above is an example for AR process
We can solve these equation for values of Φ1, Φ2 and Φ3

Reference:

https://www.coursera.org/learn/practical-time-series-analysis/home/week/4

Moving average and Auto-regressive Processes

March 24, 2018July 26, 2018Archit Vora 1 Comment

Moving Average Processes MA(q)

Stock price depends on announcements of last two days
Auto correlation function cuts off at q

maq

Auto regressive Processes AR(p)

Below are the plots for AR(2) process
Depending upon the value of phi1 and phi2 ACF has alternative positive and negative values

Writing AR(p) process as MA process by substituting values of X(t-1). And yes phi is constant, we don’t need phi1, phi2 anymore.

Mean, variance and auto-correlation of AR(p) process, we have assumed Z = Norm(0, sigma2)

ACF of AR-p using Yule-Walker Equation

It is a method of solving difference equation in recursive relation
We first obtained auxiliary equation (also known as characteristic equation) which is polynomial and find root of that
Using these root we get weighted geometric series and find weights using some initial condition
We had learned in mathematics that this way of solving difference equation also related to solving differential equations
In the course they had solved it for Fibonacci series and root had come out to be golden ratio
For AR(p) ACF comes out to be difference equation, solving which can give us ACF for different values of lag

Reference

https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

Stationarity Conditions for MA(q) and AR(p) Processes

March 19, 2018April 3, 2018Archit Vora 1 Comment

Sequence and Series

Convergent Sequence

1/2, 2/3, 3/4, . . . , n/(n+1)

Divergent Sequence

3, 9, 27, . . . . , 3^n

Series => Partial Sum of sequence

Convergent Series => if sum converges

Convergence Test

Integral Test
Comparison Test
Limit comparison test
Alternating Series Test
Ratio test
Root test

Geometric Series

a, ar, ar^2, . . . , ar^n
Convergent if r < 1

Representing function as (geometric) series

seriesRepresntation

Backward shift operator

B^kX(t) = X(t – k)

backOp

Invertibility

Two models have same ACF
Given ACF how to find out the model
We will go for model that is invertible
We can invert MA(1) into AR(∞)
Inverting is basically act of expanding function in geometric series
It is possible when growth r<1
Out of two models only one satisfies this condition
We will select that model given ACF

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

dual

How to check if series is both invertible and stationary

Check AR(p) polynomial for stationarity
Check MA(q) polynomial for invertibility
Both should hold

Reference

https://www.coursera.org/learn/practical-time-series-analysis/exam/ITocA/series-backward-shift-operator-invertibility-and-duality

[Time Series] Correlation and Stationarity

March 18, 2018April 3, 2018Archit Vora Leave a comment

Co-variance vs Correlation

Correlation is co-variance divided by standard deviation of both variables
Hence it is independent of units and is always between -1 and 1, which makes comparison easier
Formula on the right is time series specific
- It is auto correlation coefficient at lag k
- It is define as ration of auto-correlation at lag k divide by auto-correlation at lag 0
- This values are plotted on correlogram (See one for MA(2) process below)

acf

Stationary Time Series

No systematic change in mean (No trend)
No systematic change in Variance
No periodic variation (Seasonality)

If time series is not stationary we apply several transformation to make it stationary.

For example applying difference operator to random walk makes it stationary.

Random Walk

Previous value of noise
If first value is zero then current value is summation of all the noises so far
X(t) = X(t-1) + Z(t)
Z(t) = Normal (mu, sigma2)
if X(0) = 0 then X(t) = sum(Z(k)) k form 0 to t
Expectation[X(t)] = t*mu – – Changes with time
Variance[X(t)] = t*sigma2 – – Changes with time
Not a stationary process
let Y(t) = X(t) – X(t-1) = Z(t) – – Y(t) is a stationary process

Example of Stationary Process

Moving average and Auto regressive processes described here can be stationary under conditions described here.

Data Stories

Category Statistics

Generalized Linear Models (GLM)

Logistic

Poisson

Parameter Estimation

References :

Exponential Family

Here is the basic concept :

Formal Definition:

Further Reading

Grubb’s Test for Anomaly Detection

IID Assumption

Panel Data And Analysis

ARIMA model

Fitting AR Processes

Yule Walker Equation in Matrix Form

Moving average and Auto-regressive Processes

Moving Average Processes MA(q)

Auto regressive Processes AR(p)

ACF of AR-p using Yule-Walker Equation

Reference

Stationarity Conditions for MA(q) and AR(p) Processes

Sequence and Series

Backward shift operator

Invertibility

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

How to check if series is both invertible and stationary

Reference

[Time Series] Correlation and Stationarity

Co-variance vs Correlation

Stationary Time Series

Random Walk

References

Further reading