Interpretation of Multiple Regression Coefficients

August 29, 2018August 29, 2018Archit Vora Leave a comment

In case of simple linear regression with one variable we interpret slop coefficient as follows
- y = b0*x0 + c
- b0 is increase in y for unit increase in x0
In case of multiple regression:
- y = b0*x0 + b1*x1 + c
- b0 is increase in y for unit increase in x0 keeping x1 constant

Implication of keeping other variable constant:

Consider
- house_price = b0 * no_of_bedrooms + c ….. (1)
- house_price = b0 * no_of_bedrooms + b1*square_feet + c …..(2)
In (1) there are high chances that b0 will be positive
In (2) it can be negative
- If you increase no of bedrooms keeping square feet constant each room will be smaller
- This may decrease house price

Reference :

Coursera course on regression by University of Washington DC : https://www.coursera.org/specializations/machine-learning

Grubb’s Test for Anomaly Detection

August 29, 2018August 29, 2018Archit Vora Leave a comment

Problem Statement:

We are receiving time series of count data everyday and we want to detect whenever there is drastic change in this count.

Grubb’s test assumes a t-distribution of input and find out the outliers for required confidence interval. We remove this outlier and repeat the test again. Here is the pseudo code:

Grubbs Test(X, p-val=0.05):
    Repeat :
        Z <- zscore(X)
        n < len(X)
        zi, index <- max(abs(Z)), index(max(abs(Z)))
        if zi > threshold(N, p-val):
            remove X[index]  
        else:
            break

Traditionally Grubb’s tests has a alternate hypothesis that exactly one outlier is present in data. In above we modified it get all possible outliers.

Test	Hypothesis
Grubb’s Test	H0: There are no outliers in the data set Ha: There is exactly one outlier in the data
Tietjen-Moore test	H0: There are no outliers in the data set Ha: There are exactly k outliers
Generalized ESD test	H0: There are no outliers in the data set Ha: There are up to r outliers

Above can be extended to two sided tests as well.

We had followed this in time series based anomaly detection and following approach were considered for pre processing before applying Grubb’s test:

Raw Count (No processing)
Residuals after STL decomposition
Residuals after fitting ARIMA

In our case raw count had worked well enough.

Reference:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

Github Gist

Ensemble, Bagging and Boosting

July 23, 2018October 25, 2020Archit Vora Leave a comment

Ensembling is a method of combining more than one models to generate a final output.(Reference [1])

There are two ways of doing that:

Bagging
Boosting

Bagging	Boosting
We take subset of data and train different models Example Random forest It takes subset of data as well as subset of features Pros of random forest Handles high dimensions Handles missing values Cons of random forest It won’t give precise value regression because final value is mean from subset tress None the less people are using it for regression depending upon domain	We train different model on with same data. Each sample is assigned different weight in each iteration Example AdaBoost, XgBoost Pros of XgBoost Supports different loss function Works well with interactions Cons of XbBoost Prone to overfitting Tuning of hyper parameters is critical

Pros-cons of bagging vs boosting:

Bagging is easy to parallelize and hence training is faster
Boosting is more efficient for fixed no of iterations (classifiers)

AdaBoost vs XgBoost

Reference : [2]

ada_vs_xg

Quote from Tianqi Chen, one of the developers of XGBoost:

Adaboost and gradboosting [XGBoost] are two different ways to derive boosters. Both are generic. I like gradboosting better because it works for generic loss functions, while adaboost is derived mainly for classification with exponential loss.

Reference :

[1] https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

[2] https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788295758/4/ch04lvl1sec34/comparison-between-adaboosting-versus-gradient-boosting

AdaBoost

July 23, 2018October 25, 2020Archit Vora 2 Comments

AdaBoost stands for adaptive boosting
Boosting is a technique for creating better classifier form several regular/weak/base classifiers
Thought:
- We want to train second classifier in such a way that training data has more samples for the case where first classifier was wrong
- Third classifier will have more samples where first and second classifier contradicts
How do we accomplish above:
- We assign weights to each sample
  - This weights will be used in error function for each sample
  - Error function would be sum of weights of incorrectly classified samples
  - This weights keep changing in each iteration
- We also assign wight to each base (weak) classifier (α In below algorithm)
  - Formula for calculating α below is valid for binary classification
    - It will be different for multi class classification
When do we stop iteration ?
- Combined classifier is predicting all sample correctly
- No more base classifier left
Most machine learning algorithms allows to apply weight to data
- In logistic regression we can multiply weight in error fucntion
- In decision tree, classification error can be calculated by summing weights of missclassified samples

adboost

Formulations in above algorithm:

Formula of α
- if ε=0.5 then α=0. It is just a random chance classifier.
- Else it can positive/negative.
- ε < 0.5 => negative α => we invert the prediction in weighted average
Formula of D_(t+1)(i)
- y*h is just a sign
  - y and h are {-1, 1}
  - y*h will be positive for correct prediction
- Why it is function of ε
  - Random classifier updating weights does not make sense
By Z_t we are normalizing weights, which ensures numerical stability

Adaboost Theorem

Error -> 0 as no of iteration -> infinity at certain condition
Condition is here is that each learner has error < 0.5
Each iteration means adding one weak learner

Over-fitting

Contrary to weak classifier (decision stumps) we saw that rate of increase test error is less with boosting
However no of iteration still should be selected via cross validation for small data-set and validation for large data-set

References:

Click to access MIT6_034F10_boosting.pdf

Click to access msri.pdf

IID Assumption

July 7, 2018July 7, 2018Archit Vora Leave a comment

This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning

Some notes:

1) Method of cross validation changes based in i.i.d assumption.

For time series we use one step ahead cross validation.
Also in panel data when we have clustered structure we sample whole cluster to preserve shape of the data

tom

Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.

Panel Data And Analysis

July 7, 2018October 25, 2020Archit Vora Leave a comment

From Wikipedia

In statistics and econometrics, panel data or longitudinal data^[1]^[2] are multi-dimensional data involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals

Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

A study that uses panel data is called a longitudinal study or panel study.

Cross sectional data:

Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.

Analysis of cross-sectional data usually consists of comparing the differences among the subjects.

Examples

Panel Data

panel

Cross Sectional Data

Time Series

Gradient Descent vs Netwon’s Method

July 1, 2018May 23, 2020Archit Vora Leave a comment

Comparison is moved to : https://datastoriesweb.wordpress.com/2018/03/11/iterative-method-for-unconstrained-optimization/

Regression Trees

May 1, 2018May 21, 2023Archit Vora 2 Comments

Figure below shows how decision tree creates rectangle in predictor space:

As we have describe classification tree, the only difference here is how to come up with output instead of label and how to define a metric other than entropy/gini.

For output it uses the mean of leaves.

For metric CART methodology uses simple SSE.

SSE = ∑ (y_i – y1)² + ∑(y_i – y2)²
- where y1 and y2 are means of two newly crated groups.
- It is simple sum of squares, not standard deviation
- While deriving R2 in linear recession we were concerned with sum of squares
- Ward linkage criterion in hierarchical clustering also uses difference in sum of squares
diff = SSE_before – SSE_after
- We choose a predictor which minimizes SSE the most.
- diff is highest

My hand-coded regression tree is available at : https://github.com/arcarchit/datastories/tree/master/regression_tree

Reference

Applied predictive modeling by Max Kuhn and Kjell Johnson

Classification Trees

May 1, 2018May 21, 2023Archit Vora 3 Comments

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

Formula:
- H = ∑ -p * log(p)
About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
After Splitting
- Group 1:
  - n1 positive
  - m1 negative
  - g1 = (n1 + m1) total
- Group 2:
  - n2 positive
  - m2 negative
  - g2 = (n2 + m2) total

Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
diff = H_before – H_after
Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
  - So we are selecting a variable which carries more information
    - More important feature

Gini

Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
Gini is a measure of impurity which we want to reduce by splitting on predictor
diff = G_before – G_after
- We will select a predictor for which diff is highest

Entropy vs Gini

Given a choice gini will be better as computing log is costlier.
Both of them are better than “classification error as metric” [1]
- Which would be selecting majority class for calculating classification error
- It is useful while pruning but not while growing the tree

ROC Curve

ROC curve can be constructed easily for classifier which outputs ranking.
- For example logistic recession outputs probability where we can change the threshold.
On the contrary decision tree outputs label
However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

limit max. depth of trees
Cost complexity pruning
ensembles / bag more than just 1 tree
set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

Q-Learning

April 16, 2018May 7, 2020Archit Vora Leave a comment

Value Function

This represents value of a given state for the policy π.

Optimal Policy

Learning Via Value Function is not possible when we don’t know immediate rewards and next state. If they are known we can use dynamic programming methods described at Dynamic Programming for RL.

Q function

q42

How it solve the problem of unknown reward and next state functions?

To calculate value function we need to know immediate reward for next state and probability of moving into next state
In Q learning we are don’t need them
- We just take best action, then observe the reward for whichever state we go in.
- It is possible to know reward after taking action, not before that.

Training Rule for Q

We initialize Q table with random values and keep updating it on each action based on reward get.

We have trained grid world with above equation. Notebook is available at [2]. One thing to note in that code is that, we don’t need backup. We don’t need to update entire q table simultaneously. We can calculate new value for once cell and write it at once.

Also [1] is a very good resource. It shows changes in the Q-update equation in non deterministic world. (Reward and next state are non deterministic).

qq3

And also derives TD updates (Temporal Difference)

qq4

Convergence Conditions

System is deterministic MDP (Markov Decision Process)
Immediate reward values are bounded
Agent visit every state action pair infinitely often

Derivation for Convergence

Experimental Strategies

We also need to do exploration
So we pick up action with a probability such that
- action having more Q value has higher probability
One way to formulate this is as follows where in value of k control exploration-exploitation trade-off.
- Initially k is small (exploration) and then is slowly increased (exploitation)

How can we speed up Updating Sequence

Strategy 1
- Consider the case for single absorbing goal state case. (What it means that if agent’s goal is to reach particular state) and initially all (q,a) pairs are initialized with 0.
- So only one (q,a) pair will be update when we reach to reward state,
- To speed up things, we can store intermediate state and action and then update backward till origin once we reach reward state.
- Of course this will increase memory requirement
Strategy 2
- Store immediate reward of past along with state-action pair and retrain periodically
- If one Q(s, a) changes it will help change others as well
Strategy 3
- If performing real environment action in environment is difficult we can perform simulation

Reference :

[0] Machine Learning by Tom Mitchell

[1] http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf

[2] https://github.com/arcarchit/datastories/blob/master/RL_GridWorld.ipynb

Data Stories

Page 6 of 10

Interpretation of Multiple Regression Coefficients

Grubb’s Test for Anomaly Detection

Ensemble, Bagging and Boosting

AdaBoost vs XgBoost

AdaBoost

IID Assumption

Panel Data And Analysis

Gradient Descent vs Netwon’s Method

Regression Trees

Reference

Classification Trees

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference

Q-Learning

Value Function

Optimal Policy

Q function

How it solve the problem of unknown reward and next state functions?

Training Rule for Q

Convergence Conditions

Derivation for Convergence

Experimental Strategies

How can we speed up Updating Sequence

Reference :