Power Analysis

June 14, 2021August 3, 2021Archit Vora Leave a comment

When we run A/B test, we need to know how long we should keep it running. After 14 days we observe that test is not conclusive ? Should we still keep it running in the hope that we will get evidence in one more week or conclude that treatment is not significant. Power Analysis helps answer this question.

We can build our own guidelines – 5 M visitors or 14 days

Relation with Type 1 and Type 2 Error

Type 1 error is probability to wrongly reject null hypothesis
- We can fix this by reducing p-value from 5 % to say 1 %
Type 2 error is probability of failing to accept alternate hypothesis
- Alternate hypothesis is correct but we could not accept it
- We failed to reject null hypothesis
Power = 1 – (Type 2 error)
- Probability that we will reject null hypothesis when we should

4 Parameters Involved in Power Analysis

Alpha (p-val)
- When we increase p-val from 5 % to 10 % power increases
- We reduce probability of type 2 error
- But type 1 error increases
Power
- We defined it above
- Typically we set it to 0.8
Number of samples, N
- We want to find out number sample for which we should keep running tests
- After which we can conclude it.
Effect Size
- One feature brings 1 % lift in conversion
- Other feature is brining 5 % lift in conversion
- 2nd test will converge faster and hence has high power

There is formula connecting above 4 parameters. Once we fix 3 of them, we can find the 4th one.

How can I know effect size before the test ?

In most cases it is expected to have pilot data.
If not other source would be similar experiments in literature.
If none of above is available it is wise to do pilot study.

Types of Power Analysis

There are four types based on which parameter we are trying to find out.

A priori analysis
- compute N given other three
- Most common analysis
Post-hoc power analysis
- Find power given other three
- It is wise to do once test is completed
- While calculating N before test you don’t have exact estimation of effect size, post test you have it
- This helps you verify if you had sufficient samples to detect statistics you found
Criterion analysis
- Compute alpha given other three
- Rarely used
Sensitivity Analysis
- Find effect size given other three
- It is important when sample size is bounded
  - In online a/b test thing generally is not the case
  - For clinical trial it would be
- For fixed samples you can estimate what effect you would find at max
  - If it is too less, don’t conduct an experiment
- This is known as minimum detectable effect (MDE).

Calculation of effect size and power analysis formula

It is beyond scope of this post right now. Will add it in future.

Anomaly Detection Basics

March 21, 2021November 16, 2021Archit Vora Leave a comment

These are notes form Andrew N.G’s machine learning course on coursera.

Anomaly Detection Problem

Assembly line prepared aircraft engine, you want to check if it okay.
- Features can be
  - heat generated
  - vibration intensity
Fraud detection in finance/retail
- Feature would be based on user’s activity
Monitoring CPUs in data center
- Features would be memory used, CPU load, network traffic.
Generally we have less number of positive (anomalous example) compared to negative ones (normal).

Solution Based on density Estimation

We try to fit normal examples in gaussian distribution
For new engine we estimate the probability of it p
If p < epsilon – we flag the new engine as anomalous
We generally train one gaussian per features and multiply like naiver bayes
- That is to say off diagonal elements in multivariate gaussian are zero
- More details in the last section of this post ( multivariate gaussian )

Model Evaluation

Once we have multiple models having different features how to evaluate which one is better ?
We also need to tune epsilon parameter.
We can use standard setup of train-set, test-set and cross validation set
Train set would have normal examples only. It would okay if few anomalous samples slips in
- So training is unsupervised only
- Predict y = 1 if p(x) < epsilon else 0
Bad metric
- classification accuracy (because classes are imbalanced)
Good metric
- TP, FP, TN, FN
- Precision/recall
- F1 score
Cross validation set is used for tuning epsilon

Anomaly Detection vs Supervised Learning

Supervised model like logistic regression would require
- 1) More training examples
- 2) Somewhat balanced classes

Feature Engineering

Since we are fitting gaussian we need to do some transformation if feature distribution does not look like one. Popular transformations are
- log (x)
- log (x + c)
- sqrt (x)
How to introduce new feature
- We need to do this when p(x) is comparable for normal and anomalous sample
- Once you find anomalous sample for which p(x) is not low enough, try looking deep into it.
- Property which is making it anomalous would be a new feature to add
Feature Engineering Recommendation
- Think about features which will be too high or too low in case of anomaly
- x5 and x6 can be a good feature in below image.

Multivariate Gaussian

Shortcoming of individual gaussian is that in case of correlated features it won’t be able to detect the anomaly.
- Green sample in above image will not be detected
To mitigate this we can hand-code ratio based features
Original model is more popular because it scales well with no of features.

Recommendation System Fundamentals

March 14, 2021November 8, 2021Archit Vora Leave a comment

This are summaries from Andrew N.G’s course on machine learning.

Problem formulation of recommendation system

For this post we will use movie recommendation as example
We want to recommend movies to user
We will have ratings for some (user, movie) pair and would want to predict ratings for other pair
- Then for a given user we will recommend movies sorted by highest prediction ratings

Two kinds of recommendation system
- Content based
- Collaborative filtering

Content based recommendation
- Each movie has some features
- Train regression model for each user to predict ratings based on movie’s features

Collaborative Filtering

We need to know two things
- Feature values for each movie
- Weight parameter for each user
If we know one, we can calculate other
- We will of-course consider (customer, movie) ratings
- We have partially filled matrix as training set.
Can we do a joint optimisation for both ? That’s the idea behind collaborative filtering.
Why it is called collaborative filtering ?
- Users are collaborating with each other to generate predictions for each other.

Notes of collaborative filtering equations
- Both summations are equal.
- x0 = 1 is not needed. x0 is constant term in linear regression
  - If model thinks it needs a feature which is constant it will create this on its own.

Collaborative filtering is essentially a matrix factorisation. We can construct matrix for movie and user based on vectors x and theta in above formulation.

Low rank matrix factorisation
- Let number of user’s be nu
  - Number of movies nm
  - Number of movie feature k
- User matrix dimension -> nu x k
  - For each user we have vector representing feature weights
- Movie matrix dimension -> nm x k
  - For each movie we have feature value
- Original matrix dimension -> nu x nm
- We have factorised matrix into low rank k <= min(nu, nm)
- It is very popular mathematical technique to discover latent structure with sparse data. [1]
Further application
- Showing related item is different problem than recommendation
- With collaborative filtering we are obtaining feature vectors for each movie/item
- We can apply nearest neighbour on top of it

Cold start Problem

Mean normalisation helps combating that
In most ML algorithm we normalise feature (column)
Here you would like to normalise row from product perspective. Mathematically you can do both.
Example below
- 5th user has not rated any movie (blue column in first matrix)
- Regularised objective will make theta 0 for 5th user, which in turn produce 0 rating while making prediction
- Mean normalisation transform Input matrix as shown below
- During prediction time we add mean value for that movie
  - Theta will still be 0 for 5th user from matrix factorisation
  - After multiplying with 0 we are adding mean value to it

Reference

[1] https://arxiv.org/pdf/1507.00333.pdf

Elimination with matrices

March 13, 2021August 3, 2021Archit Vora Leave a comment

We use it to find solution of linear equations
- Elimination method is also popular among software packages like Matlab.
We would perform row operations to make matrix upper triangular
We can do this with equations itself (without putting in matrix)
- We don’t want to keep writing variable names
Process
- Iteratively select diagonal element as pivot
- Make all entires below it 0
- Repeat
Allowed row operations are the ones which does not change systems
- Switching rows
- Multiplying row with constant
- Add rows

What would happen when we have 2 variable and 100 equations (observation) ?

Here are four terms
- Underdetermined System (no of equation > no of unknowns)
- Overdetermined system (no of equation < no of unknowns)
- Consistent System (one/more set of values satisfy equations)
- Inconsistent System (no set of values satisfy equations)
Overdetermined system is generally inconsistent except
- Same equation multiplied by constant
- One equation is linear combination of other
More specifically, according to the Rouché–Capelli theorem, any system of linear equations (underdetermined or otherwise) is inconsistent if the rank of the augmented matrix is greater than the rank of the coefficient matrix.

Snippets for Prof Gilbert’s course

Elimination matrices – we can create matrix for each elimination operation

Remember – Multiply on left is row operation, multiply on right is column operation

Permutation matrix (P) – Exchange the rows of identity matrix

It is easy to find inverse of this elimination matrices

Reference

These are notes from Prof. Gilbert Strang’s lecture on MIT open courseware. https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/lecture-1-the-geometry-of-linear-equations/

[Paper] RankNet to LambdaRank to LambdaMART

January 1, 2021November 19, 2021Archit Vora Leave a comment

Microsoft had published a paper on RankNet in 2005, which also won ICML’s test of time award in 2015. Subsequent improvement to that algorithm were lambdaRank and lambdaMART. The later had also won 2010 – learning to rank challenge. Current paper[1] summaries all three algorithms and incremental improvement they are mapping. It is written by Chris Burges who was involved in these development since RankNet.

Quick Gist

RankNet was trained using Neural Net.
RankNet was trained to minimize pair-wise differences.
LambdaNet was designed to maximise NDCG.
- Which is more relevant to IR compared to pair-wise differences.
LambdaMART started using tree based boosting algorithms (MART, XgBoost etc)

RankNet

For a given query (q), we have two items (i and j)
- I am writing down item but it would be any document (web page for example)
- We will have features for
  - query
  - item
  - query item pairs
For both (q, i) and (q, j) we pass them through NeuralNet and get scores -> si and sj
- We will also pass pairs which has same supervised label
We need to decided a loss function based on si and sj, gradient of which we will back propagate.
Computed probability formula y_hat

Supervised probability label y = (P_ij_bar) = (1 + S_ij) / 2
- S_ij = { 0, 1, -1}
  - 1 if ( i is more relevant then j)
  - 0 if (i is less relevant then j)
  - 1/2 if both have same relevant scores
- y will be in {0, 1/2, 1}
sigma determines shape of sigmoid
- We would go to difference of (si – sj) on x-axis and read corresponding value from y axis.
- sigma determines steepness around 0
- This technique is known as temperature scaling. In tower network people use is when passing cosine similarity value to sigmoid.
- Technique is useful when input has limited range
  - Cosine similarity has range (-1, 1)
  - s_i – s_j would also have smaller range
Cost function is cross entropy

One observation here is that when si = sj
- cost = log 2 > 0
- Document with different label but same score are still pushed aways from each other.

Speeding up RankNet Training by factoring

Defining lambda_ij

Defining lambda_i

What speedup did we achieve
- Earlier for a given query we were computing lambda_ij for each pair and updating weights
- Now we are computing lambda_i for each document and then update weight for given query
  - You can correlate this to going from SGD to mini-batch
  - We can do SGD at query, pair level or query level. We choose later.
Intuition of lambda_i
- Direction(+Ve/-Ve) and magnitude where we want document to move
  - Move up/down in the ranking

LambdaRank

Figure below and caption explains why reducing pairwise difference does not necessarily results in increase of IR metrics. And LambdaRank is a way to optimize for IR metrics faster.

LamdaRank modified lamda_ij equation empirically as follows and it seemed to work well. We could do this because we don’t need to know actual cost function as long as we compute lamda_ij and then lamda_i

This becomes maximisation problem so we move in direction of gradient.

LambdaMART

MART – Multiple Additive Regression. Trees. For more info read the post on gradient boosting
To work with MART we need to specify two things
- Gradient of loss function, which will be used to fit the trees
- For the region in tree compute leaf value, that is done using line search/newton’s method
We had empirically defined lambda as gradient in lambdaRank, we use same lambda as gradient here as well.

For above lambda gradient, paper defines empirical cost function and then derives value at leaf nodes using newton’s method.

Key difference to observe between lambdaRank and lambdaMART
- Lambda rank updates all weights (of classifier) after each query is examined
- Leaf value at lambdaMART is computed for all query, item pairs that falls into that leaf
  - This allows lambdaMART to update splits and leaf values such that it may decrease utility for some queries but overall utility increases.

Reference

[0] : https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/

[1] : https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf

On Interleaving

December 12, 2020May 16, 2023Archit Vora Leave a comment

This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.

What is interleaving ?

It is a method to evaluate ranking system.
If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.

Why do we need interleaving ?

It allows us to experiment faster.
- It converges faster than a/b tests
For search systems many times a/b test might not converge at all
- Say an improvement is affecting small % of search queries
Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
- Lots of impressions are needed for convergence in a/b test.

Illustrative Examples

Shoe
- Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
- We have 100 people to test on
- option 1 – classic a/b test
  - 50 people get each type of shoes
  - Two variability
    - How active the person under tests it
    - Quality of sole
- option 2 – interleaved test
  - Different sole in left and right pair of shoe
  - Just one variability
    - Quality of sole
Soda [0]
- We want to test what is more popular among population – Pepsi or Coke ?
- option 1 – classic a/b test
  - Split population into two groups
    - One group receives Pepsi
    - Another one receives Coke
  - Measure soda consumption between two
  - Variability
    - Wide variation in soda consumption habit
    - Heavy soda consumer would be small % of population but large % of contributor
- option 2 – interleaving
  - Allow people to choose either Pepsi or Coke.
  - Don’t mark a label, but can be visually distinguishable

For some tests interleaving is not possible.

In the shoe examples instead of sole, if we have two different design of shoe
In search systems, if you are changing some UI elements

Two tests Netflix did while introducing new interleaving system

Compare sensitivity against traditional a/b tests [0]
1. Before the test it was known that algorithm B is better than A
2. They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics

Search systems have business metric like conversion & revenue
Netflix has business metrics like retention and streaming hours

Reference

[0] Netflix blog

[Paper] Semantic Product Search

November 21, 2020November 21, 2021Archit Vora Leave a comment

Paper is related to search systems for e-commerce.
- It tries to replace BM-25 inverted index based recall with query-item embeddings.
- It is written by team at Amazon
- Paper URL – https://arxiv.org/pdf/1907.00937.pdf
Shortcoming of BM-25
- lack of understanding of hypernyms, synonyms, antonyms
  - hypernym – general/broad word
  - Color is hypernym for red
  - Examples
    - Red dress and burgundy dress
    - sneakers and running shoes
- Morphological variants (woman vs women)
  - Stemming and lemmatization is typical solution
  - Requires numerous hand-crafted rules
    - Converts reading glasses to reading glass
- Sensitivity to spelling error
  - Typically solved by spell correction algorithms

Architecture

Sharing embedding layer in siamese network work well, capturing local word level matching even before training this network.
Average pooling
- Comparing to LSTM performance of average pooling reduced by just 0.5 %
- Evaluation metric were MAP and Recall@100
- It helped computationally
- Unlike web search query and items titles have smaller length
- Query also does not have stop words
Batch Normalisation
- Length of items are generally longer than query
- They had observed difference in magnitude after avg pooling of query and item
- I have seen people using FC layer instead to have separate parameters

Loss Function

They first tried with 2 part hinge loss
Analysing the overlap items they found that they were clicked but not purchased.
- They note that from matching standpoint these products are okay to show to customer
So they introduced third term and decided it’s threshold to be 0
EmpIrical values used episolon (+) = 0.9, episilon (-) = 0.2
My view on this
- For the products that has clicks but not orders, new loss tries to keep it below 0.55
- 2 part hinge loss was for purchased vs random items
- Some of those random items had got clicks
- For positive items when we get similarity above 0.9 it should contribute 0 to loss
- For negative items if we get similarity below 0.2 it should contribute 0 to loss
- For neutral items if we get similarity below 0.55 it should contribute 0 to loss
y_hat is cosine similarity
- Based on which loss is calculated
- Instead of passing cosine similarity to sigmoid they are passing it through hinge function.

Tokenization Methods

Word unigram
word n-gram
- To preserve information about word ordering
- They preferred word n-gram over LSTM or CNN
Character trigram
- Robust typos (iphone & iphonr)
- Compound words (amazontv & firetvstick)
- Item parts and sizes
  - Tires have different sizes
  - TVs have different model
Handling unseen words
- Common NLP techniques
  - Mask the input
  - use same embedding for all unseen words
- What they choose
  - Keep more embedding space for unseen words
  - Hash them to get the bucket
- Enough space is allocated for frequent words and they are not hashed
  - 10k or 100k of n-grams are kept
Combining tokenization
- One approach
  - Separate embedding for unigram, biagram and character trigrams
  - Compute. weighted sum of cosine similarity individually
- Their approach
  - Consider them as bag or words

Data

Time range
- Training data – 11 month
- Eval data – 1 month
Trick of aggregating count as weight
- Weight while computing the loss, simple multiplication ( I had done that in synonyms as well )
- 54 billion query product training pairs
- Reduced to 650 million
Ratio
- 1 purchased
- 6 impressed
- 7 random
- They plan for both matching and ranking
- Matching -> Differentiate between (purchased + impressed) from random
- Ranking -> Differentiate between purchased and impressed
Query Preprocessing
- Changes queries to list of token IDs
- array length = 99 % tile of query token length
- Smaller tokens are padded to right
Item preprocessing
- One approach
  - Embed every attribute independently and concat
- Their approach
  - Ordered Bag of words for title and attributes
  - Issue was variability in attributes and lack of structured data across items

Model Evaluations

Evaluation Metrics

Matching
- Recall@100 and MAP
  - Used for tuning model hyper parameter as well
- 20k queries and sub-corpus for those queries
  - Around 1M items
  - Ordered + Impressed + Random
Ranking
- NDCG and MRR
- Purchased + Impressed

Evaluation Results

Table 1
- L2 variant of hinge loss outperforms L1
  - They hypothesise that L2 is more robust to outliers
- 3 part hinge loss outperforms 2 part in matching but not in ranking
  - Introduction of 3rd part forced purchased and random to be more separated
Table 2
- LSTM, GRU vs avg Pooling for aggregation
  - Comparison is made for various losses
  - Average pooling is performing similar or slightly better
    - Definitely reduces computation time
  - Since query and product title are relatively short this works well
  - Recurrent methods are expressive but introduces specialisation between query and title
    - Some may argue it is a good thing
    - They argue that it might not capture local word level mapping
Table 3
- Compares different tokenization methods
- Vocab sizes in embedding space
  - 125k – unigram
  - 25k – bigram
  - 64k – character trigram
  - 500k – Out of vocabulary (OOV) bins
- Advantage of adding bigrams
  - Example – Chocolate milk vs milk chocolate
- Advantage of character trigrams
  - Robustness to spelling error
- OOV adds additional parameters
  - case 1 : 500k unigrams
  - case 2 : 125k unigrams and 375k OOV
  - case 2 performs better
Table 4
- Batch vs layer vs no normalisation
- query sorted data vs shuffled data
- Optimum combination was using batch normalisation with shuffled data
- Query sorted data did not have much effect on layer normalisation but was on batch normalisation
  - Batch estimates for mean and variant would be biased for batch normalisation
Table 5
- Comparing with baseline papers
  - DSSM
    - Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
    - https://posenhuang.github.io/papers/cikm2013_DSSM_fullversion.pdf
  - ARC-II
    - Convolutional neural network architectures for matching natural language sentences
    - https://dl.acm.org/doi/10.5555/2969033.2969055
  - Match Pyramid
    - Text Matching as Image Recognition
    - https://arxiv.org/pdf/1602.06359.pdf
  - Two variation of their approach
    - Frozen embedding
      - Using pretrained word2vec and glove
    - Randomly initialised embedding
      - Lead 3X improvement in Recall@100 and MAP

Online evaluations

Three categories
- toys and games, kitchen and pets
Conversion rate, revenue and other KPI increased with statistical significance
Challenges
- Precision was decreased
- To mitigate they added guardrails using some heuristic and ranking models to filter non relevant items
- Not that recall@100 was metric for matching task and not precision.

Training Acceleration

Increasing query product pairs from 200 million to 1.2 billion had improved matching metric by 10 %
Data parallel training
- Idea
  - Split data
  - Same model on multiple GPUs
  - Take the average of gradient in the end
- Cons for this architecture
  - Limits embedding metric
    - It has to fit on single GPU
  - Data parallel approach has high communication overhead
    - We need to weight till forward pass from each GPU is available
Model parallel training
- Pros in this architecture
  - Simplicity of average pooling
  - Siamese has some parallelism by default
- How they implemented
  - Embedding metric is split across GPUs, across embedding dimension
    - Hence we need to concat to get final embedding vector for each token
      - Concatenation requires O(2k) communication of floating point numbers
    - Partial token embeddings are averaged at each GPUs
  - Full dataset is sent to all GPUs
In the result section they have told that embedding dimension was 256

Explaining above image
- For low embedding dimension time-to-train increases with number of GPUs but for high embedding dimension time-to-train decreases with number of GPUs
- Dotted line
  - Following two should have same training time
    - 2048 embedding with 1 GPU
    - 1024 embedding with 2 GPU
  - Doted line is for the ration of embedding_size/GPU
  - With ideal scaling it would be horizontal

Exampels

Exponential, Poisson and Gamma Distribution

November 18, 2020May 16, 2023Archit Vora Leave a comment

All three distribution models different aspect of same process – poisson process.

Poisson Distribution

It is used to predict probability of number of events occurring in fixed amount of time
Binomial distribution also models similar thing
- No of heads in n coin flips
- It has two parameters, n and p. Where p is probability of success.
Shortcoming of binomial
- We want a single number i.e. k events per hour. Binomial has two – n & p
- More than one event can occur in unit time. 1 like per hour, 1 like per minute.
Below formula
- lambda events occur in unit time
- Below PDF is probability of k events in unique time

Properties
- Popularly it is used to model rare events so we see small values of lambda often. But that is not restriction
- Distribution is asymmetric. There is no such thing as # of events < 0
- As lambda increases, it looks like normal distribution.
Poisson Model Assumptions
- Average rate of events per unit time is constant
- Events are independent

Exponential Distribution

Poisson – prob (k events in unit time)
Exponential – Prob (Amount of time between events) = Prob(amount of time until first event)
lambda – # no of events in unit time
- rate
- same as poisson

Derivation from Poisson
- Exponential CDF can be derived from Poisson PDF
- Differentiating it gives exponential PDF
Memoryless property
- P(T > (a+b) | P(a) ) = P( T > b )
- When memory is require we use weibull distribution
  - Older the car, more likely break down
When memory is not required
- Probability that next bus arrives in less than 10 minutes
- Probability that server will run without restart for 10k hours
- Time for cook to prepare potato chips (probability not at a point, some range always)
Geometric distribution is counter part of exponential in discrete space
- Corresponding poisson counterpart is binomial distribution
- No of throws required to observe heads
- It is also memory less
It is monotonically decreasing distribution
Expected value = mean = 1 / lambda

Gamma Distribution

Exponential – wait time till first. event
Gamma – wait time till k events
- Two params – k and lambda
- Probability of observing k events in time t

Applications
- You are in a queue for medical checkup. There are 7 people in front of you. Avg time to check one person is 5 minute. (rate = lambada = 1/5 and k = 7)
Literature uses different symbols for above parameters
- alpha, beta
- theta, k
K can be real number in gamma distribution
- To restrict k to be integer there is Erlang distribution
Gamma function – Gamma ( k ) = ( k – 1 ) !

Reference

Three wonderful posts by same author [1]

Quantile Function (Inverse CDF)

October 22, 2020November 3, 2020Archit Vora Leave a comment

Introduction

CDF maps input between in [0,1]. That is CDF(x) -> (0,1)

Quantile function takes input in (0,1) and return x.

Not all functionals are invertible.
Continuous distribution easily satisfies this property
For discrete distributions we take innfimum of all values [0]

Application in Sampling

Suppose we want to sample from a given distribution
We can make a quantile function of it
Sample uniformly from [0,1], call it p
x = quatile(p)
x is the sampled value [1]

Application in point estimation

Suppose you want to model CTR (click through rate)
CTR lies in (0,1) and can be presented via beta distribution
You have clicks and impressions for two ads/items anything
We are more confident about CTR when you have more impressions
We can construct a beta distribution with mean as point CTR.
- ctr = beta(clicks, impression-clicks)
- Recall in thompson sampling we were increasing alpha and beta by 1
Now we take 1%, 5%, 10% from quantile function. Call it
When we have more impressions ctr_x would be close to mean ctr (point estimate), else it will be less
On side node – this way of constructing distribution gets us away from point estimation and can be used in bayesian approaches

Reference

[0] https://stats.stackexchange.com/questions/212813/help-me-understand-the-quantile-inverse-cdf-function

[1] https://stats.stackexchange.com/questions/184325/how-does-the-inverse-transform-method-work/184337#184337

Gradient Boosting

September 26, 2020April 26, 2023Archit Vora 1 Comment

Some people use term bagging and boosting, some use parallel and sequential ensembles. I have written a post on AdaBoost previously. So we are familiar with boosting idea.

In AdaBoost we had a formula for generating labels (label/sample weight in loss function to be precise) for next iteration. In gradient boosting we fit gradient of loss function with respect to predictions.

Prof Balaraman notes

Most of this post contains info from [0]. However Prof. Balaraman offers intuitive perspective in one of his lectures. [3]
My handwritten note on that is available at [2]
We can write decision tree as equation where regions and value in the region are parameters.
Two steps of gradient boosting
1. Fit gradient of loss function using regression trees (least square loss)
2. Compute values in region using loss defined for application (ndcg for ranking, log-loss for classifier anything)
Whatever is application loss, we fit regression tree using least square
Parametric gradient descent can be seen as series of summation starting from initial guess.
Related paper [4]

Basis Functions

Basis functions
- If we want to fit non linear curve with simple linear regression, we will create polynomial features
- It is not limited to regression though
Adaptive Basis function
- Instead of we defining above functions, we learn them
- This can be framed as optimisation problem and can be solved by gradient descent/newton’s method for some hypothesis space
How to include Tree’s in base hypothesis space
- Something which we can not differentiate
- Even if we fix parameters to split on output is still not continuous
- So it is difficult

FSAM

FSAM
- Forward Stage-wise Additive Modelling
- AdaBoost is example of this, another example is L2 boosting
- We can say weak function is step direction and coefficient is step size

L2 Boosting also example of FSAM
- We are fitting residual
- To prevent overfitting we can reduce step size
- If v*h(x) is also in H, we can remove v. (closed under rescaling)
- h(x) can be decision stump we can keep going as long as residuals exist (It will certainly overfit after few iterations)

Adaboost
- It works for classification problem
- When the loss function in FSAM is exponential loss of margin, it reduces adaboost
- margin = y*f(x)
  - We had encountered that in SVM

Disadvantages of AdaBoost
- Gives too much penalty for negative margin, look at the curve of exponential above
- Mislabel training data
- Sometimes there is noise in world
- Does not work when there is high baye’s error
- Also it seems AdaBoost formulation is very much tied to classification
  - We probably needs to tweak it to apply it to ranking problem
  - But gradient boosting can be applied to any problem (loss function)
    - Because Tree is alway using L2 loss

Functional Gradient Descent

Gradient is pseudo residuals
- For L2 loss it is actually residuals (difference between y and f)
- Examples where residuals are not just difference between y and f
  - logistic loss
  - ranking using ndcg
We fit this gradient(pseudo-residuals) with weak function
- Regression Tree is most common
- We find closest approximation to gradient step in base hypothesis space (which is functional space)
- Here we minimize square loss (regression problem)
Earlier tree used to be just stumps
- That is num of leaf node = 2
- Now a days xgboost/light gbm takes upto 8/16 leaf nodes as well
Note that while calculating gradient the hypothesis function is not involved anywhere
How do we calculate gradient over function space
- In close form expression it might look complicated
- From numerical computation perspective we need gradient at just one point (of dimension training data size)
- This is just like taking gradient of weights in linear regression
  - Here size of gradient vector would be number of training examples
- In most of the method we take gradient of parameters, here we are taking gradient of predictions
What should be step size
- Line search
- Regularisation parameter (similar to what we use for any gradient based optimisation)
  - This is more common

Reference

[0] https://www.youtube.com/watch?v=fz1H03ZKvLM

[1] https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/8b.gradient-boosting.pdf

[2] https://github.com/arcarchit/datastories/blob/master/notes/gradient_boost_iitm.pdf

[3] https://www.youtube.com/watch?v=FJ-MA0EfJ9k

[4] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf