Papers | Data Stories

Microsoft had published a paper on RankNet in 2005, which also won ICML’s test of time award in 2015. Subsequent improvement to that algorithm were lambdaRank and lambdaMART. The later had also won 2010 – learning to rank challenge. Current paper[1] summaries all three algorithms and incremental improvement they are mapping. It is written by Chris Burges who was involved in these development since RankNet.

Quick Gist

RankNet was trained using Neural Net.
RankNet was trained to minimize pair-wise differences.
LambdaNet was designed to maximise NDCG.
- Which is more relevant to IR compared to pair-wise differences.
LambdaMART started using tree based boosting algorithms (MART, XgBoost etc)

RankNet

For a given query (q), we have two items (i and j)
- I am writing down item but it would be any document (web page for example)
- We will have features for
  - query
  - item
  - query item pairs
For both (q, i) and (q, j) we pass them through NeuralNet and get scores -> si and sj
- We will also pass pairs which has same supervised label
We need to decided a loss function based on si and sj, gradient of which we will back propagate.
Computed probability formula y_hat

Supervised probability label y = (P_ij_bar) = (1 + S_ij) / 2
- S_ij = { 0, 1, -1}
  - 1 if ( i is more relevant then j)
  - 0 if (i is less relevant then j)
  - 1/2 if both have same relevant scores
- y will be in {0, 1/2, 1}
sigma determines shape of sigmoid
- We would go to difference of (si – sj) on x-axis and read corresponding value from y axis.
- sigma determines steepness around 0
- This technique is known as temperature scaling. In tower network people use is when passing cosine similarity value to sigmoid.
- Technique is useful when input has limited range
  - Cosine similarity has range (-1, 1)
  - s_i – s_j would also have smaller range
Cost function is cross entropy

One observation here is that when si = sj
- cost = log 2 > 0
- Document with different label but same score are still pushed aways from each other.

Speeding up RankNet Training by factoring

Defining lambda_ij

Defining lambda_i

What speedup did we achieve
- Earlier for a given query we were computing lambda_ij for each pair and updating weights
- Now we are computing lambda_i for each document and then update weight for given query
  - You can correlate this to going from SGD to mini-batch
  - We can do SGD at query, pair level or query level. We choose later.
Intuition of lambda_i
- Direction(+Ve/-Ve) and magnitude where we want document to move
  - Move up/down in the ranking

LambdaRank

Figure below and caption explains why reducing pairwise difference does not necessarily results in increase of IR metrics. And LambdaRank is a way to optimize for IR metrics faster.

LamdaRank modified lamda_ij equation empirically as follows and it seemed to work well. We could do this because we don’t need to know actual cost function as long as we compute lamda_ij and then lamda_i

This becomes maximisation problem so we move in direction of gradient.

LambdaMART

MART – Multiple Additive Regression. Trees. For more info read the post on gradient boosting
To work with MART we need to specify two things
- Gradient of loss function, which will be used to fit the trees
- For the region in tree compute leaf value, that is done using line search/newton’s method
We had empirically defined lambda as gradient in lambdaRank, we use same lambda as gradient here as well.

For above lambda gradient, paper defines empirical cost function and then derives value at leaf nodes using newton’s method.

Key difference to observe between lambdaRank and lambdaMART
- Lambda rank updates all weights (of classifier) after each query is examined
- Leaf value at lambdaMART is computed for all query, item pairs that falls into that leaf
  - This allows lambdaMART to update splits and leaf values such that it may decrease utility for some queries but overall utility increases.

Reference

[0] : https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/

[1] : https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf

Paper is related to search systems for e-commerce.
- It tries to replace BM-25 inverted index based recall with query-item embeddings.
- It is written by team at Amazon
- Paper URL – https://arxiv.org/pdf/1907.00937.pdf
Shortcoming of BM-25
- lack of understanding of hypernyms, synonyms, antonyms
  - hypernym – general/broad word
  - Color is hypernym for red
  - Examples
    - Red dress and burgundy dress
    - sneakers and running shoes
- Morphological variants (woman vs women)
  - Stemming and lemmatization is typical solution
  - Requires numerous hand-crafted rules
    - Converts reading glasses to reading glass
- Sensitivity to spelling error
  - Typically solved by spell correction algorithms

Architecture

Sharing embedding layer in siamese network work well, capturing local word level matching even before training this network.
Average pooling
- Comparing to LSTM performance of average pooling reduced by just 0.5 %
- Evaluation metric were MAP and Recall@100
- It helped computationally
- Unlike web search query and items titles have smaller length
- Query also does not have stop words
Batch Normalisation
- Length of items are generally longer than query
- They had observed difference in magnitude after avg pooling of query and item
- I have seen people using FC layer instead to have separate parameters

Loss Function

They first tried with 2 part hinge loss
Analysing the overlap items they found that they were clicked but not purchased.
- They note that from matching standpoint these products are okay to show to customer
So they introduced third term and decided it’s threshold to be 0
EmpIrical values used episolon (+) = 0.9, episilon (-) = 0.2
My view on this
- For the products that has clicks but not orders, new loss tries to keep it below 0.55
- 2 part hinge loss was for purchased vs random items
- Some of those random items had got clicks
- For positive items when we get similarity above 0.9 it should contribute 0 to loss
- For negative items if we get similarity below 0.2 it should contribute 0 to loss
- For neutral items if we get similarity below 0.55 it should contribute 0 to loss
y_hat is cosine similarity
- Based on which loss is calculated
- Instead of passing cosine similarity to sigmoid they are passing it through hinge function.

Tokenization Methods

Word unigram
word n-gram
- To preserve information about word ordering
- They preferred word n-gram over LSTM or CNN
Character trigram
- Robust typos (iphone & iphonr)
- Compound words (amazontv & firetvstick)
- Item parts and sizes
  - Tires have different sizes
  - TVs have different model
Handling unseen words
- Common NLP techniques
  - Mask the input
  - use same embedding for all unseen words
- What they choose
  - Keep more embedding space for unseen words
  - Hash them to get the bucket
- Enough space is allocated for frequent words and they are not hashed
  - 10k or 100k of n-grams are kept
Combining tokenization
- One approach
  - Separate embedding for unigram, biagram and character trigrams
  - Compute. weighted sum of cosine similarity individually
- Their approach
  - Consider them as bag or words

Data

Time range
- Training data – 11 month
- Eval data – 1 month
Trick of aggregating count as weight
- Weight while computing the loss, simple multiplication ( I had done that in synonyms as well )
- 54 billion query product training pairs
- Reduced to 650 million
Ratio
- 1 purchased
- 6 impressed
- 7 random
- They plan for both matching and ranking
- Matching -> Differentiate between (purchased + impressed) from random
- Ranking -> Differentiate between purchased and impressed
Query Preprocessing
- Changes queries to list of token IDs
- array length = 99 % tile of query token length
- Smaller tokens are padded to right
Item preprocessing
- One approach
  - Embed every attribute independently and concat
- Their approach
  - Ordered Bag of words for title and attributes
  - Issue was variability in attributes and lack of structured data across items

Model Evaluations

Evaluation Metrics

Matching
- Recall@100 and MAP
  - Used for tuning model hyper parameter as well
- 20k queries and sub-corpus for those queries
  - Around 1M items
  - Ordered + Impressed + Random
Ranking
- NDCG and MRR
- Purchased + Impressed

Evaluation Results

Table 1
- L2 variant of hinge loss outperforms L1
  - They hypothesise that L2 is more robust to outliers
- 3 part hinge loss outperforms 2 part in matching but not in ranking
  - Introduction of 3rd part forced purchased and random to be more separated
Table 2
- LSTM, GRU vs avg Pooling for aggregation
  - Comparison is made for various losses
  - Average pooling is performing similar or slightly better
    - Definitely reduces computation time
  - Since query and product title are relatively short this works well
  - Recurrent methods are expressive but introduces specialisation between query and title
    - Some may argue it is a good thing
    - They argue that it might not capture local word level mapping
Table 3
- Compares different tokenization methods
- Vocab sizes in embedding space
  - 125k – unigram
  - 25k – bigram
  - 64k – character trigram
  - 500k – Out of vocabulary (OOV) bins
- Advantage of adding bigrams
  - Example – Chocolate milk vs milk chocolate
- Advantage of character trigrams
  - Robustness to spelling error
- OOV adds additional parameters
  - case 1 : 500k unigrams
  - case 2 : 125k unigrams and 375k OOV
  - case 2 performs better
Table 4
- Batch vs layer vs no normalisation
- query sorted data vs shuffled data
- Optimum combination was using batch normalisation with shuffled data
- Query sorted data did not have much effect on layer normalisation but was on batch normalisation
  - Batch estimates for mean and variant would be biased for batch normalisation
Table 5
- Comparing with baseline papers
  - DSSM
    - Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
    - https://posenhuang.github.io/papers/cikm2013_DSSM_fullversion.pdf
  - ARC-II
    - Convolutional neural network architectures for matching natural language sentences
    - https://dl.acm.org/doi/10.5555/2969033.2969055
  - Match Pyramid
    - Text Matching as Image Recognition
    - https://arxiv.org/pdf/1602.06359.pdf
  - Two variation of their approach
    - Frozen embedding
      - Using pretrained word2vec and glove
    - Randomly initialised embedding
      - Lead 3X improvement in Recall@100 and MAP

Online evaluations

Three categories
- toys and games, kitchen and pets
Conversion rate, revenue and other KPI increased with statistical significance
Challenges
- Precision was decreased
- To mitigate they added guardrails using some heuristic and ranking models to filter non relevant items
- Not that recall@100 was metric for matching task and not precision.

Training Acceleration

Increasing query product pairs from 200 million to 1.2 billion had improved matching metric by 10 %
Data parallel training
- Idea
  - Split data
  - Same model on multiple GPUs
  - Take the average of gradient in the end
- Cons for this architecture
  - Limits embedding metric
    - It has to fit on single GPU
  - Data parallel approach has high communication overhead
    - We need to weight till forward pass from each GPU is available
Model parallel training
- Pros in this architecture
  - Simplicity of average pooling
  - Siamese has some parallelism by default
- How they implemented
  - Embedding metric is split across GPUs, across embedding dimension
    - Hence we need to concat to get final embedding vector for each token
      - Concatenation requires O(2k) communication of floating point numbers
    - Partial token embeddings are averaged at each GPUs
  - Full dataset is sent to all GPUs
In the result section they have told that embedding dimension was 256

Explaining above image
- For low embedding dimension time-to-train increases with number of GPUs but for high embedding dimension time-to-train decreases with number of GPUs
- Dotted line
  - Following two should have same training time
    - 2048 embedding with 1 GPU
    - 1024 embedding with 2 GPU
  - Doted line is for the ration of embedding_size/GPU
  - With ideal scaling it would be horizontal

Data Stories

Category Papers

[Paper] RankNet to LambdaRank to LambdaMART