[Paper] RankNet to LambdaRank to LambdaMART

Microsoft had published a paper on RankNet in 2005, which also won ICML’s test of time award in 2015. Subsequent improvement to that algorithm were lambdaRank and lambdaMART. The later had also won 2010 – learning to rank challenge. Current paper[1] summaries all three algorithms and incremental improvement they are mapping. It is written by Chris Burges who was involved in these development since RankNet.

Quick Gist

  • RankNet was trained using Neural Net.
  • RankNet was trained to minimize pair-wise differences.
  • LambdaNet was designed to maximise NDCG.
    • Which is more relevant to IR compared to pair-wise differences.
  • LambdaMART started using tree based boosting algorithms (MART, XgBoost etc)

RankNet

  • For a given query (q), we have two items (i and j)
    • I am writing down item but it would be any document (web page for example)
    • We will have features for
      • query
      • item
      • query item pairs
  • For both (q, i) and (q, j) we pass them through NeuralNet and get scores -> si and sj
    • We will also pass pairs which has same supervised label
  • We need to decided a loss function based on si and sj, gradient of which we will back propagate.
  • Computed probability formula y_hat
  • Supervised probability label y = (P_ij_bar) = (1 + S_ij) / 2
    • S_ij = { 0, 1, -1}
      • 1 if ( i is more relevant then j)
      • 0 if (i is less relevant then j)
      • 1/2 if both have same relevant scores
    • y will be in {0, 1/2, 1}
  • sigma determines shape of sigmoid
    • We would go to difference of (si – sj) on x-axis and read corresponding value from y axis.
    • sigma determines steepness around 0
    • This technique is known as temperature scaling. In tower network people use is when passing cosine similarity value to sigmoid.
    • Technique is useful when input has limited range
      • Cosine similarity has range (-1, 1)
      • s_i – s_j would also have smaller range
  • Cost function is cross entropy
  • One observation here is that when si = sj
    • cost = log 2 > 0
    • Document with different label but same score are still pushed aways from each other.

Speeding up RankNet Training by factoring

Defining lambda_ij

Defining lambda_i

  • What speedup did we achieve
    • Earlier for a given query we were computing lambda_ij for each pair and updating weights
    • Now we are computing lambda_i for each document and then update weight for given query
      • You can correlate this to going from SGD to mini-batch
      • We can do SGD at query, pair level or query level. We choose later.
  • Intuition of lambda_i
    • Direction(+Ve/-Ve) and magnitude where we want document to move
      • Move up/down in the ranking

LambdaRank

Figure below and caption explains why reducing pairwise difference does not necessarily results in increase of IR metrics. And LambdaRank is a way to optimize for IR metrics faster.

LamdaRank modified lamda_ij equation empirically as follows and it seemed to work well. We could do this because we don’t need to know actual cost function as long as we compute lamda_ij and then lamda_i

This becomes maximisation problem so we move in direction of gradient.

LambdaMART

  • MART – Multiple Additive Regression. Trees. For more info read the post on gradient boosting
  • To work with MART we need to specify two things
    • Gradient of loss function, which will be used to fit the trees
    • For the region in tree compute leaf value, that is done using line search/newton’s method
  • We had empirically defined lambda as gradient in lambdaRank, we use same lambda as gradient here as well.
  • For above lambda gradient, paper defines empirical cost function and then derives value at leaf nodes using newton’s method.
  • Key difference to observe between lambdaRank and lambdaMART
    • Lambda rank updates all weights (of classifier) after each query is examined
    • Leaf value at lambdaMART is computed for all query, item pairs that falls into that leaf
      • This allows lambdaMART to update splits and leaf values such that it may decrease utility for some queries but overall utility increases.

Reference

[0] : https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/

[1] : https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf

[Paper] Semantic Product Search

  • Paper is related to search systems for e-commerce.
  • Shortcoming of BM-25
    • lack of understanding of hypernyms, synonyms, antonyms
      • hypernym – general/broad word
      • Color is hypernym for red
      • Examples
        • Red dress and burgundy dress
        • sneakers and running shoes
    • Morphological variants (woman vs women)
      • Stemming and lemmatization is typical solution
      • Requires numerous hand-crafted rules
        • Converts reading glasses to reading glass
    • Sensitivity to spelling error
      • Typically solved by spell correction algorithms

Architecture

  • Sharing embedding layer in siamese network work well, capturing local word level matching even before training this network.
  • Average pooling
    • Comparing to LSTM performance of average pooling reduced by just 0.5 %
    • Evaluation metric were MAP and Recall@100
    • It helped computationally
    • Unlike web search query and items titles have smaller length
    • Query also does not have stop words
  • Batch Normalisation
    • Length of items are generally longer than query
    • They had observed difference in magnitude after avg pooling of query and item
    • I have seen people using FC layer instead to have separate parameters

Loss Function

  • They first tried with 2 part hinge loss
  • Analysing the overlap items they found that they were clicked but not purchased.
    • They note that from matching standpoint these products are okay to show to customer
  • So they introduced third term and decided it’s threshold to be 0
  • EmpIrical values used episolon (+) = 0.9, episilon (-) = 0.2
  • My view on this
    • For the products that has clicks but not orders, new loss tries to keep it below 0.55
    • 2 part hinge loss was for purchased vs random items
    • Some of those random items had got clicks
    • For positive items when we get similarity above 0.9 it should contribute 0 to loss
    • For negative items if we get similarity below 0.2 it should contribute 0 to loss
    • For neutral items if we get similarity below 0.55 it should contribute 0 to loss
  • y_hat is cosine similarity
    • Based on which loss is calculated
    • Instead of passing cosine similarity to sigmoid they are passing it through hinge function.

Tokenization Methods

  • Word unigram
  • word n-gram
    • To preserve information about word ordering
    • They preferred word n-gram over LSTM or CNN
  • Character trigram
    • Robust typos (iphone & iphonr)
    • Compound words (amazontv & firetvstick)
    • Item parts and sizes
      • Tires have different sizes
      • TVs have different model
  • Handling unseen words
    • Common NLP techniques
      • Mask the input
      • use same embedding for all unseen words
    • What they choose
      • Keep more embedding space for unseen words
      • Hash them to get the bucket
    • Enough space is allocated for frequent words and they are not hashed
      • 10k or 100k of n-grams are kept
  • Combining tokenization
    • One approach
      • Separate embedding for unigram, biagram and character trigrams
      • Compute. weighted sum of cosine similarity individually
    • Their approach
      • Consider them as bag or words

Data

  • Time range
    • Training data – 11 month
    • Eval data – 1 month
  • Trick of aggregating count as weight
    • Weight while computing the loss, simple multiplication ( I had done that in synonyms as well )
    • 54 billion query product training pairs
    • Reduced to 650 million
  • Ratio
    • 1 purchased
    • 6 impressed
    • 7 random
    • They plan for both matching and ranking
    • Matching -> Differentiate between (purchased + impressed) from random
    • Ranking -> Differentiate between purchased and impressed
  • Query Preprocessing
    • Changes queries to list of token IDs
    • array length = 99 % tile of query token length
    • Smaller tokens are padded to right
  • Item preprocessing
    • One approach
      • Embed every attribute independently and concat
    • Their approach
      • Ordered Bag of words for title and attributes
      • Issue was variability in attributes and lack of structured data across items

Model Evaluations

Evaluation Metrics

  • Matching
    • Recall@100 and MAP
      • Used for tuning model hyper parameter as well
    • 20k queries and sub-corpus for those queries
      • Around 1M items
      • Ordered + Impressed + Random
  • Ranking
    • NDCG and MRR
    • Purchased + Impressed

Evaluation Results

  • Table 1
    • L2 variant of hinge loss outperforms L1
      • They hypothesise that L2 is more robust to outliers
    • 3 part hinge loss outperforms 2 part in matching but not in ranking
      • Introduction of 3rd part forced purchased and random to be more separated
  • Table 2
    • LSTM, GRU vs avg Pooling for aggregation
      • Comparison is made for various losses
      • Average pooling is performing similar or slightly better
        • Definitely reduces computation time
      • Since query and product title are relatively short this works well
      • Recurrent methods are expressive but introduces specialisation between query and title
        • Some may argue it is a good thing
        • They argue that it might not capture local word level mapping
  • Table 3
    • Compares different tokenization methods
    • Vocab sizes in embedding space
      • 125k – unigram
      • 25k – bigram
      • 64k – character trigram
      • 500k – Out of vocabulary (OOV) bins
    • Advantage of adding bigrams
      • Example – Chocolate milk vs milk chocolate
    • Advantage of character trigrams
      • Robustness to spelling error
    • OOV adds additional parameters
      • case 1 : 500k unigrams
      • case 2 : 125k unigrams and 375k OOV
      • case 2 performs better
  • Table 4
    • Batch vs layer vs no normalisation
    • query sorted data vs shuffled data
    • Optimum combination was using batch normalisation with shuffled data
    • Query sorted data did not have much effect on layer normalisation but was on batch normalisation
      • Batch estimates for mean and variant would be biased for batch normalisation
  • Table 5

Online evaluations

  • Three categories
    • toys and games, kitchen and pets
  • Conversion rate, revenue and other KPI increased with statistical significance
  • Challenges
    • Precision was decreased
    • To mitigate they added guardrails using some heuristic and ranking models to filter non relevant items
    • Not that recall@100 was metric for matching task and not precision.

Training Acceleration

  • Increasing query product pairs from 200 million to 1.2 billion had improved matching metric by 10 %
  • Data parallel training
    • Idea
      • Split data
      • Same model on multiple GPUs
      • Take the average of gradient in the end
    • Cons for this architecture
      • Limits embedding metric
        • It has to fit on single GPU
      • Data parallel approach has high communication overhead
        • We need to weight till forward pass from each GPU is available
  • Model parallel training
    • Pros in this architecture
      • Simplicity of average pooling
      • Siamese has some parallelism by default
    • How they implemented
      • Embedding metric is split across GPUs, across embedding dimension
        • Hence we need to concat to get final embedding vector for each token
          • Concatenation requires O(2k) communication of floating point numbers
        • Partial token embeddings are averaged at each GPUs
      • Full dataset is sent to all GPUs
  • In the result section they have told that embedding dimension was 256
  • Explaining above image
    • For low embedding dimension time-to-train increases with number of GPUs but for high embedding dimension time-to-train decreases with number of GPUs
    • Dotted line
      • Following two should have same training time
        • 2048 embedding with 1 GPU
        • 1024 embedding with 2 GPU
      • Doted line is for the ration of embedding_size/GPU
      • With ideal scaling it would be horizontal

Exampels