[Paper] RankNet to LambdaRank to LambdaMART

Microsoft had published a paper on RankNet in 2005, which also won ICML’s test of time award in 2015. Subsequent improvement to that algorithm were lambdaRank and lambdaMART. The later had also won 2010 – learning to rank challenge. Current paper[1] summaries all three algorithms and incremental improvement they are mapping. It is written by Chris Burges who was involved in these development since RankNet.

Quick Gist

  • RankNet was trained using Neural Net.
  • RankNet was trained to minimize pair-wise differences.
  • LambdaNet was designed to maximise NDCG.
    • Which is more relevant to IR compared to pair-wise differences.
  • LambdaMART started using tree based boosting algorithms (MART, XgBoost etc)

RankNet

  • For a given query (q), we have two items (i and j)
    • I am writing down item but it would be any document (web page for example)
    • We will have features for
      • query
      • item
      • query item pairs
  • For both (q, i) and (q, j) we pass them through NeuralNet and get scores -> si and sj
    • We will also pass pairs which has same supervised label
  • We need to decided a loss function based on si and sj, gradient of which we will back propagate.
  • Computed probability formula y_hat
  • Supervised probability label y = (P_ij_bar) = (1 + S_ij) / 2
    • S_ij = { 0, 1, -1}
      • 1 if ( i is more relevant then j)
      • 0 if (i is less relevant then j)
      • 1/2 if both have same relevant scores
    • y will be in {0, 1/2, 1}
  • sigma determines shape of sigmoid
    • We would go to difference of (si – sj) on x-axis and read corresponding value from y axis.
    • sigma determines steepness around 0
    • This technique is known as temperature scaling. In tower network people use is when passing cosine similarity value to sigmoid.
    • Technique is useful when input has limited range
      • Cosine similarity has range (-1, 1)
      • s_i – s_j would also have smaller range
  • Cost function is cross entropy
  • One observation here is that when si = sj
    • cost = log 2 > 0
    • Document with different label but same score are still pushed aways from each other.

Speeding up RankNet Training by factoring

Defining lambda_ij

Defining lambda_i

  • What speedup did we achieve
    • Earlier for a given query we were computing lambda_ij for each pair and updating weights
    • Now we are computing lambda_i for each document and then update weight for given query
      • You can correlate this to going from SGD to mini-batch
      • We can do SGD at query, pair level or query level. We choose later.
  • Intuition of lambda_i
    • Direction(+Ve/-Ve) and magnitude where we want document to move
      • Move up/down in the ranking

LambdaRank

Figure below and caption explains why reducing pairwise difference does not necessarily results in increase of IR metrics. And LambdaRank is a way to optimize for IR metrics faster.

LamdaRank modified lamda_ij equation empirically as follows and it seemed to work well. We could do this because we don’t need to know actual cost function as long as we compute lamda_ij and then lamda_i

This becomes maximisation problem so we move in direction of gradient.

LambdaMART

  • MART – Multiple Additive Regression. Trees. For more info read the post on gradient boosting
  • To work with MART we need to specify two things
    • Gradient of loss function, which will be used to fit the trees
    • For the region in tree compute leaf value, that is done using line search/newton’s method
  • We had empirically defined lambda as gradient in lambdaRank, we use same lambda as gradient here as well.
  • For above lambda gradient, paper defines empirical cost function and then derives value at leaf nodes using newton’s method.
  • Key difference to observe between lambdaRank and lambdaMART
    • Lambda rank updates all weights (of classifier) after each query is examined
    • Leaf value at lambdaMART is computed for all query, item pairs that falls into that leaf
      • This allows lambdaMART to update splits and leaf values such that it may decrease utility for some queries but overall utility increases.

Reference

[0] : https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/

[1] : https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf

IR metrics

Information retrieval (IR) deals with fetching relevant documents given search query. Purpose of this blog is to list evaluation metric that can be used to measure performance of this system.

Metrics for unranked retrieval

For all these definition is same as that for classification. However they have some distinct characteristics for IR system.

Precision and Recall

  • Example of a system where precision is important
    • Web search
  • Example of a system where recall is important
    • Individual searching their hard disk

Accuracy

For IR system data is genrally very skewed. 99 % of the document are in non relvant category. Hence accuracy does not make sense.

F Measure

  • We can have F1 score, F2 score etc depending upon how much weight we want to precision and recall in harmonic mean.
  • As ß > 1, we start giving more weight to recall.

Metrics for ranked retrieval

Precision recall curves

  • Historically during classification we plot ROC by changing the threshold of binary classification. In case of IR we plot it by changing no of documents retrieved.

11 point interpolated average precision

  • For the recall values of (0,0.1,0.2,…,0.9,1.0) find out the precision and average it.

Mean average precision (MAP)

  • For a given query we calculate average precision. We take a mean of that for several queries.
  • Example:
    • There are 10 documents, document 1,2,5 are revalant.
    • System 1 retrieves : 1,5,4,6,7,2
      • Average precision = (1/1 + 2/2 + 2/6)/3 = 8/9 = 0.89
    • System 2 retrieves : 1,2,5,3,4
      • Average precision = (1/1 + 2/2 + 3/3)/3 = 1
    • System 3 retrieves : 6,7,1,2,3,4,5
      • Average precision = (1/3 + 2/4 + 3/7)/3 = 0.41
  • MAP values typically varies a lot for different query in the same system. Say between 0.1 to 0.7
  • For different systems and same query MAP values does not vary that much. Hence for testing which system is better using MAP, large no of queries are needed.

Precision at k

  • MAP measure precision at various recall levels (until all the documents are retrieved). For application like web search what matters is result on first page or first three pages.
  • Disadvantages
    • Least stable and does not average well
    • Reason : total no of relevant document for a query has a strong influence on this metric.

R precision

  • Same as precision at k where k = no of relevant documents in a given query
  • It adjusts for the relevant document for a query (disadvantage of precision at k)
  • Empirically R-precision and MAP turn out be highly correlated.

NDCG

  • Normalized Discounted Cumulative Gain
  • Like precision at k it is evaluated at some values of k
  • It constitutes of cumulative gain at each position which is discount by position and is normalized
    • It is not max or sum normalized
    • It is normlized by ideal ndcg, which is calculated by sorting document based on relevance score. [3]
  • NDCG ranges between 0 to 1. For perfect ranking ideal value of NDCG is 1.
  • Example
    • positions : 1 2 3 4 5
    • eval score : 2 1 2 0 1
NDCG(q, k) = DCG(k)/Ideal DCG(k)

DCG(k) = sum (Gain(i) * Discount factor(i)) for i in (1,k)

Discount factor is generally taken as 1/log(1 + pos)

Gain is generally takes as (2^r - 1) where r is relevance score given by humans say 0 - not relevant ,1-near relevant , or 2-relevant.

References

[0] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

[1] https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52

[2] https://blog.thedigitalgroup.com/measuring-search-relevance-using-ndcg

[3] https://www.geeksforgeeks.org/normalized-discounted-cumulative-gain-multilabel-ranking-metrics-ml/