evaluation metrics | Data Stories

Information retrieval (IR) deals with fetching relevant documents given search query. Purpose of this blog is to list evaluation metric that can be used to measure performance of this system.

Metrics for unranked retrieval

For all these definition is same as that for classification. However they have some distinct characteristics for IR system.

Precision and Recall

Example of a system where precision is important
- Web search
Example of a system where recall is important
- Individual searching their hard disk

Accuracy

For IR system data is genrally very skewed. 99 % of the document are in non relvant category. Hence accuracy does not make sense.

F Measure

We can have F1 score, F2 score etc depending upon how much weight we want to precision and recall in harmonic mean.
As ß > 1, we start giving more weight to recall.

Metrics for ranked retrieval

Precision recall curves

Historically during classification we plot ROC by changing the threshold of binary classification. In case of IR we plot it by changing no of documents retrieved.

11 point interpolated average precision

For the recall values of (0,0.1,0.2,…,0.9,1.0) find out the precision and average it.

Mean average precision (MAP)

For a given query we calculate average precision. We take a mean of that for several queries.
Example:
- There are 10 documents, document 1,2,5 are revalant.
- System 1 retrieves : 1,5,4,6,7,2
  - Average precision = (1/1 + 2/2 + 2/6)/3 = 8/9 = 0.89
- System 2 retrieves : 1,2,5,3,4
  - Average precision = (1/1 + 2/2 + 3/3)/3 = 1
- System 3 retrieves : 6,7,1,2,3,4,5
  - Average precision = (1/3 + 2/4 + 3/7)/3 = 0.41
MAP values typically varies a lot for different query in the same system. Say between 0.1 to 0.7
For different systems and same query MAP values does not vary that much. Hence for testing which system is better using MAP, large no of queries are needed.

Precision at k

MAP measure precision at various recall levels (until all the documents are retrieved). For application like web search what matters is result on first page or first three pages.
Disadvantages
- Least stable and does not average well
- Reason : total no of relevant document for a query has a strong influence on this metric.

R precision

Same as precision at k where k = no of relevant documents in a given query
It adjusts for the relevant document for a query (disadvantage of precision at k)
Empirically R-precision and MAP turn out be highly correlated.

NDCG

Normalized Discounted Cumulative Gain
Like precision at k it is evaluated at some values of k
It constitutes of cumulative gain at each position which is discount by position and is normalized
- It is not max or sum normalized
- It is normlized by ideal ndcg, which is calculated by sorting document based on relevance score. [3]
NDCG ranges between 0 to 1. For perfect ranking ideal value of NDCG is 1.
Example
- positions : 1 2 3 4 5
- eval score : 2 1 2 0 1

NDCG(q, k) = DCG(k)/Ideal DCG(k)

DCG(k) = sum (Gain(i) * Discount factor(i)) for i in (1,k)

Discount factor is generally taken as 1/log(1 + pos)

Gain is generally takes as (2^r - 1) where r is relevance score given by humans say 0 - not relevant ,1-near relevant , or 2-relevant.

References

[0] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

[1] https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52

[2] https://blog.thedigitalgroup.com/measuring-search-relevance-using-ndcg

[3] https://www.geeksforgeeks.org/normalized-discounted-cumulative-gain-multilabel-ranking-metrics-ml/

Here are some metric available for validating clustering, explanation of each one is available on sklearn. [0]

If ground truth labels are available:

Adjusted Rand Index
Mutual Information Based scores
Homogeneity, completeness and V-measure
Fowlkes-Mallows scores

If not available :

Silhouette Coefficient
- Range (-1,1)
- 1 means it is similar to data-points in each cluster
Calinski-Harabaz Index
Davies-Bouldin Index
Contingency Matrix

Calculating SSE

It is a sum of distance between each point and its cluster center.

Silhouette Score

It is calculated for each point and then we take an average of it.

a(i) is average distance of a point to other points in same cluster.

b(i) is minimum of above average in for point in other cluster. It given the distance to nearest cluster.

s(i) close to 1 means data point is appropriately clustered. -1 means it is very bad clustered.

Setting s(i) to 0 when cluster size is one ensures that curve is not monotonically decreasing.

Elbow method and Silhouette Analysis

Notebook is available at https://github.com/arcarchit/datastories/blob/master/Silhouette.ipynb

sil1 sil2

Rand Index

When cluster labels are available we can use this matrix
It basically checks the similarity between two cluster assignments
- Labels can also be seen as one type of cluster assignment
- Score basically tells us how similar to cluster assignments are
This works by taking pair of points
- Out of all pairs how many pairs are agreed in both clusters mechanism
- Agree mean both
  - They are in same cluster in both mechanism
  - They are in different cluster in both mechanism
The Rand index has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering are exactly the same

rand_index

One drawback of Rand index is that it can given non zero value for random assignment of clusters. To mitigate that there is matrix called Adjusted Rand Index. [2]
- It specifically does not work when no of clusters are high

Reference

[0] : https://scikit-learn.org/stable/modules/clustering.html

[1] : https://github.com/anthonyng2/udemy-the-complete-machine-learning-course-with-python

[2] : https://davetang.org/muse/2017/09/21/adjusted-rand-index/

Data Stories

Tag evaluation metrics

IR metrics