5 min read

Search Engines - Evaluation

Search Engines - Evaluation
Photo by Christian Wiediger / Unsplash


Evaluation

  • Evaluation is key to building effective and efficient search engines
    – measurement usually carried out in controlled laboratory experiments
    – online testing can also be done
  • Efficiency measures similar to those used in database systems
    – e.g., indexing time, query throughput, index size
  • Our focus here is on effectiveness metrics
    – Comparing systems
    – Parameter tuning

Evaluation Corpus

  • Test collections consisting of documents, queries, and relevance judgments
  • Some examples:1.png

Evaluation Measures

p.png

  • Precision
    – Proportion of a retrieved set that is relevant
    – Precision = |relevant ∩ retrieved| ÷ |retrieved|
    = P( relevant | retrieved )
  • Recall
    – proportion of all relevant documents in the collection included in the retrieved set
    – Recall = |relevant ∩ retrieved| ÷ |relevant| = P( retrieved | relevant )

t.png

EX) Precision and Recall example

Average precision of a query

  • Provides a single-number effectiveness measure
    – E.g., for a machine-learning algorithm to detect improvement
  • Average precision is widely used in IR
    – assumes user interested in finding many relevant documents for each query, preferably at top of list
    – requires many relevance judgments in text collection
  • Calculate by averaging precision when recall increasese.png

s.png

Another average: F Measure

  • Harmonic mean of recall and precision
    111.png
    – harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large
  • More general form (for those who are curious)
    kk.png
    – β is a parameter that determines relative importance of recall and precision
    – People usually use balanced F1 measure

f1.png

Averaging across queries

  • Two main types of averaging
    – Micro-average - each relevant document is a point in the average
    – Macro-average - each query is a point in the average
  • Also done with average precision value
    – Average of many queries’ average precision values
    – Called mean average precision (MAP)/ “Average average precision” sounds weird

Discounted Cumulative Gain (DCG)

  • Popular measure for evaluating web search and related tasks
  • Two assumptions:
  1. Highly relevant documents are more useful than marginally relevant document
    -> Implies we need multi-valued relevance judgments
  2. The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined
    -> AP captures this, too, but normally binary judgments
  • Uses graded relevance as a measure of the usefulness, or gain, from examining a document
  • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
  • Typical discount is 1/log (rank)
    – With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
  • DCG is the total gain accumulated at a particular rank p:333.png
  • Alternative formulation:44.png
    – used by some web search companies

Normalized DCG (NDCG)

  • DCG numbers are averaged across a set of queries at specific rank values
    – e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61
  • DCG values are often normalized by comparing the DCG at each rank with the DCG value for the ideal (perfect) ranking
    – makes averaging easier for queries with different numbers of relevant documents55.png

GitHub - IlMinCho/SE-Evaluation
Contribute to IlMinCho/SE-Evaluation development by creating an account on GitHub.

Analysis

1.

QueryQLBM25QL~BM %DPRQL~DPR %
238490.01510.018623.2%0.23911483.4%
422550.19870.262532.1%0.4411122.0%
472100.19970.20211.2%0.369284.9%
..................
..................
..................
11132560.49530.49690.3%0.4651-6.1%
11152100.09150.0887-3.1%0.0651-28.9%
11163800.03960.0111-72.0%0.058748.2%
11195430.00000.00000.0%0.00000.0%
11213530.25570.2349-8.1%0.1002-60.8%
11227670.34600.3235-6.5%0.2052-40.7%
11275400.26930.27642.6%0.1705-36.7%
11310690.02880.0856197.2%0.2143644.1%
11325320.16660.1044-37.3%0.244246.6%
11335790.66770.6666-0.2%0.753012.8%
11360430.09760.156960.8%0.3695278.6%
11360470.06660.0464-30.3%0.0623-6.5%
11367690.00000.00000.0%0.00000.0%
11369620.46890.48794.1%0.4199-10.4%
all0.19520.1886-3.4%0.20917.1%

2.

Looking at the data, we can find several results with a slightly higher value of BM25 than the value of QL and the highest value of DPR. In other words, the percentage tends to improve gradually. However, if you look at the ‘all’ part row, the values increase in the order of BM25 and Ql, DPR.

It depends on which query you look at, but while QL directly checks the probability of a term appearing in a document, BM25 relies on a more complex combination of features. In other words, QL may be better suited for situations where there is a clear semantic relationship between a query and related documents. However, for other data sets or other types of queries, bm25 may work better.

DPR is a model that uses vectors to encode queries and documents. DPR is more sophisticated than QL and BM25 because it takes advantage of similarities between queries and documents to handle complex queries and documents more effectively.

3.

The MAP is intended to evaluate the efficiency of the search system. Without a retrieved document, it is difficult to find the relevance of the document to the query. As a result, MAP calculation is not easy because precision and recall rate cannot be calculated. This also means that there is nothing to evaluate the efficiency of the system, and MAP calculations for these queries can lead to inaccurate results.

4.


Reference: Prof. James Allan