2 min read

Search Engines - Information Retrieval

Search Engines - Information Retrieval
Photo by Christian Wiediger / Unsplash

Information Retrieval
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.”(Salton, 1968)

- General definition that can be applied to many types of information and search applications

- Primary focus of IR since the 50s has been on textand documents


Basic Approach to IR

  1. Most (but not all!) successful approaches are statistical
    • Directly, or an effort to capture and use probabilities
  2. Why not natural language understanding?
    • i.e., computer understands docs and query and matches them
    • State of the art is brittle in unrestricted domains
    • Can be highly successful in predictable settings, though
    • ChatGPTsuggests things are improving, but it still makes plenty of mistakes (and doesn’t really “understand” anything)
  3. Could use manually assigned headings
    • e.g., Library of Congress headings, Dewey Decimal headings
    • Hard to predict what headings are “interesting”
    • Expensive and human agreement is not good

“Bag of Words”

  • An effective and popular approach
  • Compares words without regard to order
  • Consider reordering words in a headline
    • Random: beating takes points falling another Dow 355
    • Alphabetical: 355 another beating Dow falling points
    • “Interesting”: Dow points beating falling 355 another
    • Actual: Dow takes another beating, falling 355 points

Statistical language model

  • Document comes from a topic

  • Topic (unseen) describes how words appear in documents on the topic

  • Use document to guesswhat the topic looks like
    -Words common in document are common in topic
    -Words not in document much less likely to be in the topic
    Screenshot 2024-01-03 175216.png

  • Assign probability to words based on document

    • P(w|Topic) ≈P(w|D) = tf(w,D) / len(D)
  • Index estimated topics


What does LM look like implemented?

  • Hypothesis of statistical language model
    – Documents with topic models that are highly likely to generate the query are more likely to be relevant (to query)

  • Index collection in advance (chs. 3&4)
    – Convert documents into set of P(ti|D)
    – Store in an appropriate data structure for fast access

  • Query arrives (ch.6&7)
    – Convert it to set P(qi|D)
    – Calculate P(Q|TD) for all documents
    – Sort documents by their topics’ probability
    – Present ranked list

  • Generally good results (ch.8) though not with version of the model that is this simple (ch.5)



Some issues that arise in IR

  • Text representation
    – what makes a “good” representation?
    – how is a representation generated from text?
    – what are retrievable objects and how are they organized?

  • Representing information needs
    – what is an appropriate query language?
    – how can interactive query formulation and refinement be supported?

  • Comparing representations
    – what is a “good” model of retrieval?
    – how is uncertainty represented?

  • Evaluating effectiveness of retrieval
    – what are good metrics?
    – what constitutes a good experimental test bed?


Reference: Prof. James Allan