Search Engines - Information Retrieval
Information Retrieval
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.”(Salton, 1968)
- General definition that can be applied to many types of information and search applications
- Primary focus of IR since the 50s has been on textand documents

Basic Approach to IR
- Most (but not all!) successful approaches are statistical- Directly, or an effort to capture and use probabilities
 
- Why not natural language understanding?- i.e., computer understands docs and query and matches them
- State of the art is brittle in unrestricted domains
- Can be highly successful in predictable settings, though
- ChatGPTsuggests things are improving, but it still makes plenty of mistakes (and doesn’t really “understand” anything)
 
- Could use manually assigned headings- e.g., Library of Congress headings, Dewey Decimal headings
- Hard to predict what headings are “interesting”
- Expensive and human agreement is not good
 
“Bag of Words”
- An effective and popular approach
- Compares words without regard to order
- Consider reordering words in a headline
- Random: beating takes points falling another Dow 355
- Alphabetical: 355 another beating Dow falling points
- “Interesting”: Dow points beating falling 355 another
- Actual: Dow takes another beating, falling 355 points
 

Statistical language model
- 
Document comes from a topic 
- 
Topic (unseen) describes how words appear in documents on the topic 
- 
Use document to guesswhat the topic looks like 
 -Words common in document are common in topic
 -Words not in document much less likely to be in the topic
  
- 
Assign probability to words based on document - P(w|Topic) ≈P(w|D) = tf(w,D) / len(D)
 
- 
Index estimated topics 
What does LM look like implemented?
- 
Hypothesis of statistical language model 
 – Documents with topic models that are highly likely to generate the query are more likely to be relevant (to query)
- 
Index collection in advance (chs. 3&4) 
 – Convert documents into set of P(ti|D)
 – Store in an appropriate data structure for fast access
- 
Query arrives (ch.6&7) 
 – Convert it to set P(qi|D)
 – Calculate P(Q|TD) for all documents
 – Sort documents by their topics’ probability
 – Present ranked list
- 
Generally good results (ch.8) though not with version of the model that is this simple (ch.5) 

Some issues that arise in IR
- 
Text representation 
 – what makes a “good” representation?
 – how is a representation generated from text?
 – what are retrievable objects and how are they organized?
- 
Representing information needs 
 – what is an appropriate query language?
 – how can interactive query formulation and refinement be supported?
- 
Comparing representations 
 – what is a “good” model of retrieval?
 – how is uncertainty represented?
- 
Evaluating effectiveness of retrieval 
 – what are good metrics?
 – what constitutes a good experimental test bed?
Reference: Prof. James Allan