Ryerson Computer Science CPS842 Info Retrieval

(go back)

Course Map
Rough Notes

Overview

What is IR?

Deals with the representation, storage, organization, etc

Questions to answer

  • how does the IR collect and save information?
  • how does it process user query?

Apply text processing, transformation to both input documents as well as user queries -- same transformation; document terms are indexed

IR system does matchmaking and ranks the relevant documents

	query		|				|-> query terms	|
			| -> text processing ->	|	|-> matchmaking -> return ranked docs
	documents	|				|-> index terms	|
				

Basic terminology

  • index term - word or group of words from the document
  • document - set of representative index terms
  • collection (or corpus) - group of documents
  • vocabulary - set of distinct index terms
  • query - set of index terms transformed from a user's information need

IR Model Representation

quadruple representation: [D, Q, F, R(qi, di)]

Main Parts of the System / Workflow

  1. index the corpus
    • build inverted index
  2. Do match-making & ranking on the inverted index
  3. evaluate user queries, return results

Models - Overview

Boolean

  • (+) Easy to use
  • (+) Easy to implement
  • (-) No ranking
  • (-) No partial matches

Vector Space

  • (+) Simple
  • (+) Ranked
  • (-) Assumes term independence

Probabilistic

  • (+) Strong theoretical basis
  • (+) Theoretically thebest matches
  • (-) Relevance info is required
  • (-) Req's ongoingcollection of relevance

Okapi BM25 -> best

Boolean Model

Based on set theory, Boolean algebra

  • Boolean operators, proximity operators, simple regular expressions
  • problem: no ranking
  • Improvement area: implement "skip pointers"

Term-document Matrix

  • Not very efficient, not scalable

Inverted Index

  • dictionary of lists of document id's
  • query processing: match up documents by id
  • query processing optimization: start the boolean query by ordering the terms in order of increasing number of associated documents

Westlaw example queries

  • disab! /p access! /s work-site work-place (employment /3 place)

Vector Space Model

Use a vector to represent a document, and also to represent the query. Measure the similarity between the query vector and the document vector. Rank is based on similarity

Lucene uses the Vector Space Model

Bag of words model

Order or words is not important: John runs faster than Mary is the same as Mary runs faster than John

Zipf's Law

  • i-th most frequent term has frequency proportional to 1/i
  • SEE DISTRIBUTION PATTERN, "heavy tailed"

TF-IDF Weighting

TF - term frequency

do the LOG(base10) on the frequency

		tfij = 	1 + log(fij)   if fij > 0
		        0              otherwise

		alternative:
		tfij = 0.5 + 0.5 (fij / maxi . fij)

		D1: I know you know
		D2: you know him
		V = { him, I, know, you} // complete vocab from all docs
		f11 = 0, f21 = 1, f31 = 2, f41 = 1
		f12 = 1, f22 = 0, f32  =1, f42 = 1
		tf11 = 0, tf21 = 1, tf31 = 1.3, tf41 = 1
		etc..
		

IDF - Inverse document frequency

IDF is larger for terms found in fewer documents

finally, Wij = TFij * IDFi

Vector space

Where t is the number of terms, the vector space has t dimensions

can use euclidian distance fmla to find distance between query and each document

d: ( 1 1 1 0 1 1 1 )
q: ( 1 1 1 0 0 0 0 )
		

Euclidian distance... not so good, so use similiarity score by measuring the angle between the vectors

smaller angle means more similar

similarity(dj,q) = vec dj cdot q = SUM (Wij . wiq)
		

Cosine Similarity model

$sim(d,q) = \cfrac{\sum{w_{ij} w_{iq}}}{\sqrt{\sum{w_{ij}^2}\sum{w_{iq}^2}}}$

Longer documents with higher frequency may be favored; so we may want to normalize.

The result is normalized -> between 0 and 1

			d: (2 3 0)
			q: (1 0 1)

			sim = (2x1 + 3x0 + 0x1) / (sqrt(2^2 + 3^2 + 0^2) * sqrt(...))

			Table Filling:
			f - given
			tf - calc: 1 + log(f)
			df - given, frequency in all docs
			idf is only calculated once per term for all docs and queries
			w = tf * idf
			length of doc is sqrt of sum of squares of each weight
			then use the length to normalize the weight:  nw = weight / length

			take product of document nw and query nw, then take sum of each term product.  ->> very same as the long fmla


			can calculate from the inverted index data
			 t1(5000)---[1][3]-->[2][4]
			 t2(1000)---[1][4]-->[2][4]
		     t3(500) ---[1][2]-->[2][4]	
		

Probabilistic Model

	if P(R|D) > P (NR|D) D is returned
	Background
		Given two events A & B,
		P(A,B) - joint probability
		P(A|B) - conditional A given B has occured
	Baye's Theorem
		P(A|B) = ( P(B|A)* P(A) ) / (P(B)) = ( P(B|A)* P(A) ) / ( P(B|A) * P(A) + P(B|A) * P(?) )
	

Binary Independence Model

Ranking with Estimation

Ranking without Estimations

ni is same as df; so Ranking without Estimate is same as idf fmla

Okapi BM (best match) 25

Most popular probablistic model

  • Considers term frequency
  • also does document length normalization

Document Pre-Processing

The major steps in document pre-processing include:

  1. Parsing & tokenization
  2. Normalization
  3. Stop word removal - usually articles, prepositions, conjunctions (top of the zipf curve)
  4. Stemming
  5. Selection of index terms

Indexing & Searching

Inverted index

Bitmaps

Signature files

Techniques

Map-Reduce work distribution

  • Term Partition Index
  • Document Partition Index

Distributed Indexing: Map Reduce

  1. Map (parsing distributed across multiple machines)
  2. Reduce (indexing distributed across multiple machines)

Used to generate Document Parition Index

  • first term partitioned index is generated
  • then document partition index

Example of converting from Term Partitioned Index to Document Partitioned Index:

Suppose the goal is 3 servers, with DocIds 1-10,000, 10,001 - 20,000, 20,001 - 30,000

  1. Parse phase: each parser divides their terms into a 3-part segment - one for each document server
  2. Reduce phase: each indexer reads from all of the parser machines segment files to get their appropriate doc id sets, pushes into the document-partitioned index

generally better because the query processing workload is distributed on multiple servers; each server can do the intersection on the documents; the central server just needs to union them and sort them

Merging Indexes (dynamic indexing)

Z0 --> I0 --> I_1 --> I___2 --> I____3
  1. Fill z0 (memory space)
  2. If full, push content to I0
  3. When I0 is full, push to I1 (z0 + I0)
  4. When I1 is full, push to I2 (z0 + I0 + I1)

Blocked SortBased indexing

Single pass in-memory indexing

Dynamic indexing

  • Index Merging

Retrieval

Top-K Retrieval

reduce computation cost by eliminating large number of documents without computing their cosine scores

Techniques:

  • index elimination - only consider documents containing terms whose idf exceeds some threshold
  • champion list

Google Algorithms

  • Page Rank
  • Tiered Index
  • Anchor Text (anchor text of the linking page is given higher priority over headings and body)
  • Proximity Constraints
Midterm
- short answers (compare models)
- calculation questions
- other questions
- no MC
----
Oct 8
1. know the IR system architecture model
2. don't have to memorize all the probability fmlas
	- need to know the benefits/disadv of each model, comparisons
3. preprocessing: parsing -> normalization (equivalence classes) -> stopwords -> stemming -> index term selection
4. indexing: just inverted index
 	BSBI - basic ideas from these
	SPIMI - basic ideas from these
	MapReduce: why document-partitioning is faster (more processing on the data servers, less on the app server)
	dynamic indexing:...
5. evaluation:
	reference collection

6. pseudo rel. feedback -assume the top results are relevant..
	global vs local
	

Retrieval Evaluation

Precision and Recall

  • Recall is the fraction of relevant documents which have been retrieved; R = numRelevant / totalRelevant
  • Precision is the fraction of the retrieved documents which are relevant; P = numRelevant / totalReturned

See Precision-Recall curve calculation - usually starts high, drops to zero, sometimes with 3 plateaus

Precision-recall tradeoff

  • fewer docs returned tends to have higher precision, lower recall
  • to get the full recall (100%), have to return many documents (including non-relevant), which means lower precision
  • ideal case of the precision-recall curve is top-left corner of the typical curve (never achieved in practice)

Interpolated Precision @ some %

Max precision where recall > % amount

Precision @ K

Precision level achieved at some point in the results. Eg, P @ 5 is the precision calculated as of the fifth result

Mean Average Precision

MAP = SUM(precision values) / |Relevant|

R-Precision

Given the total number of relevant documents |R|, take the top |R| from the results and calculate the fraction of relevant documents among that set

Mean Reciprocal Rank

F-measure

Not so common; most common are MAP and NDCG

Discounted Cumulated Gain (DCG)

can discount the very relevant documents that appear low on the list

  • Ideal CG and DCG: order from most relevant to least (if we have relevance information)
  • Normalized: position-by-position divide CG by ICG and DFG by IDCG

Verification

  • A/B Testing
    • two versions of the system - the new version will be released only to a small number of users at first
  • Crowdsourcing
  • Clickthrough data

Relevance Feedback & Query Expansion

Rocchio Algorithm

Thesaurus Expansion

User provides feedback on their results

Only useful for that users session

Pseudo relevance feedback

Feedback is generally only used for query & results optimization for a single user session since one person's query may match another persons's query, but they may find some results more valuable than others

Rocchio Algorithm

Formula provides the ideal query vector based on the user's input of what is relevant and what is not (depends on the user!)

Standard Rochhio: usually relevant documents given more emphasis, so Beta usually larger than Gamma

  • best for a lot of judgement data
  • last fmla (ide_dec_hi) okay for few judgement data
  • note: if the resultant vector has a term that is less than 0, just use 0

Probabilistic Method

Does re-weighting, but does not introduce new terms; does not consider the original query

Query expansion using Thesaurus

  • can increase recall
  • but can decrease precision

WordNet - Princeton University

Building our own correlation matrix (of related terms)

Fill out a matrix of relationships calculuated with the fmlas given in the notes:
     t1  t2  t3  t4  ... tn
t1   1   0.1 0.8 0.2
t2
t3
...
				

Nearest neighbour may not mean they are close to each other in the document, but merely have a high correlation (across docs)

Similarity Thesaurus

Term is a vector of documents; same as Vector Space model but indexed by term instead of document. See slides for the fmla

From a sample term-doc matrix:

  1. Calculate itf for each document
  2. t1 = (w11, w12, w13, ... w1,10)

Global Analysis vs Local Analysis

Local is only applied to the top-ranked documents; should help prevent the precision from dropping too much

Local has to be done in real-time based on the results, so it can be slower

GUI & Usage

Info seeking

There are several processes that people use IR systems to find things...

Seeking Specific facts

Like looking for a formula, or a phone number; something quantitative so that you can consider the search finished after finding it.

Exploratory Process

  • Learning
  • Investigating

Classic Process

  • Problem identification
  • Articulate needs
  • Query formulation
  • Results evaluation

Dynamic Process

  • Berry picking
  • Orienteering

Results display

Query reformulation

  • Spelling correction
  • Term expansion
  • Relevance feedback from the user

Keywords in Context

Show the keywords entered by the user, highlighted in a snippet from the document

Grouping

  • Cluster by similarity
  • Categories

Snippets

Design evaluation

  • Longitudinal testing: deploy to everyone for a period of time, get feedback through the logs and surveys
  • A/B testing: deploy to a random subset of users, compare their actions with users on the original system

Advertising

Can charge based on views or per clicks

Sponsored results sometimes appear along with actual results(along the top of side)

Ad ranking

  • First cut
  • Bid price

Documents and Queries

Documents

Meta data

Doc formats

  • PDF, doc, etc

Information theory

  • Entropy (proportional with unpredictability)
    • Compression easier with predictable content

Compression

  • Should be able to search & randomly access compressed text without decoding everything
  • Decompression speed is usually more important than compression speed

Query

Intent

Session boundaries

Web Search

Web graph

Not strongly connected

Bow-tie structure

  • In
  • Scc
  • Out
  • Tube
  • Tendrils
  • A visualization of the graph of a large set of web pages forms the shape of a

Index size

Large index size is a good sign of search engine quality

Issues in determining size

  • Dynamic pages
  • Duplicate pages
  • Server connections

Simple size estimation technique for getting the Lower Bound

  • Use OR between several frequently used words. Eg: a OR the OR some

Relative size estimation

  • E1/E2 ~ y/x
  • Random search
    • Won't work unless you have access to the logs
    • Not really statistically sound
  • Random IP

Spam

Cloaking

Doorway

Link spam

SEO

Duplication checking with shingles

Search engine architecture

Caching is extensively used for the most common queries

Attributes

  • Cluster based architecture
  • Usually use multiple index tiers; but the problem with this is that less common queries will have slower processing time

Ranking signals

  • Content
  • Layout
  • Format indicators

Structural signals

Web Crawler

Must-have Features

  • Robustness
    • handles bot traps
    • deals with duplicate data
  • Politeness
    • respects Robots.txt
    • does not bombard a server with requests

Most Important Characteristics

  • Quality
  • Freshness
  • Volume

Should-have Features

  • Distributed
  • Scalable
  • Performance & Efficiency

Crawling process

Choose seeds

Fetch

Parse

Etc

Architecture

Scheduler

Downloaded

Storage

Policies

Revisit

Politeness

  • Rules
    • Identify as a crawler
    • Obey robots.txt
    • Keep low bandwidth usage in a given site

Practical issues

  • Server performance
  • Poorly written HTML
  • Duplicate content

Link Analysis

Algorithms

PageRank

  • Recursive algorithm ... Repeat until convergence, then done
    • See PDF notes for example of rank propagation
    • Steps
      1. Adjacency matrix from web graph
      2. Create transition probability matrix for every node

Sample PageRank solving problem

Kleinberg hits

  • Research model only
  • Query biased
  • Hubs and authorities
  • Steps
    1. initialization
    2. update authority scores
    3. updating the hub scores

Sample HITS solving problem

Recommended Systems

Recommend items based on algorithmically estimated preferences and interests

Approaches

1 Content-Based

  • Estimates a users rating based on ratings they gave to similar items
  • Drawbacks
    • Over specialization

2 User-Based Collaborative Filtering

Find the similarity of users based on their ratings on a set of items

  1. Find similarity of a particular item to each of the other items
    1. For each other item, calculate similarity using Cosine Similarity or Pearson Correlation Coefficient
  2. Calculate the predicted score of the particular item for a user by inputting the "most similar" items to an aggregate function such as Weighted sum

3 Item-Based Collaborative Filtering

Same approach as the User-Based CF, but users and items are inverted in the formulas

Example Item-Based Collaborative Filtering problem

Collaborative Filtering Drawbacks

  • New users and new items have no ratings
  • Scalability and computation time

Hybrid

A third option is to combine both the user-based and content-based approaches

Similarity Functions

Pearson Correlation Coefficient

$sim(x,y) = {{\sum{(r_{xs} - \bar{r_{x}})(r_{ys} - \bar{r_y})}}\over{\sqrt{\sum{(r_{xs} - \bar{r_x})^2}\sum{(r_{ys} - \bar{r_y})^2}}}}$

Cosine Similarity

See fmla earlier in this note

Aggregate Functions

Weighted Sum

$rating_{xy} = {1 \over{\sum{sim(x,other)}}} \times \ \sum{sim(x,other)(otherRating_{y})} $

Concerns

  • No incentive for rating more items after X many because the recommendations won't change
  • Privacy
  • Spamming

Clustering

Separating documents into groups

Cluster hypothesis

Documents in the same cluster behave similarly with respect to relevance to information needs

Applications of Clustering

  • Searching
  • Browsing
  • Scatter gatherer

Algorithms

K-Means (Flat)

Input 'K' is the number of clusters that will be produced. Algorithm is:

  1. Select 'k'-many documents as seed centroids
  2. Partitioned the docs among the k clusters
  3. Re-compute centroids
  4. Repeat until clustering converges

Example K-Means Problem

Agglomerative (Hierarchical)

A 'bottom-up' clustering algorithm.

  • Cluster similarity
    • "Single Link"
      • Find the closest cluster by measuring from 'inner side' of clusters
      • Lots of small clusters
      • Not very balanced
      • Produces long thin chains - undesirable
    • "Complete Link"
      • Find the closest cluster by measuring from 'far side' of clusters

Divisive (Hierarchical)

Hard vs soft

  • Hard: a document belongs to only 1 cluster
  • Soft: a document may be a member of more than 1 cluster

Evaluating clusters

Internal criteria

  • Tighter is better

External criteria

  • Purity
    • Simple calc
  • Rand index

Sample RandIndex solving problem