Overview

What is IR?

Deals with the representation, storage, organization, etc

Questions to answer

how does the IR collect and save information?
how does it process user query?

Apply text processing, transformation to both input documents as well as user queries -- same transformation; document terms are indexed

IR system does matchmaking and ranks the relevant documents

	query		|				|-> query terms	|
			| -> text processing ->	|	|-> matchmaking -> return ranked docs
	documents	|				|-> index terms	|

Basic terminology

index term - word or group of words from the document
document - set of representative index terms
collection (or corpus) - group of documents
vocabulary - set of distinct index terms
query - set of index terms transformed from a user's information need

IR Model Representation

quadruple representation: [D, Q, F, R(qi, di)]

Main Parts of the System / Workflow

index the corpus
- build inverted index
Do match-making & ranking on the inverted index
evaluate user queries, return results

Models - Overview

Boolean

(+) Easy to use
(+) Easy to implement
(-) No ranking
(-) No partial matches

Vector Space

(+) Simple
(+) Ranked
(-) Assumes term independence

Probabilistic

(+) Strong theoretical basis
(+) Theoretically thebest matches
(-) Relevance info is required
(-) Req's ongoingcollection of relevance

Okapi BM25 -> best

Boolean Model

Based on set theory, Boolean algebra

Boolean operators, proximity operators, simple regular expressions
problem: no ranking
Improvement area: implement "skip pointers"

Term-document Matrix

Not very efficient, not scalable

Inverted Index

dictionary of lists of document id's
query processing: match up documents by id
query processing optimization: start the boolean query by ordering the terms in order of increasing number of associated documents

Westlaw example queries

disab! /p access! /s work-site work-place (employment /3 place)

Vector Space Model

Use a vector to represent a document, and also to represent the query. Measure the similarity between the query vector and the document vector. Rank is based on similarity

Lucene uses the Vector Space Model

Bag of words model

Order or words is not important: John runs faster than Mary is the same as Mary runs faster than John

Zipf's Law

i-th most frequent term has frequency proportional to 1/i
SEE DISTRIBUTION PATTERN, "heavy tailed"

TF-IDF Weighting

TF - term frequency

do the LOG(base10) on the frequency

		tfij = 	1 + log(fij)   if fij > 0
		        0              otherwise

		alternative:
		tfij = 0.5 + 0.5 (fij / maxi . fij)

		D1: I know you know
		D2: you know him
		V = { him, I, know, you} // complete vocab from all docs
		f11 = 0, f21 = 1, f31 = 2, f41 = 1
		f12 = 1, f22 = 0, f32  =1, f42 = 1
		tf11 = 0, tf21 = 1, tf31 = 1.3, tf41 = 1
		etc..

IDF - Inverse document frequency

IDF is larger for terms found in fewer documents

finally, Wij = TFij * IDFi

Vector space

Where t is the number of terms, the vector space has t dimensions

can use euclidian distance fmla to find distance between query and each document

d: ( 1 1 1 0 1 1 1 )
q: ( 1 1 1 0 0 0 0 )

Euclidian distance... not so good, so use similiarity score by measuring the angle between the vectors

smaller angle means more similar

similarity(dj,q) = vec dj cdot q = SUM (Wij . wiq)

Cosine Similarity model

$sim(d,q) = \cfrac{\sum{w_{ij} w_{iq}}}{\sqrt{\sum{w_{ij}^2}\sum{w_{iq}^2}}}$

Longer documents with higher frequency may be favored; so we may want to normalize.

The result is normalized -> between 0 and 1

			d: (2 3 0)
			q: (1 0 1)

			sim = (2x1 + 3x0 + 0x1) / (sqrt(2^2 + 3^2 + 0^2) * sqrt(...))

			Table Filling:
			f - given
			tf - calc: 1 + log(f)
			df - given, frequency in all docs
			idf is only calculated once per term for all docs and queries
			w = tf * idf
			length of doc is sqrt of sum of squares of each weight
			then use the length to normalize the weight:  nw = weight / length

			take product of document nw and query nw, then take sum of each term product.  ->> very same as the long fmla


			can calculate from the inverted index data
			 t1(5000)---[1][3]-->[2][4]
			 t2(1000)---[1][4]-->[2][4]
		     t3(500) ---[1][2]-->[2][4]

Probabilistic Model

	if P(R|D) > P (NR|D) D is returned
	Background
		Given two events A & B,
		P(A,B) - joint probability
		P(A|B) - conditional A given B has occured
	Baye's Theorem
		P(A|B) = ( P(B|A)* P(A) ) / (P(B)) = ( P(B|A)* P(A) ) / ( P(B|A) * P(A) + P(B|A) * P(?) )

Binary Independence Model

Ranking with Estimation

Ranking without Estimations

ni is same as df; so Ranking without Estimate is same as idf fmla

Okapi BM (best match) 25

Document Pre-Processing

The major steps in document pre-processing include:

Parsing & tokenization
Normalization
Stop word removal - usually articles, prepositions, conjunctions (top of the zipf curve)
Stemming
Selection of index terms

Indexing & Searching

Inverted index

Bitmaps

Signature files

Techniques

Map-Reduce work distribution

Term Partition Index
Document Partition Index

Distributed Indexing: Map Reduce

Map (parsing distributed across multiple machines)
Reduce (indexing distributed across multiple machines)

Used to generate Document Parition Index

first term partitioned index is generated
then document partition index

Example of converting from Term Partitioned Index to Document Partitioned Index:

Suppose the goal is 3 servers, with DocIds 1-10,000, 10,001 - 20,000, 20,001 - 30,000

Parse phase: each parser divides their terms into a 3-part segment - one for each document server
Reduce phase: each indexer reads from all of the parser machines segment files to get their appropriate doc id sets, pushes into the document-partitioned index

generally better because the query processing workload is distributed on multiple servers; each server can do the intersection on the documents; the central server just needs to union them and sort them

Merging Indexes (dynamic indexing)

Z0 --> I0 --> I_1 --> I___2 --> I____3

Fill z0 (memory space)
If full, push content to I0
When I0 is full, push to I1 (z0 + I0)
When I1 is full, push to I2 (z0 + I0 + I1)

Blocked SortBased indexing

Single pass in-memory indexing

Dynamic indexing

Index Merging

Retrieval

Top-K Retrieval

reduce computation cost by eliminating large number of documents without computing their cosine scores

Techniques:

index elimination - only consider documents containing terms whose idf exceeds some threshold
champion list

Google Algorithms

Page Rank
Tiered Index
Anchor Text (anchor text of the linking page is given higher priority over headings and body)
Proximity Constraints

Midterm
- short answers (compare models)
- calculation questions
- other questions
- no MC
----
Oct 8
1. know the IR system architecture model
2. don't have to memorize all the probability fmlas
	- need to know the benefits/disadv of each model, comparisons
3. preprocessing: parsing -> normalization (equivalence classes) -> stopwords -> stemming -> index term selection
4. indexing: just inverted index
 	BSBI - basic ideas from these
	SPIMI - basic ideas from these
	MapReduce: why document-partitioning is faster (more processing on the data servers, less on the app server)
	dynamic indexing:...
5. evaluation:
	reference collection

6. pseudo rel. feedback -assume the top results are relevant..
	global vs local

Retrieval Evaluation

Precision and Recall

Recall is the fraction of relevant documents which have been retrieved; R = numRelevant / totalRelevant
Precision is the fraction of the retrieved documents which are relevant; P = numRelevant / totalReturned

See Precision-Recall curve calculation - usually starts high, drops to zero, sometimes with 3 plateaus

Precision-recall tradeoff

fewer docs returned tends to have higher precision, lower recall
to get the full recall (100%), have to return many documents (including non-relevant), which means lower precision
ideal case of the precision-recall curve is top-left corner of the typical curve (never achieved in practice)

Interpolated Precision @ some %

Max precision where recall > % amount

Precision @ K

Precision level achieved at some point in the results. Eg, P @ 5 is the precision calculated as of the fifth result

Mean Average Precision

MAP = SUM(precision values) / |Relevant|

R-Precision

Given the total number of relevant documents |R|, take the top |R| from the results and calculate the fraction of relevant documents among that set

Mean Reciprocal Rank

F-measure

Not so common; most common are MAP and NDCG

Discounted Cumulated Gain (DCG)

can discount the very relevant documents that appear low on the list

Ideal CG and DCG: order from most relevant to least (if we have relevance information)
Normalized: position-by-position divide CG by ICG and DFG by IDCG

Verification

A/B Testing
- two versions of the system - the new version will be released only to a small number of users at first
Crowdsourcing
Clickthrough data

Relevance Feedback & Query Expansion

Rocchio Algorithm

Thesaurus Expansion

User provides feedback on their results

Only useful for that users session

Pseudo relevance feedback

Feedback is generally only used for query & results optimization for a single user session since one person's query may match another persons's query, but they may find some results more valuable than others

Rocchio Algorithm

Formula provides the ideal query vector based on the user's input of what is relevant and what is not (depends on the user!)

Standard Rochhio: usually relevant documents given more emphasis, so Beta usually larger than Gamma

best for a lot of judgement data
last fmla (ide_dec_hi) okay for few judgement data
note: if the resultant vector has a term that is less than 0, just use 0

Probabilistic Method

Does re-weighting, but does not introduce new terms; does not consider the original query

Query expansion using Thesaurus

can increase recall
but can decrease precision

WordNet - Princeton University

Building our own correlation matrix (of related terms)

Fill out a matrix of relationships calculuated with the fmlas given in the notes:
     t1  t2  t3  t4  ... tn
t1   1   0.1 0.8 0.2
t2
t3
...

Nearest neighbour may not mean they are close to each other in the document, but merely have a high correlation (across docs)

Similarity Thesaurus

Term is a vector of documents; same as Vector Space model but indexed by term instead of document. See slides for the fmla

From a sample term-doc matrix:

Calculate itf for each document
t1 = (w11, w12, w13, ... w1,10)

Global Analysis vs Local Analysis

Local is only applied to the top-ranked documents; should help prevent the precision from dropping too much

Local has to be done in real-time based on the results, so it can be slower

GUI & Usage

Info seeking

There are several processes that people use IR systems to find things...

Seeking Specific facts

Like looking for a formula, or a phone number; something quantitative so that you can consider the search finished after finding it.

Exploratory Process

Learning
Investigating

Classic Process

Problem identification
Articulate needs
Query formulation
Results evaluation

Dynamic Process

Berry picking
Orienteering

Results display

Query reformulation

Spelling correction
Term expansion
Relevance feedback from the user

Keywords in Context

Show the keywords entered by the user, highlighted in a snippet from the document

Grouping

Cluster by similarity
Categories

Snippets

Design evaluation

Longitudinal testing: deploy to everyone for a period of time, get feedback through the logs and surveys
A/B testing: deploy to a random subset of users, compare their actions with users on the original system

Advertising

Can charge based on views or per clicks

Sponsored results sometimes appear along with actual results(along the top of side)

Ad ranking

First cut
Bid price

Documents and Queries

Documents

Meta data

Doc formats

PDF, doc, etc

Information theory

Entropy (proportional with unpredictability)
- Compression easier with predictable content

Compression

Should be able to search & randomly access compressed text without decoding everything
Decompression speed is usually more important than compression speed

Query

Intent

Session boundaries

Web Search

Web graph

Not strongly connected

Bow-tie structure

In
Scc
Out
Tube
Tendrils
A visualization of the graph of a large set of web pages forms the shape of a

Index size

Large index size is a good sign of search engine quality

Issues in determining size

Dynamic pages
Duplicate pages
Server connections

Simple size estimation technique for getting the Lower Bound

Use OR between several frequently used words. Eg: a OR the OR some

Relative size estimation

E1/E2 ~ y/x
Random search
- Won't work unless you have access to the logs
- Not really statistically sound
Random IP

Spam

Cloaking

Doorway

Link spam

SEO

Duplication checking with shingles

Search engine architecture

Caching is extensively used for the most common queries

Attributes

Cluster based architecture
Usually use multiple index tiers; but the problem with this is that less common queries will have slower processing time

Ranking signals

Content
Layout
Format indicators

Structural signals

Web Crawler

Must-have Features

Robustness
- handles bot traps
- deals with duplicate data
Politeness
- respects Robots.txt
- does not bombard a server with requests

Most Important Characteristics

Quality
Freshness
Volume

Should-have Features

Distributed
Scalable
Performance & Efficiency

Crawling process

Choose seeds

Fetch

Parse

Etc

Architecture

Scheduler

Downloaded

Storage

Policies

Revisit

Politeness

Rules
- Identify as a crawler
- Obey robots.txt
- Keep low bandwidth usage in a given site

Practical issues

Server performance
Poorly written HTML
Duplicate content

Link Analysis

Algorithms

PageRank

Recursive algorithm ... Repeat until convergence, then done
- See PDF notes for example of rank propagation
- Steps
  1. Adjacency matrix from web graph
  2. Create transition probability matrix for every node

Sample PageRank solving problem

Kleinberg hits

Research model only
Query biased
Hubs and authorities
Steps
1. initialization
2. update authority scores
3. updating the hub scores

Sample HITS solving problem

Recommended Systems

Recommend items based on algorithmically estimated preferences and interests

Approaches

1 Content-Based

Estimates a users rating based on ratings they gave to similar items
Drawbacks
- Over specialization

2 User-Based Collaborative Filtering

Find the similarity of users based on their ratings on a set of items

Find similarity of a particular item to each of the other items
1. For each other item, calculate similarity using Cosine Similarity or Pearson Correlation Coefficient
Calculate the predicted score of the particular item for a user by inputting the "most similar" items to an aggregate function such as Weighted sum

3 Item-Based Collaborative Filtering

Same approach as the User-Based CF, but users and items are inverted in the formulas

Example Item-Based Collaborative Filtering problem

Collaborative Filtering Drawbacks

New users and new items have no ratings
Scalability and computation time

Hybrid

A third option is to combine both the user-based and content-based approaches

Similarity Functions

Pearson Correlation Coefficient

$sim(x,y) = {{\sum{(r_{xs} - \bar{r_{x}})(r_{ys} - \bar{r_y})}}\over{\sqrt{\sum{(r_{xs} - \bar{r_x})^2}\sum{(r_{ys} - \bar{r_y})^2}}}}$

Cosine Similarity

See fmla earlier in this note

Aggregate Functions

Weighted Sum

$rating_{xy} = {1 \over{\sum{sim(x,other)}}} \times \ \sum{sim(x,other)(otherRating_{y})} $

Concerns

No incentive for rating more items after X many because the recommendations won't change
Privacy
Spamming

Clustering

Separating documents into groups

Cluster hypothesis

Documents in the same cluster behave similarly with respect to relevance to information needs

Applications of Clustering

Searching
Browsing
Scatter gatherer

Algorithms

K-Means (Flat)

Input 'K' is the number of clusters that will be produced. Algorithm is:

Select 'k'-many documents as seed centroids
Partitioned the docs among the k clusters
Re-compute centroids
Repeat until clustering converges

Example K-Means Problem

Agglomerative (Hierarchical)

A 'bottom-up' clustering algorithm.

Cluster similarity
- "Single Link"
  - Find the closest cluster by measuring from 'inner side' of clusters
  - Lots of small clusters
  - Not very balanced
  - Produces long thin chains - undesirable
- "Complete Link"
  - Find the closest cluster by measuring from 'far side' of clusters

Divisive (Hierarchical)

Hard vs soft

Hard: a document belongs to only 1 cluster
Soft: a document may be a member of more than 1 cluster

Evaluating clusters

Internal criteria

Tighter is better

External criteria

Purity
- Simple calc
Rand index

Sample RandIndex solving problem

Ryerson Computer Science CPS842 Info Retrieval

Overview

What is IR?

Questions to answer

Basic terminology

IR Model Representation

Main Parts of the System / Workflow

Models - Overview

Boolean

Vector Space

Probabilistic

Okapi BM25 -> best

Boolean Model

Term-document Matrix

Inverted Index

Westlaw example queries

Vector Space Model

Bag of words model

Zipf's Law

TF-IDF Weighting

TF - term frequency

IDF - Inverse document frequency

Vector space

Cosine Similarity model

Probabilistic Model

Binary Independence Model

Ranking with Estimation

Ranking without Estimations

Okapi BM (best match) 25

Document Pre-Processing

Indexing & Searching

Inverted index

Bitmaps

Signature files

Techniques

Map-Reduce work distribution

Distributed Indexing: Map Reduce

Used to generate Document Parition Index

Merging Indexes (dynamic indexing)

Blocked SortBased indexing

Single pass in-memory indexing

Dynamic indexing

Retrieval

Top-K Retrieval

Google Algorithms

Retrieval Evaluation

Precision and Recall

Interpolated Precision @ some %

Precision @ K

Mean Average Precision

R-Precision

Mean Reciprocal Rank

F-measure

Discounted Cumulated Gain (DCG)

Verification

Relevance Feedback & Query Expansion

Rocchio Algorithm

Thesaurus Expansion

User provides feedback on their results

Only useful for that users session

Pseudo relevance feedback

Rocchio Algorithm

Probabilistic Method

Query expansion using Thesaurus

Building our own correlation matrix (of related terms)

Similarity Thesaurus

Global Analysis vs Local Analysis

GUI & Usage

Info seeking

Seeking Specific facts

Exploratory Process

Classic Process

Dynamic Process

Results display

Query reformulation

Keywords in Context

Grouping

Snippets

Design evaluation

Advertising