Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.
Download ReportTranscript Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.
Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1 Finding Out About • There are many large corpora of information that people use. The web is the obvious example. Others include: – – – – scientific journals patent databases Medline Usenet groups • People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. • Information Retrieval, or IR, is the process of answering that information need. • Simplest approach: – Knowledge is organized into chunks (pages or documents) – Goal is to return appropriate chunks CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 2 Information Retrieval Systems • Goal of an information retrieval system is to return appropriate chunks • Steps involve include – – – – asking a question finding answers evaluating answers presenting answers • Value of an IR tool depends on how well it does on all of these. • Web search engines are the IR tools most familiar to most people. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3 Asking a question • Reflect some information need • Query Syntax needs to allow information need to be expressed – Keywords – Combining terms • Simple: “required”, NOT (+ and -) • Boolean expressions with and/or/not and nested parentheses • Variations: strings, NEAR, capitalization. – Simplest syntax that works – Typically more acceptable if predictable • Another set of problems when information isn’t text: graphics, music CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4 Finding the Information • Goal is to retrieve all relevant chunks. Too timeconsuming to do in real-time, so IR systems index pages. • Two basic approaches – Index and classify by hand – Automate • For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue. • Many IR tools like search engines provide both CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5 IR Basics • A retriever collects a page or chunk. This may involve spidering web pages, extracting documents from a DB, etc. • A parser processes each chunk and extracts individual words. • An indexer creates/updates a hash table which connects words with documents • A searcher uses the hash table to retrieve documents based on words • A ranking system decides the order in which to present the documents: their relevance CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 6 How Good Is The IR? • Information Retrieval systems are evaluated with two basic metrics: – Precision: What percent of document returned are actually relevant to the information need – Recall: what percent of documents relevant to information need are returned • Can’t typically measure these exactly; usually based on test sets. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 7 Selecting Relevant Documents • Assume: – we already have a corpus of documents defined. – goal is to return a subset of those documents. – Individual documents have been separated into individual files • Remaining components must parse, index, find, and rank documents. • Traditional approach is based on the words in the documents (predates the web) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8 Extracting Lexical Features • Process a string of characters – assemble characters into tokens (tokenizer) – choose tokens to index • Standard lexical analysis problem • Lexical Analyser Generator, such as lex CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9 Lexical Analyser • Basic idea is a finite state machine • Triples of input state, transition token, output state A-Z 1 blank A-Z blank, EOF 0 2 • Must be very efficient; gets used a LOT CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10 Design Issues for Lexical Analyser • Punctuation – treat as whitespace? – treat as characters? – treat specially? • Case – fold? • Digits – assemble into numbers? – treat as characters? – treat as punctuation? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11 Lexical Analyser • Output of lexical analyser is a string of tokens • Remaining operations are all on these tokens • We have already thrown away some information; makes more efficient, but limits somewhat the power of our search CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 12 Stemming • Additional processing at the token level – We covered earlier this semester • Turn words into a canonical form: – “cars” into “car” – “children” into “child” – “walked” into “walk” • Decreases the total number of different tokens to be processed • Decreases the precision of a search, but increases its recall CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 13 Noise Words (Stop Words) • Function words that contribute little or nothing to meaning • Very frequent words – If a word occurs in every document, it is not useful in choosing among documents – However, need to be careful, because this is corpus-dependent • Often implemented as a discrete list CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 14 Example Corpora • We are assuming a fixed corpus. Some sample corpora: – – – – Medline Abstracts Email. Anyone’s email. Reuters corpus Brown corpus • Will contain textual fields, maybe structured attributes – Textual: free, unformatted, no meta-information. NLP mostly needed here – Structured: additional information beyond the content CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 15 Structured Attributes for Medline • • • • • Pubmed ID Author Year Keywords Journal CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 16 Textual Fields for Medline • Abstract – Reasonably complete standard academic English – Capturing the basic meaning of document • Title – Short, formalized – Captures most critical part of meaning – Proxy for abstract CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 17 Structured Fields for Email • • • • • • To, From, Cc, Bcc Dates Content type Status Content length Subject (partially) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 18 Text fields for Email • Subject – Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate. • Body of email – Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 19 Indexing • We have a tokenized, stemmed sequence of words • Next step is to parse document, extracting index terms – Assume that each token is a word and we don’t want to recognize any more complex structures than single words. • When all documents are processed, create index CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 20 Basic Indexing Algorithm • For each document in the corpus – Get the next token – Create or update an entry in a list • doc ID, frequency. • For each token found in the corpus – calculate #docs, total frequency – sort by frequency – Often called a “reverse index”, because it reverses the “words in a document” index to be a “documents containing words” index. – May be built on the fly or created after indexing. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 21 Fine Points • Dynamic Corpora (e.g., the web): requires incremental algorithms • Higher-resolution data (eg, char position). – Supports highlighting – Supports phrase searching – Useful in relevance ranking • Giving extra weight to proxy text (typically by doubling or tripling frequency count) • Document-type-specific processing – In HTML, want to ignore tags – In email, maybe want to ignore quoted material CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 22 Choosing Keywords • Don’t necessarily want to index on every word – Takes more space for index – Takes more processing time – May not improve our resolving power • How do we choose keywords? – Manually – Statistically • Exhaustivity vs specificity CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 23 Manually Choosing Keywords • Unconstrained vocabulary: allow creator of document to choose whatever he/she wants – “best” match – captures new terms easily – easiest for person choosing keywords • Constrained vocabulary: hand-crafted ontologies – can include hierarchical and other relations – more consistent – easier for searching; possible “magic bullet” search CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 24 Examples of Constrained Vocabularies • ACM headings (www.acm.org/class/1998) • H: Information Retrieval – H3: Information Storage and Retrieval – H3.3: Information Search and Retrieval » Clustering » Query formulation » Relevance feedback » Search process etc. • Medline Headings (www.nlm.nih.gov/mesh/meshhome.html) • L: Information Science – L01: Information Science – L01.700: Medical Informatics – L01.700.508: – L01.700.508.280: Information Storage and Retrieval » Grateful Med [L01.700.508.280.400] Medical Informatics Applications CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 25 Automated Vocabulary Selection • Frequency: Zipf’s Law. – Pn = 1/na, where Pn is the frequency of occurrence of the nth ranked item and a is close to 1 – Within one corpus, words with middle frequencies are typically “best” • Document-oriented representation bias: lots of keywords/document • Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 26 Choosing Keywords • “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. • Words which have no resolving power within a corpus may be best choices across corpora • Not very important for web searching; more relevant for some text mining. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 27 Keyword Choice for WWW • We don’t have a fixed corpus of documents • New terms appear fairly regularly, and are likely to be common search terms • Queries that people want to make are wide-ranging and unpredictable • Therefore: can’t limit keywords, except possibly to eliminate stop words. • Even stop words are language-dependent. So determine language first. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 28 Comparing and Ranking Documents • Once our IR system has retrieved a set of documents, we may want to • Rank them by relevance – Which are the best fit to my query? – This involves determining what the query is about and how well the document answers it • Compare them – Show me more like this. – This involves determining what the document is about. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 29 Determining Relevance by Keyword • The typical document retrieval query consists entirely of keywords. • Retrieval can be binary: present or absent • More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? • Simple strategies: – How many times does word occur in document? – How close to head of document? – If multiple keywords, how close together? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 30 Keywords for Relevance Ranking • Count: repetition is an indication of emphasis – – – – Very fast (usually in the index) Reasonable heuristic Unduly influenced by document length Can be "stuffed" by web designers • Position: Lead paragraphs summarize content – Requires more computation – Also reasonably heuristic – Less influenced by document length CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 31 Keywords for Relevance Ranking • Proximity for multiple keywords – Requires even more computation – Obviously relevant only if have multiple keywords – Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all • All keyword methods – Are computationally simple and adequately fast – Are effective heuristics – typically perform as well as in-depth natural language methods for standard IR CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 32 Comparing Documents • "Find me more like this one" really means that we are using the document as a query. • This requires that we have some conception of what a document is about overall. • Depends on context of query. We need to – Characterize the entire content of this document – Discriminate between this document and others in the corpus CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 33 Characterizing a Document: Term Frequency • A document can be treated as a sequence of words. • Each word characterizes that document to some extent. • When we have eliminated stop words, the most frequent words tend to be what the document is about • Therefore: fkd (# of occurrences of word K in document d) will be an important measure. • Also called the term frequency CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 34 Characterizing a Document: Document Frequency • What makes this document distinct from others in the corpus? • The terms which discriminate best are not those which occur with high frequency! • Therefore: Dk (# of documents in which word K occurs) will also be an important measure. • Also called the document frequency CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 35 TF*IDF • This can all be summarized as: – Words are best discriminators when they • occur often in this document (term frequency) • don’t occur in a lot of documents (document frequency) • One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency • There are multiple formulas for actually computing this. The underlying concept is the same in all of them. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 36 Describing an Entire Document • So what is a document about? • TF*IDF: can simply list keywords in order of their TF*IDF values • Document is about all of them to some degree: it is at some point in some vector space of meaning CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 37 Vector Space • Any corpus has defined set of terms (index) • These terms define a knowledge space • Every document is somewhere in that knowledge space -- it is or is not about each of those terms. • Consider each term as a vector. Then – We have an n-dimensional vector space – Where n is the number of terms (very large!) – Each document is a point in that vector space • The document position in this vector space can be treated as what the document is about. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 38 Similarity Between Documents • How similar are two documents? – Measures of association • How much do the feature sets overlap? • Modified for length: DICE coefficient – DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) ) – # terms compared to intersection • Simple Matching coefficient: take into account exclusions – Cosine similarity • similarity of angle of the two document vectors • not sensitive to vector length CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 39 Bag of Words • All of these techniques are what is known as bag of words approaches. • Keywords treated in isolation • Difference between "man bites dog" and "dog bites man" non-existent • If better discrimination is needed, IR systems can add semantic tools – Use POS – Parse into basic NP VP structure – Requires that query be more complex. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 40 Improvements • The two big problems with short queries are: – Synonymy: Poor recall results from missing documents that contain synonyms of search terms, but not the terms themselves – Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of nonrelevant documents. Martin: www.cs.colorado.edu/~martin/csci5832.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 41 Query Expansion • Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall – Use a dictionary/thesaurus – Use relevance feedback Martin: www.cs.colorado.edu/~martin/csci5832.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 42 Dictionary/Thesaurus Example CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 43 Relevance Feedback • Ask user to identify a few documents which appear to be related to their information need • Extract terms from those documents and add them to the original query. • Run the new query and present those results to the user. • Typically converges quickly Based on Martin: www.cs.colorado.edu/~martin/csci5832.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 44 Blind Feedback • Assume that first few documents returned are most relevant rather than having users identify them • Proceed as for relevance feedback • Tends to improve recall at the expense of precision Based on Martin: www.cs.colorado.edu/~martin/csci5832.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 45 Post-Hoc Analyses • When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need – Grouped by meaning for polysemic queries (using N-Gram-type approaches) – Grouped by extracted information (Named entities, for instance) – Group into existing hierarchy if structured fields available – Filtering (e.g., eliminate spam) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 46 Additional IR Issues • In addition to improved relevance, can improve overall information retrieval with some other factors: – Eliminate duplicate documents – Provide good context – Use ontologies to provide synonym lists • For the web: – Eliminate multiple documents from one site – Clearly identify paid links CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 47 Summary • Information Retrieval is the process of returning documents to meet a user’s information need based on a query • Typical methods are BOW (bag of words) which rely on keyword indexing with little semantic processing • NLP techniques used including tokenizing, stemming, some parsing. • Results can be improved by adding semantic information (such as thesauri) and by filtering and other post-hoc analyses. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 48