Part of Speech (POS) Tagging CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.
Download ReportTranscript Part of Speech (POS) Tagging CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.
Part of Speech (POS) Tagging CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1 Sources (and Resources) • Some slides adapted from – Dorr, www.umiacs.umd.edu/~christof/courses/cmsc723-fall04 – Jurafsky, www.stanford.edu/class/linguist238 – McCoy, www.cis.udel.edu/~mccoy/courses/cisc882.03f • With some additional examples and ideas from – – – – Martin: www.cs.colorado.edu/~martin/csci5832.html Hearst: www.sims.berkeley.edu/courses/is290-2/f04/resources.html Litman: www.cs.pitt.edu/~litman/courses/cs2731f03/cs2731.html Rich: www.cs.utexas.edu/users/ear/cs378NLP • You may find some or all of these useful resources throughout the course. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 2 Word Classes and Part-of-Speech Tagging • • • • • • • • What is POS tagging? Why do we need POS? Word Classes Rule-based Tagging Stochastic Tagging Transformation-Based Tagging Tagging Unknown Words Evaluating POS Taggers CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3 Parts of Speech • 8 traditional parts of speech (more or less) – Noun, verb, adjective, preposition, adverb, article, pronoun, conjunction. – This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) – Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS – Actual categories vary by language , by reason for tagging, by who you ask! CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4 POS examples • • • • • • • N V ADJ ADV P PRO DET noun verb adj adverb preposition pronoun determiner chair, bandwidth, pacing study, debate, munch purple, tall, ridiculous unfortunately, slowly, of, by, to I, me, mine the, a, that, those CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5 Definition of POS Tagging “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) WORDS the girl kissed the boy on the cheek CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari TAGS N V P DET 6 POS Tagging example WORD tag the koala put the keys on the table DET N V DET N P DET N CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 7 What does Tagging do? 1. Collapses Distinctions • • Lexical identity may be discarded e.g. all personal pronouns tagged with PRP 2. Introduces Distinctions • • • Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 3. Helps classification and prediction Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8 Significance of Parts of Speech • A word’s POS tells us a lot about the word and its neighbors: – Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) – Helps in stemming – Limits the range of following words for Speech Recognition – Can help select nouns from a document for IR – Basis for partial parsing (chunked parsing) – Parsers can build trees directly on the POS tags instead of maintaining a lexicon Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9 Word Classes • What are we trying to classify words into? • Classes based on – Syntactic properties. What can precede/follow. – Morphological properties. What affixes they take. – Not primarily by semantic coherence (Conjunction Junction notwithstanding!) • Broad "grammar" categories are familiar • NLP uses much richer "tagsets" CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10 Open and closed class words • Two major categories of classes: – Closed class: a relatively fixed membership • • • • Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play a role in grammar) – Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have all 4, but not all! CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11 Open Class Words • Every known human language has nouns and verbs • Nouns: people, places, things – Classes of nouns • proper vs. common • count vs. mass • Verbs: actions and processes • Adjectives: properties, qualities • Adverbs: hodgepodge! – Unfortunately, John walked home extremely slowly yesterday CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 12 Closed Class Words • Idiosyncratic. Differ more from language to language. • Language strongly resists additions • Examples: – – – – – – – prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 13 Prepositions from CELEX CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 14 English Single-Word Particles CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 15 Pronouns in CELEX CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 16 Conjunctions CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 17 Auxiliaries CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 18 POS Tagging: Choosing a Tagset • Many parts of speech, potential distinctions • To do POS tagging, need to choose a standard set of tags to work with • Sets vary in # of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose • Need to strike a balance between – Getting better information about context (best: introduce more distinctions) – Make it possible for classifiers to do their job (need to minimize distinctions) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 19 Some of the best-known Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the BNC): 61 tags • Lancaster C7: 145 tags Slide modified from Massimo Poesio'sCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 20 The Brown Corpus • The first digital corpus (1961) – Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long – From American books, newspapers, magazines – Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 21 Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 22 Tag Set Example: Penn Treebank PRP PRP$ CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 23 Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 24 POS Tagging • Words often have more than one POS: back – – – – The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 25 Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 2 tags 3 tags 3,760 264 4 tags 5 tags 6 tags 61 12 2 7 tags 1 (Derose, 1988) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 26 Part-of-Speech Tagging • Rule-Based Tagger: ENGTWOL • Stochastic Tagger: HMM-based • Transformation-Based Tagger: Brill CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 27 Rule-Based Tagging • Basic Idea: – Assign all possible tags to words – Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. – Typically more than 1000 hand-written rules, but may be machine-learned. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 28 Start With a Dictionary • • • • • • she: promised: to back: the: bill: PRP VBN,VBD TO VB, JJ, RB, NN DT NN, VB • Etc… for the ~100,000 words of English CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 29 Assign All Possible Tags NN RB VBN JJ PRP VBD TO VB She promised to back CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari VB DT NN the bill 30 Write rules to eliminate tags Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP” NN RB VBN JJ VB PRP VBD TO VB DT NN She promised to back the bill CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 31 Sample ENGTWOL Lexicon CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 32 Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Necessitates a training corpus • No probabilities for words not in corpus. • Training corpus may be too different from test corpus. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 33 Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! – Result: 90% accuracy – Why? – Baseline: Others will do better – HMM is an example CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 34 HMM Tagger • Intuition: Pick the most likely tag for this word. • HMM Taggers choose tag sequence that maximizes this formula: – P(word|tag) × P(tag|previous n tags) • Let T = t1,t2,…,tn Let W = w1,w2,…,wn • Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 35 Conditional Probability • A brief digression… • Conditional probability: how do we determine the likelihood of one event following another if they are not independent? • Example: – I am trying to diagnose a rash in a 6-year-old child. – Is it measles? – On other words, given that the child has a rash, what is the probability that it is measles? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 36 Conditional Probabilities cont. • What would affect your decision? – The overall frequency of rashes in 6-yr-olds – The overall frequency of measles in 6-yr-olds – The frequency with which 6-yr-olds with measles have rashes. • P(measles|rash) = P(rash|measles)P(measles) P(rash) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 37 Bayes' Theorem • Bayes' Theorem or Bayes' Rule formalizes this intuition: • P(X|Y) = P(Y|X) P(X) P(Y) • P(X) and P(Y) are known as the "prior probabilities" or "priors". CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 38 Probabilities • We want the best set of tags for a sequence of words (a sentence) • W is a sequence of words • T is a sequence of tags P(W | T ) P(T ) arg max P(T | W ) P(W ) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 39 Probabilities • We want the best set of tags for a sequence of words (a sentence) • W is a sequence of words • T is a sequence of tags arg max P(T | W ) P(W | T ) P(T ) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 40 Tag Sequence: P(T) • How do we get the probability of a specific tag sequence? – Count the number of times a sequence occurs and divide by the number of sequences of that length. Not likely. – Make a Markov assumption and use N-grams over tags... • P(T) is a product of the probability of N-grams* that make it up CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 41 N-Grams • The N stands for how many terms are used – Unigram: 1 term; Bigram: 2 terms; Trigrams: 3 terms • Usually don’t go beyond 3. • You can use different kinds of terms, e.g.: – Character based n-grams – Word-based n-grams – POS-based n-grams • Ordering – Often adjacent, but not required • We use N-grams to help determine the context in which some linguistic phenomenon happens. – E.g., look at the words before and after the period to see if it is the end of a sentence or not. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 42 P(T): Bigram Example • Given a sentence: <s> Det Adj Adj Noun </s> • Probability is product of four N-grams: P(Det|<s>) P(Adj|Det) P(Adj|Adj) P(Noun|Adj) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 43 Counts • Where do you get the N-gram counts? • From a large hand-tagged corpus. – For N-grams, count all the Tagi Tagi+1 pairs – And smooth them to get rid of the zeroes • Alternatively, you can learn them from an untagged corpus CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 44 What about P(W|T) • First its odd. It is asking the probability of seeing “The big red dog” given “Det Adj Adj Noun” – Collect up all the times you see that tag sequence and see how often “The big red dog” shows up. Again not likely to work. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 45 P(W|T) • We’ll make the following assumption (because it’s easy)… Each word in the sequence only depends on its corresponding tag. So… n P(W | T ) P( wi | ti ) i 1 • How do you get the statistics for that? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 46 So… • We start with arg max P(T | W ) P(W | T ) P(T ) • And get n n i 2 i 2 arg max P( wi | ti ) * P(t1) * P(ti | ti 1) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 47 HMMs • This is a Hidden Markov Model (HMM) n n i 2 i 2 arg max P( wi | ti ) * P(t1) * P(ti | ti 1) • The states in the model are the tags, and the observations are the words. – The state to state transitions are driven by the bigram statistics – The observed words are based solely on the state that you’re in CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 48 An Example • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • to/TO race/??? the/DT race/??? • ti = argmaxj P(tj|ti-1)P(wi|tj) – max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)] • Brown: – P(NN|TO) = .021 × – P(VB|TO) = .34 × P(race|NN) = .00041 P(race|VB) = .00003 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari = .000007 = .00001 49 Performance • This method has achieved 95-96% correct with reasonably complex English tagsets and reasonable amounts of hand-tagged training data. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 50 Transformation-Based Tagging (Brill Tagging) • Combination of Rule-based and stochastic tagging methodologies – Like rule-based because rules are used to specify tags in a certain environment – Like stochastic approach because machine learning is used—with tagged corpus as input – Transformation-Based Learning (TBL) • Input: – tagged corpus – dictionary (with most frequent tags) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 51 Transformation-Based Tagging (cont.) • Basic Idea: – Set the most probable tag for each word as a start value – Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: – – – – Write a set of rule templates Among the set of rules, find one with highest score Continue from 2 until lowest score threshold is passed Keep the ordered set of rules • Rules make errors, corrected by later rules CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 52 TBL Rule Application • Tagger labels every word with its most-likely tag – For example: race has the following probabilities in the Brown corpus: • P(NN|race) = .98 • P(VB|race)= .02 • Transformation rules make changes to tags – “Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/NN CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 53 TBL: The Rule-Learning Algorithm • Step 1: Label every word with most likely tag (from dictionary) • Step 2: Check every possible transformation & select one which most improves tagging • Step 3: Re-tag corpus applying the rules • Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus • RESULT: Sequence of transformation rules CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 54 TBL: Rule Learning (cont.) • Problem: Could apply transformations ad infinitum! • Constrain the set of transformations with “templates”: – Replace tag X with tag Y, provided tag Z or word Z’ appears in some position • Rules are learned in ordered sequence • Rules may interact. • Rules are compact and can be inspected by humans CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 55 Templates for TBL CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 56 TBL: Problems • First 100 rules achieve 96.8% accuracy First 200 rules achieve 97.0% accuracy • Execution Speed: TBL tagger is slower than HMM approach • Learning Speed: Brill’s implementation can take over a day (600k tokens) BUT … (1) Learns small number of simple, non-stochastic rules (2) Can be made to work faster with FST (3) Best performing algorithm on unknown words CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 57 Tagging Unknown Words • Major continuing issue in taggers – New words added to language 20+ per month – Plus many proper names … – Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 58 Evaluating performance • How do we know how well a tagger does? • Say we had a test sentence, or a set of test sentences, that were already tagged by a human (a “Gold Standard”) • We could run a tagger on this set of test sentences • And see how many of the tags we got right. • This is called “Tag accuracy” or “Tag percent correct” CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 59 Test set • • • • We take a set of test sentences Hand-label them for part of speech The result is a “Gold Standard” test set Who does this? – Brown corpus: done by U Penn – Grad students in linguistics • Don’t they disagree? – Yes! But on about 97% of tags no disagreements – And if you let the taggers discuss the remaining 3%, they often reach agreement CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 60 So What's "Good"? • If we tag every word with its most frequent POS we get about 90% accuracy, so this is a minimum tagger. • Human taggers (without discussion) agree about 97% of the time, so if we can get to 97% we have done about as well as we can. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 61 Training and test sets • But we can’t train our frequencies on the test set sentences. (Why?) • So for testing the Most-Frequent-Tag algorithm (or any other stochastic algorithm), we need 2 things: – A hand-labeled training set: the data that we compute frequencies from, etc – A hand-labeled test set: The data that we use to compute our % correct. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 62 Computing % correct • Of all the words in the test set • For what percent of them did the tag chosen by the tagger equal the humanselected tag. # of words tagged correctly in test set %correct total # of words in test set • Human tag set: (“Gold Standard” set) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 63 Training and Test sets • Often they come from the same labeled corpus! • We just use 90% of the corpus for training and save out 10% for testing! CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 64