Transcript (PPTX)

Carnegie Mellon
Diversifiable Bootstrapping
for Acquiring High-Coverage
Paraphrase Resource
Hideki Shima
Teruko Mitamura
LREC 2012, May 24th, 2012
Language Technologies Institute
School of Computer Science
Carnegie Mellon University, USA
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
LREC 2012, May 24th, 2012
2
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
 Mary was killed by John.
LREC 2012, May 24th, 2012
passivization
3
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
 Mary was killed by John.
 John is the killer of Mary.
LREC 2012, May 24th, 2012
passivization
nominalization
4
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
 Mary was killed by John.
 John is the killer of Mary.
 John assassinated Mary.
LREC 2012, May 24th, 2012
passivization
nominalization
entailment
5
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
passivization
 Mary was killed by John.
nominalization
 John is the killer of Mary.
entailment
 John assassinated Mary.
slang
 John is the 187 suspect of Mary.
187 means: “California penal code for murder, made
popular in west coast gangsta rap”.
– From The Urban Dictionary dot com
Usage: “This is Gavilan. In pursuit of possible 187 suspects.”
–From the movie, Hollywood Homicide
LREC 2012, May 24th, 2012
6
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
passivization
 Mary was killed by John.
nominalization
 John is the killer of Mary.
entailment
 John assassinated Mary.
slang
 John is the 187 suspect of Mary.
 John terminated Mary with extreme
euphemism
prejudice.
“In military and other covert operations, terminate with extreme
prejudice is a euphemism for execution” – Wikipedia
LREC 2012, May 24th, 2012
7
Carnegie Mellon
Can a machine recognize the meaning similarity?
 John killed Mary.
passivization
 Mary was killed by John.
nominalization
 John is the killer of Mary.
entailment
 John assassinated Mary.
slang
 John is the 187 suspect of Mary.
 John terminated Mary with extreme
euphemism
prejudice.
Humans use various expressions to convey the same or similar
meaning, which makes it difficult for machines to “read” text.
LREC 2012, May 24th, 2012
8
Carnegie Mellon
Can a machine recognize the meaning similarity?
 X killed Y.
passivization
 Y was killed by Y.
nominalization
 X is the killer of Y.
entailment
 X assassinated Y.
slang
 X is the 187 suspect of Y.
 X terminated Y with extreme prejudice.
euphemism
Goal: automatically acquire paraphrase patterns
that are lexically-diverse
LREC 2012, May 24th, 2012
9
Carnegie Mellon
Paraphrase Recognition / Generation
is a common need in various applications
 Automatic Evaluation
– In Machine Translation [Kauchak & Barzilay, 2006][Padó et al., 2009]
– In Text Summarization [Zhou et al., 2006]
– In Question Answering [Ibrahim et al., 2003] [Dalmas, 2007]





Text Summarization [Lloret et al., 2008][Tatar et al., 2009]
Information Retrieval [Parapar et al., 2005][Riezler et al., 2007]
Information Extraction [Romano et al., 2006]
Question Answering [Harabagiu & Hickl, 2006][Dogdan et al., 2008]
Collocation Error Correction [Dahlmeier and Ng, 2011]
LREC 2012, May 24th, 2012
10
Carnegie Mellon
Outline
 Motivation
 Method: Diversifiable Bootstrapping
 Experiment
 Related Works
 Conclusion
LREC 2012, May 24th, 2012
11
Carnegie Mellon
Bootstrap Paraphrase Learning
INPUT
seed
instances
monolingual
plain corpus
LREC 2012, May 24th, 2012
BOOTSTRAP
LEARNING
ALGORITHM
OUTPUT
more
instances
patterns
12
Carnegie Mellon
Bootstrap Paraphrase Learning
INPUT
seed
instances
monolingual
plain corpus
LREC 2012, May 24th, 2012
BOOTSTRAP
X (killer)
LEARNING
Bootstrapping
ALGORITHM
John Wilkes
Booth
Mark David Chapman
Nathuram Godse
Yigal Amir
John Bellingham
Mohammed Bouyeri
Dan White
Sirhan Sirhan
El Sayyid Nosair
Mijailo Mijailovic
OUTPUT
Y (victim)
Abrahammore
Lincoln
Johninstances
Lennon
Mahatma Gandhi
Yitzhak Rabin
Spencer Perceval
Theo van
Gogh
patterns
Mayor George Moscone
Robert F. Kennedy
Meir Kahane
Anna Lindh
13
Carnegie Mellon
Bootstrap Paraphrase Learning
X, the assassin of Y
INPUT
assassination of Y by X
Bootstrapping
Y
seedX assassinated
instances
the
assassination of Y by X
of X, the assassin of Y
X assassinated Y in
monolingual
:
:
:
plain corpus
OUTPUT
more
instances
patterns
Unlike many other bootstrapping works
the goal is acquire patterns, not instances
LREC 2012, May 24 , 2012
th
14
Carnegie Mellon
Bootstrap Paraphrase Learning
INPUT
seed
instances
monolingual
plain corpus
LREC 2012, May 24th, 2012
BOOTSTRAP
LEARNING
ALGORITHM
OUTPUT
more
instances
patterns
15
Carnegie Mellon
Bootstrap Learning Algorithm
1st
iteration
2nd
iteration
Seed
Instances
Sentences
Extracted
Patterns
Extracted
Instances
Sentences
Ranked
Patterns
Ranked
Instances
...
This framework is based on ESPRESSO [Pantel & Pennacchiotti, 2006]
LREC 2012, May 24th, 2012
16
Carnegie Mellon
Bootstrap Learning Algorithm
Search sentences by instances
Seed
Instances
Sentences
Extracted
Patterns
1st
 Edwin Booth was brother of John Wilkes Booth, the
iteration
Ranked
Extracted
assassin
of Abraham Lincoln.
Sentences
Patterns
Instances
 John Wilkes Booth, the assassin of Abraham
Lincoln, was inspired by Brutus.
 In 1969 Berman was part of the defense team of
2nd
Ranked
...
SirhanInstances
Sirhan, the assassin of Robert F. Kennedy.
iteration
:::
LREC 2012, May 24th, 2012
17
Carnegie Mellon
Bootstrap Learning Algorithm
Search sentences by instances
Seed
Instances
Sentences
Extracted
Patterns
1st
 Edwin Booth was brother of X, the assassin of Y.
iteration
Extracted of Y, was inspired by Brutus. Ranked
 X, the assassin
Sentences
Patterns
Instances
 In 1969 Berman was part of the defense team of X,
the assassin of Y.
2nd
Ranked
. . :. : :
iteration
Instances
LREC 2012, May 24th, 2012
18
Carnegie Mellon
Bootstrap Learning Algorithm
Extract patterns from sentences
Seed
Instances
1st

…
iteration


2nd
iteration
Sentences
Extracted
Patterns
brother of X, the assassin of Y.
Ranked
Extracted
Sentences
Patterns
Instances X, the assassin of Y, was
…team of X, the assassin of Y.
Ranked
Instances
LREC 2012, May 24th, 2012
...
19
Carnegie Mellon
Bootstrap Learning Algorithm
Extract patterns from sentences
Seed
Instances
1st

…
iteration


Sentences
Extracted
Patterns
brother of X, the assassin of Y .
Ranked
Extracted
Sentences
Instances X, the assassin of Y Patterns
, was
…team of X, the assassin of Y .
2nd
Ranked
...
iteration
Instances
Extracted
Pattern: Longest Common Substring
among retrieved sentences
LREC 2012, May 24th, 2012
20
Carnegie Mellon
Bootstrap Learning Algorithm
Score and rank patterns
1st
iteration
Seed
Instances
Sentences
Extracted
Patterns
Extracted
Instances
Sentences
Ranked
Patterns
Rank by reliability of pattern: r(p).
r(p) is based on an association measure
with eachRanked
instance in the
2nd
. . . corpus.
iteration
Instances
LREC 2012, May 24th, 2012
21
Carnegie Mellon
Bootstrap Learning Algorithm
Score and rank patterns
Seed
Instances
1st
iteration
1. 0.422Extracted
X, the
Instances
Sentences
assassin
of Y
Sentences
2. 0.324 assassination of Y by X
3. 0.312 X assassinated Y
4. 0.231Ranked
the assassination
of Y by X
2nd
...
iteration
5. 0.208Instances
of X, the assassin of Y
:::
LREC 2012, May 24th, 2012
Extracted
Patterns
Ranked
Patterns
22
Carnegie Mellon
Bootstrap Learning Algorithm
Search sentences by pattern(s)
1st
iteration
Seed
Instances
Sentences
Extracted
Patterns
Extracted
Instances
Sentences
Ranked
Patterns
 Still shot from the CCTV video footage showing
2ndOguen Samast,
Ranked the assassin
. . . of Hrant Dink.
iteration
Instances is a descendant of John
 Henry Bellingham
Bellingham, the assassin of Spencer Perceval.
LREC 2012, May 24th, 2012
23
Carnegie Mellon
Bootstrap Learning Algorithm
Extract instances from sentences
1st
iteration
Seed
Instances
Sentences
Extracted
Patterns
Extracted
Instances
Sentences
Ranked
Patterns
 Still shot from the CCTV video footage showing
2ndOguen Samast,
Ranked the assassin
. . . of Hrant Dink.
iteration
Instances is a descendant of John
 Henry Bellingham
Bellingham, the assassin of Spencer Perceval.
LREC 2012, May 24th, 2012
24
Carnegie Mellon
Bootstrap Learning Algorithm
Score and rank instances
Seed
Sentences
Rank instances by reliability:
Instances
Extracted
r(i)Patterns
(similar to pattern reliability scoring)
1st
iteration
2nd
iteration
Extracted
Instances
Ranked
Instances
LREC 2012, May 24th, 2012
Sentences
Ranked
Patterns
...
25
Carnegie Mellon
Issue: Lack of Lexical Diversity
Words participating in patterns are skewed
X, the assassin of Y
assassination of Y by X
X assassinated Y
the assassination of Y by X
of X, the assassin of Y
X assassinated Y in
As a solution, we propose
the Diversifiable Bootstrapping
LREC 2012, May 24th, 2012
26
Carnegie Mellon
Diversifiable Bootstrapping
Original reliability
score of a pattern
How is a pattern lexically
different from other
patterns originally
ranked higher than this?
r ' ( p)    r ( p)  (1   )  diversity ( p)
LREC 2012, May 24th, 2012
27
Carnegie Mellon
Diversifiable Bootstrapping
Original reliability
score of a pattern
How is a pattern lexically
different from other
patterns originally
ranked higher than this?
r ' ( p)    r ( p)  (1   )  diversity ( p)
Interpolation parameter: 0    1
LREC 2012, May 24th, 2012
28
Carnegie Mellon
Diversifiable Bootstrapping
Key contribution
By tweaking the parameter λ,How
patterns
is this to
pattern
different from
acquire can be diversifiablelexically
with a specific
Original reliability
other patterns originally
degreescore
oneofcan
control. ranked higher than this?
a pattern
r ' ( p)    r ( p)  (1   )  diversity ( p)
Interpolation parameter: 0    1
LREC 2012, May 24th, 2012
29
Carnegie Mellon
Experimental Settings
 Bootstrapping Algorithm
– Based on ESPRESSO framework [Pantel & Pennacchiotti, 2006]
– Unlike ESPRESSO, we aim to obtain patterns not instances
 Lexical diversity scoring function:
– Based on Shima & Mitamura [2011]
 Seed instances: Schlaefer et al., [2006]
 Corpus: English Wikipedia
LREC 2012, May 24th, 2012
30
Carnegie Mellon
Acquired Paraphrases: killed
  1 (no diversification)
X, the assassin of Y
assassination of Y by X
X assassinated Y
the assassination of Y by X
of X, the assassin of Y
X assassinated Y in
X, the man who assassinated Y
Y's assassin, X
of Y's assassin X
of the assassination of Y by X
X shot and killed Y
Y was assassinated by X
named X assassinated Y
Y was shot by X
X to assassinate Y
LREC 2012, May 24th, 2012
31
Carnegie Mellon
Acquired Paraphrases: killed
 1
  0.7
  0.3
X, the assassin of Y
assassination of Y by X
X assassinated Y
the assassination of Y by X
of X, the assassin of Y
X assassinated Y in
X, the man who assassinated Y
Y's assassin, X
of Y's assassin X
of the assassination of Y by X
X shot and killed Y
Y was assassinated by X
named X assassinated Y
Y was shot by X
X to assassinate Y
X, the assassin of Y
X assassinated Y
assassination of Y by X
Y was shot by X
X, who killed Y
the assassination of Y by X
X assassinated Y in
X tells his version of Y
X shoot Y
X murdered Y
Y's killer, X
Y, at the theatre after X
Y, push X to his breaking point
X to assassinate Y
of X, the assassin of Y
X, the assassin of Y
X, who killed Y
Y was shot by X
X tells his version of Y
X shoot Y
X murdered Y
Y's killer, X
Y, at the theatre after X
Y, push X to his breaking point
X assassinated Y
assassination of Y by X
X to assassinate Y
X kills Y
of X shooting Y
X assassinated Y in
LREC 2012, May 24th, 2012
32
Carnegie Mellon
Acquired Paraphrases: killed
 1
  0.7
  0.3
X, the assassin of Y
assassination of Y by X
X assassinated Y
the assassination of Y by X
of X, the assassin of Y
X assassinated Y in
X, the man who assassinated Y
Y's assassin, X
of Y's assassin X
of the assassination of Y by X
X shot and killed Y
Y was assassinated by X
named X assassinated Y
Y was shot by X
X to assassinate Y
X, the assassin of Y
X assassinated Y
assassination of Y by X
Y was shot by X
X, who killed Y
the assassination of Y by X
X assassinated Y in
X tells his version of Y
X shoot Y
X murdered Y
Y's killer, X
Y, at the theatre after X
Y, push X to his breaking point
X to assassinate Y
of X, the assassin of Y
X, the assassin of Y
X, who killed Y
Y was shot by X
X tells his version of Y
X shoot Y
X murdered Y
Y's killer, X
Y, at the theatre after X
Y, push X to his breaking point
X assassinated Y
assassination of Y by X
X to assassinate Y
X kills Y
of X shooting Y
X assassinated Y in
LREC 2012, May 24th, 2012
33
Carnegie Mellon
Acquired Paraphrases: died-of
 1
X died of Y
X died of Y in
X died of Y on
X died of lung Y
X died of lung Y in
X died of lung Y on
X died of Y in the
X died of Y at
X died of stomach Y
X died of natural Y
X died of breast Y in
X died of a Y
X died of Y in his
X passed away from Y
X died of a Y in
LREC 2012, May 24th, 2012
  0.7
  0.3
X died of Y in
X died of Y
X's death from Y
X passed away from Y
Y of X, news
Y of X, a former
that X was suffering from Y
the suspected Y of X
X to breast Y in
X was diagnosed with ovarian Y
X dies of Y
X was dying of Y
X died of lung Y
X died of Y on
X died of lung Y in
X died of Y in
X's death from Y
X passed away from Y
Y of X, news
Y of X, a former
that X was suffering from Y
the suspected Y of X
X succumbed to lung Y
X to breast Y in
X was diagnosed with ovarian Y
X dies of Y
X was dying of Y
X died of Y
X's death from Y in
X died of lung Y
34
Carnegie Mellon
Acquired Paraphrases: was-led-by
 1
  0.7
  0.3
Y came to power in X in
Y came to power in X
Y to power in X
Y came to power in X in the
when Y came to power in X in
when Y came to power in X
Y took power in X
Y rose to power in X
after Y came to power in X
Y became chancellor of X
Y came to power in X and
Y seized power in X
Y gained power in X
to power of Y in X
Y's rise to power in X
Y came to power in X
Y to power in X
regime of Y in X
Y came to power in X in
Y to power in X in
Y became chancellor of X
the rise of Y in X
X's dictator Y
X's president Y
Y took control of X
Y, who ruled X
Y's success and X's saviour
Y declared that X had
X's leader Y
government of Y in X
Y came to power in X in
regime of Y in X
X's dictator Y
Y became chancellor of X
X's president Y
the rise of Y in X
X's leader Y
Y, who ruled X
Y took control of X
government of Y in X
X, led by Y
quisling had visited Y in X
to flee X after Y
Y in X the year before
X, under the leadership of Y
LREC 2012, May 24th, 2012
35
Carnegie Mellon
Related Works – Use of Thesaurus
 E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998],
Nomlex [Macleod et al., 1998], VerbNet [Kipper et al., 2006]
Synonyms of “lead (v)” in WordNet
ID
Words
Definition
S1 lead, take, direct, conduct, take somebody somewhere
guide
S2 leave, result, lead
produce as a result or residue
:
S6 run, go, pass, lead, extend
:
S14
LREC 2012,
May 24 , 2012
moderate,
chair, lead
th
stretch out over a distance,
space, time, or scope
preside over
36
Carnegie Mellon
Related Works – Use of Thesaurus
 E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998],
WEAKNESS
Nomlex [Macleod et al., 1998]
, VerbNet [Kipper et al., 2006]
Need WSD
or contexts to avoid false-positives.
Synonyms of “lead (v)” in WordNet
ID
Words
Definition
S1 lead, take, direct, conduct, take somebody somewhere
guide
S2 leave, result, lead
produce as a result or residue
:
S6 run, go, pass, lead, extend
:
S14
LREC 2012,
May 24 , 2012
moderate,
chair, lead
th
stretch out over a distance,
space, time, or scope
preside over
37
Carnegie Mellon
Related Works – Paraphrase Acquisition
 Alignment Approach
– Monolingual Comparable Corpus [Shinyama et al, 2002]
– Bilingual Parallel Corpus [Barzilay & McKeown, 2001][Bannard
& Callison-Burch, 2005][Callison-Burch, 2008]
 Distributional Approach
– Context as Vector Space [Pasca & Dienes, 2005][Bhagat &
Ravichandran, 2008]
– Context as Surface Pattern [Lin & Pantel, 2001][Ravichandran
& Hovy, 2002]
LREC 2012, May 24th, 2012
38
Carnegie Mellon
Related Works – Paraphrase Acquisition
[Bannard &
Callison-Burch,
2005]
[Callison-Burch,
2008]
[Bhagat &
Ravichandran, 2008]
[Pasca & Dienes,
2005]
murdered
died
beaten
been killed
are
lost
were killed
kill
have died
murdered
dead
death
deaths
died
victims
killing
been killed
killed in
killed ,
that killed
killed NN people
killed NN
killed by
were wounded in
and wounding
dead , including
, hundreds
used
made
involved
found
born
done
injured
seen
taken
released
Paraphrases acquired by Metzler et al., [2011]
LREC 2012, May 24th, 2012
39
Carnegie Mellon
Differences from Related Works
 Our work requires just a plain non-parallel corpus
– Language portability:
• Good news for resource/tool-scarce languages
– There’s a potential to learn words used in a closed
community (slangs, technical terms etc) by providing a
domain-specific corpus
 Bootstrapping works iteratively with minimum
supervision
– Smaller human effort is required as compared to heavily
supervised learning methods, or to relying on domain expert
humans to hand-craft patterns.
LREC 2012, May 24th, 2012
40
Carnegie Mellon
Conclusion
We proposed the Diversifiable Bootstrapping
which can acquire lexically- diverse paraphrase
patterns.
We gave initial experimental results on a few relations,
which look promising.
As a future work, we hope to conduct formal evaluations
on larger relations in different languages.
LREC 2012, May 24th, 2012
41
Carnegie Mellon
Acknowledgment
This publication was made possible in part by
a NPRP grant (No: 09-873-1-129) from the
Qatar National Research Fund (a member of
The Qatar Foundation). The statements made
herein are solely the responsibility of the
authors.
LREC 2012, May 24th, 2012
We also gratefully acknowledge the support of
Defense Advanced Research Projects Agency
(DARPA) Machine Reading Program under Air
Force Research Laboratory (AFRL) prime
contract no. FA8750-09-C-0172. Any opinions,
findings, and conclusion or recommendations
expressed in this material are those of the
authors and do not necessarily reflect the view
of the DARPA, AFRL, or the US government.
42
Carnegie Mellon
Questions?
LREC 2012, May 24th, 2012
43