PowerPoint Template - National University of Singapore

Transcript PowerPoint Template - National University of Singapore

MWE

Re-examination of Association Measures for identifying Verb Particle and Light Verb Constructions

Supervisor: Dr. Kan Min-Yen, Dr. Su Nam Kim, (Dr. Timothy Baldwin) Advisor: Lin Ziheng Student: Hoang Huu Hung 25-Apr-20

Outline



Verb Particle and Light Verb Constructions



Motivations



Association Measures

MI and PMI MI and PMI The high-frequency constituents Context measures 

Conclusions & Future work



Conclusions & Future work

25-Apr-20

Multiword Expressions



Multiple simplex words



Idiosyncratic

  Lexically:

ad hoc

(ad ? , hoc ?) Syntactically:

by and large

(prep. + conj. + adj. ?)  Semantically:

spill the beans

 Statistically:

strong coffee

(powerful coffee ?) 

Obstacles to language understanding, translation, generation, etc.

25-Apr-20

Verb Particle and Light Verb Constructions

Verb Particle Constructions



Verb + Particle(s)



Bolster up

put off



Light Verb Constructions Light verb + Complement



get

give

have

make

put,



Put up with

get on with take



Cut short , let go



Make a speech

give a demo



Syntactically Flexible:

Inflections, passive, internal modifications 

Extensive research has been carried out on this topic

. 

He has given many excellent speeches in his career



Semantically



Non-compositional

•

carry out , give up



(Semi-?)compositional

•

walk off , finish up



Compositional

 meaning from the de-verbal noun •

give a demo

 Subtle meaning deviations •

Have a read

vs.

Read

25-Apr-20

Identification of VPCs and LVCs



≠ Free Combination VPCs:

free verb + preposition  Leave it to me.

LVCs:

 free light verb + noun make decision ≠ make drugs 

VPCs ≠ Prepositional verbs :

look after , search for

 

VPCs

Joint & split configurations  

Look up the word/ Look the word up Look it up

 No intervening manner adverb   

Look up the word carefully

Look carefully up the word

Prepositional verbs

Only joint configuration  

Look after your mum Look after her

Flexible positions  

Look after your mum carefully Look carefully after your mum

25-Apr-20

Outline



Verb Particle and Light Verb Constructions



Motivations



Association Measures

 MI and PMI  The high-frequency constituents  Context measures 

Conclusion & Future work

25-Apr-20

Motivation



Pecina P. and Schlesinger P.

Association Measures for Collocation Extraction

(COLING/ACL 2006)

Combining

 An exhaustive list of 82 Lexical Association Measures for bigrams 

Lexical Association Measures (AM)

 Mathematical formulae devised to capture the degree of association between words of a phrase • Association ~ dependence • Degree ~ score  Input: statistical information. Output: scores 25-Apr-20

A comparison

Pecina and Schlesinger Our Project



Czech bigram “collocations”

 Mixed idiomatic exps, LVCs, terminologies, stock phrases,… 

English bigram VPCs, LVCs

 Separately and mixed 

Ranking extracted bigrams

 Prague Dependency Treebank • 1.5 million words  Average Precision (AP) 

Rank VPC & LVC candidates

 Wall Street Journal Corpus • 1 million words  AP 

Machine-learning based combination



Analysis

 

of AMs

Categorization Modifications 25-Apr-20

Gold-standard Evaluation data

  

Bigrams with frequency ≥ 6 413 WSJ-attested VPC candidates

 Candidates: (verb + particle) pair  Annotations from Baldwin, T. (2005)

100 WSJ-attested LVC candidates

 LVC candidates: (light verb + noun) pairs  Annotations from Tan, Y. F., Kan M., Y. and Cui H. (2006) Size Evaluation set VPC LVC Mixed 413 100 513 Negative Instances 296 72 368 Positive Instances 117 28 145 % of positive Instances 28.33% 28% 28.26%  A random ranker has an AP ~ 0.28

25-Apr-20

Rank-equivalence



Idea: refer to AMs having the same AP

 simplification  Categorization

ad bc r



ad ad





bc r



ad ad

   

Rank-equivalence over a set C

 “Ranking all members of C in the same way”  Notation:

Property

bc bc

 M(

): score assigned by M to instance

25-Apr-20

Example of AMs



Contingency table of bigram (x y)

    

: any word except

;

: any word; f(.) : frequency of (.) N: total number of bigrams Null hypothesis of independence  ( )]  

f N

25-Apr-20

Outline



Verb Particle and Light Verb Constructions



Motivations



Association Measures

 MI and PMI  The high-frequency constituents  Context measures 

Conclusion & Future work

25-Apr-20

Categorization: 4 main groups

 

Group 1 : Dependence ~ Reduction in uncertainty

 MI and PMI, Salience, etc.

Group 2: Dependence ~ set’s similarity ↑↓ marginal frequencies

  Dice, Minimum Sensitivity, Laplace etc. Sokal-Michiner, Odds ratio, etc. 

Group 3:



Compare observed freq. with expected freq.

Null hypothesis of independence   T, Z test; Pearson’s chi-squared, Fisher’s exact tests, etc.

Group 4:

 

Dependence ~ Non-compositionality

Non-compositionality ~ context similarity • Cosine Similarity in tf.idf space, idf space, etc.

Non-compositionality ↑↓ context entropy 25-Apr-20

MI and PMI



Mutual Information (MI)

 MI(U; V)   ) log )  ) (  ) Reduction in uncertainty of U given knowledge of V  MI  1



f ij f f

log ˆ

ij ij

Point-wise MI (PMI)

  log “MI at a specific point” )  ) (  )  log  ) 

) 

2 known drawbacks

25-Apr-20

The first drawback

  Higher performance in [0, 1] than [2, 100]  log

P xy

 ) ( ) 

)  log  ) 

) k VPCs LVCs Mixed     0.32

0.546

0.525

100

Joint prob.

0.217

0.170

0.546

0.505

0.431

0.289

0.28

0.544

 )

0.528

0.515

f x

 ) ( 

) 0.370

0.290

0.236

0.224

0.175

25-Apr-20

The second drawback



Beside the degree of dependence,



MI grows with entropy



PMI grows with frequency



Mathematically,

1 (  1

P x

 ) , log ) 1  1  ) ) ) , log 1  1  ) ) Comparing these scores are just not appropriate !!!

25-Apr-20

Proposed Solution



Normalizing scores

 Share the same unit 

Proposed normalization factor (NF)

NF =



(1 



) ( 

)

  [0, 1] or

NF-α 

P x

 ), ( 

))

25-Apr-20

Against high-frequency constituents

[M35] Simpson

[M35] Simpson VPCs

max(  , 0.478

0.249

min(

 )

0.578

 0.382

) 



Mixed

0.486

0.260



Insights: penalizing against the higher productive constituent

 Confirmed by [M49] Laplace, [M41] S cost AMs VPCs LVCs Mixed

 1 2

a a

 1 2 ) 

 1 0.577

 0.493

 1

[M49] Laplace

0.241

0.388

0.254

 2 25-Apr-20

Against high-frequency constituents

   0.565



[M18] Sokal-Michiner

 –max(

b, c

) 0.540

0.433

Better to penalize both constituents ???



Proposed modification:



b r

Mixed

a d r b c

0.546

0.519



) 

25-Apr-20

Context-based Measures



Non-compositionality of (x y)

  Context of (

x y

) ≠ context of

and

Context of

≠ context of

 Eg:

Dutch courage

hot dog



Context as a distribution of words

 Relative entropy (KL divergence), Jensen-Shannon diverg.  Dice similarity 

Context as a point/vector in R N

 Euclidean, Manhattan, etc. distance  Angle distance (Cosine similarity) 25-Apr-20

Representation of context

  

Context of z C z



1 

2 ), ...

Common representation schemes



w n

   Tf.Idf:    ( ( (

w w w i

)

i i

) )    0 1 

)

(

)

 Dice similarity dice(c , c )

x y

 

2  2 

x y i i



y i

2 )) VPCs LVCs  Mixed 0.367 0.374

• Salton and Buckley (1987) Dice in (Scaled tf).idf space 0.568

0.488 0.553



C x

} | 25-Apr-20

Outline



Verb Particle and Light Verb Constructions



Motivations



Association Measures

 MI and PMI  The high-frequency constituents  Context measures 

Conclusion & Future work

25-Apr-20

Conclusions

 

The 82 AMs: 4 main groups

 Meaning  Rank-equivalence

Group 1 : Dependence ~ Reduction in uncertainty

 Effective 

Group 2:



Dependence ~ set’s similarity, marginal frequencies

Simple but most effective 

Group 3:



Compare observed freq. with expected freq.

Not effective 

Group 4: Non-compositionality ~ context similarity, entropy

 Compromised by the ubiquity of particles and light verbs.

25-Apr-20

Conclusions



Co-occurrence frequency f(xy)

 Not useful for VPCs: f(xy) 0.13

 Ok for LVCs: f(xy) 0.85



Marginal frequencies f(x*) and f(*y)

 Effective to discriminate against high-frequency constituents  Useful discriminative units: • VPCs: –b –c • LVCs: 1/bc 

MI and PMI

 An indicator of independence, not dependence • Manning and Schutze (1999, p. 67)  As an indicator of dependence • PMI: normalized for VPCs • MI: normalized for VPCs and LVCs 

The tf.idf

 Effective to normalize tf to [0.5, 1] • Salton and Buckley (1987) 25-Apr-20

Future work



More types of particles of VPCs

 Adjective:

cut short , put straight



Verb: let go , make do



Trigram model

 Phrasal-prepositional verbs: Verb + adverb + prep. •

Look forward to , get away with

 Idioms •

Kick the bucket

spill the beans

 Adaptation of bigram-AMs 

A larger corpus, evaluation data set

25-Apr-20

References

      

Baldwin, T. (2005).

The deep lexical acquisition of English verb-particle constructions

. Computer Speech and Language, Special Issue on Multiword Expressions, 19(4):398 –414 Evert, S. (2004).

The Statistics of Word Cooccurrences: Word Pairs and Collocations

. Ph.D. dissertation, University of Stuttgart.

Kim, S. N. (2008).

Statistical Modeling of Multiword Expressions

. Ph.D. Thesis, University of Melbourne, Australia.

Lin, D. (1999).

Automatic identification of non-compositional phrases . In Proc. of the 37th Annual Meeting of the ACL, Manning, C.D. and Schutze, H.(1999).

Foundations of Statistical Natural Language Processing

. The MIT Press, Cambridge, Massachusetts.

Pecina, P. and Schlesinger, P. (2006).

Combining association measures for collocation extraction . COLING/ACL 2006 Tan, Y. F., Kan, M. Y. and Cui, H. (2006).

identification of light verb constructions using a supervised learning framework . EACL 2006 Workshop on Multi-word-expressions in a

multilingual context

Extending corpus-based Zhai, C. (1997).

Exploiting context to identify lexical atoms – A statistical view of linguistic context . In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97).

25-Apr-20

MWE

25-Apr-20

Q&A



Bigram idioms



in all

after all

later on

as such

25-Apr-20

Types of MWEs

Lexicalized phrases MWEs Institutionalized phrases Fixed exps Semi-fixed exps Syntactically-flexible exps Idioms NCs PP-Ds LVCs VPCs 25-Apr-20

Pavel (2005, 2006, 2008)

Combination of 82 association measures



Statistical tests:

  Mutual information Statistical independence  Likely-hood measures 

Semantics tests:

 Entropy of immediate context • Immediate context: immediately preceding/following words  Diversity of empirical context • Empirical context: words within a certain specified window 25-Apr-20

Pavel (2005, 2006, 2008)

Combination of 82 association measures



Result:

 MAP: 80.81%  “Equivalent” measures  17 measures 

Issues:

 Possible conflicting predictions • All combined is best?

 Unclear linguistic and/or statistical significance 25-Apr-20

Other linguistics-driven tests



MWEs are lexically fixed?

 Substitutability test (Lin, 1999) 

MWEs are order-specific?

 Permutation entropy (PE) (Yi Zhang et al., 2006)  Entropy of Permutation and Insertion (EPI) • (Aline et al., 2008) • Not all permutations are valid • “Permutation”  “Syntactic variants” 

Others…?

25-Apr-20

Q&A



Bigram idioms



in all

after all

later on

as such

25-Apr-20

Outline



Verb Particle and Light Verb Constructions



Motivations



Association Measures

 MI and PMI  The high-frequency constituents  Context measures 

Conclusion & Future work

25-Apr-20

PowerPoint Template - National University of Singapore

Transcript PowerPoint Template - National University of Singapore

Re-examination of Association Measures for identifying Verb Particle and Light Verb Constructions

Outline

Multiword Expressions

Verb Particle and Light Verb Constructions

Identification of VPCs and LVCs

Outline

Motivation

A comparison

Gold-standard Evaluation data

Rank-equivalence

Example of AMs

Outline

Categorization: 4 main groups

MI and PMI

The first drawback

The second drawback

Proposed Solution

NF =

(1 

) ( 

)

NF-α 

 ), ( 

))

Against high-frequency constituents

Against high-frequency constituents

) 

Context-based Measures

Representation of context

Outline

Conclusions

Conclusions

Future work

References

Q&A

Types of MWEs

Pavel (2005, 2006, 2008)

Pavel (2005, 2006, 2008)

Other linguistics-driven tests

Q&A

Outline

Directory