MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 Roadmap  Maxent:  Training  Smoothing  Case study:  POS Tagging (redux)  Beam search.

Download Report

Transcript MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 Roadmap  Maxent:  Training  Smoothing  Case study:  POS Tagging (redux)  Beam search.

MaxEnt: Training,
Smoothing, Tagging
Advanced Statistical Methods in NLP
Ling572
February 7, 2012
1
Roadmap
 Maxent:
 Training
 Smoothing
 Case study:
 POS Tagging (redux)
 Beam search
2
Training
3
Training
 Learn λs from training data
4
Training
 Learn λs from training data
 Challenge: Usually can’t solve analytically
 Employ numerical methods
5
Training
 Learn λs from training data
 Challenge: Usually can’t solve analytically
 Employ numerical methods
 Main different techniques:
 Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72)
 Improved Iterative Scaling (IIS, Della Pietra et al, ‘95)
 L-BFGS,…..
6
Generalized Iterative Scaling
 GIS Setup:
 GIS required constraint:
k
 "(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
7
Generalized Iterative Scaling
 GIS Setup:
 GIS required constraint:
k
 "(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
 If not, then set

:
8
Generalized Iterative Scaling
 GIS Setup:
 GIS required constraint:
k
 "(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
 If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
9
Generalized Iterative Scaling
 GIS Setup:
 GIS required constraint:
k
 "(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
 If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
 and add a correction feature function fk+1:
k
"(x, y) Î (X,Y ) fk+1 (x, y) = C - å f j (x, y)
j=1
10
Generalized Iterative Scaling
 GIS Setup:
 GIS required constraint:
k
 "(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
 If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
 and add a correction feature function fk+1:
k
"(x, y) Î (X,Y ) fk+1 (x, y) = C - å f j (x, y)
j=1
 GIS also requires at least one active feature for any event
 Default feature functions solve this problem
11
GIS Iteration
 Compute the empirical expectation
12
GIS Iteration
 Compute the empirical expectation
 Initialization:λj(0) ; set to 0 or some value
13
GIS Iteration
 Compute the empirical expectation
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence for each j:
14
GIS Iteration
 Compute the empirical expectation
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence for each j:
 Compute p(y|x) under the current model
15
GIS Iteration
 Compute the empirical expectation
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence for each j:
 Compute p(y|x) under the current model
 Compute model expectation under current model
16
GIS Iteration
 Compute the empirical expectation
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence for each j:
 Compute p(y|x) under the current model
 Compute model expectation under current model
 Update model parameters by weighted ratio of empirical
and model expectations
17
GIS Iteration
 Compute
1 N
d j = E p ( f j ) = å f j (xi , yi )
N i=1
18
GIS Iteration
1 N
 Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
 Initialization:λj(0) ; set to 0 or some value
19
GIS Iteration
1 N
 Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence:
 Compute
20
GIS Iteration
1 N
 Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence:
 Compute
p(n)(y|x)=
e
å j l j f j ( x,y)
Z
21
GIS Iteration
1 N
 Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence:
 Compute
 Compute
p(n)(y|x)=
e
å j l j f j ( x,y)
Z
1 N
E p( n ) ( f j ) = å å p(n) (y | xi ) f j (xi , y)
N i=1 yÎY
22
GIS Iteration
N
1
 Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
 Initialization:λj(0) ; set to 0 or some value
 Iterate until convergence:
 Compute
p(n)(y|x)=
Z
1 N
E p( n ) ( f j ) = å å p(n) (y | xi ) f j (xi , y)
N i=1 yÎY
 Compute
 Update
e
å j l j f j ( x,y)
l
(n+1)
j
=l
(n)
j
dj
1
+ (log
)
C
E p( n ) ( f j )
23
Convergence
 Methods have convergence guarantees
24
Convergence
 Methods have convergence guarantees
 However, full convergence may take very long time
25
Convergence
 Methods have convergence guarantees
 However, full convergence may take very long time
 Frequently use threshold
L( p) =
å
p(x, y)log p(y | x)
( x,y)ÎS
L( p(n) ) =
å
p(x, y)log p(n) (y | x)
( x,y)ÎS
26
Convergence
 Methods have convergence guarantees
 However, full convergence may take very long time
 Frequently use threshold
L( p) =
å
p(x, y)log p(y | x)
( x,y)ÎS
L( p(n) ) =
å
p(x, y)log p(n) (y | x)
( x,y)ÎS
L( p(n+1) ) - L( p(n) ) < threshold
L( p(n+1) ) - L( p(n) )
< threshold
(n)
L( p )
27
Calculating LL(p)
 LL = 0
 For each sample x in the training data
 Let y be the true label of x
 prob = p(y|x)
 LL += 1/N * prob
28
Running Time
 For each iteration the running time is:
29
Running Time
 For each iteration the running time is O(NPA), where:
 N: number of training instances
 P: number of classes
 A: Average number of active features for instance (x,y)
30
L-BFGS
 Limited-memory version of
 Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
31
L-BFGS
 Limited-memory version of
 Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
 Quasi-Newton method for unconstrained optimization
32
L-BFGS
 Limited-memory version of
 Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
 Quasi-Newton method for unconstrained optimization
 Good for optimization problems with many variables
33
L-BFGS
 Limited-memory version of
 Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
 Quasi-Newton method for unconstrained optimization
 Good for optimization problems with many variables
 “Algorithm of choice” for MaxEnt and related models
34
L-BFGS
 References:
 Nocedal, J. (1980). "Updating Quasi-Newton Matrices with
Limited Storage". Mathematics of Computation 35: 773–782
 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for
Large Scale Optimization". Mathematical Programming B 45 (3):
503–528
35
L-BFGS
 References:
 Nocedal, J. (1980). "Updating Quasi-Newton Matrices with
Limited Storage". Mathematics of Computation 35: 773–782
 Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for
Large Scale Optimization". Mathematical Programming B 45 (3):
503–528
 Implementations:
 Java, Matlab, Python via scipy, R, etc
 See Wikipedia page
36
Smoothing
Based on Klein & Manning, 2003; F. Xia
37
Smoothing
 Problems of scale:
38
Smoothing
 Problems of scale:
 Large numbers of features
 Some NLP problems in MaxEnt  1M features
 Storage can be a problem
39
Smoothing
 Problems of scale:
 Large numbers of features
 Some NLP problems in MaxEnt  1M features
 Storage can be a problem
 Sparseness problems
 Ease of overfitting
40
Smoothing
 Problems of scale:
 Large numbers of features
 Some NLP problems in MaxEnt  1M features
 Storage can be a problem
 Sparseness problems
 Ease of overfitting
 Optimization problems
 Features can be near infinite, take long time to converge
41
Smoothing
 Consider the coin flipping problem
 Three empirical distributions
 Models
From K&M ‘03
42
Need for Smoothing
 Two problems
From K&M ‘03
43
Need for Smoothing
 Two problems
 Optimization:
 Optimal value of λ?
∞
 Slow to optimize
From K&M ‘03
44
Need for Smoothing
 Two problems
 Optimization:
 Optimal value of λ?
∞
 Slow to optimize
 No smoothing
 Learned distribution just
as spiky (K&M’03)
From K&M ‘03
45
Possible Solutions
46
Possible Solutions
 Early stopping
 Feature selection
 Regularization
47
Early Stopping
 Prior use of early stopping
48
Early Stopping
 Prior use of early stopping
 Decision tree heuristics
49
Early Stopping
 Prior use of early stopping
 Decision tree heuristics
 Similarly here
 Stop training after a few iterations
 λwill have increased
 Guarantees bounded, finite training time
50
Feature Selection
 Approaches:
51
Feature Selection
 Approaches:
 Heuristic: Drop features based on fixed thresholds
 i.e. number of occurrences
52
Feature Selection
 Approaches:
 Heuristic: Drop features based on fixed thresholds
 i.e. number of occurrences
 Wrapper methods:
 Add feature selection to training loop
53
Feature Selection
 Approaches:
 Heuristic: Drop features based on fixed thresholds
 i.e. number of occurrences
 Wrapper methods:
 Add feature selection to training loop
 Heuristic approaches:
 Simple, reduce features, but could harm performance
54
Regularization
 In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
55
From K&M ’03, F. Xia
Regularization
 In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
 Typical examples of regularization in statistical machine
learning include ridge regression, lasso, and L2-normin
support vector machines.
56
From K&M ’03, F. Xia
Regularization
 In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
 Typical examples of regularization in statistical machine
learning include ridge regression, lasso, and L2-normin
support vector machines.
 In this case, we change the objective function:
 log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)
57
From K&M ’03, F. Xia
Prior
 Possible prior distributions: uniform, exponential
58
Prior
 Possible prior distributions: uniform, exponential
 Gaussian prior:
1
(li - mi )2
P(li ) =
exp()
2
2s
s i 2p
59
Prior
 Possible prior distributions: uniform, exponential
 Gaussian prior:
1
(li - mi )2
P(li ) =
exp()
2
2s
s i 2p
 log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)
= å log P(li ) + log P(Y | X, l )
k
i=1
(li - m )2
= -k log 2ps - å
+ log P(Y | X, l )
2
i=1
2s
k
60
 Maximize P(Y|X,λ)
Ep ( f j ) = Ep ( f j )
 Maximize P(Y, λ|X)
lj - m
E p ( f j ) = E p ( f j )- 2
s
 In practice, μ=0; 2σ2=1
61
L1 and L2 Regularization
l
L1 = å log P(yi , l | xi ) i
s
2
l
L2 = å log P(yi , l | xi ) i
s
62
Smoothing: POS Example
63
Advantages of Smoothing
 Smooths distributions
64
Advantages of Smoothing
 Smooths distributions
 Moves weight onto more informative features
65
Advantages of Smoothing
 Smooths distributions
 Moves weight onto more informative features
 Enables effective use of larger numbers of features
66
Advantages of Smoothing
 Smooths distributions
 Moves weight onto more informative features
 Enables effective use of larger numbers of features
 Can speed up convergence
67
Summary: Training
 Many training methods:
 Generalized Iterative Scaling (GIS)
 Smoothing:
 Early stopping, feature selection, regularization
 Regularization:
 Change objective function – add prior
 Common prior: Gaussian prior
 Maximizing posterior not equivalent to max ent
68
MaxEnt POS Tagging
69
Notation
 (Ratnaparkhi, 1996)
 h: history  x
 Word and tag history
 t: tag  y
70
POS Tagging Model
 P(t1,…,tn|w1,…,wn)
n
= Õ P(ti | w , t )
n
1
i-1
1
i=1
n
» Õ P(ti | hi )
i=1
p(t, h)
p(t | h) =
å p(t¢, h)
t¢ÎT
 where hi={wi,wi-1,wi-2,wi+1,wi+2,ti-1,ti-2}
71
MaxEnt Feature Set
72
Example
 Feature for ‘about’
Exclude features seen < 10 times
73
Training
 GIS
 Training time: O(NTA)
 N: training set size
 T: number of tags
 A: average number of features active for event (h,t)
 24 hours on a ‘96 machine
74
Finding Features
 In training, where do features come from?
 Where do features come from in testing?
w-1
w0
w-1w0
w+1
t-1
y
x1(Time <s>
)
Time
<s>Time
flies
BOS
N
x2
(flies)
flies
Time flies
like
N
N
like
flies like
an
N
V
Time
x3 (like) flies
75
Finding Features
 In training, where do features come from?
 Where do features come from in testing?
 tag features come from classification of prior word
w-1
w0
w-1w0
w+1
t-1
y
x1(Time <s>
)
Time
<s>Time
flies
BOS
N
x2
(flies)
flies
Time flies
like
N
N
like
flies like
an
N
V
Time
x3 (like) flies
76
Decoding
 Goal: Identify highest probability tag sequence
77
Decoding
 Goal: Identify highest probability tag sequence
 Issues:
 Features include tags from previous words
 Not immediately available
78
Decoding
 Goal: Identify highest probability tag sequence
 Issues:
 Features include tags from previous words
 Not immediately available
 Uses tag history
 Just knowing highest probability preceding tag insufficient
79
Beam Search
 Intuition:




Breadth-first search explores all paths
Lots of paths are (pretty obviously) bad
Why explore bad paths?
Restrict to (apparently best) paths
 Approach:
 Perform breadth-first search, but
80
Beam Search
 Intuition:




Breadth-first search explores all paths
Lots of paths are (pretty obviously) bad
Why explore bad paths?
Restrict to (apparently best) paths
 Approach:
 Perform breadth-first search, but
 Retain only k ‘best’ paths thus far
 k: beam width
81
Beam Search, k=3
<s>
time
flies
like
an
arrow
82
Beam Search, k=3
<s>
time
flies
like
an
arrow
83
Beam Search, k=3
<s>
time
flies
like
an
arrow
84
Beam Search, k=3
<s>
time
flies
like
an
arrow
85
Beam Search, k=3
<s>
time
flies
like
an
arrow
56
86
Beam Search
 W={w1,w2,…,wn}: test sentence
87
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
88
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
 Generate tags for w1, keep top k, set s1j accordingly
89
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
 Generate tags for w1, keep top k, set s1j accordingly
 for i=2 to n:
90
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
 Generate tags for w1, keep top k, set s1j accordingly
 for i=2 to n:
 Extension: add tags for wi to each s(i-1)j
91
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
 Generate tags for w1, keep top k, set s1j accordingly
 for i=2 to n:
 Extension: add tags for wi to each s(i-1)j
 Beam selection:
 Sort sequences by probability
 Keep only top k sequences
92
Beam Search
 W={w1,w2,…,wn}: test sentence
 sij: jth highest prob. sequence up to & inc. word wi
 Generate tags for w1, keep topN, set s1j accordingly
 for i=2 to n:
 For each s(i-1)j
 for wi form vector, keep topN tags for wi
 Beam selection:
 Sort sequences by probability
 Keep only top sequences, using pruning on next slide
 Return highest probability sequence sn1
93
Beam Search
 Pruning and storage:
 W = beam width
 For each node, store:
 Tag for wi
 Probability of sequence so far, probi,j=
t
Õ p(t
j
| hj )
j=1
 For each candidate j, si,j
 Keep the node if probi,j in topK, and
 probi,j is sufficiently high
 e.g. lg(probi,j)+W>=lg(max_prob)
94
Decoding
 Tag dictionary:
 known word: returns tags seen with word in training
 unknown word: returns all tags
 Beam width = 5
 Running time: O(NTAB)
 N,T,A as before
 B: beam width
95
POS Tagging
 Overall accuracy: 96.3+%
 Unseen word accuracy: 86.2%
 Comparable to HMM tagging accuracy or TBL
 Provides
 Probabilistic framework
 Better able to model different info sources
 Topline accuracy 96-97%
 Consistency issues
96
Beam Search
 Beam search decoding:
 Variant of breadth first search
 At each layer, keep only top k sequences
 Advantages:
97
Beam Search
 Beam search decoding:
 Variant of breadth first search
 At each layer, keep only top k sequences
 Advantages:
 Efficient in practice: beam 3-5 near optimal
 Empirically, beam 5-10% of search space; prunes 90-95%
98
Beam Search
 Beam search decoding:
 Variant of breadth first search
 At each layer, keep only top k sequences
 Advantages:
 Efficient in practice: beam 3-5 near optimal
 Empirically, beam 5-10% of search space; prunes 90-95%
 Simple to implement
 Just extensions + sorting, no dynamic programming
99
Beam Search
 Beam search decoding:
 Variant of breadth first search
 At each layer, keep only top k sequences
 Advantages:
 Efficient in practice: beam 3-5 near optimal
 Empirically, beam 5-10% of search space; prunes 90-95%
 Simple to implement
 Just extensions + sorting, no dynamic programming
 Running time:
100
Beam Search
 Beam search decoding:
 Variant of breadth first search
 At each layer, keep only top sequences
 Advantages:
 Efficient in practice: beam 3-5 near optimal
 Empirically, beam 5-10% of search space; prunes 90-95%
 Simple to implement
 Just extensions + sorting, no dynamic programming
 Disadvantage: Not guaranteed optimal (or complete)
101
MaxEnt POS Tagging
 Part of speech tagging by classification:
 Feature design
 word and tag context features
 orthographic features for rare words
102
MaxEnt POS Tagging
 Part of speech tagging by classification:
 Feature design
 word and tag context features
 orthographic features for rare words
 Sequence classification problems:
 Tag features depend on prior classification
103
MaxEnt POS Tagging
 Part of speech tagging by classification:
 Feature design
 word and tag context features
 orthographic features for rare words
 Sequence classification problems:
 Tag features depend on prior classification
 Beam search decoding
 Efficient, but inexact
 Near optimal in practice
104