Tracking: Why? How?
Download
Report
Transcript Tracking: Why? How?
Using Multi-Modality to Guide
Visual Tracking
Jaco Vermaak
Cambridge University Engineering
Department
Patrick Pérez, Michel Gangnet, Andrew Blake
Microsoft Research Cambridge
Paris, December 2002
Introduction
Visual tracking difficult: changes in pose and illumination,
occlusion, clutter, inaccurate models, high-dimensional state
spaces, etc.
Tracking can be aided by combining information in multiple
measurement modalities
Illustrated here on head tracking using:
Sound and contour measurements
Colour and motion measurements
General Tracking
Tracking Equations
Objective: recursive estimation of the filtering distribution:
p xt | y1:t ,
y1:t y1,, yt
General solution:
Prediction step:
p xt | y1:t 1 p xt | xt 1 p xt 1 | y1:t 1 dxt 1
dynamical prior
previous filtering
Filtering/update step:
pxt | y1:t Ly t | xt pxt | y1:t 1
likelihood
prediction
Problem: generally no analytic solutions available
Particle Filter Tracking
Monte Carlo implementation of general recursions.
Filtering distribution represented by samples/particles with
associated importance weights:
N
pN xt | y1:t ti xi dxt
t
i 1
Proposal step: new particles proposed from a suitable proposal
distribution:
xti simulatedfromq xt | xti1, yt
Reweighting step: particles reweighted with importance weights:
ti ti 1Lyt | xti pxti | xti1 / qxti | xti1, yt
Resampling step: multiply particles with high importance weights
and eliminate those with low importance weights.
Particle Filter Building Blocks
Sampling from conditional density
p x
xi , i
q
q x | x
q x'| x pdx
xi, i
Resampling
p x
xi , i
p x
x j (i) ,1, p j(i) MN 1 N
Reweighting with positive function
p x
xi , i
h
h x 0
h x p x h x pdx
xi , h xi i
1
Particle Filter Implementation
Requires specification of:
System configuration and state space
Likelihood model
Dynamical model for state evolution
State proposal distribution
Particle filter architecture
Head Tracking using Sound and
Contour Measurements
Problem Formulation
Objective: track the head of a person in a video sequence using
audio and image cues
Audio: time delay of arrival (TDOA) measurements at microphone
pair orthogonal to optical axis of camera
Image: edge events along normal lines to a hypothesised contour
Complimentary modalities: audio good for (re)initialisation; image
good for fine localisation
System Configuration
camera
microphone pair
image plane
Model Ingredients
Low-dimensional state space: similarity transform applied to a
reference template
x x, y, ,
Dynamical prior: integrated Langevin equation, i.e. second-order
Markov kernel
pxt x0:t 1 pxt xt 1, xt 2
Multi-modal data likelihoods:
p y x L x
LTDOA x LEDGE rx
Sound based likelihood: TDOA at mic. pair
Contour based likelihood: edge events
rx
Contour Likelihood
Input: maxima of projected luminance gradient along normals
( N j such events on j th normal)
LEDGE
d1, j d 2, j
d 3, j
1 q0
EDGE
rx q0
L
Nj
j
N d i , j ;0,
i 1
Nj
2
c
Contour Likelihood
Advantages
Low computational cost
Robust to illumination changes
Drawbacks
Fragile because of narrow support (especially with only
similarity transform on a fixed shape space)
Sensitive to background clutter
Extension
Multiply gradient by inter-frame difference to reduce influence
of background clutter
I
I
I
max
I
Inter-Frame Difference
Without frame difference
With frame difference
Audio Likelihood
Input: positions of peaks in generalised cross-correlation function
(GCCF)
Reverberation leads to multiple peaks
GCCF
x
x
d1
TDOA
dN
d1
TDOA
dN
LTDOA
Audio Likelihood
Deterministic mapping from Time Delay of Arrival (TDOA) to
bearing angle (microphone calibration) to X-coordinate in image
plane (camera calibration)
G:d x
Audio likelihood follows in similar manner to contour likelihood
TDOA
L
1 q0
x q0
N
N
1
2
N
d
;
G
x
,
i
s
i 1
Likelihood assumes a uniform clutter model
Particle Filter Architecture
LTDOA p X
qX
qX
pY p p
LEDGE
Layered sampling: first X-position and sound likelihood; then rest
X-position proposal: mixture of diffusion dynamics and sound
proposal:
q X x p XLANG x 1 q XTDOA x
q
TDOA
X
x
1
xN
G G
1
N
1
2
N
G
x
;
d
,
i
s
i 1
To admit “jumps” from proposal X-dynamics have to be
augmented with an uniform component:
pX x pXLANG x 1 U x
Examples
Effect of inter-frame difference:
Conversational ping-pong:
Examples
Conversational ping-pong and sound based reinitialisation:
Head Tracking using Colour and
Motion Measurements
Problem Formulation
Objective: detect and track the head of a single person in a
video sequence taken from a stationary camera
Modality fusion:
Motion and colour measurements are complementary
Motion: when the object is moving colour is unreliable
Colour: when the object is stationary motion information
disappears
Automatic object detection and tracker initialisation using motion
measurements
Individualisation of the colour model to the object:
Initialised with a generic skin colour model
Adapted to object colour during periods of motion: motion
model acts as “anchor”
Object Description and Motion
Head modelled as an ellipse that is free to translate and scale in
the image
Binary indicator variable to signal whether object is present in the
image or not, so object state becomes: x x, y, s, r
State components assumed to have independent motion models
Indicator: discrete Markov chain
Position and scale: Langevin motion with uniform initialisation:
undefined if rt 0
pxt | xt 1 , rt , rt 1 pL xt | xt 1 if rt 1 and rt 1 1
U x
if rt 1 and rt 1 0
Rx t
Image Measurements
Measurements taken on a regular filter grid:
hue image
saturation image
frame-difference image
isotropic Gaussian filters
y i Hi , Si , Di
Measurement vector: y y1 y G
Observation Likelihood Model
Measurements at gridpoints assumed to be independent
Unique background (object absent) likelihood model for each
gridpoint
All gridpoints covered by the object share the same foreground
likelihood model:
G
Ly | x Li y i | x
i 1
F
L
y i
iF x
B
L
i y i
iB x
At each gridpoint the measurements are also assumed to be
independent:
LF y i LFH H i LFS Si LFM Di
BS
BM
Di
LBi y i LBH
H
L
S
L
i
i
i
i
Note that the background motion model is shared by all the
gridpoints
Colour Likelihood Model
Normalised histograms for both foreground and background
colour likelihood models:
Lc c
c : colour measurement
c: bin index corresponding to measurement
i : normalisedcount for bin i 1 B
Background models trained on a sequence without objects
Foreground models trained on a set of labelled face images
Histogram models supplied with a small uniform component to
prevent numerical problems associated with empty bins
Motion Likelihood Model
Background frame-difference measurements empirically found to
be gamma distributed:
LBM Di Dia1 exp bDi
Foreground frame-difference depends on magnitude of motion,
number and orientation of foreground edges, etc.
Modelling these effects accurately is difficult
In general: if the object is moving foreground frame-difference
measurements are substantially larger than those for background
Thus a two-component uniform distribution is adopted for the
foreground frame-difference measurements (outlier model)
Particle Proposal
Three stages of operation:
Birth: object first enters scene; proposal should detect object
and spawn particles in the object region
Alive: object persists in scene; proposal should allow object to
be tracked, whether it is stationary or moves around
Death: object leaves scene; proposal should kill particles
associated with the object
Form of particle proposal:
qx | x' , y, P' qr | r ' , P'qz | z ' , r , r ' , y
z x, y , s
N
P (i ) r (i ) empirical probability of
i 1
object being alive
Particle Proposal
Indicator proposal:
Pbirth if P' 0
qr 1 | r ' 0, P'
otherwise
0
qr 0 | r ' 1 Pdeath
Birth only allowed if there is no object currently in the scene
All particles alive are subjected to a fixed death probability
State proposal:
undefined if r 0
qz | z ' , r , r ' , y pL z | z ' if r 1 and r ' 1
ˆ
N z; μˆ , Σ if r 1 and r ' 0
Langevin dynamics if object is alive
Gaussian birth proposal: parameters from detection module
Object Detection
Object region detected by probabilistic segmentation of the
horizontal and vertical projections of the frame-difference
measurements:
Region location and size determine parameters for birth proposal
distribution
Colour Model Adaptation
Why:
Generic skin colour model may be too broad for accurate
localisation
Model sensitive to colour changes due to changes in pose and
illumination
When:
Object present and moving: largest variations in colour
expected
Motion likelihood “anchors” particles around moving object
How:
Gradual: avoid fitting to the background: enforced with prior
Stochastic EM: contribution of particles proportional to
likelihood
Colour Model Adaptation
Unknown parameters: normalised bin values for object hue and
saturation histograms
EM Q-function for MAP estimation:
Q θt , θˆ t E p x |y
t
ˆ
1:t ,θ1:t
log Ly t | xt , θt log
pθt | θˆ t 1
dynamicalprior
No analytic solution but particle approximation yields:
N
QN θt , θˆ t ti log L y t | xti , θt log p θt | θˆ t 1
i 1
Monte Carlo approximation only performed over particles that
are currently alive
Colour Model Adaptation
Dirichlet prior used for parameter updates:
pθt | θt 1 Diθt | Cθt 1
Prior centred on old parameter values
Variance controlled by multiplicative constant
Update rule for normalised bin counts becomes:
i
ni i 1
n
B
j 1
j
j B
N
ni j ni j
j 1
ni j : i - th bin count for j - th particle
i : Dirichletprior paramet er
What Happens?
1
particle
histograms
2
weighted average
histogram
Implementation
Colour model adaptation iterations occur between particle
prediction and particle reweighting in standard particle filter
Stochastic EM algorithm initialised with parameters from previous
time step
A single stochastic EM iteration is sufficient at each time step
Number of particles is fixed to 100
Non-optimised algorithm runs at 15fps on standard desktop PC
Examples
No adaptation: tracker gets stuck on skincoloured carpet in the background
Adaptation: tracker successfully adapts to
changes in pose and illumination and lock is
maintained
No motion likelihood: tracker fails, illustrating
need for “anchor” likelihood
Examples
Tracking is successful despite substantial variations
in pose and illumination and the subject
temporarily leaving the scene
Particles are killed when the subject leaves the
scene; upon re-entering the individualised
colour model allows lock to be re-established
within a few frames
The End