Transcript Document
A Dynamic Probabilistic Multimedia Retrieval Model Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld ICME 2004 Introduction • Video Representation schemes used for retrieval: – Static – Spatio-temporal • Video is a temporal media so a ‘good’ model solves the limitations of keyframe-based shot representation ICME 2004 Spatio-temporal grouping • Spatial priority and tracking of regions from frame to frame • Joint spatial and temporal segmentation – Human vision finds salient structures jointly in space and time (Gepshtein and Kubovy, 2000) ICME 2004 Motivation • Pursue video retrieval instead of image (keyframe) retrieval • Extension of the Static Probabilistic Multimedia Retrieval model (2003) • GMM in DCT-space-time domain – Diagonal covariance ICME 2004 Static Model Docs Models •Indexing - Estimate Gaussian Mixture Models from images using EM - Based on feature vector with colour, texture and position information from pixel blocks - Fixed number of components ICME 2004 Static Model • Indexing – Estimate a Gaussian Mixture Model from each keyframe (using EM) – Fixed number of components (C=8) – Feature vectors contain colour, texture, and position information from pixel blocks: < x,y,DCT > ICME 2004 Static Model Models • Retrieval P(Q|M1) –Calculate conditional probabilities of query samples given models in collection Query P(Q|M2) P(Q|M3) P(Q|M4) ICME 2004 Dynamic Model • Selecting frames – 1 second sequence around the keyframe – Entire video shot as sequence of frames sampled at regular intervals • Features < x, y, t, DCT > ICME 2004 Dynamic Model 1 .5 0 ICME 2004 • Indexing: • GMM of multiple frames around keyframe • Feature vectors extended with timestamp normalized in [0,1]: <x,y,t,DCT> Dynamic Model ICME 2004 Query example: A single image • Artificial sequence of 29 images as the single query example where the time is normalized between 0 and 1 • Extend the query example image’s features with a fixed temporal feature value of 0.5 – Better results and lower computational cost ICME 2004 Dynamic Model Advantages • More training data for models – Less sensitive to random initialization • Reduced dependency upon selecting appropriate keyframe • Some spatio-temporal aspects of shot are captured – (Dis-)appearance of objects ICME 2004 Dynamic Model ICME 2004 Dynamic Model ICME 2004 Dynamic Model ICME 2004 Retrieval Framework • Smoothing 1 N RSV wi logkP xj wi 1 k P xj j 1 N • Building dynamic GMMs Px wi c 1 PCi ,c G x, i , c, i , c Nc Gx, , 1 2 n e 1 x 1 x 2 Likelihood goes to infinity ??? ICME 2004 Experimental Set-up • Build models for each shot – Static, Dynamic, Language • Build Queries from topics – Construct simple keyword text query – Select visual example – Rescale and compress example images to match video size and quality ICME 2004 Combining Modalities • Independence assumption textual/visual – P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM) • Combination works if both runs useful [CWI:TREC:2002] • Dynamic run more useful than static run ICME 2004 Run MAP ASR only Static only Static+ASR .130 .022 .105 Dynamic only .022 Dynamic+ASR .132 Combining Modalities Dynamic: Higher Initial Precision ICME 2004 Dynamic: Higher initial precision Static run Dynamic run ICME 2004 Dow Jones Topic (120) ICME 2004 Dow Jones Topic (120) • “Dow Jones Industrial Average rise day points” + = ICME 2004 Conclusions • Dynamic model captures visual similarity better – Spatio-temporal aspects – More training data – Apropriate key-frame less critical – Less sensitive to the random initialization • ASR + dynamic better than either alone ICME 2004 Future work • More data needs more computation effort – optimizations ? • Avoid the singular solutions Dynamic number of components ? • Full covariance in space-time < x,y,t > • Integration of audio ICME 2004 Thanks !!! ICME 2004 Merging Run Results • Combining (conflicting) examples difficult [CWI:TREC:2002] • Single example Miss relevant shots • Round-Robin Merging Combined 1 2 3 4 5 6 7 8 9 10 ICME 2004 1 2 3 4 5 6 7 8 9 10 1 1 2 2 3 3 4 4 . . Merging Run Results ICME 2004 Merging Run Results • Combining (conflicting) examples difficult Single [CWI:TREC:2002] • Single All example Miss relevant shots Selected • Round-Robin Merging Best +ASR .022 1 2 3 4 5 6 7 8 9 10 .031 .039 .050 ICME 2004 .132 1 2 3 4 5 6 7 8 9 10 .149 .151 .155 Combined 1 1 2 2 3 3 4 4 . . Conclusions • Visual aspects of an information need are best captured by using multiple examples • Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives nearbest performance for almost all topics ICME 2004