COMP9314 Lecture Notes Stabbing the Sky: Efficient Skyline Computation over Sliding Windows
Download ReportTranscript COMP9314 Lecture Notes Stabbing the Sky: Efficient Skyline Computation over Sliding Windows
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes Outline • • • • • Introduction n-of-N Queries (n1, n2)-of-N Queries Performance Evaluation Conclusions COMP9314 Xuemin [email protected] 2 Skyline Skyline Query: • Input: a set of points in ddimensional space. • Output: points not dominated by another point. • (x1, x2, …, xd) dominates (y1, y2, …, yd) iff xi<=yi (1<=i<=d) & ∃k, xk<yk. COMP9314 Xuemin [email protected] 3 Applications Multi-criteria decision making… Stock Trading Example: • What are the top deals? COMP9314 Xuemin [email protected] 4 Skyline Query Over Sliding Window Stock Trading Example • Top deals of a stock in the last 5 mins? last 4 mins, … • Top deals of a stock in the last 10K deals? … Queries: • n-of-N model (n <= N): the most recent n elements • (n1, n2)-of-N model • One-time queries • Continuous queries COMP9314 Xuemin [email protected] 5 Challenges Insertions & deletions (possibly high speed). On-line information – memory requirement – processing speed Existing techniques do not support n-of-N: [Borzsonyi et al (ICDE01), Tan et al (VLDB01), Kossman et al (VLDB 02), Papadias et al (SIGMOD03), Kapoor (SIAM J. comp00)] – support the computation of whole dataset – O (n logd-2 n) for d >= 4 & O (n log n ) otherwise COMP9314 Xuemin [email protected] 6 Results n-of-N: • keep N’ (N’ N) elements where N’ = O (logd N) if data distribution on each dimension is independent. • a novel encoding scheme, with O (N’) space, leads to nof-N query time O ( log N’ + s ) instead of O (n logd-2 n). • a new trigger based technique for continuously processing an n-of-N query. – trigger update time: O ( log s). – result update time: O (logδ) where δ is a result change. (n1, n2)-of-N: similar results. COMP9314 Xuemin [email protected] 7 n-of-N Queries • e is redundant Point e in PN (the most recent N elements) iff – e expires w.r.t PN, or – ∃e’ s.t. e’ e, and e’ is younger than e N=6 COMP9314 Xuemin [email protected] 8 Optimality Theorem: Non-redundant Points (RN) vs. n-of-N Skyline Query Result (Qn,N) – (PN – RN) does not appear in any Qn,N – Qn,N must be a subset of RN – xRN n, xQn,N – RN = O(logd-1N) for “independent” distributions • Only need to keep RN – the minimum number of elements to be kept. COMP9314 Xuemin [email protected] 9 Querying RN • critical dominance: e e’ where e is the youngest. • dominance graph GRN: RN and the critical dominance relationships. Y Y M = 7, N = 7 1 M = 7, N = 7 1 2 7 2 7 6 6 5 5 3 3 4 COMP9314 X Xuemin [email protected] 4 X 10 Querying RN e Qn,N iff • e is a root in GRN or • e’ e in GRN & e’ has expired t(e’) < M – n + 1 <= t(e) n RN M-n+1 n=5 {3,4,5,6,7 } 3 Qn,N Y M = 7, N = 7 1 {3,4} 2 7 6 n=4 {4,5,6,7 } 4 {4, 7} n=3 {5,6,7} 5 {5,6,7} 5 3 4 COMP9314 Xuemin [email protected] X 11 Querying RN: Optimal Algorithm To answer an n-of-N Query, encode the GRN using intervals: • Stab the intervals by (M-n+1). • For all returned intervals (x,y), return point whose timestamp is y • Technique: Use an interval tree index to achieve optimal O(log|RN|+s) query time Y Y M = 7, N = 7 M = 7, N = 7 1 1 2 7 2 7 e.g., n=6 6 6 (3,7] (4,6] 5 3 (0,3] 3 4 COMP9314 X (4,5] 4 (0,4] Xuemin [email protected] 5 X (0,3] (0,4] (3,7] (4,5] (4,6] 12 Maintaining RN new element enew arrives: • • • If the oldest eold RN expires, remove eold and update RN and GRN (interval tree). find D RN dominated by enew, update RN and GRN – Depth-first search on a R-tree of RN find e c enew, update GRN – Best-first search on the R-tree of RN Y Y M = 8, N = 5 M = 8, N = 5 enew = 8 eold = 3 D = {6} e=4 7 6 8 5 3 4 COMP9314 7 6 8 5 3 X Xuemin [email protected] 4 X 13 Continuous n-of-N Query Trigger-based algorithm: • • • Deletion: Qn,N – {eold}, and Qn,N – {D} Insertion: Qn,N {enew} if (e’ c enew and t(e’ ) >= M-n+1) Maintain a min-heap of Qn,N for efficiency Y enew = 8 M=8 eold = 3 D = {6} e’ = 4 M = 8, N = 5 7 6 8 M=7 M=8 n = 4, N=5 n = 4, N=5 Q4,5 = {4,7} Q4,5 = {5,7,8} 5 3 4 COMP9314 X Q5,5 = {3,4} Xuemin [email protected] Q5,5 = {4,7} 14 (n1,n2)-of-N Query More complicated than n-of-N Query • • • • • PN needs to be kept! (Old) critical dominance: t (ae) = max { t (e’): e’ e & t (e’) < t (e) } backward critical dominance: t (be) = min { t (e’): e’ e & t (e’) > t (e)} e Q(n1,n2),N iff ae < M-n2+1 e M-n1+1 < be CBC dominance graph: PN & the two kinds of dominance relationships Y Y (2,4)-of-7: {4, 6} M = 7, N = 7 n1 = 4, n2 = 6 2 4 2 M = 7, N = 7 4 5 5 3 6 1 COMP9314 7 X a5 = 3, b5 = 6 (M-n2+1, M-n1+1] = (4,6] 3 Xuemin [email protected] 6 1 7 X 15 Processing (n1,n2)-of-N Query Encode the CBC dominance graph: – e ((ae, e], be) – build an interval tree on (ae, e] only Stab using M-n2+1 against the interval tree and check e <= M-n1+1 < be) on-the-fly: – O(logN+s*), sub-optimal Y 2 (1,6] (3,4] (3,5] ... M = 7, N = 7 4 5 (2,4)-of-7: ??? (M-n2+1, M-n1+1] = (4,6] 3 6 (2,4)-of-7: {4, 6} 1 COMP9314 7 Candidates: {4, 5, 6} X Xuemin [email protected] 16 More on (n1,n2)-of-N Query Maintenance: Similar to that of n-of-N query, but – Always expires the oldest element in PN, and maintain the interval tree and the R-tree on RN. – Implementation-wise: Use two interval trees to index RN and PNRN, respectively. Continuous queries – More complicated • A new skyline point might not be a skyline in the previous result, • nor critically dominated by a skyline point in the previous result • nor a newly arrived point – Basic idea • Maintain additional Candidate Solutions (minimization) & triggers – Details in the full paper COMP9314 Xuemin [email protected] 17 Experiment Setup • Hardware – P4 2.8G CPU, 1G Memory • Datasets – Correlated, independent, and anticorrelated – d = 2 to 5, N = 106 • Algorithms – KLP, nN, mnN, cnN, n12N, mn12N • Metrics – Processing time Streaming rate COMP9314 Xuemin [email protected] 18 n-of-N Query • Varying dimensionality – M up to 2M, N = 1M, n uniformly from [1K, 1M], #queries = 1000 COMP9314 Xuemin [email protected] 19 n-of-N Query (cont’d) • Varying n • for correlated, independent, and anti-correlated datasets COMP9314 Xuemin [email protected] 20 Maintenance Costs 2d and 5d datasets, measure average and max time, N = i * 105 COMP9314 Xuemin [email protected] 21 Scalability M (total number) = 2M, N = 1M, #queries = 2M independent COMP9314 anti-correlated Xuemin [email protected] 22 Continuous n-of-N Queries • • • • 2d & 5d datasets N = 10K and 1M 10 queries with n = i*(N/10) measures cnN avg, cnN max, nN avg, nN max COMP9314 Xuemin [email protected] 23 (n1,n2)-of-N Queries Varying dimensionality – M up to 2M, N = 1M, #queries = 1000 – restricting n2 – n1 >= 500 Scalability – M = 2M, N = 1M, #queries = 2M COMP9314 Xuemin [email protected] 24 Maintenance • 2d and 5d datasets • measure average and max time • N = i * 105 COMP9314 Xuemin [email protected] 25 Conclusions • Efficient algorithms for various sliding windows skyline queries – Keep only minimum number of points – Encode and index those points – Maintain all the data structures • The proposed solutions – have theoretical guarantee on the performance, and – have demonstrated efficiency and scalability in the experiments • Future work – Improve the current solution for (n1,n2)-of-N queries – Approximate skyline queries COMP9314 Xuemin [email protected] 26 Q&A Thank You! COMP9314 Xuemin [email protected] 27 Reference • [ICDE01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. ICDE, 2001. • [VLDB01] K. Tan, P. Eng, and B. Ooi. Efficient progressive skyline computation. VLDB, 2001. • [VLDB 02] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. VLDB, 2002. • [SIGMOD03] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal progressive alogrithm for skyline queries. SIGMOD, 2003. • [SIAM J. comp00] S. Kapoor. Dynamic maintenance of maxima of 2- d point sets. SIAM J. Comput., 2000. COMP9314 Xuemin [email protected] 28