Transcript pptx
Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams DMSN 2011 Cagri Balkesen & Nesime Tatbul Talk Outline • Intro & Motivation • Stream Partitioning Techniques – Basic window partitioning – Batch partitioning – Pane-based partitioning • Ring-based Query Evaluation • Experimental Evaluation • Conclusions & Future Work [email protected] 2 Intro & Motivation DSMS [email protected] 3 Architectural Overview input stream Split stage Split node Query Query Query Query nodes Merge stage output stream Merge node QoS: latency < 5 seconds disorder < 3 tuples • Classical Split-Merge pattern from Parallel DBs • Adjustable parallelism level, d • QoS on max latency & order [email protected] 4 Related Work: How to Partition? • Content-sensitive – FluX: Fault-tolerant, load balancing Exchange [1,2] – Use group-by values from the query to partition – Need explicit load-balancing due to skewed data • Content-insensitive – GDSM: Window-based parallelization (fixed-size tumbling wins) [3] – Win-Distribute: Partition at window boundaries – Win-Split: Partition each win into equi-length subwins • The Problem: – How to handle sliding windows? – How to handle queries without group-by or a few groups? [1] Flux: An Adaptive Partitioning Operator for Continuous Query Systems, ICDE‘03 [2] Highly-Available, Fault-Tolerant, Parallel Dataflows, SIGMOD ‘04 [3] Customizable Parallel Execution of Scientific Stream Queries, VLDB ‘05 [email protected] 5 Stream Partitioning Techniques Approach 1: Basic Sliding Window Partitioning • Independently processable chunking – Window aware splitting of the stream • Each window has an id & tuples are marked – (first-winid, last-winid, is-win-closer) • Tuples are replicated for each of their windows Node1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . W1 Split W2 Node2 W3 W4 w = 6 units, s = 2 units, Replication = 6/2 = 3 [email protected] Node3 7 Approach 1: Basic Sliding Window Partitioning The Problem with Basic sliding window partitioning: • Tuples belong to many windows depending on slide • Excessive replication of tuples for each window • Increase in output data volume of split Node1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . W1 Split W2 Node2 W3 W4 w = 6 units, s = 2 units, Replication = 6/2 = 3 [email protected] Node3 8 Approach 2: Batch-based Partitioning • Batch several windows together to reduce replication • “Batch-window”: wb = w+(B-1)*s ; sb = B*s – All the tuples in a batch go to the same partition – Only tuples overlapping btw. batches are replicated • Replication reduced to wb/sb partitions instead of w/s t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . . w1 w = 3, s = 1 B = 3 wb = 5, sb = 3 Replication : 3 5/3 [email protected] w2 w3 Definitions: w : window-size s : slide-size B : batch-size B1 w4 w5 w6 B2 w7 w8 9 The Panes Technique • • • • Divide overlapping windows into disjoint panes Reduce cost by sub-aggregation and sharing Each window has w/gcd(w,s) panes of size gcd(w,s) Query is decomposed: pane-level (PLQ) & window-level (WLQ) queries panes p1 p2 p3 p4 p5 p6 p7 p8 . . . windows w1 w2 w3 w4 w5 ... [1] No Pane, No Gain: Efficient Evaluation of Sliding Window Aggregates over Data Streams, SIGMOD Record ‘05 [email protected] 10 Approach 3: Pane-based Partitioning • Mark each tuple with pane-id + win-id – Treat panes as tumbling window with wp = sp = gcd(w,s) • Route tuples to a node based on pane-id • Nodes compute PLQ with pane tuples • Combine all PLQ results of a window to form WLQ – Need for an organized topology of nodes – We propose organization of nodes in a ring Node1 Split w = 6 units, s = 2 units [email protected] Node2 Node3 11 Ring-based Query Evaluation 1 2 Pane1 3 4 5 Pane2 6 Input Source Pane3 Window1 5 Pane3 … P9 P8 P3 P2 P1 8 9 Pane4 W = 6, S = 4 tuples P = GCD(6,4) = 2 tuples 10 Pane5 … P11 P10 Split R9 R3 W2 W1 R13 Node3 9 11 12 13 14 Pane6 Pane7 Window3 Node2 R11 R5 10 Pane5 P5 P4 Merge W3 [email protected] 7 Window2 Node1 R7 6 … P13 P12 ... P7 P6 • High amount of pipelined result sharing among nodes • Organized communication topology 12 Assignment of Windows and Panes to Nodes • All pane results only arrive from predecessors • Pane results sent to successor is only local panes – Each node is assigned n consecutive windows – Min n st. Definitions: ww : win-size in # of panes sw : slide-size in # of panes [email protected] 13 Flexible Result Merging Fullyordered FIFO *k=0 k-ordered: k-ordering constraint [1], certain disorder allowed Defn: For any tuple s, s’ arrives at least k+1 tuples after s st. s’.A ≥ s.A [1] Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams. ACM TODS ‘04 [email protected] 14 Experimental Evaluation • Implementation of techniques in Borealis • Workload adapted from Linear Road Benchmark – Slightly modified segment statistics queries – Basic aggregation functions with different window/slide ratios [email protected] 15 Maximum input rate (tuples/second) Scalability of Split Operator window-size/slide ratio (window overlap) • Pane-partitioning: cost & tput constant regardless of overlap ratio • Window & batch –partitioning: cost ↑ and tput↓ as overlap ↑ • Excessive replication in window-partitioning is reduced by batching [email protected] 16 Scalability of Partitioning Techniques * w/s = overlap ratio = 100 • Pane-based scales close to linear until split is saturated – per tuple cost is constant • Window & batch based: exteremely high replication – Split is not saturated, but scales very slowly [email protected] 17 Summary & Conclusions 1) Window-based 2) Batch-based 3) Pane-based • Pane-partitioning is the choice of partitioning – Avoids tuple replication – Incurs less overhead in split and aggregate – Scales close to linear [email protected] 18 Ongoing & Future Work • • • • Generalization of the framework Support for adaptivity during runtime Extending complexity of query plans Extending performance analysis & experiments [email protected] 19 Thank You! [email protected] 20