SALSA Group Research Activities April 27, 2011 Research Overview MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/PlotViz Education.
Download ReportTranscript SALSA Group Research Activities April 27, 2011 Research Overview MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/PlotViz Education.
SALSA Group Research Activities April 27, 2011 Research Overview MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/PlotViz Education Twister & Azure MapReduce What is Twister? Twister is an Iterative MapReduce Framework which supports Customized static input data partition Cacheable map/reduce tasks Combining operation to converge intermediate outputs to main program Fault recovery between iterations Twister Programming Model Twister Architecture Applications and Performance MapReduceRoles for Azure MapReduce framework for Azure Cloud Built using highly-available and scalable Azure cloud services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint Hides the complexity of cloud & cloud services from the users Co-exist with eventual consistency & high latency of cloud services Decentralized control avoids single point of failure MapReduceRoles for Azure • Supports dynamically scaling up and down of the compute resources. • Fault Tolerance • Combiner step • Web based monitoring console • Easy testing and deployment Twister for Azure Map Job Start Map Scheduling Queue Combine Combine Worker Role Job Bulleting Board Reduce MapID Merge Add Iteration? Map Combine Reduce No Map 1 Map 2 Map n Reduce Workers Red 1 Yes Hybrid scheduling of the new iteration Status Job Finish Data Cache ……. Map Workers Map Task Table MapID ……. Status Red 2 Red n In Memory Data Cache Task Monitoring Role Monitoring Iterative MapReduce Framework for Microsoft Azure Cloud. Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as well as using a bulletin board Kmeans Performance with/without data caching. Performance Comparisons Kmeans Scaling speedup 100.00% 1600 90.00% 160% 1400 Relative Parallel Efficiency Time(s) 80.00% 1200 70.00% 800 80% 600 60% Hadoop-Blast 400 40% DryadLINQ-Blast 200 20% Time (s) 40.00% Twister4Azure 20.00% 10.00% 0.00% 0 528 428 328 Number of Query Files 628 728 Cap3 Sequence Assembly 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 0% 8 X 16M 16 X 32M 32 X 64M 48 X 96M Num Instances X Num Data Points 64 X 128M Smith Watermann Sequence Alignment 3000 2500 Twister4Azure Amazon EMR Apache Hadoop Adjusted Time (s) 228 128 120% 100% 50.00% 30.00% 140% 1000 60.00% Parallel Efficiency Parallel Efficiency Kmeans Increasing number of iterations Relative Parallel Efficiency BLAST Sequence Search 2000 1500 Twister4Azure 1000 Amazon EMR 500 Apache Hadoop 0 Num. of Cores * Num. of Files Num. of Cores * Num. of Blocks Dryad & Parallel Applications DryadLINQ CTP Evaluation The beta version released on Dec 2010 Motivation: Evaluate key features and interface in DryadLINQ Study parallel programming model in DryadLINQ Three applications SW-G bioinformatics application Matrix Matrix Multiplication PageRank Parallel programming model DryadLINQ store input data as DistributedQuery<T> objects It splits distributed objects into partitions with following APIs: AsDistributed() Window HPC Server 2008 R2 Cluster RangePartition() Common LINQ providers Data Provider Base class LINQ-to-objects IEnumerable<T> PLINQ ParallelQuery<T> LINQ-to-SQL IQueryable<T> DSC Client Service LINQ-to-? IQueryable<T> HPC Client Utilites DryadLINQ DistributedQuery<T> DSC DryadLINQ Provider Workstation computer Dryad graph manager Vertex 1 Compute node Compute node DSC Service HPC Job Scheduler Service Data Head node ... Data Vertex 2 Vertex n Compute node Compute node SW-G bioinformatics application Workload balance issue SW-G tasks are inhomogeneous in CPU time. Skewed distributed input data cause in-balance workload distribution Randomized distributed input data can alleviate above issue Static and Dynamic optimization in Dryad/DryadLINQ skewed/randomized ratio Skewed/randomize Ratio 3.5 3 2.5 2 1.5 1 previous Dryad 0.5 new Dryad 0 0 50 100 150 200 mean sequence length (400) with varying standard deviations 250 Matrix-Matrix Multiplication Parallel programming algorithms Row split Row Column split 2 dimensional block decomposition in Fox algorithm Multi core technologies in .NET 250 TPL, PLINQ, Thread pool 200 Hybrid parallel model Port multi-core to Dryad 150 task to improve performance 100 TPL Thread Task PLINQ 50 0 Fox-DSC RowColumn-DSC RowSplit-DSC PageRank Grouped Aggregation A core primitive of many distributed programming models. Two stage:1) Partition the data into groups by some keys 2) Performs an aggregation over each groups DryadLINQ provide two types of grouped aggregation GroupBy(), without partial aggregation optimization. GroupAndAggregate(), with partial aggregation. 3500 Seconds 3000 2500 GroupAndAggregate 2000 TwoApplyPerpartition 1500 OneApplyPerPartition 1000 GroupBy 500 HierarchicalAggregation 0 1280 960 number of am files 640 320 NIH Projects Sequence Clustering MPI.NET Implementation Smith-Waterman / Needleman-Wunsch with Kimura2 / Jukes-Cantor / Percent-Identity C# Desktop Application based on VTK Pairwise Clustering Cluster Indices Gene Sequences Pairwise Alignment & Distance Calculation Visualization Distance Matrix 3D Plot Coordinates MultiDimensional Scaling Chi-Square / Deterministic Annealing MPI.NET Implementation MPI.NET Implementation * Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation library Scale-up Sequence Clustering with Twister Gene Sequences (N = 1 Million) e.g. 25 Million O(MxM) Select Reference Pairwise Alignment & Distance Calculation Reference Sequence Set (M = 100K) Distance Matrix Reference Coordinates N-M Sequence Set (900K) Interpolative MDS with Pairwise Distance Calculation x, y, z MultiDimensional Scaling (MDS) O(Mx(N-1)) N-M Coordinates x, y, z Visualization 3D Plot O(MxM) Services and Support Web Portal and Metadata Management CGB work // todo - Ryan GTM vs. MDS GTM Purpose MDS (SMACOF) • Non-linear dimension reduction • Find an optimal configuration in a lower-dimension • Iterative optimization method Input Vector-based data Non-vector (Pairwise similarity matrix) Objective Function Maximize Log-Likelihood Minimize STRESS or SSTRESS Complexity O(KN) (K << N) O(N2) Optimization Method EM Iterative Majorization (EM-like) PlotViz PlotViz Light-weight client DrugBank CTD QSAR Visualization Algorithms Parallel dimension reduction algorithms PubChem Chem2Bio2RDF Aggregated public databases 23 Education SALSAHPC Dynamic Virtual Cluster on Demonstrate the concept of Science FutureGrid -- Demo SC09 on Clouds onat FutureGrid Dynamic Cluster Architecture Monitoring Infrastructure SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ Linux Baresystem Linux on Xen Windows Server 2008 Bare-system XCAT Infrastructure iDataplex Bare-metal Nodes (32 nodes) Monitoring & Control Infrastructure Monitoring Interface Pub/Sub Broker Network Virtual/Physical Clusters XCAT Infrastructure iDataplex Baremetal Nodes Summarizer Switcher SALSAHPC Dynamic Virtual Cluster on Demonstrate the concept of Science FutureGrid -- Demo ata SC09 on Clouds using FutureGrid cluster http://salsahpc.indiana.edu/b534 http://salsahpc.indiana.edu/b534projects