Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and.
Download ReportTranscript Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and.
Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers Creating Custom Processors Approximate Computing • 100% accuracy is not always necessary • Less Work – Better performance – Lower power consumption • There are many domains where approximate output is acceptable 2 Data Parallelism is everywhere Financial Modeling Games Medical Imaging Physics Simulation Image Processing Audio Processing Machine Learning Statistics Video Processing • Mostly regular applications • Works on large data sets • Exact output is not required for operation Good opportunity for automatic approximation 3 Approximating KMeans 4 Approximating KMeans Exact Centers 5 Approximating KMeans Exact Centers 6 Approximating KMeans Mislabeled Exact Centers Approximate Centers 7 Approximating KMeans Mislabeled Exact Centers Approximate Centers 8 Approximating KMeans Mislabeled Exact Centers Approximate Centers 50% Mislabeling error 45% Approximating alone is not enough we need a 40% way to control the output quality 35% 30% 25% 20% 15% 10% 5% 0% 0% 20% 40% 60% Error in computing clusters' centers 80% 100% 9 Approximate Computing • Ask the programmer to do it – Not easy / practical – Hard to debug • Automatic Approximation – One solution does not fit all • Paraprox : Pattern-based Approximation – Pattern-specific approximation methods – Provide knobs to control the output quality 10 Common Patterns Map 𝑓 𝑓 Scan Partitioning 𝑓 𝑓 𝑓 + 𝑓 Image Processing, Finance, … Signal Processing, Physics,… Scatter/Gather 𝑓 𝑓 𝑓 Statistics,… 𝑓 Stencil 𝑓 𝑓 𝑓 + + + Machine Learning, Search,… Reduction 𝑓 + Image Processing, Physics,… Machine Learning, Physics,.. M. McCool et al. “Structured Parallel Programming: Patterns for Efficient Computation.” 11 Morgan Kaufmann, 2012. Paraprox Parallel Program (OpenCl/CUDA) Paraprox Pattern Detection Approximation Methods Runtime system Approximate Kernels Tuning Parameters 12 Common Patterns Map 𝑓 𝑓 Scan Partitioning 𝑓 𝑓 𝑓 + 𝑓 Image Processing, Finance, … Signal Processing, Physics,… Scatter/Gather 𝑓 𝑓 𝑓 Statistics,… 𝑓 Stencil 𝑓 𝑓 𝑓 + + + Machine Learning, Search,… Reduction 𝑓 + Image Processing, Physics,… Machine Learning, Physics,.. 13 Approximate Memoization S X T R V Sqrt Div X T R V Q Q Q Q Q 0.5 Mul Mul Log S << q3 Mul Or Add << q2 Mul Add Mul Or Div << q1 Exp Sub Or Cnd() Sub Mul Mul << q0 Cnd() Sub Or Mul Mul Sub Sub CallResult PutResult BlackScholes Addr: q4 q3 q2 q1 q0 LookUp Table float2 CallResult PutResult 14 Approximate Memoization Identify candidate functions Find the table size Check The Quality Determine qi for each input Fill the Table Execution 15 Candidate Functions • Pure functions do not: – read or write any global or static mutable state. – call an impure function. – perform I/O. S X T R V Sqrt Div 0.5 Mul Mul Log Mul Add Mul Add Mul Div • In CUDA/OpenCL: – No global/shared memory access – No thread ID dependent computation Exp Sub Cnd() Cnd() Sub Mul Mul Sub Mul Mul Sub Sub CallResult PutResult 16 Table Size Quality 64K 32K 16K Speedup 17 How Many Bits per Input? Table Size = 32KB 15 bits address 6 4 5 4 91.3% 6 A B C 5 5 5 5 95.2% 6 95.4% 6 5 4 95.1% Input 5 Output Quality Bits 4 7 4 96.5% 4 5 4 6 91.2% 5 95.4% 7 3 95.8% Quantization Levels Inputs thatA do not need will get 5 high precision 32 B fewer number 6 of bits.64 C 4 16 18 Common Patterns Map 𝑓 𝑓 Scan Partitioning 𝑓 𝑓 𝑓 + 𝑓 Image Processing, Finance, … Signal Processing, Physics,… Scatter/Gather 𝑓 𝑓 𝑓 Statistics,… 𝑓 Stencil 𝑓 𝑓 𝑓 + + + Machine Learning, Search,… Reduction 𝑓 + Image Processing, Physics,… Machine Learning, Physics,.. 19 Tile Approximation Difference with neighbors [0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90) [90-100] 0 20 40 60 Percentage of pixels 80 20 Stencil/Partitioning C W E NW N NE SW S SE = Input[i][j] = Input[i][j-1] = Input[i][j+1] = Input[i-1][j-1] = Input[i-1][j] = Input[i-1][j+1] = Input[i+1][j-1] = Input[i+1][j] = Input[i+1][j+1] NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile 21 Stencil/Partitioning C W E NW N NE SW S SE = Input[i][j] = Input[i][j-1] = Input[i][j+1] = Input[i-1][j-1] = Input[i-1][j] = Input[i-1][j+1] = Input[i+1][j-1] W = Input[i+1][j] C = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile 22 Stencil/Partitioning C W E NW N NE SW S SE = Input[i][j] = Input[i][j-1] = Input[i][j+1] = Input[i-1][j-1] = Input[i-1][j] = Input[i-1][j+1] = Input[i+1][j-1] = Input[i+1][j] = Input[i+1][j+1] W C E W C E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile 23 Stencil/Partitioning C W E NW N NE SW S SE = Input[i][j] = Input[i][j-1] = Input[i][j+1] = Input[i-1][j-1] = Input[i-1][j] = Input[i-1][j+1] = Input[i+1][j-1] = Input[i+1][j] = Input[i+1][j+1] C C C C C C C C NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile 24 Common Patterns Map 𝑓 𝑓 Scan Partitioning 𝑓 𝑓 𝑓 + 𝑓 Image Processing, Finance, … Signal Processing, Physics,… Scatter/Gather 𝑓 𝑓 𝑓 Statistics,… 𝑓 Stencil 𝑓 𝑓 𝑓 + + + Machine Learning, Search,… Reduction 𝑓 + Image Processing, Physics,… Machine Learning, Physics,.. 25 Scan/ Prefix Sum • Prefix Sum – 𝑂𝑢𝑡𝑝𝑢𝑡[𝑖] = 𝑖 𝑘=0 𝐼𝑛𝑝𝑢𝑡[𝑘] + + + + • Cumulative histogram, list ranking,… • Data parallel implementation: 1. Divide the input into smaller subarrays 2. Compute the prefix sum of each subarray in parallel 26 Data Parallel Scan Phase I 1 1 1 1 1 2 1 1 1 3 4 1 2 1 1 1 3 4 1 Phase II 2 3 4 1 1 1 3 4 Scan Scan 4 1 4 4 4 12 16 2 Scan 4 Phase III 1 Scan Scan 1 1 Add Add 1 2 3 4 5 6 8 7 8 9 10 Add 11 12 13 14 15 16 27 Data Parallel Scan Phase I 1 1 1 1 1 2 1 1 1 3 4 1 2 1 1 1 3 4 1 Phase II 2 3 4 1 1 1 3 4 Scan Scan 4 1 4 4 4 12 16 2 Scan 4 Phase III 1 Scan Scan 1 1 Add Add 1 2 3 4 5 6 8 7 8 9 10 Add 11 12 13 14 15 16 28 Scan Approximation Output Elements 0 N 29 Evaluation 30 Experimental Setup • Clang 3.3 CUDA Driver AST Visitor Pattern Detection Action Generator Rewrite Approximate Kernels • GPU – NVIDIA GTX 560 • CPU – Intel Core I7 • Benchmarks – NVIDIA SDK, Rodinia, … 31 Runtime System Quality Checking Quality Target Quality Speedup Green[PLDI2010] SAGE[MICRO2013] 32 Speedups for Both CPU and GPU CPU GPU Cumulative Histogram Mean Filter Target = 90% Gaussian Filter Convolution Separable HotSpot 7.9 Kernel Density Naïve Bayes Image Denoising Matrix Multiplication BoxMuller Gamma Correction Quasirandom Generator BlackScholes Geometric Geo Mean Mean 0 1 2 Speedup 3 4 5 33 One Solution Does Not Fit All! Paraprox Loop Perforation GEO MEAN CUMULATIVE HISTOGRAM MEAN FILTER GAUSSIAN FILTER HOTSPOT BOXMULLER GAMMA CORRECTION QUASIRANDOM GENERATOR BLACKSCHOLES 0 0.5 1 1.5 2 Speedup 2.5 3 3.5 4 34 We Have Control on Output Quality 5 4.5 Kernel Density Speedup 4 3.5 3 2.5 2 1.5 1 100 98 96 94 Output Quality(%) 92 90 35 We Have Control on Output Quality 5 Matrix Multiplication 4.5 Kernel Density Speedup 4 Gaussian Filter 3.5 3 2.5 Quasirandom Generator 2 Convolution Separable 1.5 BlackScholes 1 100 98 96 94 Output Quality(%) 92 90 36 Distribution of Errors 100% Percentage of Elements 90% Image Denoising 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Error 37 Distribution of Errors 100% Percentage of Elements 90% 80% Cumulative Histogram 70% Gamma Correction 60% Matrix Multiplication Image Denoising 50% Naïve Bayes 40% Kernel Density 30% Hotspot 20% Gaussian Filter Mean Filter 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Error 38 Conclusion • Manual approximation is not easy/practical. • We need tools for approximation • One approximation method does not fit all applications. • By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality. 39 Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors