Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon , Rakesh Kumar
Download ReportTranscript Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon , Rakesh Kumar
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldona, Rakesh Kumarb, Roman Lyseckyc, Frank Vahida*, Dean Tullsenb aDepartment of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine bDepartment of Computer Science and Engineering University of California, San Diego cDepartment of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx FPGA Soft Core Processors HDL Description Soft-core Processor HDL description Flexible implementation FPGA or ASIC Technology independent David Sheldon, UC Riverside FPGA Spartan 3 Virtex 2 ASIC Virtex 4 2 of 22 FPGA Soft Core Processors Soft Core Processors can have configurable options Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPU μP MAC Cache FPGA David Sheldon, UC Riverside 3 of 22 Goal Goal: Tune FPGA soft-core microprocessor for a given application App Parameter Values Parameter Values μP Synthesis Configured μP time Configured μP FPGA David Sheldon, UC Riverside size 4 of 22 Microblaze – Xilinx FPGA Soft-Core All units not necessarily the fastest, due to critical path lengthening Base MicroBlaze Speedup Multiplier Barrel Shifter Divider B ase M B Full M B Optimal M B Ba a s e i fir FP bi 01 tm np br c a ev nr d g3 r g7 fa 21 x _p s id m ct a ra tm u yt ra l c tb e lo o tts k pr k AV G FPU 7 6 5 4 3 2 1 0 Cache 18 FPU base bs mul 12 mul+bs 10 bs+cache 8 mul+bs+cache 6 4 2 14 00 0 12 00 0 10 00 0 80 00 60 00 40 00 0 0 Significant tradeoffs 14 20 00 Instantiatable units Application Runtime (ms) 16 Size (Equivalent LUTs) David Sheldon, UC Riverside 5 of 22 Problem Need fast exploration Synthesis runs can take an hour This talk Two approaches Approach 1: Using Traditional CAD Techniques Approach 2: Synthesis-in-theloop Results David Sheldon, UC Riverside Parameter Values Exploration μP Synthesis ~20-60 mins Configured μP 6 of 22 Constraints on Configurations Size constraints may prevent use of all possible units Multiplier Barrel Shifter Multiplier MicroBlaze FPU FPU Cache Max Area David Sheldon, UC Riverside Divider Cache 7 of 22 Approach 1: Traditional CAD Techniques Create model Create a model of the problem Solve model with extensive search heuristics We will model this problem as a 0-1 knapsack problem Multiplier MicroBlaze Slow, includes synthesis Model Exploration Fast, considers 1000s of configurations FPU Cache Max Area David Sheldon, UC Riverside 8 of 22 Approach 1: Traditional CAD Techniques Synthesis Synthesis Creating the model FPU MicroBlaze Base size perf perf perf App Multiplier FPU size Cache size size BS FPU MUL Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 David Sheldon, UC Riverside Divider perf Barrel Shifter perf MicroBlaze size DIV CACHE 9 of 22 Approach 1: Traditional CAD Techniques 0-1 knapsack model Object’s benefit = Unit’s performance increment / size increment Object’s weight = Unit’s Size Knapsack’s size constraint = FPGA size constraint BS FPU MUL Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 David Sheldon, UC Riverside DIV CACHE MicroBlaze 10 of 22 Approach 1: Traditional CAD Techniques Solved the 0-1 knapsack problem using established methods Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980 Running time 6 Microblaze configuration synthesis runs to create model O(n*p) to solve model n is the number of factors p is the available area Negligible (seconds) compared to synthesis runtimes (~hour) David Sheldon, UC Riverside 11 of 22 Approach 1: Traditional CAD Techniques Problems 100’s of target FPGAs Model approach estimates size and performance for two or more units Different hard core resources (multiplier, block RAM) MUL speedup 1.3, DIV speedup 1.6 estimate MUL+DIV speedup 1.9 May really be 1.7 Device XC2V2000 XC2VP2 XC4VLX80 XC4VLX15 XC2S300E XC2V4000 XC2VP40 XC4VSX25 XC4VSX35 XC4VFX20 XC2S150E XC2VP30 XC4VLX60 XC2S600E XC2VP20 XC2V500 XC2VPX70 XC4VLX40 XC2V6000 XC4VFX60 XC4VFX100 XC2VP4 XC2VP70 LUTs PPCs 21504 2816 71680 12288 6140 46080 38784 20480 30720 17088 3456 27392 53248 13824 18560 6144 66176 36864 67584 50560 84352 6016 66176 0 0 0 0 0 0 2 0 0 1 0 2 0 0 2 0 2 0 0 2 2 1 2 Model inaccuracies may be large David Sheldon, UC Riverside 12 of 22 Approach 2: Synthesis-in-the-Loop Problem with traditional CAD approach 100’s of target FPGAs Model approach estimates size and performance for two or more units Model inaccuracies may be large Create model Model Exploration Solution – Synthesis in the loop No abstract model Guided by actual size and performance data But slow – can only explore a few configurations 10’s of minutes Synthesis-in-the-Loop Exploration Synthesis Execute David Sheldon, UC Riverside size perf 13 of 22 Approach 2: Synthesis-in-the-Loop First pre-analyze units to guide heuristic Same calculations as when creating model for knapsack perf size Multiplier size Cache size size BS FPU MUL Perf increment 1.1 0.9 1.2 1.0 1.3 Size increment 1.4 2.7 1.8 1.1 1.6 Perf/Size 0.96 0.34 0.63 0.93 0.80 David Sheldon, UC Riverside Divider perf Floating Point perf Barrel Shifter perf perf size DIV CACHE 14 of 22 Approach 2: Synthesis-in-the-Loop Build “impact-ordered tree” structure Tree is specific to given application BS Perf/Size FPU MUL DIV CACHE 0.96 0.34 0.63 0.93 0.80 Sort BS Perf/Size DIV CACHE MUL FPU 0.96 0.93 David Sheldon, UC Riverside 0.80 0.63 0.34 Application Specific Impact-ordering Impact BS 0.96 DIV 0.93 CACHE 0.80 MUL 0.63 FPU 0.34 15 of 22 Approach 2: Synthesis-in-the-Loop Run tree-based search heuristic Not Include Synthesis-in-the-Loop Include Perf/Size Useful BS 0.96 Yes DIV 0.93 No CACHE 0.80 No MUL 0.63 Yes FPU 0.34 No Exploration Synthesis size Execute David Sheldon, UC Riverside perf 16 of 22 Comparison of Approaches Approach 1 – Traditional CAD 6 synthesis runs to build model O(np) knapsack solution Examines thousands of configurations during exploration Approach 2 – Synthesis in the loop 11 synthesis runs (6 pre-analysis, 5 exploration) Examines (at most) 5 configurations during exploration David Sheldon, UC Riverside 17 of 22 Results 10 EEMBC and Powerstone benchmarks Average results shown, on Virtex 2 Pro, for particular size constraint Tool Run Time (min) aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk Knapsack sub-optimality due to multi-unit estimation inaccuracy David Sheldon, UC Riverside 800 Exhaustive App-Spec 600 Knapsack 400 200 0 1 1.5 2 2.5 Application-specific impact-ordered tree approach yields near-optimal results in acceptable tool runtime Speedup 18 of 22 Obtained results for six different size constraints Results shown for a second size constraint Similar findings for all six constraints Tool Run Time (min) Results 800 Exhaustive App-Spec 600 Knapsack 400 200 0 1 1.5 2 2.5 Speedup David Sheldon, UC Riverside 19 of 22 Also ran for different FPGA Xilinx Spartan2 Similar findings Tool Run Time (min) Results 300 Exhaustive 250 App-Spec 200 Knapsack 150 100 50 0 1 1.2 1.4 1.6 Speedup David Sheldon, UC Riverside 20 of 22 Conclusions Synthesis-in-the-loop approach outperformed traditional CAD approach Better results Slightly longer runtime Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach Future Extend for highly-configurable soft-core processors, and for multiple processors competing for and/or sharing resources David Sheldon, UC Riverside 21 of 22 Questions? David Sheldon, UC Riverside 22 of 22