A High Performance SoC: Pkunity Chen Jie TM Peking University Microprocessor R&D Center
Download ReportTranscript A High Performance SoC: Pkunity Chen Jie TM Peking University Microprocessor R&D Center
A High Performance SoC: PkunityTM Chen Jie Peking University Microprocessor R&D Center Contents • PkUnity SoC Introduction • PkUnity SoC Low Power Design ICSoC2005, Aug 05 Introduction Frequency(MHz) Finish develop platform Processor develop SoC Components develop Communication chip & Router chip develop Chip mass production Pkunity-3 SoC 600 500 400 300 200 100 UniCore16 Processor 00 UniCore32 Processor UniCoreF64 Pkunity-2 SoC Pkunity-1 SoC Year 01 02 03 04 05 06 ICSoC2005, Aug 05 PKUnity-3 Architecture ICSoC2005, Aug 05 UniCore fix-point processor • UniCore Frequency: 600MHz • 32-bit harvard-architecture RISC CPU • UniCore32 instruction set compatible • Add conditional mov & BLX instructions • 8-stage instruction pipeline • Dynamic prediction policy: Gshare • Pipelined I&D Cache • Two-level TLB ICSoC2005, Aug 05 Performance Evaluation • Unicore-II CPI increase 10%15% CPI • G-share prediction, pipelined cache, two-level TLB reduce the increasing of CPI caused by deep pipeline CPI Increase 7 6.0026 6 4.9501 4.6426 5 4.3184 4.0891 3.8576 3.6477 3.4945 3.4777 4 3.2634 2.5334 3 2.18182.5013 2.43782.6634 2.0016 1.78491.9877 1.8586 1.684 2 1 0 164.gzip 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 255.vortex 256.bzip2 300.twolf Benchmark • UniCore-II MIPS increase 70%- 80 % MIPS Increase 350 250 235.17 232.19 220.86 179.95 200 150 295.94 293.88 300 MIPS • Performance improvement come from improvement of micro-architecture and technology unicore-1 unicore-2 138.89 100 136.22 126.7 98 124.3 87.14 78.55 176.gcc 181.mcf 186.crafty 197.parser 252.eon Benchmark 61.22 161.26 92.86 143.85 169.77 163.04 MIPS-unicore1 MIPS-unicore2 86.72 50 0 164.gzip 254.gap 255.vortex 256.bzip2 300.twolf ICSoC2005, Aug 05 SoC Design Platform • To build : – a chip-based infrastructure – a integrated develop environment – a design and verification flow • In PkUnity-3: – CPU configurable – BUS configurable – Interrupt system configurable – DMA configurable – Frequency configurable – Power management ICSoC2005, Aug 05 Verification Coverage-oriented VERA verification flow SystemC-based HW/SW Co-verification methodology FPGA prototype ICSoC2005, Aug 05 Contents • PkUnity SoC Introduction • PkUnity SoC Low Power Design – Power research status – PkUnity low power design and power estimation – Future work ICSoC2005, Aug 05 Power : New Challenge – Power is a dramatic issue for SoCs with billions of transistors – Power has to be reduced for portable devices that require a dramatic increase of computation power – Deep submicron technologies (90 and 65 nm) will present a dramatic increase of leakage power – Power still too high for most SoCs – SoC Architectures, HW/SW, multiprocessor, multiple memories, are not well supported by CAD tools – Reconfigurability and Flexibility compromises low-power – Leakage and very low Vdd are dramatic problems ICSoC2005, Aug 05 LP Research Condition low power design technology and research topics Technology Feature size shrink, low dielectric constant material, SOI technology Circuit Design low power standard cell library Gate Design low power logic chain: gated clock, gated Vdd RTL reduce switching activity: gated clock, state machine & glitch optimization Micro-arch Parallel, Pipeline, Pre-computing Instruction Good task partition between HW/SW, design low power instruction set Compiler Saving power while improve performance, Memory organization OS Dynamic voltage scaling, I/O devices, Power and energy analysis of OS Application Task partition, Algorithm optimization ICSoC2005, Aug 05 Power Estimation Research Power Estimation Hierarchy High level architectural model Analysis Speed SimplePower System Wattch CACTI Algorithm Register Transfer PrimePower PowerCompiler Logic Analysis Precision HSPICE Circuit simulation vs. analysis simulation with timing info extract circuit parameters adding technology info gate level simulation analysis with extractive parameter ICSoC2005, Aug 05 • Embedded Processor: High Performance vs. Low Power • Three methods to reduce chip power: power(mW) Power of Pkunity – Close unused module 1800 1600 1400 1200 1000 800 600 400 200 0 Pkunity1 – Frequency scaling – Close Pll • Pkunity-3 object: Pkunity2 CPU Power 2% 2% 1% 3% 0% 0% Pkunity3 SoC Power 1% 14% 2% – CPU <[email protected]/600MHz – SoC <[email protected]/600MHz 38% 37% Pkunity2 CPU Power Unicore FPU CP1 CP0 FPU_reg BIU BIUIU DCache Icache DMMU IMMU ICSoC2005, Aug 05 Power Estimation TestBench SPEC VCS Simulation Executable File Gate level Netlist ? VCS RTL PowerCompiler Netlist Floorplan Power Report PowerCompiler PrimePower SAIF file Power_hier.rpt Power_hier.rpt Operating conditions :typical Library:typical Operating conditions :typical Library:typical Wire load model mode:top Wire operating load modelVoltage=1.8 mode:top Global Global operating Voltage=1.8 CTS&Router PrimePower ECO ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Hierarchy Switch Power Int Power Leak Power Total Power % Hierarchy Switch Power 161.355 Int Power Leak Power 369.691 Total Power 100 % Top_pad 208.267 6.83e+07 Top_pad 208.267 161.355 6.83e+07 369.691 100 …… …… U_unity_1 11.482 U_unity_1 11.482 …… …… U_fpu 1.983 1.983 ……U_fpu …… U_unicore 0.711 U_unicore 0.711 91.724 91.724 6.08e+07 6.08e+07 103.268 103.268 27.9 27.9 16.305 16.305 3.25e+07 3.25e+07 18.321 18.321 5.0 5.0 3.572 3.572 2.20e+07 2.20e+07 4.284 4.284 1.2 1.2 Why are they so different? Signoff ICSoC2005, Aug 05 Power optimization • Close unused module through gated clock • Reduce chip power through scaling among multiple run mode – Run – Idle – Sleep Clock gating vs. non Clock gating 60 • Change chip frequency through dynamic PLL configuration 50 • Input vector control in Execution components 0 40 30 20 10 MU IM MU DM he ac Ic he ac DC U UI BI U BI g re U_ FP 0 CP 1 CP U FP e or ic Un None-gated Gated ICSoC2005, Aug 05 Work Flow Low power design and estimation flow System power simulator Gate-level Estimation Signoff Netlist ECO RTL CTS&Router SPEC PC insert clock gating Floorplan LP micorarch design Estimation with timing and load ICSoC2005, Aug 05 Future Work Low Power Design • Memory architecture (cache, TLB, register file) • Clock system ( Syn vs. Asyn ) • Bus system • Instruction set selection • Voltage and frequency scaling • Compiler optimization • Task movement Power Estimation • To pre-analyze arch & microarch design through fast and accurate Architectural level power simulator • To build a full-chip power simulator • Power simulator parameter reconfigurable • To build accurate leakage power estimation model • Specific component power model ICSoC2005, Aug 05 Thank you ICSoC2005, Aug 05