Transcript (.ppt)
MAPG: Memory Access Power Gating Kwangok Jeong, Andrew B. Kahng, Seokhyeong Kang, Tajana S. Rosing, and Richard Strong University of California, San Diego The Problem • Relative increase of leakage power in advanced nodes • Leakage cost even when cores stall • Cores can stall quite often when accessing memory ! – Memory access (L1->DDR3->L1) can take 80ns – Access latency grows as more threads contend for the memory resource • Previous work – Fine-grain power gating of functional units: 4-6ns wake-up latency but wastes leakage in other parts of core – High-overhead core power gating: 100ms to wake up due to state restoration from main memory 1 Motivation: Stalled Core Energy Waste • Leakage power percentage increasing in smaller technology nodes In-order Core MAX Energy Waste: 52.3% EV6 Core MAX Energy Waste: 39.1% 2 Goals of This Work • Power gate cores during long memory accesses to reduce leakage waste • New mechanism: – Power-gate whole core with minimum wake-up cost – Avoid long restore latency: keep cache on, store key data in e.g., retention FFs • Low overhead wake-up – Within the threshold latency to save energy during a memory access • Satisfy voltage noise constraints – Active core voltage drop: < 5% – Idle core voltage drop: < 40% – PDN core peak current limits 3 Power Gating Introduction active idle Vdd_core Vdd_int sleep energy A=stall energy, B=retention energy, C=wake-up pg_enable active Logic block power Without power gating Vdd_core Voltage A Vss C Vdd_int B t Break-even point (minimum time)gating Withoutidle power decreases in advanced technologyWith nodes power gatin Vss Current active idle t With power gating power idle active pg_enable sleep Vdd_core Logic block Voltage Vss Current active idle wake up Vdd_int Without power gating With power gating 4 Ilimit time enable_few enable_rest Power-gating controller enable_rest Ilimit time Best possible wake-up Rush current With two control signals, current profile is programmable Rush current Optimal wake-up profile requires complex control logic Rush current PPGS Design: Wake-up Delay and Noise Ilimit enable_rest time Safer wake-up with longer time PPGS: provides multiple wake-up modes subject to the Ilimit that may be used to charge core logic capacitance 5 PPGS: Programmable Power Gating Switch 0 1 m[0-9] 0 1 m[0] m[1] 0 1 enable_rest mode 1 2 mout[0] mout[1] 10 mout[9] m[9] PPGS Depending on mode, time difference between enable_few and enable_rest changes 10Ilimit 9Ilimit 8Ilimit 7Ilimit 6Ilimit 5Ilimit 4Ilimit 3Ilimit 2Ilimit Ilimit Rush current enable_few enable_few enable_rest m[0]=0 m[1-9]=1 m[0-1]=0 m[2-9]=1 … m[0-9]=0 Mode 10 Mode 2 Mode 1 t1/10 t1/2 t1 time 6 Core Model and PDN Analysis ● McPAT Core area Core transistor counts ● Core power ● ● ● ITRS 2009-2010 Update McPAT ● Transistor capacitance ● VDD ● ● Qcore = (Clogic+Cint)VDD 7 Core Wake-up Latency Results ● PPGS wake-up modes for a 32nm HP in-order core mode Latency(ns) Core Wake-up Latency (ns) ● 1 2 3 4 5 6 14.16 13.14 12.12 11.11 10.09 9.08 7 8.06 8 7.05 9 6.03 10 5.02 Core wake-up latency for varying system utilization 16 14 12 10 8-cores 8 6-cores 6 4-cores 4 2-cores 2 0 0 1 2 3 4 Number of Idle Cores 5 6 7 8 Core State Retention and Restoration Interface for power gating and data retention head switch Vdd switch control Controller (PPGS) Retention Domain CORE level shifter Collapsible Domain RET D PC retention flip-flops clock ID EX I$/D$ MEM flip-flops retention control IF flip-flops architectural, misc priv. registers flip-flops WB flip-flops ● Vdd(sram) SRAM Register files I$, D$ Vss • Three power domains – Collapsible domain: supply voltage is disconnected during power gating – Retention domain: retain data with supply voltage – SRAM domain: source biasing during standby mode 9 WUC: Wake-up Controller PPGS State Diagram (2) WUC Provides Wake-up Mode (1) Query WUC PPGS Waiting for WUC Core Off (5) WUC Updates All Core Wakeup Modes (3) PG Mem Stall PPGS Wakeup Mode Ready (4) Release Wake-up to WUC 10 MAPG-Controller Controller Design Power States Mem Request If (cur_stall_cycles > latencyLLC-hit-response): latencypred-stall = latencyrow-buffer-miss+δ; β = latencystall – latencypred-stall; if (β < 0): δ = δ + β; Avoid future performance hit else: δ = 0.8δ+0.2*β; Adapt to increasing mem latency Core Stalls In-rush Current Mem Response, Update δ PPGS Power Gates Core Restore State & Fill Pipeline Core Saves State MAPG Controller Predicted Wake-up Active Stalled Power Gated 0 10 20 30 Time 40 50 nanoseconds 60 70 80 11 Methodology ● TOOLS ● ● ● ● ● Comparison points ● ● ● ● FUPG: functional unit power gating Oracle: PPGS with oracle core stall knowledge MAPG-Counter: PPGS with practical controller System ● ● ● ● ● GEM5: architectural simulation DRAMSIM2: memory hierarchy tool McPAT: area and power analysis tool HSPICE: core wake-up and PDN analysis 4 in-order cores @ 2GHz (2IALU, 1IMULT, 1FPALU) 32KB-2way 0.5ns L1 , 256KB-8way 4ns L2, 8MB-16way 13ns L3 DDR3 50ns Memory 32nm HP Benchmarks: SPEC2006 4-Aug-16 Your Name / Affiliation 12 Results: Energy Comparison Oracle saves 8.8% energy on average, up to 38% max ● MAPG saves 1.68X the energy savings of FUPG ● 38% 21% 14% -0.2% -2.0% 13 Results: Time Breakdown 0.08% average execution overhead ● 11% average power gate time (47% MAX for lbm) ● 0.6% and 1% average core restore and wake-up time ● 14 Conclusions ● Developed new power gating mechanism in between FUPG and long latency wake-up core power gating ● Composed of PPGS, WUC, predictive MAPG-Controller Modeled safe core wake-up latencies between 5.02ns – 14.16ns ● Showed oracle energy savings as high as 38% ● Demonstrated practical MAPG energy savings as high as 21% ● Currently, we are extending our work to: ● ● ● ● ● Power gate out-of-order cores Create a model for wake-up delay given core states and location Analyze the benefits of staggered core wake-up Apply our technique to thermal management 15 Thank You 4-Aug-16 Your Name / Affiliation 16 Backup Slides • BACKUP SLIDES 17 Methodology core Memory Hierarchy System Configuration ISA DEC-Alpha L2 Cache 256kB-8way 4ns Total Cores 4,8,16 Model EV4 L3 Cache 8MB-16way 13ns Tech Node 32nm Executio n In-order DDR3 Latency 50ns Private Caches L1, L2 Clock 2.0GHz DDR3 Size 2GB Shared Cache L3 Icache 32kB-2way OS Vanilla-Linux2.6.27 Dcache 32KB-2way Width 2 Function 2IALU 1IMULT 1FPALU al Units ● TOOLS ● ● ● ● GEM5: architectural simulation DRAMSIM2: memory hierarchy tool McPAT: area and power analysis tool HSPICE: core wake-up and PDN analysis 18 Results: Energy Comparison Oracle: oracle knowledge of core stall periods TAP: Token-Based Adaptive Power Gating MAPG-Counter: Adaptive Stall Counter Mechanism FUPG: Function Unit Power Gating In-Order Core Wake-up Mode: 8ns Charge Delay 0% Performance Hit! TAP In-Order: Up to 25.26% energy savings TAP EV6: Up to 23.18% energy saving EV6 Core 1.5% of Max Oracle Savings 2x the energy savings of FUPG 19 Summary • PPGS provides a flexible mechanism for reducing core leakage power • TAP provides core stall duration information to allow the PPGS to power gate • WUC manages reliability constraints to prevent core logic corruption • Power gating will be an important mechanism for providing energy proportional processors • Waking-up a core from power gated state is possible in less than 10ns. 20 PPGS: ProgRAMMABLE Power Gating Switch • Wake-up time vs. rush current: header case – During sleep mode, charges in all circuit nodes are discharged – During wake-up, all nodes need to be charged to the correct states • The amount of charges depends on design size (not wake-up time) Rush current Rush current • Fast (slow) wake-up large (small) rush current Same area Ilimit Ilimit time Slow wake-up time Rush current Fast wake-up Ilimit time Optimal wake-up How to make this waveform? needs a good wake-up control technique with fewest #signals! 21 Power Gating/Wake-up Sequence ACTIVE MODE POWER DOWN 7 CLOCK power down 1 1T 1T Trestore 2 1T enable few power up trigger power down trigger retention clamp RESTORE ACTIVE MODE WAKE UP 8 3 4 Tcharge 5 enable rest 1T 6 async-reset Power down sequence Wake up sequence – Tcharge (between 4 & 5): charge time for Vdd_int node (10 – 50 cycles) – Trestore: pipeline refill (6 cycles) – We exploit variable wake-up time at different system utilization levels 22 Safe Wake-up Mode Analysis • Minimum wake-up time for EV6 16-core (SPICE simulation) Wa A Wa A Wa Wa (a) 7.8ns (b) 11.1ns A Wa Wa (c) 13.8ns Wa Wa Wa A Wa Wa A Wa Wa (d) 16.1ns Wa Wa Wd (e) 16.5ns A A Wd (f) 3.9ns A Wn (g) 3.4ns (h) 8.3ns A: critical active core, Wa: adjacent woken-up core, Wd: woken-up core in the diagonal position, Wn: non-adjacent woken-up core • Minimum wake-up latency model: To, , , , : coefficient, w, x, y: # of W , W , W a d n 𝑻 = 𝑻𝟎 (𝒘 + 𝜷 ∙ 𝒙 + 𝜸 ∙ 𝒚 + 𝜹 ∙ 𝒛)𝜶z: # of woken-up cores in edge • Modeled wake-up latency and error from SPICE 23 To Dependency 24 Staggered Wake-up • Different start time between two woken-up cores Staggered wake-up can reduce wakeup time significantly • Minimum wake-up latency with some interval time (delta) 25 TAP: Adapting to Memory Latency • DDR3 access latency experiences variability from: • Bank Queue Length • Row Buffer Hit/Row Buffer Miss • Channel Contention • Refresh Cycle • TAP adapts to this variability via two step process: • On last level cache miss, TAP sends token with unknown ETA and PPGS power gates the core immediately. • Memory controller sends updated ETA once memory operation scheduled, PPGS then schedules core wake-up. 26 RESULTS: Core Execution TIME BreakDOwn • TAP has 0% performance impact as all execution time is normalize to original non modified execution time • TAP power gates in-order and out-of-cores for 16.23% and 8.25% of simulation time respectively, when average across all benchmarks • TAP max power gated time is 64.4% for lbm on in-order core and 49.98% for mcf on the ev6 core • TAP spends 5.62% and 1.20% of time waking up and restoring the the in-order and ev6 cores respectively when average across all benchmarks. EV6 Core In-order Core 27 Results: Energy Savings Vs Wake-up Latency Note: Higher wake-up modes denote higher wake-up latency. In-Order Core System: 4 core CMP As wake-up latency of the core increases from 2ns to 16ns, max energy savings decrease from: •TAP In-Order: 31.5% to 22.3% (lbm) •TAP EV6: 25.8% to 20.1% (mcf) • Greater than 20% energy savings even •when core wake-up latency is 16ns! EV6 Core 28 16 8‐core Wake‐up Time (ns) 14 Results: Staggered Wake-up Energy Savings 12 10 System: 16 EV6 8Core CMP 6 4 Wake‐up Time (ns) A stagger of 0.9ns, can reduce wake-up latency by 7.7ns and improve energy savings from 2 18.92% to 22.06% for mcf 0 0 20 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 16‐core 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Energy Savings 24.00% 19.00% 14.00% no stagger 9.00% 0.3ns stagger 4.00% 0.6ns stagger -1.00% 0.9ns stagger 29 RESULTS: Adapting to Memory Latency From 1 to 32 threads, average core stall latency increases from 36.77ns to 287.63ns TAP can increased core power gated time from 10.12% to 34.23% of total execution time. 40.00% Average Stall Dura on TAP MAPG-Counter 300 35.00% 30.00% 250 25.00% 200 20.00% 150 15.00% 100 10.00% 50 5.00% 0 0.00% 0 5 10 15 20 Number of Threads 25 30 Core Power Gate Time (%) average stall dura on (nanoseconds) 350 35 30 Power Gating and Data Retention head switch Vdd_core enable_few enable_rest • Interface Vdd_sram – Three power domains – Vdd_int domain is collapsible during power gating – Vdd_core domain supplies Retention latches – Vdd_sram domain supplies SRAM and source biasing can be used during standby mode CORE Vdd_int retention RET Q clk reset level shifter retention flip-flops RET Q D clk clamp D SET SET Q Q SRAM reset CLR CLR Q Q reset clock controller Vss • Power gating/ wake-up sequence – Tcharge (between 4 & 5 ): charge time for Vdd_int node – Trestore: cycles for data restoration – We exploit variable wake-up time at different system utilization levels ACTIVE MODE POWER DOWN WAKE UP power down 1 1T 1T Trestore 2 1T enable few power up trigger power down trigger retention clamp RESTORE ACTIVE MODE 7 CLOCK 8 3 4 Tcharge 5 enable rest 1T 6 async-reset Power down sequence Wake up sequence 31 Power Gating Design: Enable Signal • Enable signal topologies enable enable power gating switch enable Single daisy chain bone group of switches Star • With two-signal wake-up,Fish each cells needs to be controlled as fast as possible – Rush current is controlled by the time difference between enable_few and enable_rest, not by the topology 32 Core Modeling Power Gating Strategies Energy, Performance IR-drop rule: Switch: Ron, Ioff - Total gate cap. - Total interconnect cap. - Total charge (Q) - #switches - Rush current - Wake-up time - Energy - Break-even point Design / Tech Spec Freq.: 2 GHz #tr ITRS Logic gate model (following ITRS MPU power/freq. model) - Logic area - Runtime dynamic power - Peak dynamic power - Leakage power 7.8M McPAT Device parameters M1 half pitch, Lgate Vdd, Cg,total, Jg,limit, Ioff 33 Safe Mode w.r.t Location • Multi core wake-up case 8 7 8 7 8 8 8 8 8 8 7 7 8 8 7 7 8 7 8 8 8 6 5 8 • Mode does not much different w.r.t location • About the turn-on location, temperature analysis is possible ? (e.g., S. Reda work) 34 Wake-UP Time Control With WUC PACKET: 1. wake-up mode 2. staggered offset WUC determines the optimal wake-up mode and staggered offset for PPGS memory controller CORE PACKET: expected latency PPGS WUC • Packet interface between WUC and PPGS memory miss / expected latency • Run time (dynamic) wake-up scenario with core information memory miss on CORE 1 interval for staggered wake-up wake-up mode update due to CORE 2 victim core ON wake-up OFF T’WAKE,1 TWAKE,1 TOFF,1 CORE 1 ON TON,1 wake- active up OFF ON TOFF,2 memory miss on CORE 2 wake-up OFF TWAKE,2 ON TON,2 wakeup CORE 2 time expected memory latency 35