IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation.
Download ReportTranscript IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation.
IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Outline Preliminaries BG/L MPI Software Architecture Optimization framework Status & Future direction BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research BG/L MPI Who’s who Users Watson: Testers John Gunnels BlueMatter team Leckband, Jeff Garbisch (Rochester) Pinnow, Joe Ratterman (Rochester) Dennis, Henry Tufo LANL Hoisie, Fabrizio Petrini, Darren Kerbyson IBM India LLNL Astron Charles Archer (Rochester) George Almasi, Xavier Martorell Torus primitives Smeds Philip Heidelberger Enablers System software group (you know who you are) BG/L Day, Feb 6 2004 BG/L port: library core Nils Meeta Sharma,Rahul Garg Jesus Adolfy Gropp, Rusty Lusk, Brian Toonen, Rajeev Thakur, others (ANL) Performance Analysis Labarta (UPC) Nils Smeds Bob Walkup, Gyan Bhanot, Frank Suits MPICH2 framework Bill Performance Testing Kurt John Glenn NCAR Functionality Testing Developers Tree primitives Chris Erway Burk Steinmacher © 2004 IBM Corporation IBM Research The BG/L MPI Design Effort Started off with constraints and ideas from everywhere, pulling in every direction 90% of work was to figure out which ideas made immediate sense Use algorithm X for HW feature Y MPI package choice, battle over required functionality Operating system, job start management constraints Immediately implement Implement in the long term, but ditch for the first year Evaluate only when hardware becomes available Forget it Development framework established by January 2003 Project grew alarmingly: January 2003: 1 fulltime + 1 postdoc + 1 summer student January 2004: ~ 30 people (implementation, testing, performance) BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research MPICH2 based BG/L Software Architecture Message passing Process management MPI pt2pt datatype topo collectives MM bgltorus socket Message Layer Packet Layer BG/L Day, Feb 6 2004 Torus Device Tree Device GI Device bgltorus CH3 mpd uniprocessor simple Abstract Device Interface torus tree GI PMI CIO Protocol © 2004 IBM Corporation IBM Research Architecture Detail: Message Layer Connection Manager Rank 0 (0,0,0) Rank 1 (0,0,1) Rank 2 (0,0,2) … Rank n (x,y,z) … sendQ sendQ sendQ Progress Engine recv Send manager recv recv Dispatcher sendQ recv MPID_Request Message Data Send Queue msg1 msg2 … msgP user buffer (un)packetizer protocol & state info BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Performance Limiting Factors in the MPI Design Hardware Bytes/cycle/link (theoretical) 0.22 Bytes/cycle/link (effective) 12*0.22 = 2.64 Bytes/cycle/node routing: in order, bad torus performance Adaptive routing: excellent network performance, out-of-order packets In-order semantics is expensive cycles to read a packet; 50 – 100 cycles to write a packet Alignment restrictions Handling badly aligned data is expensive Short FIFOs Network needs frequent attention Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive Dual core setup, memory coherency coherency management via “blind device” and cache flush primitives Requires communication between processors Best done in large chunks Coprocessor cannot manage MPI data structures CPU/network interface 204 Network order semantics and routing Deterministic Torus Network link bandwidth 0.25 Explicit Software Only tree channel 1 is available to MPI CNK is single-threaded; MPICH2 is not thread safe Context switches are expensive Interrupt BG/L Day, Feb 6 2004 driven execution is slow © 2004 IBM Corporation IBM Research Optimizing short-message latency The thing to watch is overhead Bandwidth CPU load Co-processor Network load Not a factor: not enough network traffic Composition HW 32% routing would double msg layer overhead Balance here may change as we scale to 64k nodes Today: ½ nearest-neighbor roundtrip latency: 3000 cycles 6 s @ 500MHz Within SOW specs @ 700MHz About High level (MPI) 26% Memory copies take care of alignment Deterministic routing insures MPI semantics Adaptive of roundtrip latency: msg layer 13% Per-packet overhead 29% Can improve 20-25% by shortening packets BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Optimizing MPI for High CPU Network Traffic (neighbor to neighbor communication) Most important thing to optimize for: CPU per packet overhead (Allows 180 cycles/CPU/packet) Explicit cache management 5000 cycles/message System support necessary Coprocessor library Scratchpad library Lingering RIT1 memory issues At maximum torus utilization, only 90 CPU cycles available to prepare/handle a packet! Sad (measured) reality: READ: 204, WRITE: 50-100 cycles Plus MPI overhead Packet overhead reduction Cooked packets: Contain destination address Assume intitial dialog (rendezvous) Rendezvous costs 3000 cycles Saves 100 cycles/packet Allows adaptively routed packets Permits coprocessor mode BG/L Day, Feb 6 2004 Coprocessor mode essential Adaptive routing essential MPI semantics achieved by initial deterministically routed scout packet Packet alignment issues handled with 0 memory copies Overlapping realignment with torus reading Drawback: only works well for long messages (10KBytes+) © 2004 IBM Corporation IBM Research Per-node asymptotic bandwidth in MPI 2 1.8 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Bandwidth (Bytes/cycle) 2 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0.2 0 6 5 Senders 4 3 2 1 0 0 1 2 BG/L Day, Feb 6 2004 3 4 5 6 Receivers Bandwidth (Bytes/cycle) Per-node bandwidth in coprocessor mode Per-node bandwidth in heater mode 0 6 5 Senders 4 3 2 1 0 0 1 2 3 4 5 6 Receivers © 2004 IBM Corporation IBM Research The cost of packet re-alignment cost (cycles) of reading a packet from the torus into un-aligned memory 600 500 400 cycles The 300 200 non-aligned receive receive + copy Ideal 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 alignment BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Optimizing for high network traffic, short messages High network traffic Adaptive routing absolute necessity Short messages: Cannot use rendezvous protocol CPU load not a limiting factor Coprocessor Situation not prevalent on the 8x8x8 network. Will be one of the scaling problems increases with n2 #cpus increases with n3 Cross-section irrelevant Message reordering solution Worst-case: up to 1000 cycles/packet Per CPU bandwidth limited to 10% of nominal peak Flow control solution Quasi-sync protocol: Ack packets for each unordered message Only works for messages long enough Tmsg > latency BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research MPI communication protocols A mechanism to optimize MPI behavior based on communication requirements Protocol Status Routing NN BW Dyn BW Latency Copro. Range Eager Deployed Det. High Low Good No 0.2-10KB Short Deployed Det. Low Low V. good No 0-240B Rendezvous Deployed Adaptive V. high Max. Bad Yes 3KB - Quasi-sync Planned Hybrid Good High ? No 0.5-3KB BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research MPI communication protocols and their uses Message size rendezvous protocol quasi-sync Network load BG/L Day, Feb 6 2004 rendezvous protocol eager protocol CPU load © 2004 IBM Corporation IBM Research MPI in Virtual Node Mode Splitting resources between Cus each of memory, cache 50% each of torus hardware Tree channel 0 used by CNK Tree channel 1 shared by CPUs Common memory: scratchpad Virtual node mode is good for 50% Computationally intensive Small memory footprint Small/medium network traffic Deployed, used by BlueMatter team Effect of L3 sharing on virtual node mode NAS Performance measure (MOps/s/processor) 140 Heater mode Virtual node mode 120 100 80 60 40 20 0 BG/L Day, Feb 6 2004 cg ep ft lu bt sp © 2004 IBM Corporation IBM Research Optimal MPI task->torus mapping NAS BT 2D mesh communication pattern Map on 3D mesh/torus? Folding and inverting planes in the 3D mesh NAS BT scaling: Computation scales down with n-2 Communication scales down with n-1 Per-CPU performance (MOps/s/CPU) NAS BT Scaling (virtual node mode) 100 naïve mapping optimized mapping 90 80 70 60 50 40 30 20 10 0 961 841 729 625 529 441 361 289 225 169 121 Number of processors BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Optimizing MPI Collective Operations MPICH2 comes with default collective algorithms: Work has started on optimized collectives: Functionally, we are covered But default algorithms not suitable for torus topology Written with ethernet-like networks in mind For torus network: broadcast, alltoall For tree network: barrier, broadcast, allreduce Work on testing for functionality and performance just begun Rochester performance testing team BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Broadcast on a mesh (torus) Based on ideas from Vernon Austel, John Gunnels, Phil Heidelberger, Nils Smeds Implemented & measured by Nils Smeds 4S+2R 3S+2R 2S+2R 1S+2R 0S+2R BG/L Day, Feb 6 2004 © 2004 IBM Corporation IBM Research Optimized Tree Collectives Implementation w/ Chris Erway & Burk Steinmacher Measurements from Kurt Pinnow Tree Integer Allreduce Bandwidth Tree Broadcast Bandwidth 2.500E+08 BG/L Day, Feb 6 2004 Message size (Bytes) 512 Processors 4194304 128 1048576 0.000E+00 262144 512 32 65536 4194304 1048576 Message size (Bytes) 262144 65536 128 16384 2.399E+08 4096 32 1024 2.400E+08 8 5.000E+07 16384 8 1.000E+08 4096 2.401E+08 Processors 2.402E+08 1.500E+08 1024 2.403E+08 2.000E+08 256 2.404E+08 Bandwidth (Bytes/s) 2.405E+08 256 Bandwidth (Bytes/s) 2.406E+08 © 2004 IBM Corporation IBM Research BG/L MPI: Status Today (2/6/2004) MPI-1 compliant Passes large majority of Intel/ANL MPI test suite Coprocessor mode available 50-70% improvement in bandwidth Regularly tested Not fully deployed Hampered by BLC 1.0 bugs Virtual node mode available Deployed Not tested regularly User-defined process to torus mappings available BG/L Day, Feb 6 2004 Process management Optimized collectives: Optimized torus broadcast Ready for deployment pending code review, optimizations Optimized tree broadcast, barrier, allreduce Almost ready for deployment Functionality: OK Performance: a good foundation © 2004 IBM Corporation IBM Research Where are we going to hurt next? Lessons from last year problems Co-processor mode A coding nightmare Overlapping computation with communication Coprocessor cannot touch data w/o main processor cooperating Excessive CPU load hard to handle Even with coprocessor, still cannot handle 2.6 Bytes/cycle/nod (yet) Flow control Unexpected messages slow reception down Alignment BG/L Day, Feb 6 2004 Anticipating this year: 4 racks in the near (?) future Don’t anticipate major scaling problems CEO milestone at end of year We are up to 29 of 216. That’s halfway on a log scale. We have not hit any “unprecedented” sizes yet. LLNL can run MPI jobs on more machines than we have. Fear factor: combination of congested network and short messages © 2004 IBM Corporation IBM Research Conclusion In the middle of moving from functionality mode to performance centric mode We don’t know how to run 64k MPI processes Rochester taking over functionality, routine performance testing Teams in Watson & Rochester collaborating on collective performance Imperative to keep design fluid enough to counter surprises Establishing a large community of measuring, analyzing behavior A lot of performance work needed New protocol(s) Collectives on the torus, tree BG/L Day, Feb 6 2004 © 2004 IBM Corporation