Transcript Test
The Computational Plant 9th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000 Tech Report SAND98-2221 Distributed and Parallel Systems 9th ORAP Forum. March 21, 2000 Distributed systems heterogeneous • • • • • • • • Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not critical Time-shared Massively parallel systems homogeneous • • • • • • • • Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of equipment Real-time constraints Space-shared 9th ORAP Forum. March 21, 2000 Massively Parallel Processors • • • • • • Intel Paragon 1,890 compute nodes 3,680 i860 processors 143/184 GFLOPS 175 MB/sec network SUNMOS lightweight kernel • • • • • • Intel TeraFLOPS 4,576 compute nodes 9,472 Pentium II processors 2.38/3.21 TFLOPS 400 MB/sec network Puma/Cougar lightweight kernel 9th ORAP Forum. March 21, 2000 Cplant Goals • • • • • Production system Multiple users Scalable (easy to use buzzword) Large scale (proof of the above) General purpose for scientific applications (not Beowulf dedicated to a single user) • 1st step: Tflops look and feel for users Cplant Strategy 9th ORAP Forum. March 21, 2000 • Hybrid approach combining commodity cluster technology with MPP technology • Build on the design of the TFLOPS: – large systems should be built from independent building blocks – large systems should be partitioned to provide specialized functionality – large systems should have significant resources dedicated to system maintenance 9th ORAP Forum. March 21, 2000 Why Cplant? • Modeling and simulation, essential to stockpile stewardship, require significant computing power • Commercial supercomputer are a dying breed • Pooling of SMPs is expensive and more complex • Commodity PC market is closing the performance gap • WEB services and e-commerce are driving high-performance interconnect technology Cplant Approach 9th ORAP Forum. March 21, 2000 • Emulate the ASCI Red environment – – – – – Partition model (functional decomposition) Space sharing (reduce turnaround time) Scalable services (allocator, loader, launcher) Ephemeral user environment Complete resource dedication • Use Existing Software when possible – Red Hat distribution, Linux/Alpha – Software developed for ASCI Red Conceptual Partition View 9th ORAP Forum. March 21, 2000 File I/O Service Compute Users Net I/O /home System Support Sys Admin 9th ORAP Forum. March 21, 2000 File I/O User View Net I/O Service partition alaska0 alaska1 alaska2 alaska3 alaska4 Load balancing daemon rlogin alaska /home System Support Hierarchy Admin access Master copy of system software 9th ORAP Forum. March 21, 2000 sss1 sss0 node node node node sss0 In-use copy of system software NFS mount root from SSS0 Scalable Unit node node node node sss0 In-use copy of system software NFS mount root from SSS0 Scalable Unit node node node node In-use copy of system software NFS mount root from SSS0 Scalable Unit Scalable Unit compute compute compute compute compute compute compute compute compute service service 100BaseT hub 100BaseT hub power serial Ethernet Myrinet 16 port Myrinet switch compute Terminal server compute Power controller compute 16 port Myrinet switch Terminal server Power controller To system support network compute compute sss0 9th ORAP Forum. March 21, 2000 8 Myrinet LAN cables “Virtual Machines” 9th ORAP Forum. March 21, 2000 Uses rdist to push system software down sss0 node node node node sss1 Production Alpha Beta sss0 In-use copy of system software NFS mount root from SSS0 Scalable Unit node node node node SU configuration database sss0 In-use copy of system software NFS mount root from SSS0 Scalable Unit node node node node In-use copy of system software NFS mount root from SSS0 Scalable Unit 9th ORAP Forum. March 21, 2000 Runtime Environment • yod - Service node parallel job launcher • bebopd - Compute node allocator • PCT - Process control thread, compute node daemon • pingd - Compute node status tool • fyod - Independent parallel I/O Phase I - Prototype (Hawaii) 9th ORAP Forum. March 21, 2000 • • • • • • • • • • 128 Digital PWS 433a (Miata) 433 MHz 21164 Alpha CPU 2 MB L3 Cache 128 MB ECC SDRAM 24 Myrinet dual 8-port SAN switches 32-bit, 33 MHz LANai-4 NIC Two 8-port serial cards per SSS0 for console access I/O - Six 9 GB disks Compile server - 1 DEC PWS 433a Integrated by SNL Phase II Production (Alaska) 9th ORAP Forum. March 21, 2000 • • • • • • 400 Digital PWS 500a (Miata) 500 MHz Alpha 21164 CPU 2 MB L3 Cache, 192 MB RAM 16-port Myrinet switch 32-bit, 33 MHz LANai-4 NIC 6 DEC AS1200, 12 RAID (.75 Tbyte) || file server • 1 DEC AS4100 compile & user file server • Integrated by Compaq • 125.2 GFLOPS on MPLINPACK (350 nodes) – would place 53rd on June 1999 Top 500 9th ORAP Forum. March 21, 2000 Phase III Production (Siberia) • • • • • • • • 624 Compaq XP1000 (Monet) 500 MHz Alpha 21264 CPU 4 MB L3 Cache 256 MB ECC SDRAM 16-port Myrinet switch 64-bit, 33 MHz LANai-7 NIC 1.73 TB disk I/O Integrated by Compaq and Abba Technologies • 247.6 GFLOPS on MPLINPACK (572 nodes) – would place 40th on Nov 1999 Top 500 Phase IV (Antarctica?) 9th ORAP Forum. March 21, 2000 • • • • ~1350 DS10 Slates (NM+CA) 466MHz EV6, 256MBRAM Myrinet 33MHz 64bit LANai 7.x Will be combined with Siberia for a ~1600-node system • Red, black, green switchable 9th ORAP Forum. March 21, 2000 Myrinet Switch • Based on 64-port Clos switch • 8x2 16-port switches in a 12U rack-mount case • 64 LAN cables to nodes • 64 SAN cables (96 links) to mesh 16-port switch 4 nodes 9th ORAP Forum. March 21, 2000 One Switch Rack = One Plane • 4 Clos switches in one rack • 256 nodes per plane (8 racks) • Wrap-around in x and y direction • 128+128 links in z direction y z 4 nodes per x not shown Cplant 2000: “Antarctica” Wrap-around and z links and nodes not shown Connected to classified network 9th ORAP Forum. March 21, 2000 Connected to unclassified network Compute nodes swing between red, black, or green Connected to open network Cplant 2000: “Antarctica” cont. 9th ORAP Forum. March 21, 2000 • 1056 + 256 + 256 nodes 1600 nodes 1.5TFlops • 320 “64-port” switches + 144 16-port switches from Siberia • 40 + 16 system support stations 9th ORAP Forum. March 21, 2000 Portals • Data movement layer from SUNMOS and PUMA • Flexible building blocks for supporting many protocols • Elementary constructs that support MPI semantics well • Low-level message-passing layer (not a wire protocol) • API intended for library writers, not application programmers • Tech report SAND99-2959 Interface Concepts • One-sided operations – Put and Get • Zero copy message passing – Increased bandwidth 9th ORAP Forum. March 21, 2000 • OS Bypass – Reduced latency • Application Bypass – No polling, no threads – Reduced processor utilization – Reduced software complexity MPP Network: Paragon and Tflops 9th ORAP Forum. March 21, 2000 Network interface is on the memory bus Network Memory Memory Bus Processor Message passing or computational coprocessor Processor Commodity: Myrinet Network is far from the memory Processor Memory Memory Bus 9th ORAP Forum. March 21, 2000 Bridge PCI Bus OS Bypass NIC Network 9th ORAP Forum. March 21, 2000 “Must” Requirements Common protocols (MPI, system protocols) Portability Scalability to 1000’s of nodes High-performance Multiple process access Heterogeneous processes (binaries) Runtime independence Memory protection Reliable message delivery Pairwise message ordering 9th ORAP Forum. March 21, 2000 “Will” Requirements Operational API Zero-copy MPI Myrinet Sockets implementation Unrestricted message size OS Bypass, Application Bypass Put/Get 9th ORAP Forum. March 21, 2000 “Will” Requirements Packetized implementations Receive uses start and length Receiver managed Sender managed Gateways Asynchronous operations Threads 9th ORAP Forum. March 21, 2000 “Should” requirements No message alignment restrictions Striping over multiple channels Socket API Implement on ST Implement on VIA No consistency/coherency Ease of use Topology information Portal Addressing Operational Boundary Portal Table 9th ORAP Forum. March 21, 2000 Match List Event Queue Memory Descriptors Portal API Space Memory Region Application Space Portal Address Translation Enter Get Next Match Entry No Match? Yes No More Match Entries? Discard Message Increment Drop Count 9th ORAP Forum. March 21, 2000 Yes No Empty & Unlink ME? First MD Accepts? Yes Perform Operation Unlink MD Yes Yes Unlink ME Unlink MD? No No Record Event Yes Event Queue? No Exit Implementing MPI • Short message protocol – Send message (expect receipt) – Unexpected messages 9th ORAP Forum. March 21, 2000 • Long message protocol – Post receive – Send message – On ACK or Get, release message • Event includes the memory descriptor Implementing MPI Pre-posted Match none 9th ORAP Forum. March 21, 2000 Match any Mark short,unlink buffer short,unlink buffer Event Queue short,unlink Match any 0, trunc, no ACK buffer Flow Control • Basic flow control – Drop messages that receiver is not prepared for • Long messages might waste network resources 9th ORAP Forum. March 21, 2000 – Good performance for well-behaved MPI apps • Managing the network – packets – Packet size big enough to hold a short message – First packet is an implicit RTS – Flow control ACK can indicate that message will be dropped Portals 3.0 Status • Currently testing Cplant Release 0.5 9th ORAP Forum. March 21, 2000 – Portals 3.0 kernel module using the RTS/CTS module over Myrinet – Port of MPICH 1.2.0 over Portals 3.0 • TCP/IP reference implementation ready • Port to LANai begun 9th ORAP Forum. March 21, 2000 http://www.cs.sandia.gov/cplant