IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation.

IBM Research
MPI for BG/L
George Almási
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Outline




Preliminaries
BG/L MPI Software Architecture
Optimization framework
Status & Future direction
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
BG/L MPI Who’s who
Users

Watson:
Testers

John
Gunnels
BlueMatter team

Leckband, Jeff
Garbisch (Rochester)
Pinnow, Joe
Ratterman (Rochester)
Dennis,
Henry Tufo

LANL
Hoisie,
Fabrizio Petrini,
Darren Kerbyson
IBM India


LLNL
Astron
Charles Archer
(Rochester)
George Almasi, Xavier
Martorell

Torus primitives
Smeds
Philip Heidelberger
Enablers

System software group
(you know who you are)
BG/L Day, Feb 6 2004
BG/L port: library core
Nils
Meeta
Sharma,Rahul
Garg

Jesus
Adolfy

Gropp, Rusty Lusk,
Brian Toonen, Rajeev
Thakur, others (ANL)
Performance Analysis
Labarta (UPC)
Nils Smeds
Bob Walkup, Gyan
Bhanot, Frank Suits
MPICH2 framework
Bill
Performance Testing
Kurt
John


Glenn

NCAR
Functionality Testing
Developers

Tree primitives
Chris
Erway
Burk Steinmacher
© 2004 IBM Corporation
IBM Research
The BG/L MPI Design Effort

Started off with constraints and ideas from everywhere, pulling in every
direction




90% of work was to figure out which ideas made immediate sense






Use algorithm X for HW feature Y
MPI package choice, battle over required functionality
Operating system, job start management constraints
Immediately implement
Implement in the long term, but ditch for the first year
Evaluate only when hardware becomes available
Forget it
Development framework established by January 2003
Project grew alarmingly:


January 2003: 1 fulltime + 1 postdoc + 1 summer student
January 2004: ~ 30 people (implementation, testing, performance)
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
MPICH2 based BG/L Software Architecture
Message passing
Process management
MPI pt2pt datatype topo collectives
MM bgltorus
socket
Message
Layer
Packet
Layer
BG/L Day, Feb 6 2004
Torus
Device
Tree
Device
GI
Device
bgltorus
CH3
mpd
uniprocessor
simple
Abstract Device Interface torus tree GI
PMI
CIO
Protocol
© 2004 IBM Corporation
IBM Research
Architecture Detail: Message Layer
Connection Manager
Rank 0 (0,0,0)
Rank 1 (0,0,1)
Rank 2 (0,0,2)
…
Rank n (x,y,z)
…
sendQ
sendQ
sendQ
Progress Engine
recv
Send manager
recv
recv
Dispatcher
sendQ
recv
MPID_Request
Message Data
Send Queue
msg1
msg2
…
msgP
user buffer
(un)packetizer
protocol & state info
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Performance Limiting Factors in the MPI Design
Hardware

Bytes/cycle/link (theoretical)
0.22 Bytes/cycle/link (effective)
12*0.22 = 2.64 Bytes/cycle/node
routing: in order, bad torus
performance
Adaptive routing: excellent network
performance, out-of-order packets
In-order semantics is expensive

cycles to read a packet;
50 – 100 cycles to write a packet
Alignment restrictions
Handling badly aligned data is
expensive
Short FIFOs
Network needs frequent attention
Streaming memory bandwidth
4.3 Bytes/cycle/CPU
 memory copies are expensive
Dual core setup, memory coherency
coherency management via “blind
device” and cache flush primitives
Requires communication between
processors
Best done in large chunks
Coprocessor cannot manage MPI data
structures
CPU/network interface
204


Network order semantics and routing
Deterministic
Torus Network link bandwidth
0.25


Explicit
Software



Only tree channel 1 is available to MPI
CNK is single-threaded; MPICH2 is not
thread safe
Context switches are expensive
Interrupt
BG/L Day, Feb 6 2004
driven execution is slow
© 2004 IBM Corporation
IBM Research
Optimizing short-message latency

The thing to watch is overhead

Bandwidth
CPU load
Co-processor
Network load





Not a factor:
not enough
network traffic
Composition
HW
32%
routing would double msg layer
overhead
Balance here may change as we scale to 64k
nodes
Today: ½ nearest-neighbor roundtrip
latency:  3000 cycles
6 s @ 500MHz
Within SOW specs @ 700MHz
About

High level
(MPI)
26%
Memory copies take care of alignment
Deterministic routing insures MPI semantics
Adaptive

of roundtrip latency:
msg layer
13%
Per-packet
overhead
29%
Can improve 20-25% by shortening packets
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Optimizing MPI for High CPU Network Traffic
(neighbor to neighbor communication)

Most important thing to optimize
for: CPU per packet overhead

(Allows 180 cycles/CPU/packet)
 Explicit cache management
  5000 cycles/message
System support necessary
Coprocessor library
Scratchpad library
Lingering RIT1 memory issues

At
maximum torus utilization, only
90 CPU cycles available to
prepare/handle a packet!

Sad (measured) reality:
READ:
204, WRITE: 50-100 cycles
Plus MPI overhead

Packet overhead reduction
Cooked packets:
Contain destination address
Assume intitial dialog
(rendezvous)
 Rendezvous costs  3000 cycles
 Saves  100 cycles/packet
 Allows adaptively routed packets
 Permits coprocessor mode

BG/L Day, Feb 6 2004
Coprocessor mode essential

Adaptive routing essential
MPI
semantics achieved by initial
deterministically routed scout packet

Packet alignment issues handled
with 0 memory copies
Overlapping
realignment with torus
reading

Drawback: only works well for long
messages (10KBytes+)
© 2004 IBM Corporation
IBM Research
Per-node asymptotic bandwidth in MPI
2
1.8
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
Bandwidth (Bytes/cycle)
2
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0.2
0
6
5
Senders
4
3
2
1
0 0
1
2
BG/L Day, Feb 6 2004
3
4
5
6
Receivers
Bandwidth (Bytes/cycle)
Per-node bandwidth in coprocessor mode
Per-node bandwidth in heater mode
0
6
5
Senders
4
3
2
1
0 0
1
2
3
4
5
6
Receivers
© 2004 IBM Corporation
IBM Research
The cost of packet re-alignment
cost (cycles) of reading a packet from the torus into un-aligned memory
600
500
400
cycles
The
300
200
non-aligned receive
receive + copy
Ideal
100
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
alignment
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Optimizing for high network traffic, short messages

High network traffic
Adaptive


routing absolute necessity
Short messages:
Cannot
use rendezvous protocol
CPU load not a limiting factor
Coprocessor



Situation not prevalent on the
8x8x8 network.
Will be one of the scaling problems
increases with n2
#cpus increases with n3
Cross-section
irrelevant
Message reordering solution
Worst-case:
up to 1000 cycles/packet
Per CPU bandwidth limited to 10% of
nominal peak

Flow control solution
Quasi-sync
protocol: Ack packets for
each unordered message
Only works for messages long enough
Tmsg > latency
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
MPI communication protocols

A mechanism to optimize MPI behavior based on communication
requirements
Protocol
Status
Routing
NN BW
Dyn BW
Latency
Copro.
Range
Eager
Deployed
Det.
High
Low
Good
No
0.2-10KB
Short
Deployed
Det.
Low
Low
V. good
No
0-240B
Rendezvous
Deployed
Adaptive
V. high
Max.
Bad
Yes
3KB - 
Quasi-sync
Planned
Hybrid
Good
High
?
No
0.5-3KB
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
MPI communication protocols and their uses
Message size
rendezvous
protocol
quasi-sync
Network load
BG/L Day, Feb 6 2004
rendezvous
protocol
eager
protocol
CPU load
© 2004 IBM Corporation
IBM Research
MPI in Virtual Node Mode
Splitting resources between Cus

each of memory, cache
50% each of torus hardware
Tree channel 0 used by CNK
Tree channel 1 shared by CPUs
Common memory: scratchpad
Virtual node mode is good for
50%
Computationally
intensive
Small memory footprint
Small/medium network traffic

Deployed, used by BlueMatter
team
Effect of L3 sharing on virtual node mode
NAS Performance measure (MOps/s/processor)

140
Heater mode
Virtual node mode
120
100
80
60
40
20
0
BG/L Day, Feb 6 2004
cg
ep
ft
lu
bt
sp
© 2004 IBM Corporation
IBM Research
Optimal MPI task->torus mapping

NAS BT



2D mesh communication pattern
Map on 3D mesh/torus?
 Folding and inverting planes in
the 3D mesh
NAS BT scaling:


Computation scales down with n-2
Communication scales down with n-1
Per-CPU performance (MOps/s/CPU)
NAS BT Scaling (virtual node mode)
100
naïve mapping
optimized mapping
90
80
70
60
50
40
30
20
10
0
961
841
729
625
529
441
361
289
225
169
121
Number of processors
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Optimizing MPI Collective Operations

MPICH2 comes with default collective algorithms:



Work has started on optimized collectives:



Functionally, we are covered
But default algorithms not suitable for torus topology
 Written with ethernet-like networks in mind
For torus network: broadcast, alltoall
For tree network: barrier, broadcast, allreduce
Work on testing for functionality and performance just begun

Rochester performance testing team
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Broadcast on a mesh (torus)
Based on ideas from Vernon Austel, John Gunnels, Phil Heidelberger, Nils Smeds
Implemented & measured by Nils Smeds
4S+2R
3S+2R
2S+2R
1S+2R
0S+2R
BG/L Day, Feb 6 2004
© 2004 IBM Corporation
IBM Research
Optimized Tree Collectives
Implementation w/ Chris Erway & Burk Steinmacher
Measurements from Kurt Pinnow
Tree Integer Allreduce Bandwidth
Tree Broadcast Bandwidth
2.500E+08
BG/L Day, Feb 6 2004
Message size (Bytes)
512
Processors
4194304
128
1048576
0.000E+00
262144
512
32
65536
4194304
1048576
Message size (Bytes)
262144
65536
128
16384
2.399E+08
4096
32
1024
2.400E+08
8
5.000E+07
16384
8
1.000E+08
4096
2.401E+08
Processors
2.402E+08
1.500E+08
1024
2.403E+08
2.000E+08
256
2.404E+08
Bandwidth (Bytes/s)
2.405E+08
256
Bandwidth (Bytes/s)
2.406E+08
© 2004 IBM Corporation
IBM Research
BG/L MPI: Status Today (2/6/2004)

MPI-1 compliant
Passes large majority of
Intel/ANL MPI test suite
Coprocessor mode available
50-70% improvement in
bandwidth
Regularly tested
Not fully deployed
Hampered by BLC
1.0 bugs
Virtual node mode available
Deployed
Not tested regularly

User-defined
process to torus
mappings available

BG/L Day, Feb 6 2004
Process management

Optimized collectives:
Optimized
torus broadcast
Ready for deployment
pending code review,
optimizations
 Optimized tree broadcast, barrier,
allreduce
Almost ready for deployment


Functionality: OK
Performance: a good foundation
© 2004 IBM Corporation
IBM Research
Where are we going to hurt next?

Lessons from last year
problems
Co-processor mode
A coding nightmare
Overlapping computation with
communication
Coprocessor cannot touch
data w/o main processor
cooperating
Excessive CPU load hard to handle
Even with coprocessor, still
cannot handle 2.6
Bytes/cycle/nod (yet)
Flow control
Unexpected messages slow
reception down

Alignment
BG/L Day, Feb 6 2004
Anticipating this year:
4
racks in the near (?) future
Don’t anticipate major scaling
problems
CEO milestone at end of year
We are up to 29 of 216. That’s
halfway on a log scale.
We have not hit any
“unprecedented” sizes yet. LLNL can
run MPI jobs on more machines than
we have.

Fear factor: combination of
congested network and short
messages
© 2004 IBM Corporation
IBM Research
Conclusion

In the middle of moving from functionality mode to performance
centric mode



We don’t know how to run 64k MPI processes



Rochester taking over functionality, routine performance testing
Teams in Watson & Rochester collaborating on collective performance
Imperative to keep design fluid enough to counter surprises
Establishing a large community of measuring, analyzing behavior
A lot of performance work needed


New protocol(s)
Collectives on the torus, tree
BG/L Day, Feb 6 2004
© 2004 IBM Corporation

IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation.

Transcript IBM Research MPI for BG/L George Almási BG/L Day, Feb 6 2004 © 2004 IBM Corporation.

Directory