MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006 © 2005 IBM Agenda • Hardware and Software • Compilers and POE • Configuration variables • Control • Tuning • Performance.

Download Report

Transcript MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006 © 2005 IBM Agenda • Hardware and Software • Compilers and POE • Configuration variables • Control • Tuning • Performance.

MPI
Louisiana Tech University
Ruston, Louisiana
Charles Grassl
IBM
January, 2006
© 2005 IBM
Agenda
• Hardware and Software
• Compilers and POE
• Configuration variables
• Control
• Tuning
• Performance characteristics
2
© 2005 IBM Corporation
Compiling and Running an MPI Program
mpcc_r –c mpi_prog.c
mpcc_r –o a.out mpi_proc.o
cat host.list
nodename0
nodename1
nodename2
nodename3
poe a.out –procs 4 –hostfile host.list
export MP_PROCS=4
export MP_HOSTFILE=$PWD/host.list
a.out
3
© 2005 IBM Corporation
Submitting A LoadLeveler Batch Job
#!/bin/ksh
#
# @ error
# @ output
# @ notification
# @ wall_clock_limit
# @ job_type
# @ node
# @ tasks_per_node
# @ network.mpi
# @ node_usage
# @ class
# @ queue
= Error
= Output
= never
= 00:59:00
= parallel
=2
=8
= sn_single,shared,US
= not_shared
= standard
for i in mpi_prog1 mpi_prog2
do
$i
done
4
© 2005 IBM Corporation
LoadLeveler Commands
• llq
•
Show queue
• llsubmit
• Submit LoadLeveler job
• llclass
• Show classes
• llstatus
• List status of nodes
5
© 2005 IBM Corporation
Message Passing Network Heritage
Switch
6
Latency Bandwidth
Available
[microsec.] [Mbyte/s]
TPMX
1998
25
125
Switch2
(Colony)
2001
15
320
HPS
(Federation)
2003
5
1800
© 2005 IBM Corporation
Message Passing on Clusters
• On node:
• Flat address Space
• Uniform memory access times (approximately)
• Distributed memory programming (MPI)
• Shared memory MPI
• Between nodes
• Disjoint address spaces
• Distributed memory programming (MPI)
7
© 2005 IBM Corporation
Shared Memory
• Characteristics
• Single address space
• Single operating system
• Limitations
• Memory
• Contention
• Bandwidth
• Cache coherency
• Benefits
• Memory size
• Programming models
8
© 2005 IBM Corporation
Distributed Memory
• Characteristics
• Multiple address spaces
• Multiple operating systems
Shared
Memory
Node
• Limitations
• Switch
• Contention
• Bandwidth
• Local memory size
• Benefits
• Cache coherency
9
© 2005 IBM Corporation
Network
Emphasize
• Message Passing (MPI) works VERY well on
shared memory nodes
• Uses shared memory messages
• SMP (OpenMP) does not work between
nodes
• Only scales up to number of CPUs on node
10
© 2005 IBM Corporation
Message Passing Software
• Parallel Environment (PE)
• LoadLeveler
• Distributed batch queuing system
• Parallel Operating Environment (POE)
• Message passing library (MPI)
• Parallel debuggers
• Parallel Operating Environment (POE)
• Generalization of mpirun...
• Runs on:
• pSeries systems
• AIX workstations
• PATH=/usr/lpp/LoadL/full/bin
11
© 2005 IBM Corporation
Invoking the MPI Compiler
• MPI usage: set 'mp'
before the compiler
• mpxlf ....
Language
Compiler
Fortran 77
mpxlf
Fortran 90
mpxlf90
C
mpcc
C++
mpCC
• mpcc
• mpCC
•
•
12
© 2005 IBM Corporation
Sets include path
Links libmpi.a
Parallel Operating Environment (POE)
• Takes place of "mpirun" command
• Also distributes local environment
• Local environment variables are exported to other
nodes
• Example:
• $ poe a.out -procs ...
• or
• $ a.out -procs ...
# /usr/bin/poe is implied
• $ poe ksh myscript.ksh ...
• Runs "myscript.ksh" on nodes listed in hostlist
13
© 2005 IBM Corporation
Compile and Run and MPI program
$ mpcc mpiprogram.c
# Edit host.list or batch queue script
$cat host.list
r36n11.pbm.ihost.com
r36n11.pbm.ihost.com
r36n11.pbm.ihost.com
r36n11.pbm.ihost.com
$ poe a.out –procs 4 –hfile host.list
Or
$ a.out –procs 4 –hostfile host.list
Or
$ export MP_PROCS=4 MP_HOSTFILE=host.list a.out
14
© 2005 IBM Corporation
Specifying MPI Control Parameters
• Interactive
• poe a.out –procs 2 –hostfile myhost.list
• LoadLeveler or LSF
• #@ node=4
• #@ tasks_per_node=8
• Environment variables
• MP_PROCS=32
• MP_TASKS_PER_NODE=8
15
© 2005 IBM Corporation
Specifying MPI Control Parameters
Method
Interactive
Batch Queues:
LoadLeveler
LSF
Env. variables
16
© 2005 IBM Corporation
Example
$ poe a.out –procs 2 \
–hostfile myhost.list
#@ node=4
#@ tasks_per_node=8
MP_PROCS=32
MP_TASKS_PER_NODE=8
Control Environment Variables
17
Parameter
Values
Description
MP_NODES
1-n
Number of nodes
MP_TASKS_PER_NODE
1-m
Tasks per node
MP_PROCS
1-m*n
Number of processes
MP_TASKS
1-m*N
Number of processes
MP_HOSTFILE
host.list host name for interactive use
MP_LABELIO
{yes,no} Label I/O with task numbers
© 2005 IBM Corporation
Configuration Strategy
• Be aware of node concept
• p5-575 nodes have eight processors
• Use shared memory for on-node MPI communication
• MPI configurations:
• Nodes and tasks per node
• Procs and tasks per node
• Procs
• CPUs
18
© 2005 IBM Corporation
MPI Tasks and Processors
• MPI tasks are not necessarily associated 1:1
with processors
• System has configuration for “mpistarters”
• Specify number of MPI tasks possible on each processor
• “Often” set to 1 MPI task per processor
• User concerns:
• Total number of MPI tasks
• Number of nodes
• Number of tasks per node
• Number of tasks per processor
• SMT concerns
19
© 2005 IBM Corporation
Number of Tasks (processors)
• MP_PROCS=MP_NODES* MP_TASKS_PER_NODE
• MP_PROCS : Total number of processes
• MP_NODES : Number of nodes to use
• MP_TASKS_PER_NODE : number of proc. per node
• Any two variables can be specified
• MP_TASKS_PER_NODE is (usually) the number of
processors per node
20
© 2005 IBM Corporation
Environment, Statistics and Information
Env. Variable
21
Values
Comment
MP_PRINTENV
yes, no
Echo environment Variables
MP_STATISTICS
yes, no
Low level statistics
MP_INFOLEVEL
Information
{0,1,2,3,4,5,6} Warnings
Errors
© 2005 IBM Corporation
MPI Environment Variables
• MP_PRINTENV=yes
• Job ID: MP_PARTITION
• Tasks: MP_PROCS
• Nodes: MP_NODES
• Tasks per node: MP_TASKS_PER_NODE
• Library: MP_EUILIB
• Adaptor Name
• IP Address
• Striping setup
• 64-bit Mode
• Thread Scope: AIXTHREAD_SCOPE
• Shared memory MPI: MP_SHARED_MEMORY
• Memory Affinity: MEMORY_AFFINITY
• Thread Usage: MP_SINGLE_THREAD
22
© 2005 IBM Corporation
MPI Statistics
#include "mpi.h“
#include "pm_util.h"
MPI_Init(&argc, &argv);
mpc_statistics_zero();
…
MPI_Send(…)
MPI_Recv(…)
...
mpc_statistics_write(stdout);
MPI_Finalize();
$ MP_STATISTICS=yes a.out
23
© 2005 IBM Corporation
Start of task (pid=410098) statistics
MPCI: sends = 108000
MPCI: sendsComplete = 178000
MPCI: sendWaitsComplete = 108000
MPCI: recvs = 108000
MPCI: recvWaitsComplete = 108000
MPCI: earlyArrivals = 2
MPCI: earlyArrivalsMatched = 2
MPCI: lateArrivals = 107998
MPCI: shoves = 100000
MPCI: pulls = 138000
MPCI: threadedLockYields = 0
MPCI: unorderedMsgs = 0
MPCI: EA buffer high water mark= 1770784
MPCI: token starvation= 0
MPCI: envelope buffer used=53424
MPI INFO LEVEL
$ MP_INFOLEVEL=6 a.out
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
00000000000-
1:Hostname: r36n11.pbm.ihost.com
1:Job ID (MP_PARTITION): 1134540874
1:Number of Tasks (MP_PROCS): 2
1:Number of Nodes (MP_NODES): NOT SET
1:Number of Tasks per Node (MP_TASKS_PER_NODE): NOT SET
1:64 Bit Mode: YES
1:Threaded Library: YES
1:Polling Interval (MP_POLLING_INTERVAL/sec): 0.400000
1:Buffer Memory (MP_BUFFER_MEM/Bytes): 2800000
1:Max. Buffer Memory (MP_BUFFER_MEM_MAX/Bytes): 2800000
1:Message Eager Limit (MP_EAGER_LIMIT/Bytes): 32768
D3<L4>: Message type 20 from source 0
D1<L4>: All remote tasks have exited: maxx_errcode = 0
24
© 2005 IBM Corporation
Tuning Environment Variables
Parameter
Values
Description
Buffer for early
MP_BUFFER_MEM
0 – 64,000,000
arrivals
Threshold for
MP_EAGER_LIMIT
0 - 262144
rendezvous
protocol
Use of shared
MP_SHARED_MEMORY
{yes,no}
memory on node
US default: poll
MP_WAIT_MODE
{poll,yield,sleep}
IP default: yield
Block Transfer:
MP_USE_BULK_XFER
yes, no
message striping
MP_EUILIB
25
© 2005 IBM Corporation
{us,ip}
Communication
Method
MPI Tuning
• Message Passing System Library
• Internet Protocol (IP)
• Ethernet protocol
• User Space (US)
• IBM Switch
• Shared Memory
• System V shared memory for on-node messages
• Eager or Rendezvous Protocol
• Small or large messages
26
© 2005 IBM Corporation
Message Passing Library
• MP_EUILIB={us,ip}
• us: user space
• Much faster: 5 microseconds latency, 2000 Mbyte/s bandwidth
• ip: useable with ethernet
• Mush slower: 50 microsecond latency
• US mode is usually default
27
© 2005 IBM Corporation
MP_EUILIB
User Space: US mode
User
Internet Protocol: IP mode
AIX
User
Adaptor
Adaptor
Network
28
© 2005 IBM Corporation
AIX
Effect of US Mode: Latency
40
Time (microsec.)
35
30
25
20
15
10
5
0
HPS Switch
29
© 2005 IBM Corporation
IP, 2 Nodes
US, 2 nodes
US or IP, 1 node
Effect of US Mode: Latency
Time vs. Length
Microsecond
100
80
60
IP
US
40
20
0
0
2000
4000
6000
8000
Lengeh (byte)
p5-575 1.9 GHZ, HPS
30
© 2005 IBM Corporation
10000 12000
Effect of US Mode: Bandwidth
Mbyte/s
Rate vs. Length
3000
2500
2000
1500
1000
500
0
IP
US
0
125000
250000
Length
POWER5 1.9 GHZ
31
© 2005 IBM Corporation
375000
500000
Shared Memory MPI
• MP_SHARED_MEMORY={yes,no}
• High Bandwidth:
• 2000 Gbyte/s
• Low Latency:
• 2 microsecond
• Not always default
• SHOULD ALWAYS BE SET TO yes
32
© 2005 IBM Corporation
MPI Performance on HPS
Time vs. Length
Mbyte/s
5000
4000
3000
1 Node
2 Nodes
2000
1000
0
0
125000
250000
Length
POWER5 1.9 GHZ
33
© 2005 IBM Corporation
375000
500000
Protocol
• MP_EAGER_LIMIT=[0-256000000]
• Smaller messages are passed directly to other task
• Larger messages us rendezvous protocol
34
© 2005 IBM Corporation
MPI Transfer Protocols
Header.
Mess.
Recv.
Send
Large Messages:
Rendezvous
35
© 2005 IBM Corporation
MPI Transfer Protocols
Header.
Mess.
Recv.
Send
Small Messages:
Eager
36
© 2005 IBM Corporation
Flow Control
• Small messages:
• Lower latency
• MPI_Send is more like MPI_Isend
• Large Messages:
• Rendezvous protocol
• MPI_Send is equivalent to MPI_Ssend
Strategy:
Develop application with MP_EAGER_LIMIT=0
Run application with MP_EAGER_LIMIT=65536
37
© 2005 IBM Corporation
Default Eager Limits
• Small Message
(MP_EAGER_LIMIT)
• Send header and message
• Large Message
• Send header
• Acknowledge
• Send message
38
© 2005 IBM Corporation
No. Tasks
MP_EAGER_LIMIT
(default, bytes)
1 - 256
32768
257 – 512
16384
513 – 1024
8192
1025 – 2048
4096
2049 – 4096
2048
4097 - 8192
1024
Setting Eager Limits
• MPI library checks eager limits:
• MP_BUFFER_MEM=2^26 (~64 Mbyte) (default)
• MP_EAGER_LIMIT={default from table}
• Calculate Credits:
• MP_BUFFER_MEM / (MP_PROCS * MAX(MP_EAGER_LIMIT, 64))
• Credits must be greater than or equal to 2.
• MPI reduces MP_EAGER_LIMIT or increases MP_BUFFER_MEM
39
© 2005 IBM Corporation
Effect of MP_EAGER Limit
Time vs. Length
Microseconds
30
20
Eager Default
Eager 0
10
0
0
2000 4000 6000 8000 10000
Length (bytes)
40
© 2005 IBM Corporation
Microseconds
Effect of MP_EAGER Limit
Time vs. Length
100
90
80
70
60
50
40
30
20
10
0
EAGER Default
EAGER 0
0
20000
40000
60000
Length (bytes)
41
© 2005 IBM Corporation
80000
100000
Effect of MP_EAGER_LIMIT
• MP_EAGER_LIMIT affects latency
• Lower time for "small" messages
• Latency reduced from 50 to 5 microsec.
• Effect is noticeable only for messages of
size 0 - 256000 bytes
42
© 2005 IBM Corporation
Bulk Transfer:
RDMA vs. no RDMA
Mbyte/s
Rate vs. Length
3500
3000
2500
2000
1500
1000
500
0
RDMA
no RDMA
0
500000
1000000 1500000 2000000
Length
P5-575 1.9 GHz, HPS, RDMA
43
© 2005 IBM Corporation
Other MPI Tuning
Env. Variable
Values
MP_SINGLE_THREAD yes, no
Setting
Address Mode
44
© 2005 IBM Corporation
Values
-q32
-q64
Comment
1 microsec. Lower latency
for non-threaded MPI tasks
Comment
Enhanced MPI collectives
performance with 64-bit
addresses
Microseconds
MPI Bcast:
32-bit vs. 64-bit
Time vs. Length
32 Tasks MPI_Bcast
2500
2000
1500
64-bit
32-bit
1000
500
0
0
250000
500000
Length (byte)
p655 1.5 GHz, HPS, RDMA
45
© 2005 IBM Corporation
750000
Microseconds
MPI Bcast:
32-bit vs. 64-bit
Time vs. Length
64 Tasks MPI_Bcast
4000
3500
3000
2500
2000
1500
1000
500
0
64-bit
32-bit
0
250000
500000
Length (byte)
p655 1.5 GHz, HPS, RDMA
46
© 2005 IBM Corporation
750000
Microseconds
MPI Bcast:
32-bit vs. 64-bit
Time vs. Length
64 Tasks MPI_Bcast
4000
3500
3000
2500
2000
1500
1000
500
0
64-task 64-bit
64-task 32-bit
32-task 64-bit
32 task 32-bit
0
250000
500000
Length
p655 1.5 GHz, HPS, RDMA
47
© 2005 IBM Corporation
750000
MPI Portability
• Problems arise when porting MPI
applications
• Blocking send and receives
• MPI Cartesian coordinates
48
© 2005 IBM Corporation
Unsafe Send - Receive
• Example: Circular shift
• Does not always work
• Above: rendezvous mode, deadlock
• Send doesn't return
blocking communication calls
MPI_SEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,...)
MPI_RECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,...)
49
© 2005 IBM Corporation
Task 0
Task 1
Task 2
Task 0
Task 1
Task 1
Safe Send - Receive
• Always works
•MPI_ISEND returns in all cases
•Best Performance
• Nonblocking communication calls
MPI_ISEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,ireq,...)
MPI_IRECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,stat,...)
... (as much computation as possible) ...
MPI_WAITALL(ireq,stat,ierr)
50
© 2005 IBM Corporation
MPI Cartesian Coordinates
• MPI_CART_CREATE
• "Full" implementation on AIX
• "No-operation" on most other (all?) operating
systems
• Be careful with usage
• Generally not test on other operating systems
51
© 2005 IBM Corporation
Use of Multiple Program Multiple Data (MPMD)
• Each task in MPI session can be a unique
program:
• export MP_PGMMODEL=<mpmd/spmd>
• export MP_CMDFILE=cmdfile
cmdfile
a.out
b.out
c.out
52
© 2005 IBM Corporation
Host File
node1
node2
node3
Execution command:
$ MP_PGMMODEL=mpmd
$ MP_CMDFILE=cmdfile
$ poe -procs 3
Performance Summary
• Latency is low
• Bandwidth is high
• Low contention
• Programming
• Some portability problems
• Deadlocks
• Tuning:
• EAGER_LIMIT
• RDMA
53
© 2005 IBM Corporation
MPI Single Messages
Rate vs. Length
Mbyte/s
4000
3000
2000
1000
0
0
250000
500000
750000
Length (byte)
p5-575 1.8 GHz, HPS, RDMA, LP
54
© 2005 IBM Corporation
1000000
MPI Double Messages
Mbyte/s
Rate vs. Length
6000
5000
4000
3000
2000
1000
0
0
250000
500000
750000
Length (byte)
p5-575 1.8 GHz, HPS, RDMA, LP
55
© 2005 IBM Corporation
1000000
Summary
• Exploit Shared memory MPI
• Low Latency
• High bandwidth
• Use MP_EAGER_LIMIT for small messages
56
© 2005 IBM Corporation