Parallel Libraries and Parallel I/O John Urbanic Pittsburgh Supercomputing Center September 14, 2004 Outline Libraries I/O Solutions Code Level Parallel Filesystems.
Download ReportTranscript Parallel Libraries and Parallel I/O John Urbanic Pittsburgh Supercomputing Center September 14, 2004 Outline Libraries I/O Solutions Code Level Parallel Filesystems.
Parallel Libraries and Parallel I/O John Urbanic Pittsburgh Supercomputing Center September 14, 2004 Outline Libraries I/O Solutions Code Level Parallel Filesystems Scientific Libraries Leveraging libraries for your code. Libraries Math Libraries Parallel Serial Graphic Libraries File I/O Libraries Communication MPI, Grid Application Specific Protein/Nucleic Sequencing Serial Math Libraries CXML (Alphas) BLAS EISPACK LAPACK SCILIB (portable version) Some “Preferred” Parallel Math Libraries PDE solvers (PETSC) Parallel Linear Algebra (ScaLAPACK) Fourier transforms (FFTW) PETSc PETSc, the Portable Extensible Toolkit for Scientific Computation, is a suite of data structures and routines for the uni- and parallel processor solution of large-scale scientific application problems modeled by partial differential equations. PETSc employs the MPI standard for all message-passing communication. As a framework, it does have a learning curve. Very scalable PETSc Codes Some examples of applications that use PETSc Quake – Earthquake simulation code. This year’s Gordon Bell prize winner. Runs over 1TFLOP on Lemieux. Multiflow - curvlinear, multiblock, multiprocessor flowsolver for multiphase flows. FIDAP 8.5 - Fluent's commercial finite element fluid code uses PETSc for parallel linear solves. Many, many others. PETSc Design PETSc integrates a hierarchy of components, enabling the user to employ the level of abstraction that is most natural for a particular problem. Some of the components are: Mat - a suite of data structures and code for the manipulation of parallel sparse matrices; PC - a collection of preconditioners; KSP - data-structure-neutral implementations of many popular Krylov space iterative methods; SLES - a higher-level interface for the solution of large-scale linear systems; SNES - data-structure-neutral implementations of Newton-like methods for nonlinear systems. Further details at http://www-unix.mcs.anl.gov/petsc Parallel Programming with MPI, Peter Pacheco, Morgan-Kaufmann, Devotes a couple of sections to PETSc, 1997. ScaLAPACK ScaLAPACK is a linear algebra library for parallel computers. Routines are available to solve the linear system A*x=b, or to find the matrix eigensystem, for a variety of matrix types. One of the design goals of ScaLAPACK was to have the ScaLAPACK routines resemble their LAPACK equivalents as much as possible. ScaLAPACK implements the block-oriented LAPACK linear algebra routines, adding a special set of communication routines to copy blocks of data between processors as needed. As with LAPACK, a single subroutine call typically carries out the requested computation. However, ScaLAPACK requires the the user to configure the processors and distribute the matrix data, before the problem can be solved. Similarly to PETSC, the user is spared the mechanics of the parallelization ScaLAPACK Project The ScaLAPACK project was a collaborative effort involving several institutions and comprised four components: dense and band matrix software (ScaLAPACK) large sparse eigenvalue software (PARPACK and ARPACK) sparse direct systems software (CAPSS and MFACT) preconditioners for large sparse iterative solvers (ParPre) Includes parallel versions of EISPACK routines. TCS and genersal information at http://www.psc.edu/general/software/packages/scalapac k/scalapack.html http://www.netlib.org/scalapack/ FFTW FFTW is a C subroutine library for computing the Discrete Fourier Transform in one or more dimensions, of both real and complex data, of arbitrary input size. FFTW is callable from Fortran. It works on any platform with a C compiler. Parallelization through library calls. The API of FFTW 3.x is incompatible with that of FFTW 2.x, for reasons of performance and generality (see the FAQ and manual). MPI parallel transforms are still only available in 2.1.5. FFTW Web Page at http://www.fftw.org/ FFTW FFTW is a C subroutine library for computing the Discrete Fourier Transform in one or more dimensions, of both real and complex data, of arbitrary input size. FFTW is callable from Fortran. It works on any platform with a C compiler. Parallelization through library calls. FFTW Web Page at http://www.fftw.org/ Other Common Packages CACTUS CHOMBO NAG - Parallel Version (built on ScaLAPACK) Resources At PSC Staff (Hotline, [email protected]) Web (www.psc.edu/general/software/categories/ categories.html) In General Netlib (http://netlib.belllabs.com/netlib/master/readme.html) Parallel I/O Achieving scalable I/O. Motivation Many best-in-class codes spend significant amounts of time doing file I/O. By significant I mean upwards of 20% and often approaching 40% of total run time. These are mainstream applications running on dedicated parallel computing platforms. Terminology A few terms will be useful here: Start/Restart File Checkpoint File Visualization File Start/Restart File(s): The file(s) used by the application to start or restart a run. May be about 25% of total application memory. Checkpoint File(s): a periodically saved file used to restart a run which was disrupted in some way. May be exactly the same as a Start/Restart file, but may also be larger if it stores higher order terms. If it is automatically or system generated it will be 100% of app memory. Visualization File(s): used to generate interim data which is usually for visualization or similar analysis. These are often only a small fraction of total app memory (5-15%) each. How Often Are These Generated? Start/Restart File: Once at startup and perhaps at completion of run. Checkpoint: Depends on MTBF of machine environment. This is getting worse, and will not be better on a PFLOP system. On order of hours. Visualization: Depends on data analysis requirements but can easily be several times per minute. Latest (Most Optimistic) Numbers Blue Gene/L 16TB Memory 40 GB/s I/O bandwidth 400s to checkpoint memory ASCI Purple 50TB Memory 40 GB/s 1250s to checkpoint memory Latest machine will still take on order of minutes to 10’s of minutes to do any substantial IO. Example Numbers We’ll use Lemieux, PSC’s main machine, as most of these high-demand applications have similar requirements on other platforms, and we’ll pick an application (Earthquake Modeling) that won the Gordon Bell prize this past year. 3000 PE Earthquake Run Start/Restart: 3000 files totaling 150 GB Checkpoint: 40 GB every 8 hours Visualization: 1.2 GB every 30 seconds Although this is the largest unstructured mesh ever run, it still doesn’t push the available memory limit. Many apps are closer to being memory bound. A Slight Digression: Visualization Cluster What was once a neat idea has now become a necessity. Real time volume rendering is the only way to render down these enormous data sets to a storable size. Actual Route Pre-load startup data from FAR to SCRATCH (~12 hr) Start holding breath (no node remapping) Move from SCRATCH to LOCAL (4 hr) Run (16 hour, little IO time w/ 70GB/s path) Move from LOCAL to SCRATCH (6 hr) Release breath Move to FAR/offsite (~12 hr) Bottom Line (which is always some bottleneck) Like most of the TFLOP class machines, we have several hierarchical levels of file systems. In this case we want to leverage the local disks to keep the app humming along (which it does), but we eventually need to move the data off (and on) to these drives. The machine does not give us free cycles to do this. This pre/post run file migration is the bottleneck here. Skip local disk? Only if we want to spend 70X more time during the run. Although users love a nice DFS solution, it is prohibitive for 3000 PE’s writing simultaneously and frequently. Where’s the DFS? It’s on our giant SMP ☺ Just like the difficulty in creating a massive SMP revolves around contention, so does making a DFS (NFS, AFS, GPFS, etc.) that can deal with thousands of simultaneous file writes. Our SCRATCH (~ 1 GB/s) is as close as we get. It is a globally accessible filesystem. But, we still use locally attached disks when it really counts. Parallel Filesystem Test Results Parallel filesystems were tested with a simple mpi program that reads and writes a file from each rank. These tests were run Jan, 2004 on the clusters while they were in production mode. The filesystems and clusters were not in dedicated mode, and so these results are only a snapshot. Hosts * ppn Approx. Size of Test File Filesystem Agg. Transfer rate [MB/s] 32*4 4 gigabytes PSC/scratch 3000 (5/2/04) 110*2 5 gigabytes SDSC /gpfs 753 128*2 5 gigabytes NCSA / gpfs 423 32*2 2.5 gigabytes Caltech / pvfs 99 Data path jumps through hoops, how about the code? Most parallel code has naturally modular, isolated I/O routines. This makes the above issue much less painful. This is very unlike computational algorithm scalability issues which often permeate a code. How many lines/hours? Quake, which has thousands of lines of code, has only a few dozen lines of I/O code in several routines (startup, checkpoint, viz). To accommodate this particular mode of operation (as compared to the default “magic DFS” mode) took only a couple hours of recoding. How Portable? This is one area where we have to forego strict portability. However, once we modify these isolated areas of code to deal with the notion of local/fragmented disk spaces, we can bend to any new environment with relative ease. Pseudo Code (writing to local) synch if (not subgroup #X master) send data to subgroup #X master else openfile datafile.data.X for (1 to number_in_subgroup) receive data write data Pseudo Code (reading from local) synch if (not subgroup #X master) receive data else openfile datafile.data.X for (1 to number_in_subgroup) read data send data Pseudo Code (writing to DFS) synch openfile SingleGiantFile Setfilepointer(based on PE #) write data Platform and Run Size Issues Various platforms will strongly suggest different numbers or patterns of designated I/O nodes (sometime all nodes, sometime a very few). Simple to accommodate in code. Different numbers of total PE’s or I/O PE’s will require different distributions of data in local files. This can be done offline. File Migration Mechanics ftp, scp, gridco, gridftp, etc. tcsio (a local solution) How about MPI-IO? Not many (any?) full MPI-2 implementations. More like some vendor/site combinations have implemented the features to accomplish the above type of customization for a particular disk arrangement. Or: Portable-looking, code that runs very, very slow. You can explore this separately via ROMIO: http://www-unix.mcs.anl.gov/romio/ Parallel Filesystems PVFS http://www.parl.clemson.edu/pvfs/index.html LUSTRE http://www.lustre.org/ Current deployments Summer 2003 (3 of the top 8 run Linux. Lustre on all 3) LLNL MCR: 1,100 node cluster LLNL ALC: 950 node cluster PNNL EMSL: 950 node cluster Installing in 2004 NCSA: 1,000 nodes SNL/ASCI Red Storm: 8,000 nodes LANL Pink: 1,000 nodes LUSTRE = Linux+Cluster Provides Caching Failover QOS Global Namespace Security and Authentication Built on Portals Kernel mods Interface (for striping control) Shell lstripe Code ioctl Performance From http://www.lustre.org/docs/lustre-datasheet.pdf File I/O % of raw bandwidth: Achieved client I/O: Aggregate I/O 1,000 clients: Attribute retrieval rate: >90% >650 MB/s 11.1 GB/s 7500/s (in 10M file directory, 1,000 clients) Creation rate: (one directory,1,000 clients) 5000/s Benchmarks FLASH http://flash.uchicago.edu/~zingale/flash_benchmark_io/#intro PFS http://www.osc.edu/~djohnson/gelato/pfbs-0.0.1.tar.gz Didn’t Cover (too trivial for us) Formatted/Unformatted Floating Point Representations Byte Ordering XDF – No help for parallel performance