[ home ]
Scientific Computing on Linux and Tru64
Wherein we explore the whys and wherefores of various softwares for enabling the performance of really spiffy scientific feats on Linux machines.
The categories thus far explored include:
This includes pointers to various incarnations of the LAPACK/BLAS software for performing various linear algebra tasks. Eventually it might even contain things like binaries for specific platform/compiler combinations, but don't hold your breath.
"LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision."
Both LAPACK and BLAS were originally written in Fortran 77, and that is still the language of the reference implementation. It has been ported to other languages by both translations to other languages and the implementation of interfaces to other languages. This is a list of some of those.
A Fortran 95 interface to LAPACK. There is a LAPACK95 Users' Guide.
"LAPACK95 is a Fortran 95 interface to the Fortran 77 LAPACK library. It improves upon the original user-interface to the LAPACK package, taking advantage of the considerable simplifications which Fortran 95 allows. The design of LAPACK95 exploits assumed-shape arrays, optional arguments, and generic interfaces. The Fortran 95 interface has been implemented by writing Fortran 95 `wrappers' to call existing routines from the LAPACK package. This interface can persist unchanged even if the underlying Fortran 77 LAPACK code is rewritten to take advantage of the new features of Fortran 95."
"Our primary motivation is to provide numerical linear algebra software originally written in Fortran as Java class files. The numerical libraries will be distributed as class files produced by a Fortran-to-Java translator, f2j. The f2j translator is a formal compiler that translates programs written using a subset of Fortran77 into a form that may be executed on Java virtual machines(JVM). The first priority of f2j is to translate the BLAS and LAPACK numerical libraries from their Fortran77 reference source code to Java class files."
A C version of LAPACK built using the Fortran to C conversion utility f2c. Various details are supplied in readme, readme.install and readme.maintain.
"The entire Fortran 77 LAPACK library is run through f2c to obtain C code, and then modified to improve readability. CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler. However, f2c is designed to create C code that is still callable from Fortran, so all arguments must be passed using Fortran calling conventions and data structures."
This is a set of C++ wrappers allowing object-oriented techniques to be used. This can be used with either LAPACK or CLAPACK. The developers recommend that those wishing to solve such problems using C++ use instead the Template Numerical Toolkit, which has been designed as a better mouse trap for such things.
"LAPACK++ (Linear Algebra PACKage in C++) is a software library for numerical linear algebra that solves systems of linear equations and eigenvalue problems on high performance computer architectures. Computational support is provided for supports various matrix classes for vectors, non-symmetric matrices, SPD matrices, symmetric matrices, banded, triangular, and tridiagonal matrices; however, it does not include all of the capabilities of original f77 LAPACK. Emphasis is given to routines for solving linear systems consisting of non-symmetric matrices, symmetric positive definite systems, and solving linear least-square systems."
A portable parallel implementation of some of the core LAPACK routines designed to work across MPI, PVM, the Intel Paragon, the IBM SP and the SGI O2K. The ScaLAPACK component of the project...
"...includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interprocessor communication. It assumes matrices are laid out in a two-dimensional block cyclic decomposition.
ScaLAPACK is designed for heterogeneous computing and is portable on any computer that supports MPI or PVM.
Like LAPACK, the ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. (For such machines, the memory hierarchy includes the off-processor memory of other processors, in addition to the hierarchy of registers, cache, and local memory on each processor.) The fundamental building blocks of the ScaLAPACK library are distributed memory versions (PBLAS) of the Level 1, 2 and 3 BLAS, and a set of Basic Linear Algebra Communication Subprograms (BLACS) for communication tasks that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, all interprocessor communication occurs within the PBLAS and the BLACS. One of the design goals of ScaLAPACK was to have the ScaLAPACK routines resemble their LAPACK equivalents as much as possible."
"The ScaLAPACK project was a collaborative effort involving several institutions. ... [It] comprised four components:
An effort to provide specifications for standard reference implementations of BLAS. The functionality and language bindings of three planned BLAS categories are distinguished, along with a fourth category covering legacy work. distinguished:
Some details on various reference implementations:
Part of a reference implementation for the dense and banded BLAS routines, along with their extended and mixed precision versions.
"Extended precision is only used internally in the BLAS; the input and output arguments remain just Single or Double as in the existing BLAS. For the internal precision, we allow Single, Double, Indigenous, or Extra. In our reference implementation we assume that Single and Double are the corresponding IEEE floating point formats. Indigenous means the widest hardware-supported format available. Its intent is to let the machine run close to top speed, while being as accurate as possible. On some machines this would be a 64-bit (Double) format, but on others such as Intel machines it means the 80-bit IEEE format (which has 64 fraction bits). Our reference implementation currently supports machines on which Indigenous is the same as Double. Extra means anything at least 1.5 times as accurate as Double, and in particular wider than 80 bits. An existing quadruple precision format could be used to implement Extra precision, but we cannot assume that the language or compiler supports any format wider than Double. So for our reference implementation, we use a technique called double-double in which extra precise numbers are represented by a pair of double precision numbers, providing about 106 bits of precision.
We have designed all our routines assuming that single precision arithmetic is actually done in IEEE single precision (32 bits) and that double precision arithmetic is actually done in IEEE double precision (64 bits). It also passes our tests on the Intel machine with 80-bit floating point registers.
Mixed precision permits some input/output arguments to be of different mathematical types, meaning real and complex, or different precisions, meaning single and double. This permits such operations as real-by-complex matrix multiplication, which can be rather faster than using alternatives that do not mix precision."
"The ASCI project has 3 Intel Computational scientists. They have generated a number of utilities and single node libraries for the Intel Pentium Pro(TM) processors used on the ASCI Option Red machine at SNL. The utilities are built in a Unix-like environment in ELF format, and are therefore portable to most Linux-based Intel-inside workstations. These include things like the BLAS, FFTs, extended precision kernels, and hardware performance monitoring utilities. These come free of charge, and we only ask that you register your usage, give us your feedback, and pay close attention to the disclaimers."
"The Math Kernel Library contains the following groups of routines:
"ATLAS stands for Automatically Tuned Linear Algebra Software. ATLAS is both a research project and a software package. ... ATLAS's purpose is to provide portably optimal linear algebra software. The current version provides a complete BLAS API (for both C and Fortran77), and a very small subset of the LAPACK API. For all supported operations, ATLAS achieves performance on par with machine-specific tuned libraries."
There are performance problems when using certain newer versions of GCC with ATLAS. These compiler and other issues are discussed on the ATLAS Errata, which should be read before attempting to use it.
According to Greg Henry (3/14/2001): "Go to Intel's new official MKL webpage: Intel MKL where they now (since 3/13/01) support Linux. My personal observation is that the MKL 5.1 BETA release is not as fully optimized for the Pentium IV as the MKL 5.1 full release is. Please keep this in mind and check back to make certain you are using the most recent release."
According to Greg Henry (3/14/2001): "There are 3 choices:
Being faced with the task of implementing one of these things on a mess o' hardware, I've cobbled together a list of candidates in the hopes of lessening the confusion.
"In this review, I will try to go through most of the over 100 projects that are listed in freshmeat's Clustering/Distributed Networks category that relate to Linux clustering. In order to do this effectively, I will break down the projects into a few categories. Here is a quick outline of how this review will be structured:
Software for building and using clusters
A weblog of articles on the subject as well as a repository of scads of related projects.
"OSCAR version 1.4 is a snapshot of the best known methods for building, programming, and using clusters. It consists of a fully integrated and easy to install software bundle designed for high performance cluster computing. Everything needed to install, build, maintain, and use a modest sized Linux cluster is included in the suite, making it unnecessary to download or even install any individual software packages on your cluster."
The components of OSCAR include:
"System Installation Suite ("SIS") is not just another mass install tool, it is built with the idea that a large number of applications will interface with the various aspects of SIS to create a robust, flexible application stack capable of more than just installing and setting up a variety of workstations. Long term goals include providing a solution to the problems of maintaining, upgrading, and recovering entire networks of dissimilar workstations and servers."
"ORNL is developing user interfaces for assisting in the system management of PC clusters. C3 is a command line interface that may also be called within programs. M3C provides a web based GUI, that among other features, may invoke the C3 tools. The Cluster Command and Control (C3) tool suite was developed for use in operating the HighTORC Linux cluster at Oak Ridge National Laboratory. This suite implements a number of command line based tools that have been shown to increase system manager scalability by reducing time and effort to operate and manage the cluster."
"LAM (Local Area Multicomputer) is an MPI programming environment and development system for heterogeneous computers on a network. With LAM, a dedicated cluster or an existing network computing infrastructure can act as one parallel computer solving one problem. LAM features extensive debugging support in the application development cycle and peak performance for production applications. LAM features a full implementation of the MPI communication standard."
"The Linux Utility for cluster Installation (LUI) is an open source utility for installing Linux workstations remotely, over an ethernet network. What distinguishes LUI is that it is "resource based". LUI provides tools to manage installation resources on the server, that can be allocated and applied to installing clients, allowing users to select just which resources are right for each client. Examples of resources supported in LUI 1.1 are the the linux kernel and associated system map, the disk partition table, RPMs, user exits, and local and remote (NFS) file systems. LUI supports both the BOOTP protocol for diskette based client installation, as well as true network installation, using DHCP and PXE."
"Maui is an advanced job scheduler for use on clusters and supercomputers. It is a highly optimized and configurable tool capable of supporting a large array of scheduling policies, dynamic priorities, extensive reservations, and fairshare and is acknowledged by many as 'the most advanced scheduler in the world'. It is currently in use at hundreds of leading government, academic, and commercial sites throughout the world. It improves the manageability and efficiency of machines ranging from clusters of a few processors to multi-teraflop supercomputers."
"OpenSSH is a FREE version of the SSH protocol suite of network connectivity tools that increasing numbers of people on the Internet are coming to rely on. Many users of telnet, rlogin, ftp, and other such programs might not realize that their password is transmitted across the Internet unencrypted, but it is. OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other network-level attacks. Additionally, OpenSSH provides a myriad of secure tunneling capabilities, as well as a variety of authentication methods."
"PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. The software is very portable. The source, which is available free thru netlib, has been compiled on everything from laptops to CRAYs."
The components of Beowulf include:
"The Beowulf Distributed Process Space (BProc) is set of kernel modifications, utilities and libraries which allow a user to start processes on other machines in a Beowulf-style cluster. Remote processes started with this mechanism appear in the process table of the front end machine in a cluster. This allows remote process management using the normal UNIX process control facilities. Signals are transparently forwarded to remote processes and exit status is received using the usual wait() mechanisms."
"One of the goals of the goals of the Beouwlf project is to demonstrate scalable I/O using commodity subsystems. For scaling network I/O we devised a method to join multiple low-cost networks into a single logical network with higher bandwidth. The only additional work over a using single network interface is the computationally simple task of distributing the packets over the available device transmit queues."
SCE is a set of interoperable open source tools that enable users to easily build and deploy Beowulf cluster. SCE comes in two flavors:
The components of NPACI Rocks include:
"MPICH is a freely available, portable implementation of MPI, the Standard for message-passing libraries."
An advanced job scheduler for use on clusters and supercomputers.
"The Portable Batch System (PBS) is a flexible batch queueing and workload management system originally developed by Veridian Systems for NASA. It operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems."
An effort to provide a high-performance and scalable parallel file system for PC clusters.
"OpenMosix is a a set of extensions to the standard Linux kernel allowing you to build a cluster of out of off-the-shelf PC hardware. openMosix scales perfectly up to thousands of nodes. You do not need to modify your applications to benefit from your cluster (unlike PVM, MPI, Linda, etc.). Processes in openMosix migrate transparently between nodes and the cluster will always auto-balance."
The new cluster architecture design replaces legacy mechanisms for booting (LinuxBIOS) and runs an operating system that provides a single system image of the entire cluster (BProc). Contrast this with the traditional cluster architecture which is a loose coupling of many individual single user workstations.
Clustermatic is a collection of new technologies being developed specifically for our new cluster architecture. Each technology can be used separately, and thus does not preclude integration with other clustering efforts or even other types of computing environments. The components of Clustermatic include:
"Replaces the normal BIOS bootstrap mechanism with a Linux kernel that can be booted from a cold start. Cluster nodes can now be as simple as they need to be -- perhaps as simple as a CPU and memory, no disk, no floppy, and no file system (though it does not preclude these things). As a side effect, they are up and running in under 3 seconds."
"Provides a single system image of the entire cluster. LinuxBIOS cluster nodes come up autonomously and contact the "front end" node which sends them a BProc kernel to boot and registers them as part of the cluster. Users run programs on the front end, which migrates the jobs to the other cluster nodes."
"A high speed cluster monitoring tool that can collect 1000 samples per node per second without noticeable affect on the cluster nodes. The data from Supermon can be used to monitor node health and perform remote node maintenance. In addition, the monitoring information can be used to predictively react to node failures."
"For application support, we have added automatic checkpointing in the ZPL compiler. ZPL is a high level parallel programming language developed at the University of Washington. The compiler inserts checkpoint calls in the user's source code at places with a minimum number of live variables, greatly reducing the checkpoint size as compared to other systems that use the virtual memory system to checkpoint dirty pages. The compiler can also guarantee that there are no in-flight messages during the checkpoint; this eliminates the need for message logging for recovery."
"The SSI project leverages both HP's NonStop Clusters for Unixware technology and other open source technology to provide a full, highly available SSI environment for Linux. Goals for SSI Clusters include availability, scalability and manageability, built from standard servers. Technology pieces will include: membership, single root and single init, cluster filesystems and DLM, single process space and process migration, load leveling, single and shared IPC space, device space and networking space, and single management space. The SSI project was seeded with HP's NonStop Clusters for UnixWare (NSC) technology. It also leverages other open source technologies, such as Cluster Infrastructure (CI), Global File System (GFS), keepalive/spawndaemon, Linux Virtual Server (LVS), and the Mosix load-leveler, to create the best general-purpose clustering environment on Linux.
"CplantTM system software is a collection of code designed, with an emphasis on scalability, to provide a full-featured environment for cluster computing on commodity hardware components. For example, CplantTM system software provides a scalable message passing layer, scalable runtime utilities, and scalable debugging support. CplantTM system software is distributed as source code which can be built for a specific hardware configuration. This source code consists of operating system code (in the form of Linux modules and driver), application support libraries and compiler tools, an MPI port, user-level runtime utilities, support for application debugging, and scripts for configuring and installing the built software."
The major Cplant system software components are:
Standards
"MPI/RT-1.0 is a new Standard for real-time communications and message exchanges between multiple processes that combine forces and work as a group to obtain the solution to a common numerical problem."
"IMPI is an industrial-led effort to create a standard to enable interoperability of different implementations of the Message Passing Interface (MPI)."
"OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism in Fortran and C/C++ programs."
Implementations
"MPICH is a freely available, portable implementation of MPI, the Standard for message-passing libraries."
"MPICH-G2 is a grid-enabled implementation of the MPI v1.1 standard. That is, using Globus services (e.g., job startup, security), MPICH-G2 allows you to couple multiple machines, potentially of different architectures, to run MPI applications. MPICH-G2 automatically converts data in messages sent between machines of different architectures and supports multiprotocol communication by automatically selecting TCP for intermachine messaging and (where available) vendor-supplied MPI for intramachine messaging."
"MP-MPICH stands for Multi-Platform MPICH. It is a modification and extension to the MPICH distribution (currently release 1.2.0). MP-MPICH compiles and runs on all common UNIX platforms (just like the original MPICH) and also on Windows NT and 2000/XP Professional (tested with Visual C++ 6.0)."
"However, with MPICH, the recently arising class of parallel platforms, commonly referred to as Clusters, can only be utilized with TCP/IP as interconnect between the nodes. To use more sophisticated cluster interconnects like the Scalable Coherent Interface (SCI), a commercially developed MPI implementation has to be used which has to be purchased and does not come with source code. Support of SCI-connected clusters by MPICH would help to make this platform more affordable and thus more commonly used. Therefore, we developed an ADI-2 device for SCI-adapters which enables MPICH on SCI-connected clusters."
"A port of MPICH atop GAMMA."
"LAM (Local Area Multicomputer) is an MPI programming environment and development system for heterogeneous computers on a network. With LAM, a dedicated cluster or an existing network computing infrastructure can act as one parallel computer solving one problem. LAM features extensive debugging support in the application development cycle and peak performance for production applications. LAM features a full implementation of the MPI communication standard."
"A Threads-Only MPI Implementation written by Erik Demaine. TOMPI is designed to run MPI programs on a single computer, either a single processor or an SMP. It is designed to be efficient in this environment, allowing effective testing, debugging, and tuning of parallel programs on a workstation. Typically, you can get 100,000+ processes using user-level threads (depending on the stack size you choose), and 512 processes using system-level threads. Experimentally, the slowdown is sub-logarithmic (for example, running 1,024 processes on the same problem is under ten times slower than the serial version)."
Alternative Language Bindings
"An overloaded interface for Fortran 90 and MPI. It defined the same derived types as for the MPICH implementation, includes automatic generation of MPI derived type definitions and overloads all MPI functions."
"Pypar is a efficient but easy-to-use module that allows programs/scripts written in the Python programming language to run in parallel on multiple processors and communicate using message passing. Pypar provides bindings to an important subset of the message passing interface standard MPI."
"The module allows you to send nested data structures between nodes without having to worry about serializing them. So far the module is known to work on: Unicos, Irix, FreeBSD, Linux, Windows 2000 and Compaq Alpha Servers."
"mpiJava is an object-oriented Java interface to the standard Message Passing Interface (MPI)."
"MPI Ruby is a Ruby binding of MPI. The primary goal in making this binding was to make the power of MPI available to Ruby users in a way that fits into the language's object oriented model. In order to do this, the buffer and datatype management necessary in the C, C++, and Fortran bindings have been removed. What this means is that MPI Ruby allows you to treat objects as messages."
"The Object Oriented MPI (OOMPI) package is an object oriented approach to the Message Passing Interface (MPI). OOMPI is a class library specification that encapsulates the functionality of MPI into a functional class hierarchy to provide a simple, flexible, and intuitive interface. The "thin layer" approach to providing powerful class abstractions has been proven, with reasonable optimizing C++ compilers, to add virtually no overhead to the underlying MPI."
"hMPI is an acronym for HaskellMPI. It is a Haskell binding conforming to MPI (Message Passing Interface) standard 1.1/1.2 (LAM and MPICH, which are both implementations of the standard, are supported)."
MPI Tools
"SMS is a directive-based parallelization tool that translates Fortran code into a parallel version that runs efficiently on both shared and distributed memory systems including the IBM SP2, Cray T3E, SGI Origin, Sun Clusters, Alpha Linux clusters and Intel Clusters. SMS was designed to reduce the time required to parallelize serial codes and to maintain them. FSL has parallelized several weather and ocean models using SMS."
"MagPIe is a library of MPI collective communication operations that are optimized for wide-area systems. Version 2.0 is designed to be an add-on to any MPI implementation. MagPIe is built as a separate library and calls the underlying MPI via the profiling interface. Applications just have to be linked with MagPIe and with MPI; changes to application source are not necessary for using MagPIe. However, it is required that you implement two functions that tell MagPIe how many clusters your wide-area system has and which MPI process is located in which cluster."
"Cactus is an open source problem solving environment designed for scientists and engineers. Its modular structure easily enables parallel computation across different architectures and collaborative code development between different groups."
"PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It employs the MPI standard for all message-passing communication."
"MPIMap is a free tool that graphically displays the structure of user-defined MPI data types such as vectors and structs."
"The SKaMPI-Benchmark is a suite of tests designed to measure the performance of MPI."
"The Global Arrays (GA) toolkit provides an efficient and portable "shared-memory" programming interface for distributed-memory computers. Each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed dense multi-dimensional arrays, without need for explicit cooperation by other processes. Unlike other shared-memory environments, the GA model exposes to the programmer the non-uniform memory access (NUMA) characteristics of the high performance computers and acknowledges that access to a remote portion of the shared data is slower than to the local portion. The locality information for the shared data is available and a direct access to the local portions of shared data is provided.
Global Arrays have been designed to complement rather than substitute the message-passing programming model. The programmer is free to use both the shared-memory and message-passing paradigms in the same program, and to take advantage of existing message-passing software libraries. Global Arrays are compatible with the Message Passing Interface (MPI)."
"NetPIPE is a protocol independent performance tool that encapsulates the best of ttcp and netperf and visually represents the network performance under a variety of conditions. It performs simple ping-pongxi tests, bouncing messages of increasing size between two processes, whether across a network or within an SMP system. Message sizes are chosen at regular intervals, and with slight perturbations, to provide a complete test of the communication system. Each data point involves many ping-pong tests to provide an accurate timing. Latencies are calculated by dividing the round trip time in half for small messages."
The mission of the LTC Linux Performance Team is making Linux better by improving Linux kernel performance, with special emphasis on SMP scalability. We measure, analyze, and improve the performance and scalability of the Linux kernel, focusing on platform-independent issues.
[ home ]