KLAT2, FNN & SWAR
No, those aren't the names of characters in a bad fantasy
trilogy (is there any other kind?), but rather abbreviations
associated with the neatest advance in
Beowulf clusters
since someone said, "By golly! We've got a bunch of computers,
a barn, and script, so let's put on a show!"
The Kentucky
Linux
Athlon
Testbed
2 is a cluster of 66 700 MHz AMD
Athlon machines that uses a novel combination of hardware and
software tricks to attain $650 GigaFLOPS.
You heard correctly - the folks at the
Aggregate project
squeezed 64 GigaGLOPS out of a cluster they put together for
just over $41,000.
The
FAQ defines a Beowulf cluster thusly:
It's a kind of high-performance massively parallel computer built
primarily out of commodity hardware components, running a free-software
operating system like Linux or FreeBSD, interconnected by a private
high-speed network. It consists of a cluster of PCs or workstations
dedicated to running high-performance computing tasks. The nodes in
the cluster don't sit on people's desks; they are dedicated to running
cluster jobs. It is usually connected to the outside world through
only a single node.
As you might guess, the performance one can coax out of such
a beastie is a function of the interconnection topology and bandwidth as well as the software used.
The Aggregate folks came up with a new network topology called
a Flat
Neighborhood
Network that minimizes the latency between
any pair of machines in the cluster, i.e. the amount of time a signal
is delayed due to traveling through the wires and, more significantly,
waiting at any of the switches in network.
The ideal solution would be to implement a direct connection
between each machine, but the nonexistence of commodity
motherboards with space for 65 network interface cards
(NICs) obviates that choice.
The next best choice would be one-switch delay solution requiring
a 66-way switch, although that sort of thing isn't exactly on the
shelves down at Best Buy either.
Some damned clever cogitating about possible cheap and
efficient alternatives brought the Aggregate folks to the FNN:
The "Flat Neighborhood" network topology came from the realization that it was sufficient to share at least one switch with each PC -- all PCs do not
have to share the same switch. A switch defines a local network neighborhood, or subnet. If a PC has several NICs, it can belong to several
neighborhoods. For two PCs to communicate directly, they simply use NICs that are in a neighborhood that the two PCs have in common.
Coincidentally, this flat, interleaved, arrangement of the switches results in spectacular bisection bandwidth -- approaching the same bisection
bandwidth that we would have gotten if we had wire-speed switches that were wide enough to span the entire cluster!
A further advantage of this network topology is that the multiple
NICs in each machine can be combined via a technique called
channel bonding into a single datapath with a larger
bandwidth. This allows 100Mb NICs - obtainable for $10 in
mass quantities these days - to be used instead of the much more
expensive Gigabit or Myrinet cards, with the total cost in switches,
cabling and NICs coming out to $125 per node in KLAT2.
Seems really spiffy so far, doesn't it? But, as the song says,
"every form of refuge has its price," and it's a real sumbitch in
this case. The wiring pattern needed to both minimize latency
(i.e. switching) and maximize the bandwidth provided by each
switch is relatively easy to figure out for networks with relatively
few nodes, but when you have 66 nodes the difficulty in solving
the analogous graph covering problem increases
combinatorially. While this sort of nonlinear optimization problem
is difficult or impossible to solve via traditional methods, it's just
the thing at which one should throw a genetic search algorithm (GA)
Another bonus is that additional constraints can be added
to the aforementioned to further optimize the connection topology
for solving specific problems. Yep, you can tune the Beowulf
cluster to specific problems by simply running the genetic
seach algorithm again, although the attendant rewiring can be a bit
of a pain in the ass so it's not something you want to do very
often.
They've even got an
interactive version of a simplified FNN designing GA you
can play with.
The software trick they're using is simply using the
instruction-level parallelism (ILP) available within the Athlon
processor in the guise of its 3DNow! technology.
They give the generic acronym
SIMD
Within
A
Register to all such techniques that
apply SIMD parallel processing across sections of a CPU register, e.g. MMX, SSE, AltiVec and 3DNow!.
These techniques were all developed to make for faster games,
but there's no law saying they can't be used for scientific
programming as well. Such programming is of course difficult, but
they've developed a
toolset including a compiler that transforms
a vector dialect of C into SWAR code to make life a bit easier.
Using
ScaLAPACK (the same thing used to compile the
Top500 Supercomputer Sites list), the Aggregate folks
got 22.8 GigaFLOPS of performance for the double precision
benchmark using no SWAR techniques.
When SWAR was applied, they obtained an amazing 64 GigaFLOPS
for the the single precision ScaLAPACK benchmark.
That's as in 64 700 MHz Athlons producing 64 GigaFLOPs,
a performance level that can only be achieved by using SWAR
to perform two operations simultaneously in most of the code.
The 64 GigaFLOPs number also includes the time taken for
interprocessor communcation, by the way.
All this research has led to the concept of
Personalized Turnkey Supercomputers (PeTS), which are
customized cluster hardware and software combinations that
will appear to its scientific users as a dedicated piece of "laboratory equipment" that directly solves their most important computational problems. That is, a cheap and very fast supercomputer designed
to be maximally efficient for a specific task.
To echo the words of a certain chunky native of
Colorado, '"Sweeeeeeeeet."
posted by Steven Baum
12/8/2000 03:04:05 PM |
link