|
Active since 1984, APE("ah-pei") is the collective name of several
generations of massively parallel super-computers, optimized for
Theoretical Physics simulations, mainly LGT Simulations.
They are built on custom processors connected by a high
bandwidth, low latency communication network. The APE machines are recognized as a leading
platform for LQCD,
a most demanding High Performance Computing (HPC) application.
apeNEXT is the latest generation of APE super-computers and is a
joint project of INFN, DESY and Université Paris-Sud 11. It is currently in the mass production and
deployment phase.
apeNEXT development sites are: INFN Rome, INFN Ferrara, INFN Parma, DESY Zeuthen, Université Paris-Sud 11 in Orsay.
Overview of the APE Architecture
The APE machines are massively parallel 3D arrays of custom
computing nodes with periodic boundary conditions. Four
generations of APE supercomputing engines have been characterized
by impressive value of sustained performances, performance/volume,
performance/power and performance/price ratios. The APE group made
extensive use of VLSI design, designing VLIW processor
architectures with native implementation of the complex type
operator performing the AxB+C operation and
large multi-port register files for high bandwidth with the
arithmetic block. The interconnection system is optimized for
low-latency, high bandwidth nearest-neighbours communication. The
result is a dense system, based on a reliable and safe HW
solution. It has been necessary to develop a custom mechanics for
wide integration of cheap systems with very low cost of
maintenance.
The APE family of supercomputers
|
|
APE
|
APE100
|
APEmille
|
APEnext
|
|
Year
|
1984-1988
|
1989-1993
|
1994-1999
|
2000-2005
|
|
Number of processors
|
16
|
2048
|
2048
|
4096
|
|
Topology
|
Flexible 1D
|
Next Neighbour 3D
|
Flexible 3D
|
Flexible 3D
|
|
Total Memory
|
256 MB
|
8 GB
|
64 GB
|
1 TB
|
|
Clock
|
8 MHz
|
25 MHz
|
66 MHz
|
200 Mhz
|
|
Peak Processing Power
|
1 GFlops
|
100 GFlops
|
1 TFlops
|
7 TFlops
|
|
|
|
|
|
|
As pointed before one of the most critical issue for the design
of the next generations of high performance numerical applications
will be the power dissipation. A typical figure is the power
dissipation per unit area. The following picture illustrates the
efficiency of the approach adopted by INFN:
On the other hand the most limiting factor of processing nodes
density is the inter-processor bandwidth density. In the
traditional approach, used in top computing systems as INFN
apeNEXT and IBM
BlueGene/L, the inter-processor connections use a 3-D toroidal
point-to-point network. In these systems the limiting factor of
the density of processing nodes is the connectors’ number
and related surface area required to implement such networks. The
physical implementation is realized distributing the set of
motherboard connectors on its 1-D edge using a backplane to form a
multi board system.
As a consequence the density of processing nodes per
motherboard is 16 nodes for apeNEXT and 64 nodes per BlueGene/L,
while the density per rack is equal to 512 nodes for
apeNEXT and 1024 for BlueGene/L.
|
The J&T processor has been designed in standard cell
technology using the Atmel
0.18µ, 7 metal process. The chip die has a surface of 16x16
mm2 and contains 520K gates. The chip package is a BGA with
600pins. The chip target frequency is 200 MHz.
Processor highlights are:
- Low power consumption: 5W at 200MHz.
- Complex algebra is supported in HW, to accelerate LQCD computation.
- All data types have full 64 bit precision.
- A*B+C operator for all data types: integer, double, vector integer, vector double, complex.
- 8 KWord of on-chip instruction cache.
- 256 MB DDR DRAM plus error correction (EDAC).
- 128 bit, up to 200 MHz, local memory channel.
- 6+1 on-chip, bidirectional, communication links with 200 MB/s of bandwitdh in LVDS technology.
|
QCD (Quantum Chromo-Dynamics) is the field theory which
describes the physics of the strong force, ruling the behavior of
quarks and gluons. In general, the solution of the theory is not
possible through pure analytical methods and requires simulation
on computers. LQCD (Lattice-QCD) is
the discretized version of the theory and it is the method of
choice. The related algorithms are considered as one of the most
demanding numerical applications. A large part of the scientific
community which historically worked out the LQCD set of algorithms
is now trying to apply similar numerical methods to other
challenging fields requiring Physical Modelling,
e.g. BioComputing. Up to now, LQCD has been the driving
force for the design of systems like QCDSP,
QCDOC
(Columbia University), CP-PACS
(Tsukuba Univerity) and several generations of systems designed by
INFN (APE family of massively parallel computers).
Present simulations run on 4-D (643x128) physical
lattices, and are well mapped onto a 3D lattice of processors. On
a 10+ year scale, theoretical physics will need to grow to
2563x512 physical lattices.
Since the required computational power scales with the seventh
power of the lattice size, PETAFlops systems are needed.
Notably, architectures designed for efficient solution of LQCD
proved themselves efficient also on the numerical kernel of a few
demanding Digital Signal Processing algorithms, but the
transformation into DSP engines required the
addition of critical real time and system interface features. In
the future the similarity between requisites of DSP and physical
modelling will increase. This aspect must be understood and
exploited. A starting point to explore this convergence will be
the inclusion of LQCD and other Physical Modeling Algorithms in
the benchmark suite of SHAPES. The participation of INFN Roma to
the SHAPES
consortium is meant to develop basic technologies for the next
generation of LQCD engines and to perform a bi-directional
technology transfer action.
To develop (2006-2009), a new generation of parallel machine
dedicated to a broad range of numerical simulation applications
and characterized by:
- A peak performances of the order of 1
PetaFlops.
- A number of processors in the order of 100.000.
- Very high computational efficieny on target applications
- Reduced power consumption
- Impressive density of computing power per cubic meter.
A Small Image collection
|
|
|
|
|
|
|

|
A photo of a last generation, mass production level apeNEXT rack, sporting 512 processors (8x8x8).
|
|
|
|

|
A photo of the apeNEXT J&T processor, mounted on a piggy-back module together with 256MB of SDRAM.
|
|
|
|

|
A photo showing 2 APEmille racks installed in INFN Roma.
|
|
|
|

|
A Computer Graphics image computed via a SIMD Parallel Ray Tracing Algorithm, with texture mapped sea bottom and mirror spheres.
|
|
|
|
|
|
|
|