Active since 1984, APE("ahpei") is the collective name of several
generations of massively parallel supercomputers, optimized for
Theoretical Physics simulations, mainly LGT Simulations.
They are built on custom processors connected by a high
bandwidth, low latency communication network. The APE machines are recognized as a leading
platform for LQCD,
a most demanding High Performance Computing (HPC) application.
apeNEXT is the latest generation of APE supercomputers and is a
joint project of INFN, DESY and Université ParisSud 11. It is currently in the mass production and
deployment phase.
apeNEXT development sites are: INFN Rome, INFN Ferrara, INFN Parma, DESY Zeuthen, Université ParisSud 11 in Orsay.
Overview of the APE Architecture
The APE machines are massively parallel 3D arrays of custom
computing nodes with periodic boundary conditions. Four
generations of APE supercomputing engines have been characterized
by impressive value of sustained performances, performance/volume,
performance/power and performance/price ratios. The APE group made
extensive use of VLSI design, designing VLIW processor
architectures with native implementation of the complex type
operator performing the AxB+C operation and
large multiport register files for high bandwidth with the
arithmetic block. The interconnection system is optimized for
lowlatency, high bandwidth nearestneighbours communication. The
result is a dense system, based on a reliable and safe HW
solution. It has been necessary to develop a custom mechanics for
wide integration of cheap systems with very low cost of
maintenance.
The APE family of supercomputers

APE

APE100

APEmille

APEnext

Year

19841988

19891993

19941999

20002005

Number of processors

16

2048

2048

4096

Topology

Flexible 1D

Next Neighbour 3D

Flexible 3D

Flexible 3D

Total Memory

256 MB

8 GB

64 GB

1 TB

Clock

8 MHz

25 MHz

66 MHz

200 Mhz

Peak Processing Power

1 GFlops

100 GFlops

1 TFlops

7 TFlops






As pointed before one of the most critical issue for the design
of the next generations of high performance numerical applications
will be the power dissipation. A typical figure is the power
dissipation per unit area. The following picture illustrates the
efficiency of the approach adopted by INFN:
On the other hand the most limiting factor of processing nodes
density is the interprocessor bandwidth density. In the
traditional approach, used in top computing systems as INFN
apeNEXT and IBM
BlueGene/L, the interprocessor connections use a 3D toroidal
pointtopoint network. In these systems the limiting factor of
the density of processing nodes is the connectors’ number
and related surface area required to implement such networks. The
physical implementation is realized distributing the set of
motherboard connectors on its 1D edge using a backplane to form a
multi board system.
As a consequence the density of processing nodes per
motherboard is 16 nodes for apeNEXT and 64 nodes per BlueGene/L,
while the density per rack is equal to 512 nodes for
apeNEXT and 1024 for BlueGene/L.

The J&T processor has been designed in standard cell
technology using the Atmel
0.18µ, 7 metal process. The chip die has a surface of 16x16
mm^{2} and contains 520K gates. The chip package is a BGA with
600pins. The chip target frequency is 200 MHz.
Processor highlights are:
 Low power consumption: 5W at 200MHz.
 Complex algebra is supported in HW, to accelerate LQCD computation.
 All data types have full 64 bit precision.
 A*B+C operator for all data types: integer, double, vector integer, vector double, complex.
 8 KWord of onchip instruction cache.
 256 MB DDR DRAM plus error correction (EDAC).
 128 bit, up to 200 MHz, local memory channel.
 6+1 onchip, bidirectional, communication links with 200 MB/s of bandwitdh in LVDS technology.

QCD (Quantum ChromoDynamics) is the field theory which
describes the physics of the strong force, ruling the behavior of
quarks and gluons. In general, the solution of the theory is not
possible through pure analytical methods and requires simulation
on computers. LQCD (LatticeQCD) is
the discretized version of the theory and it is the method of
choice. The related algorithms are considered as one of the most
demanding numerical applications. A large part of the scientific
community which historically worked out the LQCD set of algorithms
is now trying to apply similar numerical methods to other
challenging fields requiring Physical Modelling,
e.g. BioComputing. Up to now, LQCD has been the driving
force for the design of systems like QCDSP,
QCDOC
(Columbia University), CPPACS
(Tsukuba Univerity) and several generations of systems designed by
INFN (APE family of massively parallel computers).
Present simulations run on 4D (64^{3}x128) physical
lattices, and are well mapped onto a 3D lattice of processors. On
a 10+ year scale, theoretical physics will need to grow to
256^{3}x512 physical lattices.
Since the required computational power scales with the seventh
power of the lattice size, PETAFlops systems are needed.
Notably, architectures designed for efficient solution of LQCD
proved themselves efficient also on the numerical kernel of a few
demanding Digital Signal Processing algorithms, but the
transformation into DSP engines required the
addition of critical real time and system interface features. In
the future the similarity between requisites of DSP and physical
modelling will increase. This aspect must be understood and
exploited. A starting point to explore this convergence will be
the inclusion of LQCD and other Physical Modeling Algorithms in
the benchmark suite of SHAPES. The participation of INFN Roma to
the SHAPES
consortium is meant to develop basic technologies for the next
generation of LQCD engines and to perform a bidirectional
technology transfer action.
To develop (20062009), a new generation of parallel machine
dedicated to a broad range of numerical simulation applications
and characterized by:
 A peak performances of the order of 1
PetaFlops.
 A number of processors in the order of 100.000.
 Very high computational efficieny on target applications
 Reduced power consumption
 Impressive density of computing power per cubic meter.
A Small Image collection






A photo of a last generation, mass production level apeNEXT rack, sporting 512 processors (8x8x8).




A photo of the apeNEXT J&T processor, mounted on a piggyback module together with 256MB of SDRAM.




A photo showing 2 APEmille racks installed in INFN Roma.




A Computer Graphics image computed via a SIMD Parallel Ray Tracing Algorithm, with texture mapped sea bottom and mirror spheres.






