APE logo

APE - The Array Processor Experiment

INFN logo

 

Active since 1984, APE("ah-pei") is the collective name of several generations of massively parallel super-computers, optimized for Theoretical Physics simulations, mainly LGT Simulations. They are built on custom processors connected by a high bandwidth, low latency communication network. The APE machines are recognized as a leading platform for LQCD, a most demanding High Performance Computing (HPC) application.

apeNEXT is the latest generation of APE super-computers and is a joint project of INFN, DESY and Université Paris-Sud 11. It is currently in the mass production and deployment phase.

apeNEXT development sites are: INFN Rome, INFN Ferrara, INFN Parma, DESY Zeuthen, Université Paris-Sud 11 in Orsay.


Overview of the APE Architecture

The APE machines are massively parallel 3D arrays of custom computing nodes with periodic boundary conditions. Four generations of APE supercomputing engines have been characterized by impressive value of sustained performances, performance/volume, performance/power and performance/price ratios. The APE group made extensive use of VLSI design, designing VLIW processor architectures with native implementation of the complex type operator performing the AxB+C operation and large multi-port register files for high bandwidth with the arithmetic block. The interconnection system is optimized for low-latency, high bandwidth nearest-neighbours communication. The result is a dense system, based on a reliable and safe HW solution. It has been necessary to develop a custom mechanics for wide integration of cheap systems with very low cost of maintenance.

The APE family of supercomputers

 

APE

APE100

APEmille

APEnext

Year

1984-1988

1989-1993

1994-1999

2000-2005

Number of processors

16

2048

2048

4096

Topology

Flexible 1D

Next Neighbour 3D

Flexible 3D

Flexible 3D

Total Memory

256 MB

8 GB

64 GB

1 TB

Clock

8 MHz

25 MHz

66 MHz

200 Mhz

Peak Processing Power

1 GFlops

100 GFlops

1 TFlops

7 TFlops

  

  

  

  

  


Architecture Design Tradeoffs

Power vs Performance
picture

As pointed before one of the most critical issue for the design of the next generations of high performance numerical applications will be the power dissipation. A typical figure is the power dissipation per unit area. The following picture illustrates the efficiency of the approach adopted by INFN:


3D Mesh of Processors

On the other hand the most limiting factor of processing nodes density is the inter-processor bandwidth density. In the traditional approach, used in top computing systems as INFN apeNEXT and IBM BlueGene/L, the inter-processor connections use a 3-D toroidal point-to-point network. In these systems the limiting factor of the density of processing nodes is the connectors’ number and related surface area required to implement such networks. The physical implementation is realized distributing the set of motherboard connectors on its 1-D edge using a backplane to form a multi board system.

As a consequence the density of processing nodes per motherboard is 16 nodes for apeNEXT and 64 nodes per BlueGene/L, while the density per rack is equal to 512 nodes for apeNEXT and 1024 for BlueGene/L.


apeNEXT J&T processor design

J&T chip
floorplan

  

J&T chip scheme

The J&T processor has been designed in standard cell technology using the Atmel 0.18µ, 7 metal process. The chip die has a surface of 16x16 mm2 and contains 520K gates. The chip package is a BGA with 600pins. The chip target frequency is 200 MHz.

Processor highlights are:

  • Low power consumption: 5W at 200MHz.
  • Complex algebra is supported in HW, to accelerate LQCD computation.
  • All data types have full 64 bit precision.
  • A*B+C operator for all data types: integer, double, vector integer, vector double, complex.
  • 8 KWord of on-chip instruction cache.
  • 256 MB DDR DRAM plus error correction (EDAC).
  • 128 bit, up to 200 MHz, local memory channel.
  • 6+1 on-chip, bidirectional, communication links with 200 MB/s of bandwitdh in LVDS technology.

Lattice QCD

QCD (Quantum Chromo-Dynamics) is the field theory which describes the physics of the strong force, ruling the behavior of quarks and gluons. In general, the solution of the theory is not possible through pure analytical methods and requires simulation on computers. LQCD (Lattice-QCD) is the discretized version of the theory and it is the method of choice. The related algorithms are considered as one of the most demanding numerical applications. A large part of the scientific community which historically worked out the LQCD set of algorithms is now trying to apply similar numerical methods to other challenging fields requiring Physical Modelling, e.g. BioComputing.

Up to now, LQCD has been the driving force for the design of systems like QCDSP, QCDOC (Columbia University), CP-PACS (Tsukuba Univerity) and several generations of systems designed by INFN (APE family of massively parallel computers).

Present simulations run on 4-D (643x128) physical lattices, and are well mapped onto a 3D lattice of processors. On a 10+ year scale, theoretical physics will need to grow to 2563x512 physical lattices.

Since the required computational power scales with the seventh power of the lattice size, PETAFlops systems are needed.


Application to scalable platforms for Embedded Digital Signal Processing

Notably, architectures designed for efficient solution of LQCD proved themselves efficient also on the numerical kernel of a few demanding Digital Signal Processing algorithms, but the transformation into DSP engines required the addition of critical real time and system interface features. In the future the similarity between requisites of DSP and physical modelling will increase. This aspect must be understood and exploited. A starting point to explore this convergence will be the inclusion of LQCD and other Physical Modeling Algorithms in the benchmark suite of SHAPES. The participation of INFN Roma to the SHAPES consortium is meant to develop basic technologies for the next generation of LQCD engines and to perform a bi-directional technology transfer action.


Future Plans hypothesis

To develop (2006-2009), a new generation of parallel machine dedicated to a broad range of numerical simulation applications and characterized by:

  • A peak performances of the order of 1 PetaFlops.
  • A number of processors in the order of 100.000.
  • Very high computational efficieny on target applications
  • Reduced power consumption
  • Impressive density of computing power per cubic meter.


A Small Image collection

  

  

  

  

  

A photo of an apeNEXT rack

A photo of a last generation, mass production level apeNEXT rack, sporting 512 processors (8x8x8).

  

  

A photo of the apeNEXT processor

A photo of the apeNEXT J&T processor, mounted on a piggy-back module together with 256MB of SDRAM.

  

  

A photo of 2 APEmille racks

A photo showing 2 APEmille racks installed in INFN Roma.

  

  

ape ray tracing image

A Computer Graphics image computed via a SIMD Parallel Ray Tracing Algorithm, with texture mapped sea bottom and mirror spheres.

  

  

  

  

  

  


  

 

CVS $Revision: 1.26 $
webmaster
Valid HTML 4.01! Valid CSS!