Difference between revisions of "APEmille project"

From APEWiki
Jump to: navigation, search
(Performances)
(Performances)
Line 59: Line 59:
 
Processing node:
 
Processing node:
  
    * 528 MegaFlops (66 MHz release)
+
* 528 MegaFlops (66 MHz release)
    * 800 MegaFlops (100 MHz release)
+
* 800 MegaFlops (100 MHz release)
  
 
Board (8 nodes, 2*2*2):
 
Board (8 nodes, 2*2*2):
  
    * 4.2 GigaFlops (66 Mhz)
+
* 4.2 GigaFlops (66 Mhz)
    * 6.4 GigaFlops(100 Mhz)
+
* 6.4 GigaFlops(100 Mhz)
  
 
SubCrate (4 Boards, 2*2*8):
 
SubCrate (4 Boards, 2*2*8):
  
    * 16.8 GigaFlops (66 Mhz)
+
* 16.8 GigaFlops (66 Mhz)
    * 25.6 GigaFlops(100 Mhz)
+
* 25.6 GigaFlops(100 Mhz)
  
 
Crate (16 Boards, 2*8*8, ~ 0.5 m^3):
 
Crate (16 Boards, 2*8*8, ~ 0.5 m^3):
  
    * 66 GigaFlops (66 Mhz)
+
* 66 GigaFlops (66 Mhz)
    * 100 GigaFlops(100 Mhz)
+
* 100 GigaFlops(100 Mhz)
  
 
Tower (4 Crates, 8*8*8, ~ 2 m^3):
 
Tower (4 Crates, 8*8*8, ~ 2 m^3):
  
    * 264 GigaFlops (66 Mhz)
+
* 264 GigaFlops (66 Mhz)
    * 400 GigaFlops(100 Mhz)
+
* 400 GigaFlops(100 Mhz)
  
 
APEMille (4 Towers, 32*8*8, ~ 8 m^3):
 
APEMille (4 Towers, 32*8*8, ~ 8 m^3):
  
    * 1 TeraFlops(66 Mhz)
+
* 1 TeraFlops(66 Mhz)
    * 1.6 TeraFlops(100Mhz)
+
* 1.6 TeraFlops(100Mhz)

Revision as of 09:46, 5 October 2006

APEmille

A PARALLEL PROCESSOR IN THE TERAFLOPS RANGE


This document describes APEmille, a 3-D SIMD scalable parallel processor in the Teraflop range. This machine is very efficient for LGT simulations as well as for a broader class of numeric applications requiring massive intensive floating point computations.


Architectural Overview


An APEmille machine can be viewed as a 3-D processing grid with periodic boundaries, composed of Processing Nodes. Each Processing Node is directly connected to its own 6 near-neighbours through synchronous data communication channels. The Processing Nodes are optimised for single precision floating point arithmetics and integer and double precision floating point operations are supported too. The SIMD paradigm (groups of nodes which execute the same instruction on different data) is complemented with a local addressing feature. Therefore all Processing Nodes may access their own local memories using different local addresses. This new feature is the most important extension to the present APE100 architecture, opening a path for coding of algorithms which could not be efficiently implemented on APE100 machines.

The local addressing capability is a valuable addition to the power, already present in the APE100 architecture, of the local conditional statements (Where(local condition)... Endwhere) and of global conditions derived by the set of local conditions (e.g. If(All(local condition)) ... Endif). A further enhancement to the APE100 architecture is represented by more general data routing capabilities among non-first-neighbour nodes.

While the computational kernel of APE100 is built using replicas of three kinds of boards (Controller Boards, Processing Boards and Communication Boards) an APEmille computational engine is based on multiple instances of just one Processing Board (PB). This design solution implies scalability and engineering advantages in comparison with the APE100 arrangement.

Each PB integrates all system functionality: flow control, data processing, internode communication and host<->APEmille I/O. A Root Board provides global synchronisation of the Processing Boards.

The host consists of one or more networked workstations, each controlling a group of PBs.

An APEmille machine can be partitioned into smaller SIMD machines, each comprising one or more Processing Boards, executing different instructions on each partition.

Each host in the network, using a high performance communication channel, maps the memories of a portion of APEmille (up to 128 nodes) on its own bus. The close integration with a network of host workstations allows a high input/output bandwidth with disks and peripherals in the range of 100 MByte per second per I/O device. Close integration of APEmille with standard workstation also adds the flexibility needed to customise the I/O system to the requirements of specific applications.

Topology


An APEmille machine is a 3D grid of processing nodes with hardware data links between the 3D first neighbouring nodes. The smallest machine is the single Processing Board (PB) where are placed 8 processing nodes with a 2x2x2 topology, (cube). Arranging togheter more PBs it's possible to have more complicated, greater and faster machines.

3D grid


Hierarchical Hardware Topology

The yellow cubes in this rappresentation are single APEmille boards with their 8 processing nodes supposed as placed at the vertices of the cube.

  • Board: 8 Nodes (2*2*2 topology)

Cube1.gif

  • SubCrate: 4 Boards (2*2*8 topology)

Cube4.gif

  • Crate: 4 SubCrates -16 Boards (2*8*8 topology)

Cube16.gif

  • Tower: 4 Crates - 64 Boards (8*8*8)
  • APEMille: 4 Towers - 256 Boards(32*8*8)


Performances

The following table reports the performance of the different configuration of an APEmille system: the maximum performance are about 1TeraFlops for the 66 MHz machine, and could be 1.6TFlops in a future 100 MHz release.

Comparing these performances, with an APE100 machine we find out that it's possible to replace an entire APE100 tower (25 GigaFloaps) with a APEmille SubCrate.

Processing node:

  • 528 MegaFlops (66 MHz release)
  • 800 MegaFlops (100 MHz release)

Board (8 nodes, 2*2*2):

  • 4.2 GigaFlops (66 Mhz)
  • 6.4 GigaFlops(100 Mhz)

SubCrate (4 Boards, 2*2*8):

  • 16.8 GigaFlops (66 Mhz)
  • 25.6 GigaFlops(100 Mhz)

Crate (16 Boards, 2*8*8, ~ 0.5 m^3):

  • 66 GigaFlops (66 Mhz)
  • 100 GigaFlops(100 Mhz)

Tower (4 Crates, 8*8*8, ~ 2 m^3):

  • 264 GigaFlops (66 Mhz)
  • 400 GigaFlops(100 Mhz)

APEMille (4 Towers, 32*8*8, ~ 8 m^3):

  • 1 TeraFlops(66 Mhz)
  • 1.6 TeraFlops(100Mhz)