APEmille project

From APEWiki
Revision as of 10:05, 5 October 2006 by Biagioni (talk | contribs) (Scheme of connections among the 8 nodes of a PB)
Jump to: navigation, search

A PARALLEL PROCESSOR IN THE TERAFLOPS RANGE


This document describes APEmille, a 3-D SIMD scalable parallel processor in the Teraflop range. This machine is very efficient for LGT simulations as well as for a broader class of numeric applications requiring massive intensive floating point computations.


Architectural Overview

An APEmille machine can be viewed as a 3-D processing grid with periodic boundaries, composed of Processing Nodes. Each Processing Node is directly connected to its own 6 near-neighbours through synchronous data communication channels. The Processing Nodes are optimised for single precision floating point arithmetics and integer and double precision floating point operations are supported too. The SIMD paradigm (groups of nodes which execute the same instruction on different data) is complemented with a local addressing feature. Therefore all Processing Nodes may access their own local memories using different local addresses. This new feature is the most important extension to the present APE100 architecture, opening a path for coding of algorithms which could not be efficiently implemented on APE100 machines.

The local addressing capability is a valuable addition to the power, already present in the APE100 architecture, of the local conditional statements (Where(local condition)... Endwhere) and of global conditions derived by the set of local conditions (e.g. If(All(local condition)) ... Endif). A further enhancement to the APE100 architecture is represented by more general data routing capabilities among non-first-neighbour nodes.

While the computational kernel of APE100 is built using replicas of three kinds of boards (Controller Boards, Processing Boards and Communication Boards) an APEmille computational engine is based on multiple instances of just one Processing Board (PB). This design solution implies scalability and engineering advantages in comparison with the APE100 arrangement.

Each PB integrates all system functionality: flow control, data processing, internode communication and host<->APEmille I/O. A Root Board provides global synchronisation of the Processing Boards.

The host consists of one or more networked workstations, each controlling a group of PBs.

An APEmille machine can be partitioned into smaller SIMD machines, each comprising one or more Processing Boards, executing different instructions on each partition.

Each host in the network, using a high performance communication channel, maps the memories of a portion of APEmille (up to 128 nodes) on its own bus. The close integration with a network of host workstations allows a high input/output bandwidth with disks and peripherals in the range of 100 MByte per second per I/O device. Close integration of APEmille with standard workstation also adds the flexibility needed to customise the I/O system to the requirements of specific applications.

Topology

An APEmille machine is a 3D grid of processing nodes with hardware data links between the 3D first neighbouring nodes. The smallest machine is the single Processing Board (PB) where are placed 8 processing nodes with a 2x2x2 topology, (cube). Arranging togheter more PBs it's possible to have more complicated, greater and faster machines.

3D grid


Hierarchical Hardware Topology

The yellow cubes in this rappresentation are single APEmille boards with their 8 processing nodes supposed as placed at the vertices of the cube.

  • Board: 8 Nodes (2*2*2 topology)

Cube1.gif

  • SubCrate: 4 Boards (2*2*8 topology)

Cube4.gif

  • Crate: 4 SubCrates -16 Boards (2*8*8 topology)

Cube16.gif

  • Tower: 4 Crates - 64 Boards (8*8*8)
  • APEMille: 4 Towers - 256 Boards(32*8*8)


Performances

The following table reports the performance of the different configuration of an APEmille system: the maximum performance are about 1TeraFlops for the 66 MHz machine, and could be 1.6TFlops in a future 100 MHz release.

Comparing these performances, with an APE100 machine we find out that it's possible to replace an entire APE100 tower (25 GigaFloaps) with a APEmille SubCrate.

Processing node:

  • 528 MegaFlops (66 MHz release)
  • 800 MegaFlops (100 MHz release)

Board (8 nodes, 2*2*2):

  • 4.2 GigaFlops (66 Mhz)
  • 6.4 GigaFlops(100 Mhz)

SubCrate (4 Boards, 2*2*8):

  • 16.8 GigaFlops (66 Mhz)
  • 25.6 GigaFlops(100 Mhz)

Crate (16 Boards, 2*8*8, ~ 0.5 m^3):

  • 66 GigaFlops (66 Mhz)
  • 100 GigaFlops(100 Mhz)

Tower (4 Crates, 8*8*8, ~ 2 m^3):

  • 264 GigaFlops (66 Mhz)
  • 400 GigaFlops(100 Mhz)

APEMille (4 Towers, 32*8*8, ~ 8 m^3):

  • 1 TeraFlops(66 Mhz)
  • 1.6 TeraFlops(100Mhz)

Processing Board

Fore each APEmille PB a CPU takes care of the program flow and drives the communication protocol with the host network. The CPU broadcasts a VLIW (Very Long Instruction Word) to the 8 Processing Nodes located on each PB. At the same time, as a result of its calculations, the CPU broadcasts a Global Address to the Processing Nodes. The Nodes can access their own local memory using this Global Address either directly or after adding a local offset to it, thus producing a local address.

The 8 Processing Nodes of a Board are arranged in a cubic lattice (a 2x2x2 topology). Each processing node has an arithmetic and logic unit including floating point adders, multipliers, a large multiport register file plus a memory controller with address generation capability, interfacing to local memory. Integer and bitwise operations and local addressing are some of the new main features of the Processing Nodes.

The first APEmille systems, based on presently available Syncronous Dynamic Ram and VLSI ASIC technologies, will adopt a 66 MHz, 528 MFlops processor. We expect however that the clock frequency may eventually increase up to 100 MHz. A 2048 100 MHz nodes system would have a peak performance of 1.6 TeraFlops.

According to our application requirements, the size of the local memory of each Processing Node will range from 2 to 8 MWord. Through a Communication Device each Processing Node directly accesses the data on its own 6 neighbouring Nodes. Some routing capabilities allow of the comunication device (slower) access to far-away nodes.

The Processing Board hosts:

  • 1 CPU
  • 8 Processing Nodes,
  • 1 Communication Device
  • required memories.

Scheme of connections among the 8 nodes of a PB

Xyz02.gif

Xyz01.gif


Cmille is the custom chip which takes care of the remote data communications between processing nodes.

Tmille is the custom processor which controls the instruction flow and global addressing of each APEMille SIMD partition (a minimum of 8 processing nodes). Tmille is connected to the Host, to its own data memory, to an instruction memory and to the Communication Device. It also drives the Address Bus. There is a major difference from the Ape100 machine where one controller CPU normally drives 128 nodes and is housed in a different board.

The Processing Node is composed of a floating point processor (called JMILLE) which drives its own Local Memory, receives Instructions and Global Addresses from TMILLE and communicates with the Communication Device (CMILLE).

The Communication Device also has a channel connected to the CPU, and through this pathway, to the Host. Through this pathway the Host loads the program and data memories of the floating point processor.

All the CPUs and all the Nodes of a SIMD machine partition execute the same instruction. Each PB sends its Local Signals to a Root Board and receives Global Signals from it, in order to manage Synchronisation, Exceptions and Global Flow Control Conditions.

Each Processing Board interfaces the Host through a dedicated synchronous channel (APE Channel) going from the CPU to the Host Interface.

Processing Node

Tmille

Jmille

Cmille

PB interconnections

Memory

Software