Leveraging the acquired know-how in networking and re-employing the gained insights, a spin-off project called APEnet developed an interconnect board based on FPGA that allows to assemble a PC cluster a la APE with off-the-shelf components.

The design of APEnet interconnect is easily portable and can be configured for different environments: (i) the APEnet was the first point-to-point, low-latency, high-throughput network interface card for LQCD dedicated clusters; (ii) the Distributed Network Processor (DNP) was one of the key elements of RDT (Risc+DSP+DNP) chip for the implementation of a tiled architecture in the framework of the EU FP6 SHAPES project; (iii) the APEnet Network Interface Card, based on an Altera Stratix IV FPGA, was used in a hybrid, GPU-accelerated x86 64 cluster QUonG with a 3D toroidal mesh topology, able to scale up to 10^4 – 10^5 nodes in the framework of the EU FP7 EURETILE project. APEnet+ was the first device to directly access the memory of the NVIDIA GPU providing GPUDirect RDMA capabilities and experiencing a boost in GPU to GPU latency test; (iv) the APEnet network IP — i.e. routing logic and link controller — is responsible for data transmision at Tier 0/1/2 in the framework of H2020 ExaNeSt project

APEnet
DNP
APEnet+
APEnet+ v5
ExaNet
Year
2003
2007
2012
2014
2017
FPGA
Altera Stratix III
ASIC
Altera Stratix IV
Altera Stratix V
Xilinx Ultrascale+
BUS
PCI-X
AMBA-AHB
PCIe Gen2
PCIe Gen3
AXI
Computing
Intel CPU
RISC+DNP
NVIDIA GPU
NVIDIA GPU
ARM+FPGA
Bandwidth
6.4 Gbps
34 Gbps
45 Gbps
32 Gbps
Latency
6.5 us
6.5 us
5 us
1.1 us

APEnet Interconnect Architecture based on a layer models

GPU I/O accelerator

APEnet+ has been the first-of-its-kind device to implement an RDMA protocol to directly read/write data from/to Fermi and Kepler NVIDIA GPUs using NVIDIA peer-to-peer and GPUDirect RDMA protocols, obtaining real zero-copy GPU-to-GPU transfers over the network. This means that the APEnet+ network board can target GPU memory by ordinary RDMA semantics with no CPU involvement and dispensing entirely with intermediate copies . In this way, real zero-copy, inter-node GPU-to-host, host-to-GPU or GPU-to-GPU transfers can be achieved, with substantial reductions in latency. 

APEnet Performance