NaNet - APE LAB

The NaNet project goal is the design and implementation of a family of FPGA-based PCIe Network Interface Cards for High Energy Physics to bridge the front-end electronics and the software trigger computing nodes.

The design supports both standard and custom channels: GbE (1000BASE-T), 10GbE (10Base-KR),40GbE, APElink (custom 34 Gbps link dedicated to HPC systems), KM3link (deterministic latency 2.5Gbps link used in the KM3Net-IT experiment data acquisition system).

The RDMA feature combined with of a transport protocol layer offload module and a data stream processing stage makes NaNet a low-latency NIC suitable for online processing of data streams.

NaNet GPUDirect/RDMA capability enables the connected processing system to exploit the high computing performances of modern GPUs on real-time applications.

I/O Interface

Router

Network Interface

PCIe Core

I/O Interface

It performs a 4-stages processing on the data stream: following the OSI Model, the Physical Link Coding stage implements, as the name suggests, the channel physical layer (e.g. 1000BASE-T) while the Protocol Manager stage handles, depending on the kind of channel, data/network/transport layers (e.g. Time Division Multiplexing or UDP); the Data Processing stage implements application dependent transformations on data streams (e.g. performing compression/decompression) while the APEnet Protocol Encoder performs protocol adaptation, encapsulating inbound payload data in APElink packet protocol, used in the inner NaNet logic, and decapsulating outbound APElink packets before re-encapsulating their payload in output channel transport protocol (e.g. UDP).

Router

Network Interface

PCIe Core

	NaNet-1	NaNet³	NaNet-10
Year	Q3 - 2013	Q1 - 2017	Q1 - 2017
Device Family	Altera Stratix IV Development Kit	Altera Stratix V Terasic DE5	Altera Stratix V Terasic DE5
Channel Technology	1 GbE	KM3link	10 GbE
Trasmission Protocol	UDP	TDM	UDP
Number of channel	1	4	4
PCIe	Gen2 x8	Gen2 x8	Gen3 x8
nVIDIA GPUDirect RDMA	YES	YES	YES
Real-time Processing	Decompressor	Decompressor	Decompressor; Merger
HEP experiment	NA62	KM3NeT-It	NA62

NaNet Software Stack

Software components for NaNet operation are needed both on the x86 host and on the Nios II FPGA-embedded μcontroller. On the x86 host, a GNU/Linux kernel driver and an application library are present.

The application library provides an API mainly for open/close device operations, registration (i.e. allocation, pinning and returning of virtual addresses of buffers to the application) and deregistration of circular lists of persistent receiving buffers (CLOPs) in GPU and/or host memory and signalling of receive events on these registered buffers to the application (e.g. to invoke a GPU kernel to process data just received in GPU memory).

On the μcontroller, a single process application is in charge of device configuration, generation of the destination virtual address inside the CLOP for incoming packets payload and virtual to physical memory address translation performed before the PCIe DMA transaction to the destination buffer takes place.

The control flow of processes through kernel and user space are detailed below:

NaNet NIC DMA-writes a “receiving done” event in a memory region called “event queue” trapped by a kernel-space device driver notified to the user application which launches a CUDA kernel to process the data using the GPU;
Results of the processing is eventually sent via NaNet board to the network:
data are DMA-read directly from GPU memory;
the kernel device driver (invoked by the user application on HOST) instructs the NIC by filling a “descriptor” into a dedicated, DMA-accessible memory region called “TX ring”;
the presence of new descriptors is notified to NaNet by writing on a doorbell register over PCIe;
NaNet NIC issues a “tx done” completion event in the “event queue”.

NaNet performance / implementation

NaNet-1

NaNet-10

NaNet³

NaNet-1

NaNet-10

NaNet³