The NaNet project goal is the design and implementation of a family of FPGA-based PCIe Network Interface Cards for High Energy Physics to bridge the front-end electronics and the software trigger computing nodes.

The design supports both standard and custom channels: GbE (1000BASE-T), 10GbE (10Base-KR),40GbE, APElink (custom 34 Gbps link dedicated to HPC systems), KM3link (deterministic latency 2.5Gbps link used in the KM3Net-IT experiment data acquisition system).

The RDMA feature combined with of a transport protocol layer offload module and a data stream processing stage makes NaNet a low-latency NIC suitable for online processing of data streams.

NaNet GPUDirect/RDMA capability enables the connected processing system to exploit the high computing performances of modern GPUs on real-time applications.

 

NaNet-1
NaNet3
NaNet-10
Year
Q3 - 2013
Q1 - 2017
Q1 - 2017
Device Family
Altera Stratix IV
Development Kit
Altera Stratix V
Terasic DE5
Altera Stratix V
Terasic DE5
Channel Technology
1 GbE
KM3link
10 GbE
Trasmission Protocol
UDP
TDM
UDP
Number of channel
1
4
4
PCIe
Gen2 x8
Gen2 x8
Gen3 x8
nVIDIA GPUDirect RDMA
YES
YES
YES
Real-time Processing
Decompressor
Decompressor
Decompressor; Merger
HEP experiment
NA62
KM3NeT-It
NA62

NaNet Software Stack

Software components for NaNet operation are needed both on the x86 host and on the Nios II FPGA-embedded μcontroller. On the x86 host, a GNU/Linux kernel driver and an application library are present.

The application library provides an API mainly for open/close device operations, registration (i.e. allocation, pinning and returning of virtual addresses of buffers to the application) and deregistration of circular lists of persistent receiving buffers (CLOPs) in GPU and/or host memory and signalling of receive events on these registered buffers to the application (e.g. to invoke a GPU kernel to process data just received in GPU memory).

On the μcontroller, a single process application is in charge of device configuration, generation of the destination virtual address inside the CLOP for incoming packets payload and virtual to physical memory address translation performed before the PCIe DMA transaction to the destination buffer takes place.

The control flow of processes through kernel and user space are detailed below:

  • NaNet NIC DMA-writes a “receiving done” event in a memory region called “event queue” trapped by a kernel-space device driver notified to the user application which launches a CUDA kernel to process the data using the GPU;
  • Results of the processing is eventually sent via NaNet board to the network:
    data are DMA-read directly from GPU memory;
  • the kernel device driver (invoked by the user application on HOST) instructs the NIC by filling a “descriptor” into a dedicated, DMA-accessible memory region called “TX ring”;
  • the presence of new descriptors is notified to NaNet by writing on a doorbell register over PCIe;
    NaNet NIC issues a “tx done” completion event in the “event queue”.
Circular List of Permanent buffers (CLOPs)
Control flow detail of processes through kernel/user space

NaNet performance / implementation