Difference between revisions of "NaNet overview"

From APEWiki
Jump to navigationJump to search
 
(126 intermediate revisions by 4 users not shown)
Line 1: Line 1:
====Introduction====
+
[[File:NaNet-logo-1.png|thumb|left|200px]]
NaNet is a modular design of a low-latency NIC dedicated to real-time GPU-based systems and
 
supporting a number of different physical links; its design baseline comes from the [[APEnet+_project|APEnet+]] PCIe
 
Gen 2 x8 3D NIC.
 
  
The board is able to manage either 34 Gbps APElink channel or 1/10 GbE interfaces and exploit the GPUDirect
+
'''NaNet''' is a ''FPGA-based'' PCIe Network Interface Card ([http://en.wikipedia.org/wiki/Network_interface_controller NIC]) design with ''Remote Direct Memory Access'' ([http://en.wikipedia.org/wiki/Remote_direct_memory_access RDMA]) and [https://developer.nvidia.com/gpudirect GPUDirect P2P/RDMA] capabilities featuring a configurable and extensible set of network channels.
P2P capabilities of NVIDIA Fermi/Kepler GPUs equipping a hosting PC to directly inject into
+
 
 +
The design supports both standard and custom channels:
 +
*'''GbE''' (1000BASE-T)
 +
*'''10GbE''' (10Base-KR)
 +
*'''40GbE''' 
 +
*'''APElink''' (custom 34 Gbps link dedicated to HPC systems)
 +
*'''KM3link''' (deterministic latency 2.5Gbps link used in the KM3Net-IT experiment data acquisition system)
 +
 
 +
The RDMA feature combined with of a transport protocol layer offload module and a data stream processing stage makes NaNet a low-latency NIC suitable for '''real-time processing''' of data streams.
 +
 
 +
NaNet GPUDirect capability enables the connected processing system to exploit the high computing performances of modern
 +
'''GPUs on real-time applications'''.
 +
 
 +
Since January 2015 NaNet is an [[image:Logo-infn.png|30px]][https://web2.infn.it/csn5/index.php/en/ INFN Scientific Committee 5] funded experiment.
 +
 
 +
[[NaNet_Publications|'''Here you can find a list of publications and talks about the project'''.]]
 +
 
 +
===NaNet Architecture===
 +
 
 +
NaNet design is partitioned into '''4 main modules''': Router, Network Interface, PCIe Core, and I/O Interface (see ig.1).
 +
 
 +
[[File:NaNetInternal-1.png|thumb|left|550px|Figure 1 - NaNet General Architecture]]
 +
 
 +
#The '''Router''' module supports a configurable number of ports implementing a full crossbar switch responsible for data routing and dispatch. Number and ''bit-width'' of the switch ports and the routing algorithm can all be defined by the user to automatically achieve a desired configuration. The Router block dynamically interconnects the ports and comprises a fully connected switch, plus routing and arbitration blocks managing multiple data flows @2.8 GB/s.
 +
#The '''Network Interface''' block acts on the trasmitting side by gathering data incoming from the PCIe port and forwarding them to the Router destination ports; on the receiving side it provides support for RDMA in communications involving both the host and the GPU (via a dedicated '''GPU I/O Accelerator''' module). A NIOS-II microcontroller handles configuration and runtime operations.
 +
#The '''PCIe Core''' module is built upon a powerful commercial core from PLDA that sports a simplified but efficient backend interface and multiple DMA engines.
 +
#The '''I/O Interface''' module is the discriminating component among the cards in the NaNet family. It  is  each  time  re-designed  in  order  to  satisfy  the  requirements  of  the  readout  system  data transmission protocol optimizing the data movement process for the different experiments.  The I/O  Interface  module  performs  a  4-stages  processing  on  the  data  stream:  following  the  OSI Model, the Physical Link Coding stage implements, as the name suggests, the channel physical layer  (e.g. 1000BASE-T, 10GBASE-R, etc.)  the  Protocol  Manager  one  handles  data/network/transport layers  (e.g. Time  Division  Multiplexing  or  UDP),  depending  on  the  kind  of  channel;  the Data  Processing  stage  implements  application-dependent  reshuffling  on  data  streams  (e.g. performing de/compression) while the APEnet Protocol Encoder performs protocol adaptation, encapsulating inbound payload data into the APElink packet protocol — used in the inner NaNet logic — and decapsulating outbound APElink packets before re-encapsulating their payload into the output channel transport protocol (e.g. UDP).
 +
 
 +
This general architecture has been specialized into several configurations to match the requirements of different experimental setups:
 +
*[[#NaNet-1|NaNet-1]] featuring a '''PCIe Gen2 x8''' host interface plus a '''GbE''' one, three optional 34 Gbps APElink channels and is implemented on the Altera Stratix IV FPGA Development Kit.
 +
*[[#NaNet3|NaNet<sup>3</sup>]] implemented on the Terasic DE5-NET Stratix V FPGA development board sporting four SFP+ cages. It supports '''four 2.5~Gbps deterministic latency optical KM3link channels''' and a '''PCIe Gen2 x8''' host interface.
 +
*[[#NaNet-10|NaNet-10]] featuring '''four 10GbE SFP+ ports''' along with a '''PCIe x8 Gen2/Gen3''' host interface and also implemented on the Terasic DE5-NET board.
 +
*[[#NaNet-40|NaNet-40]] featuring '''two 40GbE QSFP+ ports''' along with a '''PCIe x8 Gen3''' host interface, implemented on the Bittware S5-PCIe-HQ board (with Altera Stratix V)
 +
 
 +
The board is able to manage either 34 Gbps APElink channel or 1/10/40 GbE interfaces and exploit the GPUDirect
 +
P2P capabilities of nVIDIA Fermi/Kepler/Maxwell/Pascal GPUs equipping a hosting PC to directly inject into
 
their memory an UDP input data stream from the detector front-end, with rates compatible
 
their memory an UDP input data stream from the detector front-end, with rates compatible
 
with the low latency real-time requirements of the trigger system.
 
with the low latency real-time requirements of the trigger system.
Line 13: Line 45:
 
CPU can be offloaded from any data communication or computing task, leaving to it only system
 
CPU can be offloaded from any data communication or computing task, leaving to it only system
 
configuration and GPU kernel launch tasks. Within NaNet, this meant that data communication
 
configuration and GPU kernel launch tasks. Within NaNet, this meant that data communication
tasks were entirely offloaded to a dedicated UDP protocol-handling block directly communicating
+
tasks were entirely offloaded to a dedicated IP/UDP protocol-handling block directly communicating
 
with the P2P logic: this allows a direct (no data coalescing or staging is performed) data transfer
 
with the P2P logic: this allows a direct (no data coalescing or staging is performed) data transfer
 
with low and predictable latency on the GbE link → GPU data path.
 
with low and predictable latency on the GbE link → GPU data path.
 
The UDP OFFLOAD block comes from an open core module <ref>http://www.alterawiki.com/wiki/Nios_II_UDP_Offload_Example</ref> built for a Stratix II 2SGX90
 
development board. Focus of that design is the unburdening of the Nios II soft-core
 
microprocessor onboard the Stratix II from UDP packet management duties by a module
 
that collects data coming from the Avalon Streaming Interface (Avalon-ST) of the Altera
 
Triple-Speed Ethernet Megacore (TSE MAC) and redirects UDP packets along a hardware
 
processing data path. The Nios II subsystem executes the InterNiche TCP/IP stack to setup
 
and tear down UDP packet streams which are processed in hardware at the maximum data rate
 
achievable over the GbE network.
 
 
Bringing the open core into the NaNet design required some modifications, first of all the
 
hardware code was upgraded to work on the Stratix IV FPGA family; this upgrade made
 
available the improved performances of an FPGA which is two technology steps ahead in respect
 
to the Stratix II.
 
 
The synthesis performed on a Stratix IV achieves the target frequency of 200 MHz (in the
 
current APEnet+ implementation, the Nios II subsystem operates at the same frequency).
 
Current NaNet implementation provides a single 32-bits wide channel; it achieves 6.4 Gbps
 
at the present operating frequency, 6 times greater than what is required for a GbE channel.
 
Data coming from the single channel of the modified UDP OFFLOAD are collected by the NaNet
 
CTRL. NaNet CTRL is a hardware module in charge of managing the GbE flow by encapsulating packets in the typical APEnet+ protocol (''header'', ''footer'' and a ''payload'' whose maximum size is 4096 bytes).
 
  
 
Incoming data streams are processed by a Physical Link Coding block feeding the Data Protocol Manager that
 
Incoming data streams are processed by a Physical Link Coding block feeding the Data Protocol Manager that
 
in turn extracts the payload data. These payload data are encapsulated by the NaNet Controller and sent to the APEnet+ Network Interface.
 
in turn extracts the payload data. These payload data are encapsulated by the NaNet Controller and sent to the APEnet+ Network Interface.
  
The ''Distributed Network Processor'' (DNP) is the APEnet+ core logic, acting as
+
The ''Distributed Network Processor'' (DNP)<ref>A. Biagioni, F. Lo Cicero, A. Lonardo, P.S. Paolucci, M. Perra, D. Rossetti, C. Sidore, F. Simula, L. Tosoratto and P. Vicini - '''The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture''', March 2012. (http://arxiv.org/abs/1203.1536).</ref> is the APEnet+ core logic, acting as an off-loading engine for the computing node in performing inter-node communications. The DNP provides hardware support for the Remote Direct Memory Access (RDMA) protocol guaranteeing low-latency data transfers. Moreover, APEnet+ is also able to directly access the Fermi/Kepler/Pascal-class NVIDIA GPUs memory (provided that both devices share the same upstream PCIe root complex) leveraging their peer-to-peer capabilites. This is a first-of-its-kind feature for a non-NVIDIA device (GPUDirect RDMA being its commercial name), allowing unstaged off-board GPU-to-GPU transfers with unprecedented low latency.
an off-loading engine for the computing node in performing inter-node communications. The DNP provides hardware support for the Remote Direct Memory Access (RDMA) protocol guaranteeing low-latency data transfers. Moreover, APEnet+ is also able to directly access the Fermi- and
 
Kepler-class NVIDIA GPUs memory (provided that both devices share the same upstream PCIe root complex) leveraging their peer-to-peer capabilites. This is a first-of-its-kind feature for a non-NVIDIA device (GPUDirect RDMA being its commercial name), allowing unstaged off-board GPU-to-GPU transfers with unprecedented low latency.
 
  
 +
On fig. 2 we show a recap of the used FPGA logic resources as measured by the synthesis software.
  
<gallery heights= 380px widths=600px mode="packed-hover">
 
File:NaNet_internals_red.png      | NaNet Architecture and Data Flow.
 
File:ApenetPlus_Board.jpg                | NaNet-1 implemented on an Altera Stratix IV coupled with a custom mezzanine card sporting 3 APE link channels.
 
</gallery>
 
  
 
=====NaNet Architecture and Data Flow=====
 
=====NaNet Architecture and Data Flow=====
Line 60: Line 66:
 
*A Performance Counter is used to analyze the latency of the GbE data flow inside the NIC.
 
*A Performance Counter is used to analyze the latency of the GbE data flow inside the NIC.
  
====NaNet-1====
+
=====Software Stack=====
This version of the NIC features GPUDirect RDMA over 1 GbE and optionally 3 APElink channels.
+
Software components for NaNet operation are needed both on the x86 host and on the Nios II FPGA-embedded μcontroller. On the x86 host, a GNU/Linux kernel driver and an application
The design employs SGMII standard interface to connect the MAC to the PHY including Management Data I/O (MDIO); the MAC is a
+
library are present.
single module in FIFO mode for both the receive and the transmit sides (2048x32 bits).
 
The logic resources consumption is shown in fig. (__)
 
[[File:NaNet_1_10_Resource_Consumption.png|right|500px|thumb|NaNet-1 and NaNet-10 logic resources consumption]]
 
  
=====Software Stack=====
+
The application library provides an API mainly for open/close device
Software components for NaNet-1 operation are needed both on the x86 host and on the Nios II
 
FPGA-embedded μcontroller. On the x86 host, a GNU/Linux kernel driver and an application
 
library are present. The application library provides an API mainly for open/close device
 
 
operations, registration (''i.e.'' allocation, pinning and returning of virtual addresses of buffers
 
operations, registration (''i.e.'' allocation, pinning and returning of virtual addresses of buffers
 
to the application) and deregistration of circular lists of persistent receiving buffers (CLOPs)
 
to the application) and deregistration of circular lists of persistent receiving buffers (CLOPs)
 
in GPU and/or host memory and signalling of receive events on these registered buffers to
 
in GPU and/or host memory and signalling of receive events on these registered buffers to
the application (''e.g.'' to invoke a GPU kernel to process data just received in GPU memory).
+
the application (''e.g.'' to invoke a GPU kernel to process data just received in GPU memory, see fig. 5).
 +
 
 
On the μcontroller, a single process application is in charge of device configuration, generation
 
On the μcontroller, a single process application is in charge of device configuration, generation
 
of the destination virtual address inside the CLOP for incoming packets payload and virtual
 
of the destination virtual address inside the CLOP for incoming packets payload and virtual
Line 80: Line 81:
 
destination buffer takes place.
 
destination buffer takes place.
  
 +
<gallery heights= 340px widths=410px mode="traditional">
 +
File:NaNet_resource_consumption.png      | Figure 2. An overview of NaNet resource consumption.
 +
File:ApenetPlus_Board.jpg                | Figure 3. NaNet-1 implemented on an Altera Stratix IV coupled with a custom mezzanine card sporting 3 APE link channels.
 +
File:TerasicDE5_plain.png                | Figure 4. NaNet-10 implemented on an Altera Stratix V and featuring four 10GbE SFP+ ports.
 +
</gallery>
 +
 +
====NaNet-1====
 +
This version of the NIC features GPUDirect RDMA over 1 GbE and optionally 3 APElink channels.
 +
The design employs SGMII standard interface to connect the MAC to the PHY including Management Data I/O (MDIO); the MAC is a
 +
single module in FIFO mode for both the receive and the transmit sides (2048x32 bits).
 +
The logic resources consumption is shown in fig. 2.
 +
 +
 +
<gallery widths=320px heights=300px mode="traditional">
 +
File:CLOP_NaNet_GPU.png                                | Figure 5. CLOPs scheme and their implementation in the NA62-NaNet-GPU work flow.
 +
File:A60_R08_D2_0_udpsop2comp.jpg                      | Figure 6. Distribution plot over 60000 samples of a NaNet trasversal time.
 +
File:LeCroy_Latenza_Nanet_TEL62_32packets_clop8.png    | Figure 7. TEL62 to NaNet communication latency measurements performed with an oscilloscope.
 +
</gallery>
 +
 +
=====NIC packets traversal latency=====
 +
In order to characterize the host+NIC system, a “system loopback” configuration was used: connecting one GbE interface of the hosting PC to the NaNet, it was possible able to generate and receive a UDP stream in a single host process, measuring latency as the difference of host processor Time Stamp Counter register at send and receive time of the same UDP packet.
  
=====Performance Analysis=====
 
In order to characterize the host+NIC system, a “system loopback” configuration was used: connecting one GbE interface of the hosting PC to the NaNet, it was possible able to generate and receive a UDP stream in a single host process, measuring latency as the difference of host processor Time Stamp Counter register at send and receive time of the same UDP packet (fig. __).
 
 
Latency inside the NIC was measured adding 4 cycles counters at different stages of packet processing; their values are stored in a profiling packet footer with a resolution of 4 ns; for a
 
Latency inside the NIC was measured adding 4 cycles counters at different stages of packet processing; their values are stored in a profiling packet footer with a resolution of 4 ns; for a
standard 1472 bytes UDP packet, traversal time ranges between 7.3 us and 8.6 us from input of NaNet CTRL to the completion signal of the DMA transaction on the PCIe bus (fig. __).
+
standard 1472 bytes UDP packet, traversal time ranges between 7.3 us and 8.6 us from input of NaNet CTRL to the completion signal of the DMA transaction on the PCIe bus (fig. 6).
 +
 
 
For the same packet size, saturation of the GbE channel is achieved, with 119.7 MB of sustained bandwidth.
 
For the same packet size, saturation of the GbE channel is achieved, with 119.7 MB of sustained bandwidth.
 +
 
For scheduled improvements on the NaNet design see [[#NaNet-10| NaNet-10]].
 
For scheduled improvements on the NaNet design see [[#NaNet-10| NaNet-10]].
 +
 +
=====Preliminary tests for integration in a working environment=====
 +
 +
Tests for the integration in the [[#A physics case: the Low Level trigger in NA62 Experiment's RICH Detector|NA62 experimental setup]] have been performed with the NaNet-1 board connecting a TEL62 to NaNet Gbe port and sending some Monte Carlo-generated events.
 +
 +
An example of latency measurements performed with an oscilloscope is shown in fig. 7: a bunch of 32 UDP packets is sent from TEL62 (red signal), thereby 4 PCIe completions (yellow signal) pinpoint the end of the DMA write transactions towards the GPU memory buffers, each sized 8 times the udp packet payload size.
  
 
====NaNet-10====
 
====NaNet-10====
  
 +
This version of our NIC is implemented on the Terasic DE5-net board equipped with an Altera Stratix V FPGA and featuring four 10GbE SFP+ ports and a PCIe Gen2 x8 connector (see fig. 3).
 +
The network adapter offers hardware support for either direct CPU/GPU memory access and the  offloading  engine  managing  the  network  stack  protocol.
 +
 +
=====Data transmission system=====
 +
 +
Following the design guidelines for the NaNet I/O interface described in [[#NaNet Architecture| NaNet Architecture]] the Physical Link Coding is implemented by two Altera IPs, the 10GBASE-R PHY
 +
and the 10 Gbps MAC. The 10GBASE-R PHY IP delivers serialized data to an optical module
 +
that drives optical fiber at a line rate of 10.3125 Gbps.  PCS and PMA are implemented as hard
 +
IP blocks in Stratix V devices, using dedicated FPGA resources.
 +
The 10 Gbps MAC supports 10 Mbps, 100 Mbps, 1 Gbps, 10 Gbps operating modes with Avalon-Streaming up to 64-bit wide
 +
client interface running at 156.25 MHz and MII/GMII/SDR XGMII on the network side.
 +
 +
We developed a custom 10 Gbps UDP/IP Core as a Protocol Manager of the I/O interface,
 +
providing  full  UDP,  IPv4  and  ARP  protocols.  It  is  derived  and  adapted  from  the  FPGA-
 +
proven  1  Gbps  UDP/IP  open  core<ref> 1g eth udp/ip stack (https://opencores.org/project/udp_ip_stack).</ref>  and  provides  an  AXI-based  64-bit  data  interface  at
 +
an operating frequency of 156.25 MHz.  Several registers are exposed for UDP header settings
 +
(e.g. source/destination port and destination IP address) both in the transmit and receive side.
 +
 +
IP  and  MAC  address  are  also  fully  customizable.  The  core  offers  ARP  level  functionalities,
 +
with a 256-entries cache for IP-to-MAC address translation.  Underlying ARP communication
 +
is  automatic  when  first  packet  transfer  occurs  and  sender  and  receiver  mutually  exchange
 +
informations  about  their  own  IP  and  MAC  addresses.
 +
 +
There  is  no  data  buffering  internally,
 +
allowing zero latency between the Data Processing block and the Physical layer.  For this reason
 +
packet segmentation and reassembly are not supported.
 +
 +
The  Multi-Stream  and  Decompressor  hardware  components  apply  application-dependent
 +
modifications to accelerate the GPU computing task. Multi-Stream module analyses the received
 +
data stream and separates the packets according to the UDP destination port. A  Decompressor  stage  was  added  in  the  I/O  interface  to  reformat  events  data  in  a
 +
GPU-friendly fashion on the fly.
 +
 +
The NaNet Transmission Control Logic (NaNet TCL) encapsulates the received streams into
 +
the APEnet Protocol allowing for reuse of the overall APEnet+ architecture.  Several parameters
 +
are used to configure the NaNet TCL (
 +
i.e.
 +
packet size,  port id,  target device) and whatever
 +
is needed to fulfill the key task of virtual address generation for the APEnet packets.  All the
 +
information for the virtual memory management is provided by the on-board micro-controller
 +
(base address, buffer size, number of available buffers).
 +
 +
=====Performance Analysis=====
 +
NaNet-10 performances are assessed on a SuperMicro Server. The setup comprises a X9DRG-HF
 +
dual socket motherboard — Intel C602 Patsburg chipset — populated with Intel Xeon E5-2630
 +
@2.60  GHz  CPUs  (
 +
i.e.
 +
Ivy  Bridge  micro-architecture),  64  GB  of  DDR3  memory  and  a
 +
Kepler-class NVIDIA K40m GPU.
 +
 +
Measurements are performed in “loop-back” configuration closing 2 out of 4 available ports
 +
on the Terasic DE5-net board.  The outgoing packet payload is generated via a custom hardware
 +
module directly feeding the UDP TX. Packet size, number of transmitted packets, delay between
 +
packets and UDP protocol destination port are configured setting several NaNet-10 registers.
 +
First  we  measure  the  time-of-flight  from  UDP  TX  to  UDP  RX  exploiting  the  SignalTap
 +
II  Logic  Analyzer  tool  of  the  Altera  Quartus  II  suite:  this  is  64  clock  cycles  @156.25  MHz
 +
(409.6 ns).
 +
 +
The receiving hardware path traversal latency is profiled via cycle counter values recorded at
 +
different stages during packet processing.  The values are patched into the completion payload
 +
and stored in the event queue.  The adoption of a data transmission custom hardware module
 +
ensures  that  results  are  not  affected  by  external  latency  sources  (
 +
i.e.
 +
DMA  memory  reading
 +
process).  The custom module is always ready to send data exploiting the entire link capability
 +
mimicking the detector readout system “worst-case”.
 +
 +
A comparison between NaNet-10 and NaNet-1 latency results in the range of interest is shown
 +
in fig. 8.  The NaNet-10 NIC experiences sub-microsecond hardware latency moving data to the
 +
GPU/CPU memory for buffer sizes up to
 +
 +
1kByte.
 +
Focusing on bandwidth, the maximum capability of the data transmission system, 10 Gbps, is
 +
already reached for a
 +
 +
1kByte buffer size (fig. 9).
 +
 +
Finally,  we  notice  that  performance  of  moving  data  from  the  I/O  interface  of  the  NIC  to
 +
target device memory are the same for both the CPU and the GPU.
 +
 +
 +
<gallery widths=400px heights=350px mode="traditional">
 +
File:Nanet10_lat-1.png                                | Figure 8. A comparison  between  NaNet-10 and NaNet-1 hardware latency.
 +
File:Nanet10_bw-1.png                                  | Figure 9. A comparison  between  NaNet-10 and NaNet-1 bandwidth.
 +
</gallery>
 +
 +
=====A physics case: the Low Level trigger in NA62 Experiment's RICH Detector=====
 +
 +
NaNet is currently being used in a pilot project within the [https://home.cern/about/experiments/na62 CERN NA62 experiment] aiming at investigating GPUs usage in the central Level 0 trigger processor (L0TP) <ref>G. Lamanna, G. Collazuol, Marco Sozzi - '''GPUs for fast triggering and pattern matching at the CERN experiment NA62''', Nuclear Science Symposium Conference Record, 2009</ref>. This is a synchronous real-time
 +
system implemented in hardware through FPGAs on the readout boards with a time budget for
 +
trigger decision of 1 ms.
 +
 +
The GPU-RICH processing stage is positioned between the RICH readout and the L0 Trigger Processor (L0TP) with the
 +
task of computing in real-time physics-related primitives (i.e. centers and radii of Čerenkov ring
 +
patterns on the photomultipliers arrays), in order to improve the low level trigger discrimination
 +
capability.
 +
Data from the detector photomultipliers (PMTs) are collected by four readout boards
 +
(TEL62) sending primitives to NaNet-10 as UDP datagram streams over two GbE channels (for each board, see
 +
fig. 10) connected to a GbE/10GbE switch. Packets are then routed on a 10GbE channel towards one of the NaNet-10 ports.
 +
 +
A processing stage on the onboard FPGA decompresses
 +
and coalesces the events fragments scattered among the several UDP streams; zero-copy DMAs
 +
towards the GPU memory are then instantiated to transmit the reconstructed events. Events
 +
are gathered and arranged in the GPU memory as a Circular List Of Persistent buffers (CLOP),
 +
according to a configurable time window which must be shorter than the total processing time to
 +
avert the overwriting of the buffers before they are consumed by the pattern recognition CUDA
 +
kernel.
 +
 +
A system like this, obtained retrofitting the RICH detector with a processing system
 +
capable of exploiting different kind of computing units, can be regarded as an heterogeneous
 +
“smart detector”.
 +
 +
=====Results from NA62 RUN 2017=====
 +
The NaNet based system is installed close to the
 +
RICH readout rack in the NA62 experimental hall. After preliminary tests performed with NaNet-1 and a Kepler class GPU (nVIDIA K20c) in 2013-2015, it was updated to NaNet-10 in 2016 and then further improved with a Pascal class GPU (nVIDIA P100) during the NA62 2017 Run.
 +
 +
The test-bed is made up of a HP2920 switch, a NaNet-10 PCIe board plugged
 +
into a server made of X9DRG-QF dual-socket mother-
 +
board populated with Intel Xeon E5-2620 @2.00 GHz
 +
CPUs (i.e. Ivy Bridge architecture), 32 GB of DDR3
 +
RAM, and a NVIDIA Pascal P100 GPU.
 +
 +
Latencies of
 +
the stages corresponding to GPU processing (event indexing and ring reconstruction) and sending UDP packets with the results to the L0TP are shown in fig. 11.
 +
Those are data regarding a single burst at beam intensity
 +
∼ 19 × 10<sup>11</sup> pps and with a gathering time for NaNet-10
 +
of 250 μs (dashed line on plot).
 +
 +
The overall time is always well below the time budget limit, pointing it out
 +
as an ideal working point. Being the quality of the ring
 +
reconstruction similar to the offline and the time budget
 +
below the maximum latency allowed by the system, we
 +
conclude that the GPU trigger system implemented can
 +
be used in the future runs to design more selective and
 +
efficient trigger conditions.
 +
 +
 +
<gallery widths=400px heights=360px mode="traditional">
 +
File:NaNet_Internal_ReadoutScheme_plain.png            | Figure 10. GPU-RICH readout system.
 +
File:Latency_stripes_data_03102017_r8196_burst1263.png  | Figure 11. GPU system heterogeneous processing pipeline latency. The red area represents the latency of event indexing in the CLOP buffer (almost constant), the cyan area shows the ring reconstruction kernel latency and the blue is the sending stage.
 +
</gallery>
  
 
====NaNet<sup>3</sup>====
 
====NaNet<sup>3</sup>====
 +
[[File:StratixV_nanet3_testbed-edit-oriz.jpg|thumb|right|350px|NaNet<sup>3</sup> testbed: board is connected to off-shore Read-Out system via optical cable.]]
 +
To be complete, an overview of the NaNet board family must mention the undergoing development of the NaNet3 board for the KM3 HEP experiment<ref>M. Ageron e al., '''Technical Design Report for a Deep-Sea Research Infrastructure in the Mediterranean Sea Incorporating a Very Large Volume Neutrino Telescope''', Tech. Rep. ISBN 978-90-6488-033-9.</ref>. In KM3 the board is tasked with delivering global clock and synchronization signals to the underwater electronic system and receiving photomultipliers data via optical cables. The design employs Altera Deterministic Latency Transceivers with an 8B10B encoding scheme as Physical Link Coding and Time Division MultiPlexing (TDMP) data transmission protocol. Current implementation is being developed on the Altera Stratix V development board with a Terasic SFP-HSMC daughtercard plugged on top and sporting 4 transceiver-based SFP ports.
 +
 +
<br clear=all>
 +
===Further enhancements===
 +
====NaNet-40====
 +
To enable 40 Gb Ethernet with UDP/IP protocol hardware offload we developed this new member of the NaNet family, based on a Bittware S5-PCIe-HQ board with Altera Stratix V. It is equipped with 2 QSFP+ ports and a PCIe X8 Grn3 connector.
 +
 +
==NaNet Public Documentation==
 +
 +
* [[NaNet_Publications|NaNet Publications and Talks]]
 +
 +
*; M.S. Theses
 +
*: [[Media:Tesi_Pontisso_2014.pdf|Luca Pontisso, Master Thesis in Physics, Sapienza - Università di Roma. Title: "Caratterizzazione della scheda di comunicazione NaNet e suo utilizzo nel Trigger di Livello 0 basato su GPU dell’esperimento NA62"]] (2014)
  
 +
* [[NaNet_init_procedure|NaNet_init_procedure]]
  
 +
----
  
====Notes:====
+
==References==
  
 
<references />
 
<references />

Latest revision as of 16:15, 5 June 2018

NaNet-logo-1.png

NaNet is a FPGA-based PCIe Network Interface Card (NIC) design with Remote Direct Memory Access (RDMA) and GPUDirect P2P/RDMA capabilities featuring a configurable and extensible set of network channels.

The design supports both standard and custom channels:

  • GbE (1000BASE-T)
  • 10GbE (10Base-KR)
  • 40GbE
  • APElink (custom 34 Gbps link dedicated to HPC systems)
  • KM3link (deterministic latency 2.5Gbps link used in the KM3Net-IT experiment data acquisition system)

The RDMA feature combined with of a transport protocol layer offload module and a data stream processing stage makes NaNet a low-latency NIC suitable for real-time processing of data streams.

NaNet GPUDirect capability enables the connected processing system to exploit the high computing performances of modern GPUs on real-time applications.

Since January 2015 NaNet is an Logo-infn.pngINFN Scientific Committee 5 funded experiment.

Here you can find a list of publications and talks about the project.

NaNet Architecture

NaNet design is partitioned into 4 main modules: Router, Network Interface, PCIe Core, and I/O Interface (see ig.1).

Figure 1 - NaNet General Architecture
  1. The Router module supports a configurable number of ports implementing a full crossbar switch responsible for data routing and dispatch. Number and bit-width of the switch ports and the routing algorithm can all be defined by the user to automatically achieve a desired configuration. The Router block dynamically interconnects the ports and comprises a fully connected switch, plus routing and arbitration blocks managing multiple data flows @2.8 GB/s.
  2. The Network Interface block acts on the trasmitting side by gathering data incoming from the PCIe port and forwarding them to the Router destination ports; on the receiving side it provides support for RDMA in communications involving both the host and the GPU (via a dedicated GPU I/O Accelerator module). A NIOS-II microcontroller handles configuration and runtime operations.
  3. The PCIe Core module is built upon a powerful commercial core from PLDA that sports a simplified but efficient backend interface and multiple DMA engines.
  4. The I/O Interface module is the discriminating component among the cards in the NaNet family. It is each time re-designed in order to satisfy the requirements of the readout system data transmission protocol optimizing the data movement process for the different experiments. The I/O Interface module performs a 4-stages processing on the data stream: following the OSI Model, the Physical Link Coding stage implements, as the name suggests, the channel physical layer (e.g. 1000BASE-T, 10GBASE-R, etc.) the Protocol Manager one handles data/network/transport layers (e.g. Time Division Multiplexing or UDP), depending on the kind of channel; the Data Processing stage implements application-dependent reshuffling on data streams (e.g. performing de/compression) while the APEnet Protocol Encoder performs protocol adaptation, encapsulating inbound payload data into the APElink packet protocol — used in the inner NaNet logic — and decapsulating outbound APElink packets before re-encapsulating their payload into the output channel transport protocol (e.g. UDP).

This general architecture has been specialized into several configurations to match the requirements of different experimental setups:

  • NaNet-1 featuring a PCIe Gen2 x8 host interface plus a GbE one, three optional 34 Gbps APElink channels and is implemented on the Altera Stratix IV FPGA Development Kit.
  • NaNet3 implemented on the Terasic DE5-NET Stratix V FPGA development board sporting four SFP+ cages. It supports four 2.5~Gbps deterministic latency optical KM3link channels and a PCIe Gen2 x8 host interface.
  • NaNet-10 featuring four 10GbE SFP+ ports along with a PCIe x8 Gen2/Gen3 host interface and also implemented on the Terasic DE5-NET board.
  • NaNet-40 featuring two 40GbE QSFP+ ports along with a PCIe x8 Gen3 host interface, implemented on the Bittware S5-PCIe-HQ board (with Altera Stratix V)

The board is able to manage either 34 Gbps APElink channel or 1/10/40 GbE interfaces and exploit the GPUDirect P2P capabilities of nVIDIA Fermi/Kepler/Maxwell/Pascal GPUs equipping a hosting PC to directly inject into their memory an UDP input data stream from the detector front-end, with rates compatible with the low latency real-time requirements of the trigger system.

In order to render harmless the unavoidable OS jitter effects that usually hinder system response time stability, the main design rule is to partition the system so that the hosting PC CPU can be offloaded from any data communication or computing task, leaving to it only system configuration and GPU kernel launch tasks. Within NaNet, this meant that data communication tasks were entirely offloaded to a dedicated IP/UDP protocol-handling block directly communicating with the P2P logic: this allows a direct (no data coalescing or staging is performed) data transfer with low and predictable latency on the GbE link → GPU data path.

Incoming data streams are processed by a Physical Link Coding block feeding the Data Protocol Manager that in turn extracts the payload data. These payload data are encapsulated by the NaNet Controller and sent to the APEnet+ Network Interface.

The Distributed Network Processor (DNP)[1] is the APEnet+ core logic, acting as an off-loading engine for the computing node in performing inter-node communications. The DNP provides hardware support for the Remote Direct Memory Access (RDMA) protocol guaranteeing low-latency data transfers. Moreover, APEnet+ is also able to directly access the Fermi/Kepler/Pascal-class NVIDIA GPUs memory (provided that both devices share the same upstream PCIe root complex) leveraging their peer-to-peer capabilites. This is a first-of-its-kind feature for a non-NVIDIA device (GPUDirect RDMA being its commercial name), allowing unstaged off-board GPU-to-GPU transfers with unprecedented low latency.

On fig. 2 we show a recap of the used FPGA logic resources as measured by the synthesis software.


NaNet Architecture and Data Flow
  • APEnet+ Firmware Customization.
  • UDP offload collects data coming from the Altera Triple-Speed Ethernet Megacore (TSE MAC) and extracts UDP packets payload, providing a 32-bit wide channel achieving 6.4~Gbps, discharging the Nios II from the data protocol management.
  • NaNet Controller (CTRL) encapsulates the UDP payload in a newly forged APEnet+ packet, sending it to the RX Network Interface logic.
  • RX DMA CTRL manages CPU/GPU memory write process, providing hw support for the Remote Direct Memory Access (RDMA) protocol.
  • Nios II handles all the details pertaining to buffers registered by the application to implement a zero-copy approach of the RDMA protocol (OUT of the data stream).
  • EQ DMA CTRL generates a DMA write transfer to communicate the completion of the CPU/GPU memory write process.
  • A Performance Counter is used to analyze the latency of the GbE data flow inside the NIC.
Software Stack

Software components for NaNet operation are needed both on the x86 host and on the Nios II FPGA-embedded μcontroller. On the x86 host, a GNU/Linux kernel driver and an application library are present.

The application library provides an API mainly for open/close device operations, registration (i.e. allocation, pinning and returning of virtual addresses of buffers to the application) and deregistration of circular lists of persistent receiving buffers (CLOPs) in GPU and/or host memory and signalling of receive events on these registered buffers to the application (e.g. to invoke a GPU kernel to process data just received in GPU memory, see fig. 5).

On the μcontroller, a single process application is in charge of device configuration, generation of the destination virtual address inside the CLOP for incoming packets payload and virtual to physical memory address translation performed before the PCIe DMA transaction to the destination buffer takes place.

NaNet-1

This version of the NIC features GPUDirect RDMA over 1 GbE and optionally 3 APElink channels. The design employs SGMII standard interface to connect the MAC to the PHY including Management Data I/O (MDIO); the MAC is a single module in FIFO mode for both the receive and the transmit sides (2048x32 bits). The logic resources consumption is shown in fig. 2.


NIC packets traversal latency

In order to characterize the host+NIC system, a “system loopback” configuration was used: connecting one GbE interface of the hosting PC to the NaNet, it was possible able to generate and receive a UDP stream in a single host process, measuring latency as the difference of host processor Time Stamp Counter register at send and receive time of the same UDP packet.

Latency inside the NIC was measured adding 4 cycles counters at different stages of packet processing; their values are stored in a profiling packet footer with a resolution of 4 ns; for a standard 1472 bytes UDP packet, traversal time ranges between 7.3 us and 8.6 us from input of NaNet CTRL to the completion signal of the DMA transaction on the PCIe bus (fig. 6).

For the same packet size, saturation of the GbE channel is achieved, with 119.7 MB of sustained bandwidth.

For scheduled improvements on the NaNet design see NaNet-10.

Preliminary tests for integration in a working environment

Tests for the integration in the NA62 experimental setup have been performed with the NaNet-1 board connecting a TEL62 to NaNet Gbe port and sending some Monte Carlo-generated events.

An example of latency measurements performed with an oscilloscope is shown in fig. 7: a bunch of 32 UDP packets is sent from TEL62 (red signal), thereby 4 PCIe completions (yellow signal) pinpoint the end of the DMA write transactions towards the GPU memory buffers, each sized 8 times the udp packet payload size.

NaNet-10

This version of our NIC is implemented on the Terasic DE5-net board equipped with an Altera Stratix V FPGA and featuring four 10GbE SFP+ ports and a PCIe Gen2 x8 connector (see fig. 3). The network adapter offers hardware support for either direct CPU/GPU memory access and the offloading engine managing the network stack protocol.

Data transmission system

Following the design guidelines for the NaNet I/O interface described in NaNet Architecture the Physical Link Coding is implemented by two Altera IPs, the 10GBASE-R PHY and the 10 Gbps MAC. The 10GBASE-R PHY IP delivers serialized data to an optical module that drives optical fiber at a line rate of 10.3125 Gbps. PCS and PMA are implemented as hard IP blocks in Stratix V devices, using dedicated FPGA resources. The 10 Gbps MAC supports 10 Mbps, 100 Mbps, 1 Gbps, 10 Gbps operating modes with Avalon-Streaming up to 64-bit wide client interface running at 156.25 MHz and MII/GMII/SDR XGMII on the network side.

We developed a custom 10 Gbps UDP/IP Core as a Protocol Manager of the I/O interface, providing full UDP, IPv4 and ARP protocols. It is derived and adapted from the FPGA- proven 1 Gbps UDP/IP open core[2] and provides an AXI-based 64-bit data interface at an operating frequency of 156.25 MHz. Several registers are exposed for UDP header settings (e.g. source/destination port and destination IP address) both in the transmit and receive side.

IP and MAC address are also fully customizable. The core offers ARP level functionalities, with a 256-entries cache for IP-to-MAC address translation. Underlying ARP communication is automatic when first packet transfer occurs and sender and receiver mutually exchange informations about their own IP and MAC addresses.

There is no data buffering internally, allowing zero latency between the Data Processing block and the Physical layer. For this reason packet segmentation and reassembly are not supported.

The Multi-Stream and Decompressor hardware components apply application-dependent modifications to accelerate the GPU computing task. Multi-Stream module analyses the received data stream and separates the packets according to the UDP destination port. A Decompressor stage was added in the I/O interface to reformat events data in a GPU-friendly fashion on the fly.

The NaNet Transmission Control Logic (NaNet TCL) encapsulates the received streams into the APEnet Protocol allowing for reuse of the overall APEnet+ architecture. Several parameters are used to configure the NaNet TCL ( i.e. packet size, port id, target device) and whatever is needed to fulfill the key task of virtual address generation for the APEnet packets. All the information for the virtual memory management is provided by the on-board micro-controller (base address, buffer size, number of available buffers).

Performance Analysis

NaNet-10 performances are assessed on a SuperMicro Server. The setup comprises a X9DRG-HF dual socket motherboard — Intel C602 Patsburg chipset — populated with Intel Xeon E5-2630 @2.60 GHz CPUs ( i.e. Ivy Bridge micro-architecture), 64 GB of DDR3 memory and a Kepler-class NVIDIA K40m GPU.

Measurements are performed in “loop-back” configuration closing 2 out of 4 available ports on the Terasic DE5-net board. The outgoing packet payload is generated via a custom hardware module directly feeding the UDP TX. Packet size, number of transmitted packets, delay between packets and UDP protocol destination port are configured setting several NaNet-10 registers. First we measure the time-of-flight from UDP TX to UDP RX exploiting the SignalTap II Logic Analyzer tool of the Altera Quartus II suite: this is 64 clock cycles @156.25 MHz (409.6 ns).

The receiving hardware path traversal latency is profiled via cycle counter values recorded at different stages during packet processing. The values are patched into the completion payload and stored in the event queue. The adoption of a data transmission custom hardware module ensures that results are not affected by external latency sources ( i.e. DMA memory reading process). The custom module is always ready to send data exploiting the entire link capability mimicking the detector readout system “worst-case”.

A comparison between NaNet-10 and NaNet-1 latency results in the range of interest is shown in fig. 8. The NaNet-10 NIC experiences sub-microsecond hardware latency moving data to the GPU/CPU memory for buffer sizes up to ∼ 1kByte. Focusing on bandwidth, the maximum capability of the data transmission system, 10 Gbps, is already reached for a ∼ 1kByte buffer size (fig. 9).

Finally, we notice that performance of moving data from the I/O interface of the NIC to target device memory are the same for both the CPU and the GPU.


A physics case: the Low Level trigger in NA62 Experiment's RICH Detector

NaNet is currently being used in a pilot project within the CERN NA62 experiment aiming at investigating GPUs usage in the central Level 0 trigger processor (L0TP) [3]. This is a synchronous real-time system implemented in hardware through FPGAs on the readout boards with a time budget for trigger decision of 1 ms.

The GPU-RICH processing stage is positioned between the RICH readout and the L0 Trigger Processor (L0TP) with the task of computing in real-time physics-related primitives (i.e. centers and radii of Čerenkov ring patterns on the photomultipliers arrays), in order to improve the low level trigger discrimination capability. Data from the detector photomultipliers (PMTs) are collected by four readout boards (TEL62) sending primitives to NaNet-10 as UDP datagram streams over two GbE channels (for each board, see fig. 10) connected to a GbE/10GbE switch. Packets are then routed on a 10GbE channel towards one of the NaNet-10 ports.

A processing stage on the onboard FPGA decompresses and coalesces the events fragments scattered among the several UDP streams; zero-copy DMAs towards the GPU memory are then instantiated to transmit the reconstructed events. Events are gathered and arranged in the GPU memory as a Circular List Of Persistent buffers (CLOP), according to a configurable time window which must be shorter than the total processing time to avert the overwriting of the buffers before they are consumed by the pattern recognition CUDA kernel.

A system like this, obtained retrofitting the RICH detector with a processing system capable of exploiting different kind of computing units, can be regarded as an heterogeneous “smart detector”.

Results from NA62 RUN 2017

The NaNet based system is installed close to the RICH readout rack in the NA62 experimental hall. After preliminary tests performed with NaNet-1 and a Kepler class GPU (nVIDIA K20c) in 2013-2015, it was updated to NaNet-10 in 2016 and then further improved with a Pascal class GPU (nVIDIA P100) during the NA62 2017 Run.

The test-bed is made up of a HP2920 switch, a NaNet-10 PCIe board plugged into a server made of X9DRG-QF dual-socket mother- board populated with Intel Xeon E5-2620 @2.00 GHz CPUs (i.e. Ivy Bridge architecture), 32 GB of DDR3 RAM, and a NVIDIA Pascal P100 GPU.

Latencies of the stages corresponding to GPU processing (event indexing and ring reconstruction) and sending UDP packets with the results to the L0TP are shown in fig. 11. Those are data regarding a single burst at beam intensity ∼ 19 × 1011 pps and with a gathering time for NaNet-10 of 250 μs (dashed line on plot).

The overall time is always well below the time budget limit, pointing it out as an ideal working point. Being the quality of the ring reconstruction similar to the offline and the time budget below the maximum latency allowed by the system, we conclude that the GPU trigger system implemented can be used in the future runs to design more selective and efficient trigger conditions.


NaNet3

NaNet3 testbed: board is connected to off-shore Read-Out system via optical cable.

To be complete, an overview of the NaNet board family must mention the undergoing development of the NaNet3 board for the KM3 HEP experiment[4]. In KM3 the board is tasked with delivering global clock and synchronization signals to the underwater electronic system and receiving photomultipliers data via optical cables. The design employs Altera Deterministic Latency Transceivers with an 8B10B encoding scheme as Physical Link Coding and Time Division MultiPlexing (TDMP) data transmission protocol. Current implementation is being developed on the Altera Stratix V development board with a Terasic SFP-HSMC daughtercard plugged on top and sporting 4 transceiver-based SFP ports.


Further enhancements

NaNet-40

To enable 40 Gb Ethernet with UDP/IP protocol hardware offload we developed this new member of the NaNet family, based on a Bittware S5-PCIe-HQ board with Altera Stratix V. It is equipped with 2 QSFP+ ports and a PCIe X8 Grn3 connector.

NaNet Public Documentation


References

  1. A. Biagioni, F. Lo Cicero, A. Lonardo, P.S. Paolucci, M. Perra, D. Rossetti, C. Sidore, F. Simula, L. Tosoratto and P. Vicini - The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture, March 2012. (http://arxiv.org/abs/1203.1536).
  2. 1g eth udp/ip stack (https://opencores.org/project/udp_ip_stack).
  3. G. Lamanna, G. Collazuol, Marco Sozzi - GPUs for fast triggering and pattern matching at the CERN experiment NA62, Nuclear Science Symposium Conference Record, 2009
  4. M. Ageron e al., Technical Design Report for a Deep-Sea Research Infrastructure in the Mediterranean Sea Incorporating a Very Large Volume Neutrino Telescope, Tech. Rep. ISBN 978-90-6488-033-9.