Stop Thinking, Just Do!

Sung-Soo Kim's Blog

Remote Direct Memory Access


8 June 2014

Article Source

Accelerating Big Data over RDMA

In this video from the 2013 Open Fabrics Developer Workshop, Sreev Doddabalapur from Mellanox presents: Accelerating Big Data over RDMA.

Remote Direct Memory Access

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.


RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.


This strategy presents several problems related to the fact that the target node is not notified of the completion of the request (1-sided communications). The common way to notify it is to change a memory byte when the data has been delivered, but it requires the target to poll on this byte. Not only does this polling consume CPU cycles, but the memory footprint and the latency increases linearly with the number of possible other nodes, which limits use of RDMA in High-Performance Computing (HPC).

RDMA reduces network protocol overhead, leading to improvements in communication latency. Reductions in protocol overhead can increase a network’s ability to move data quickly, allowing applications to get the data they need faster, in turn leading to more scalable clusters. However, one must be aware of the tradeoff between this reduction in network protocol overhead and additional overhead that may be incurred on each node due to the need for pinning virtual memory pages. In particular, zero-copy RDMA protocols require that the memory pages involved in a transaction be pinned, at least for the duration of the transfer. If this is not done, RDMA pages might be paged out to disk and replaced with other data by the operating system, causing the DMA engine (which knows nothing of the virtual memory system maintained by the operating system) to send the wrong data. The net result of not pinning the pages in a zero-copy RDMA system can be corruption of the contents of memory in the distributed system. Pinning memory takes time and additional memory to set up, reduces the quantity of memory the operating system can allocate to processes, limits the overall flexibility of the memory system to adapt over time, and could even lead to underutilization of memory if processes unnecessarily pin pages. The net result is the introduction of latency, sometimes in linear proportion to the number of pages of data pinned in memory. In order to mitigate these problems, several techniques for interfacing with RDMA devices were developed:

  • using caching techniques to keep data pinned as long as possible, producing overhead reductions for applications that repeatedly communicate in the same memory area
  • pipelining memory pinning operations and data transfer (as done on Infiniband or Myrinet)
  • deferring memory pinning out of the critical path, thus hiding the latency increase
  • entirely removing the need for pinning (as Quadrics does)

In contrast, the Send/Recv model used by other zero-copy HPC interconnects, such as Myrinet or Quadrics, does not have the 1-sided communication problem or the memory paging problem described above, yet provides comparable reductions in latency when used in conjunction with HPC communication frameworks that expose the Send/Recv model to the programmer (such as MPI).


Much like other HPC interconnects, RDMA has achieved limited acceptance as of 2013 due to the need to install a different networking infrastructure. However, new standards enable Ethernet RDMA implementation at the physical layer using TCP/IP as the transport, thus combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. The RDMA Consortium and the DAT Collaborative have played key roles in the development of RDMA protocols and APIs for consideration by standards groups such as the Internet Engineering Task Force and the Interconnect Software Consortium.

Hardware vendors have started working on higher-capacity RDMA-based network adapters, with rates of 40Gbit/s reported. Software vendors such as Red Hat and Oracle Corporation support these APIs in their latest products, and as of 2013 engineers have started developing network adapters that implement RDMA over Ethernet. Both Red Hat Enterprise Linux and Red Hat Enterprise MRG have support for RDMA. Microsoft supports RDMA in Windows Server 2012 via SMB Direct.

Common RDMA implementations include the Virtual Interface Architecture, InfiniBand, and iWARP.


  1. DAT Collaborative website
  2. The Interconnect Software Consortium website

comments powered by Disqus