RDMA Network Fabric

Overview

RDMA (Remote Direct Memory Access) allows one computer to directly access data from another without operating system or CPU involvement, and has high bandwidth and low latency. The network card directly reads from and writes to memory with no extra copying or buffering and with very low latency. RDMA is an integral part of the Exadata high-performance architecture, and has been tuned and enhanced over the past decade, underpinning several Exadata-only technologies such as Exafusion Direct-to-Wire Protocol.

RDMA over Converged Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) is the latest generation of Exadata's RDMA Network Fabric. RoCE (pronounced "rocky") runs the RDMA protocol on top of Ethernet. Prior to Exadata X8M, the internal fabric ran the RDMA protocol on top of Infiniband. As the RoCE API infrastructure is identical to InfiniBand’s, all existing Exadata performance features are available on RoCE. RoCE enables the scalability and bandwidth of Ethernet with the speed of RDMA. The RoCE protocol is defined and maintained by the InfiniBand Trade Association (IBTA), an open consortium of companies. It is developed in open-source and maintained in upstream Linux. It is supported by major network card and switch vendors.

Benefits of using RoCE in Exadata include:

  • Exadata RoCE internal fabric provides an extremely fast and low-latency connection for database and storage servers
  • Exadata RoCE provides all the benefits previously unique to InfiniBand
  • Exadata transparently prioritizes traffic by type, ensuring best performance for critical messages
  • Exadata automatically optimizes network communications by ensuring packets are delivered on the first try without costly retransmissions
  • Exadata eliminates stalls due to failures by immediately detecting server failures without waiting for lengthy timeouts

Exadata RoCE provides RDMA speed and reliability on Ethernet fabric including 100 Gb throughout, zero packet loss messaging, prioritization of critical database messages, and the latest KVM based virtualization

Network prioritization for latency sensitive DB algorithms ensures that messages requiring low latency are not slowed by high throughput messages. Through the use of RoCE's Class of Service (CoS), packets can be sent on multiple classes of service (think of it as lanes of a freeway), each with separate network buffers for independence. For example, cluster heartbeats, transaction commits, and cache fusion require low latency, while backups, reporting and batch processing require high throughput. Exadata uniquely chooses the most optimal Class of Service for each database message.

Packet loss is usually caused by network congestion where packets are sent faster than the receiver or switch can process. Conventional ethernet silently drops packets and expects retransmission if data is sent too fast. It is this packet drop that causes huge hits to latency and throughput. Exadata avoids packet drops by using two specific protocols. RoCE Priority-based Flow Control (PFC), which enables the switch to tell the sender to pause if a switch buffer is full, and RoCE Explicit Congestion Notification (ECN), which enables the switch to mark a packet flow as too fast, telling the source to slow down packet sends.

Exadata RoCE is an extremely fast and low-latency internal fabric for database and storage servers, providing all the benefits previously unique to InfiniBand, transparently prioritizes traffic by type, ensuring best performance for critical messages, and automatically optimizing network communications by ensuring packets are delivered on the first try without costly retransmissions.

RDMA over InfiniBand

RDMA over InfiniBand (IB) was first announced with Exadata V1 in 2008 and provided the base for Exadata's scale out network fabric. Over the next 10 years, the Remote Direct Memory Access (RDMA) provided by InfiniBand would become central in many performance features of Exadata, including Exafusion Direct-to-Wire Protocol.

Exadata IB provided RDMA speed and reliability including 40 Gb throughout, zero packet loss messaging, prioritization of critical database messages, and Xen based virtualization