Skip to content
On this page

coro_rpc, from yalantinglibs, is a high-performance RPC library based on C++20 coroutines. It provides a concise and easy-to-use interface, allowing users to implement RPC communication with just a few lines of code. In addition to TCP, coro_rpc now also supports RDMA communication (ibverbs). Let's use a simple example to get a feel for coro_rpc with RDMA communication.

example

Start the RPC server:

cpp
std::string_view echo(std::string str) { return str; }
coro_rpc_server server(/*thread_number*/ std::thread::hardware_concurrency(), /*port*/ 9000);
server.register_handler<echo>();
server.init_ibv(); // Initialize RDMA resources
server.start();

The client sends an RPC request:

cpp
Lazy<void> async_request() {
  coro_rpc_client client{};
  client.init_ibv(); // Initialize RDMA resources
  co_await client.connect("127.0.0.1:9000");
  auto result = co_await client.call<echo>("hello rdma");
  assert(result.value() == "hello rdma");
}

int main() {
  syncAwait(async_request());
}

With just a few lines of code, you can set up an RPC server and client based on RDMA communication. If users need to configure more RDMA-related parameters, they can pass a config object when calling init_ibv(). Various ibverbs-related parameters can be set in this object. For details, see the documentation.

How do you enable TCP communication? Simply don't call init_ibv(). TCP is the default communication protocol; RDMA communication is enabled only after init_ibv() is called.

benchmark

We conducted some performance tests on coro_rpc between two hosts in a 180Gb RDMA (RoCE v2) environment. In high-concurrency, small-packet scenarios, the QPS can reach 1.5 million. When sending slightly larger packets (256KB and above), the bandwidth can be easily saturated with fewer than 10 concurrent connections.

Request Data SizeConcurrencyThroughput (Gb/s)P90 (us)P99 (us)QPS
128B10.04242643,394
40.152944149,130
160.404861393,404
640.81100134841,342
2561.472102561,533,744
4K11.21353937,017
44.503748137,317
1611.646274355,264
6424.47112152745,242
25642.362443121,318,979
32K18.41394132,084
429.914255114,081
1683.735893319,392
64148.66146186565,878
256182.74568744697,849
256K128.59819013,634
4100.079611347,718
16182.5821024287,063
64181.7077686487,030
256180.983072339288,359
1M155.081581726,566
4161.9023625419,299
16183.4183288821,864
64184.292976310421,969
256184.90116481177622,041
8M178.6484014881,171
4180.88153618402,695
16185.01588860102,756
64185.0123296235522,756
256183.4793184942082,733

The specific benchmark code can be found here.

GPU-direct RDMA Support

GPU-direct RDMA allows direct memory access between GPU memory and remote nodes via RDMA, eliminating the dependency on CPU during data transfers. This feature significantly reduces latency and improves throughput for GPU-related applications.

Initialization

To enable GPU-direct RDMA support, you need to:

  1. Initialize CUDA Environment: First get available CUDA devices and initialize the GPU environment:

    cpp
    auto cuda_dev_list = coro_io::cuda_device_t::get_cuda_devices();
  2. Create IB Device with GPU Memory Support: Create an InfiniBand device that supports GPU memory buffers:

    cpp
    auto dev = coro_io::ib_device_t::create(
        {.buffer_pool_config = {.gpu_id = 0 /* GPU ID */}});
  3. Initialize Server and Client with GPU Buffer Support IB Device:

    • Server: server.init_ibv({.device = dev})
    • Client: client.init_ibv({.device = dev})

RPC Client

  • Set Request Attachment: Use set_req_attachment2 to send GPU data:

    cpp
    coro_io::data_view gpu_attachment; // = ...;
    client.set_req_attachment2(gpu_attachment);
  • Access Response Attachment: Use get_resp_attachment2 to retrieve GPU data sent by the server from the response:

    cpp
    coro_io::data_view resp_attachment = client.get_resp_attachment2();
  • Optional: Set Response Attachment Buffer: Use set_resp_attachment_buf2 to pre-allocate and set the buffer address for receiving attachment response data. When the length is insufficient, an internal buffer will be automatically reallocated.

    cpp
    coro_io::data_view gpu_attachment_buf; // = ...;
    client.set_resp_attachment2(gpu_attachment_buf);

    data_view is a data view that, in addition to traditional [data()](file:///root/lizezheng/yalantinglibs/include/ylt/thirdparty/asio/detail/is_buffer_sequence.hpp#L38-L38) and [size()](file:///root/lizezheng/yalantinglibs/include/ylt/thirdparty/asio/detail/is_buffer_sequence.hpp#L35-L35) interfaces, provides a [gpu_id()](file:///root/lizezheng/yalantinglibs/include/ylt/coro_io/memory_owner.hpp#L66-L72) interface to indicate the GPU ID where the GPU memory resides. When ID=-1, it indicates that the data is located in system memory.

RPC Server

On the RPC server side, you can access request attachments and set response attachments through the context of the RPC function:

cpp
// In the handler function
void rpc_function() {
    coro_io::data_view attachment = coro_rpc::get_context()->get_request_attachment2();
    coro_rpc::get_context()->set_response_attachment2(attachment);
}

Performance Advantages

GPU-direct RDMA eliminates CPU-GPU memory copying during network transmission, reducing latency and CPU overhead. Data flows directly from GPU memory to the network interface and vice versa, making it ideal for high-performance computing and AI applications where large amounts of GPU data need to be shared across nodes.

RDMA Performance Optimization

RDMA Memory Pool

RDMA requests require pre-registering memory for sending and receiving data. In our tests, the overhead of registering RDMA memory is much higher than that of memory copying. A more reasonable approach, compared to registering memory for each send or receive operation, is to use a memory pool to cache already-registered RDMA memory. Each time a request is initiated, data is divided into multiple chunks for receiving/sending. The maximum length of each chunk corresponds to the size of the pre-registered memory blocks. A registered block is retrieved from the pool, and a memory copy is performed between this block and the actual data address.

RNR and the Receive Buffer Queue

RDMA operates directly on remote memory. If the remote memory is not ready, it triggers an RNR (Receiver Not Ready) error. To handle an RNR error, one must either disconnect or sleep for a period. Clearly, avoiding RNR errors is key to improving RDMA transfer performance and stability.

coro_rpc uses the following strategy to address the RNR issue: For each connection, we prepare a receive buffer queue. This queue contains several memory blocks (e.g., 8 blocks of 256KB by default). Whenever a notification of a completed data transfer is received, a new memory block is immediately added to the buffer queue, and this new block is posted to RDMA's receive queue.

Send Buffer Queue

In the send path, the most straightforward approach is to first copy data to an RDMA buffer, then post it to the RDMA send queue. After the data is written to the peer, repeat the process for the next block.

This process has two bottlenecks: first, how to parallelize memory copying and network transmission; second, the NIC is idle during the time between when it finishes sending one block and when the CPU posts the next, failing to maximize bandwidth utilization.

To improve sending performance, we introduce the concept of a send buffer. For each I/O operation, instead of waiting for the peer to complete the write, we complete the send operation immediately after posting the memory to the RDMA send queue. This allows the upper-level code to send the next request/data block until the number of outstanding sends reaches the send buffer queue's limit. Only then do we wait for a send request to complete before posting a new memory block to the RDMA send queue.

For large packets, this algorithm allows memory copying and network transmission to occur concurrently. Since multiple blocks are sent simultaneously, the NIC can send another pending block during the time it takes for the application layer to post a new one after the previous one is sent, thus maximizing bandwidth utilization.

Small Packet Write Coalescing

RDMA throughput is relatively low when sending small packets. For small packet requests, a strategy that improves throughput without introducing extra latency is to coalesce multiple small packets into a larger one. If the application layer posts a send request when the send queue is full, the data is not immediately sent but is temporarily stored in a buffer. If the application then posts another request, we can merge the new data into the same buffer as the previous data, achieving data coalescing for sending.

Inline Data

Some RDMA NICs can send small packets using the inline data method, which does not require registering RDMA memory and offers better transfer performance. coro_rpc uses this method for packets smaller than 256 bytes if the NIC supports inline data.

Memory Usage Control

RDMA communication requires managing memory buffers manually. Currently, coro_rpc uses a default memory chunk size of 256KB. The initial size of the receive buffer is 8, and the send buffer limit is 2. Therefore, the memory usage per connection is 10 * 256KB, approximately 2.5MB. Users can control memory usage by adjusting the buffer queue sizes and the buffer block size.

Additionally, the RDMA memory pool also provides a watermark configuration to control the upper limit of memory usage. When the memory pool's watermark is high, a connection attempting to get a new memory block from the pool will fail and be closed.

Using a Connection Pool

In high-concurrency scenarios, the connection pool provided by coro_rpc can be used to reuse connections, avoiding the overhead of repeated connection creation. Furthermore, since coro_rpc supports connection multiplexing, we can submit multiple small packet requests over a single connection. This enables pipelined sending and leverages the underlying small packet write coalescing to improve throughput.

cpp
auto pool = coro_io::client_pool<coro_rpc::coro_rpc_client>::create(
    conf.url, pool_conf);
auto ret = co_await pool->send_request(
    [&](coro_io::client_reuse_hint, coro_rpc::coro_rpc_client& client) {
        return client.send_request<echo>("hello");
    });
if (ret.has_value()) {
    auto result = co_await std::move(ret.value());
    if (result.has_value()) {
        assert(result.value()=="hello"); 
    }
}

This website is released under the MIT License.