Skip to content
On this page

coro_rpc, from yalantinglibs, is a high-performance RPC library based on C++20 coroutines. It provides a concise and easy-to-use interface, allowing users to implement RPC communication with just a few lines of code. In addition to TCP, coro_rpc now also supports RDMA communication (ibverbs). Let's use a simple example to get a feel for coro_rpc with RDMA communication.

example

Start the RPC server:

cpp
std::string_view echo(std::string str) { return str; }
coro_rpc_server server(/*thread_number*/ std::thread::hardware_concurrency(), /*port*/ 9000);
server.register_handler<echo>();
server.init_ibv(); // Initialize RDMA resources
server.start();

The client sends an RPC request:

cpp
Lazy<void> async_request() {
  coro_rpc_client client{};
  client.init_ibv(); // Initialize RDMA resources
  co_await client.connect("127.0.0.1:9000");
  auto result = co_await client.call<echo>("hello rdma");
  assert(result.value() == "hello rdma");
}

int main() {
  syncAwait(async_request());
}

With just a few lines of code, you can set up an RPC server and client based on RDMA communication. If users need to configure more RDMA-related parameters, they can pass a config object when calling init_ibv(). Various ibverbs-related parameters can be set in this object. For details, see the documentation.

How do you enable TCP communication? Simply don't call init_ibv(). TCP is the default communication protocol; RDMA communication is enabled only after init_ibv() is called.

benchmark

We conducted some performance tests on coro_rpc between two hosts in a 180Gb RDMA (RoCE v2) environment. In high-concurrency, small-packet scenarios, the QPS can reach 1.5 million. When sending slightly larger packets (256KB and above), the bandwidth can be easily saturated with fewer than 10 concurrent connections.

Request Data SizeConcurrencyThroughput (Gb/s)P90 (us)P99 (us)QPS
128B10.04242643,394
40.152944149,130
160.404861393,404
640.81100134841,342
2561.472102561,533,744
4K11.21353937,017
44.503748137,317
1611.646274355,264
6424.47112152745,242
25642.362443121,318,979
32K18.41394132,084
429.914255114,081
1683.735893319,392
64148.66146186565,878
256182.74568744697,849
256K128.59819013,634
4100.079611347,718
16182.5821024287,063
64181.7077686487,030
256180.983072339288,359
1M155.081581726,566
4161.9023625419,299
16183.4183288821,864
64184.292976310421,969
256184.90116481177622,041
8M178.6484014881,171
4180.88153618402,695
16185.01588860102,756
64185.0123296235522,756
256183.4793184942082,733

The specific benchmark code can be found here.

RDMA Performance Optimization

RDMA Memory Pool

RDMA requests require pre-registering memory for sending and receiving data. In our tests, the overhead of registering RDMA memory is much higher than that of memory copying. A more reasonable approach, compared to registering memory for each send or receive operation, is to use a memory pool to cache already-registered RDMA memory. Each time a request is initiated, data is divided into multiple chunks for receiving/sending. The maximum length of each chunk corresponds to the size of the pre-registered memory blocks. A registered block is retrieved from the pool, and a memory copy is performed between this block and the actual data address.

RNR and the Receive Buffer Queue

RDMA operates directly on remote memory. If the remote memory is not ready, it triggers an RNR (Receiver Not Ready) error. To handle an RNR error, one must either disconnect or sleep for a period. Clearly, avoiding RNR errors is key to improving RDMA transfer performance and stability.

coro_rpc uses the following strategy to address the RNR issue: For each connection, we prepare a receive buffer queue. This queue contains several memory blocks (e.g., 8 blocks of 256KB by default). Whenever a notification of a completed data transfer is received, a new memory block is immediately added to the buffer queue, and this new block is posted to RDMA's receive queue.

Send Buffer Queue

In the send path, the most straightforward approach is to first copy data to an RDMA buffer, then post it to the RDMA send queue. After the data is written to the peer, repeat the process for the next block.

This process has two bottlenecks: first, how to parallelize memory copying and network transmission; second, the NIC is idle during the time between when it finishes sending one block and when the CPU posts the next, failing to maximize bandwidth utilization.

To improve sending performance, we introduce the concept of a send buffer. For each I/O operation, instead of waiting for the peer to complete the write, we complete the send operation immediately after posting the memory to the RDMA send queue. This allows the upper-level code to send the next request/data block until the number of outstanding sends reaches the send buffer queue's limit. Only then do we wait for a send request to complete before posting a new memory block to the RDMA send queue.

For large packets, this algorithm allows memory copying and network transmission to occur concurrently. Since multiple blocks are sent simultaneously, the NIC can send another pending block during the time it takes for the application layer to post a new one after the previous one is sent, thus maximizing bandwidth utilization.

Small Packet Write Coalescing

RDMA throughput is relatively low when sending small packets. For small packet requests, a strategy that improves throughput without introducing extra latency is to coalesce multiple small packets into a larger one. If the application layer posts a send request when the send queue is full, the data is not immediately sent but is temporarily stored in a buffer. If the application then posts another request, we can merge the new data into the same buffer as the previous data, achieving data coalescing for sending.

Inline Data

Some RDMA NICs can send small packets using the inline data method, which does not require registering RDMA memory and offers better transfer performance. coro_rpc uses this method for packets smaller than 256 bytes if the NIC supports inline data.

Memory Usage Control

RDMA communication requires managing memory buffers manually. Currently, coro_rpc uses a default memory chunk size of 256KB. The initial size of the receive buffer is 8, and the send buffer limit is 2. Therefore, the memory usage per connection is 10 * 256KB, approximately 2.5MB. Users can control memory usage by adjusting the buffer queue sizes and the buffer block size.

Additionally, the RDMA memory pool also provides a watermark configuration to control the upper limit of memory usage. When the memory pool's watermark is high, a connection attempting to get a new memory block from the pool will fail and be closed.

Using a Connection Pool

In high-concurrency scenarios, the connection pool provided by coro_rpc can be used to reuse connections, avoiding the overhead of repeated connection creation. Furthermore, since coro_rpc supports connection multiplexing, we can submit multiple small packet requests over a single connection. This enables pipelined sending and leverages the underlying small packet write coalescing to improve throughput.

cpp
auto pool = coro_io::client_pool<coro_rpc::coro_rpc_client>::create(
    conf.url, pool_conf);
auto ret = co_await pool->send_request(
    [&](coro_io::client_reuse_hint, coro_rpc::coro_rpc_client& client) {
        return client.send_request<echo>("hello");
    });
if (ret.has_value()) {
    auto result = co_await std::move(ret.value());
    if (result.has_value()) {
        assert(result.value()=="hello"); 
    }
}

This website is released under the MIT License.