fhtr: April 2018

2018-04-21

rdma-pipe

Haven't you always wanted to create UNIX pipes that run from one machine to another? Well, you're in luck. Of sorts. For I have spent my Saturday hacking on an InfiniBand RDMA pipeline utility that lets you pipe data between commands running on remote hosts at multi-GB/s speeds.

Unimaginatively named, rdma-pipe comes with the rdpipe utility that coordinates the rdsend and rdrecv programs that do the data piping. The rdpipe program uses SSH as the control channel and starts the send and receive programs on the remote hosts, piping the data through your commands.

For example


  # The fabulous "uppercase on host1, reverse on host2"-pipeline.
  $ echo hello | rdpipe host1:'tr [:lower:] [:upper:]' host2:'rev'
  OLLEH

  # Send ZFS snapshots fast-like from fileserver to backup_server.
  $ rdpipe backup@fileserver:'zfs send -I tank/media@last_backup tank/media@head' backup_server:'zfs recv tank-backup/media'

  # Decode video on localhost, stream raw video to remote host.
  $ ffmpeg -i sintel.mpg -pix_fmt yuv420p -f rawvideo - | rdpipe playback_host:'ffplay -f rawvideo -pix_fmt yuv420p -s 720x480 -'

  # And who could forget the famous "pipe page-cached file over the network to /dev/null" benchmark!
  $ rdpipe -v host1:'</scratch/zeroes' host2:'>/dev/null'
  Bandwidth 2.872 GB/s

Anyhow, it's quite raw, new, exciting, needs more eyeballs and tire-kicking. Have a look if you're on InfiniBand and need to pipe data across hosts.

2018-04-12

RDMA cat

Today I wrote a small RDMA test program using libibverbs. That library has a pretty steep learning curve.

Anyhow. To use libibverbs and librdmacm on CentOS, install rdma-core-devel and compile your things with -lrdmacm -libverbs.

My test setup is two IBM-branded Mellanox ConnectX-2 QDR InfiniBand adapters connected over a Voltaire 4036 QDR switch. These things are operating at PCIe 2.0 x8 speed, which is around 3.3 GB/s. Netcat and friends get around 1 GB/s transfer rates piping data over the network. Iperf3 manages around 2.9 GB/s. With that in mind, let's see what we can reach.

I was basing my test programs on these amazingly useful examples: https://github.com/linzion/RDMA-example-application https://github.com/jcxue/RDMA-Tutorial http://www.digitalvampire.org/rdma-tutorial-2007/notes.pdf and of course http://www.rdmamojo.com/ . At one point after banging my head on the ibverbs library for a bit too long I was thinking of just using MPI to write the thing and wound up on http://mpitutorial.com - but I didn't have the agility to jump from host-to-host programs to strange new worlds, so kept on using ibverbs for these tests.

First light

The first test program was just reading some data from STDIN, sending it to the server, which reverses it and sends it back. From there I worked towards sending multiple blocks of data (my goal here was to write an RDMA version of cat).

I had some trouble figuring out how to make the two programs have a repeatable back-and-forth dialogue. First I was listening to too many events with the blocking ibv_get_cq_event -call, and that was hanging the program. Only call it as many times as you're expecting replies.

The other fib was that my send and receive work requests shared the sge struct, and the send-part of the dialogue was setting the sge buffer length to 1 since it was only sending acks back to the other server. Set it back to the right size before sending each work request, problem solved.

Optimization

Once I got the rdma-cat working, performance wasn't great. I was reading in a file from page cache, sending it to the receiver, and writing it to the STDOUT of the receiver. The program was sending 4k messages, doing a 4k acks, and a mutex-requiring event ack after each message. This ran at around 100 MB/s. Changing the 4k acks to single-byte acks and doing the event acks for all the events at once got me to 140 MB/s.

How about doing larger messages? Change the message size to 65k and the cat goes at 920 MB/s. That's promising! One-megabyte messages and 1.4 GB/s. With eight meg messages I was up to 1.78 GB/s and stuck there.

I did another test program that was just sending an 8 meg buffer to the other machine, which didn't do anything to the data. This is useful to get an optimal baseline and gauge perf for a single process use case. The test program was running at 2.9 GB/s.

Adding a memcpy to the receive loop roughly halved the bandwidth to 1.3 GB/s. Moving to a round-robin setup with one buffer receiving data while another buffer is having the data copied out of it boosted the bandwidth to 3 GB/s.

The send loop could read in data at 5.8 GB/s from the page cache, but the RDMA pipe was only doing 1.8 GB/s. Moving the read to happen right after each send got them both moving in parallel, which got the full rdma_send < inputfile ; rdma_recv | wc -c -pipe running at 2.8 GB/s.

There was an issue with the send buffer contents getting mangled by an incoming receive. Gee, it's almost like I shouldn't use the same buffer for sending and receiving messages. Using a different buffer for the received messages resolved the issue.

It works!

I sent a 4 gig file and ran diff on it, no probs. Ditto for files less than buffer size in size and small strings sent with echo.

RDMA cat! 2.9 GB/s over the network.

Let's try sending video frames next. Based on these CUDA bandwidth timings, I should be able to do 12 GB/s up and down. Now I just need to get my workstation on the IB network (read: buy a new workstation with more than one PCIe slot.)

[Update] For the heck of it, I tried piping through two hosts.

[A]$ rdma_send B < inputfile
[B]$ rdma_recv | rdma_send C
[C]$ rdma_recv | wc -c

2.5 GB/s. Not bad, could do networked stream processing. Wonder if it would help if I zero-copy passed the memory regions along the pipe.

And IB is capable of multicast as well...

2018-04-06

4k over IB

So, technically, I could stream uncompressed 4k@60Hz video over the Infiniband network. 4k60 needs about 2 GB/s of bandwidth, the network goes at 3 GB/s.

This... how would I try this?

I'd need a source of 4k frames. Draw on the GPU to a framebuffer, then glReadPixels (or CUDA GPUDirect RDMA). Then use IB verbs to send the framebuffer to another machine. Upload it to the GPU to display with glTexImage (or GPUDirect from the IB card).

And make sure that everything in the data path runs at above 2 GB/s.

Use cases? Extreme VNC? Combining images from a remote GPU and local GPU? With a 100Gb network, you could pull in 6 frames at a time and composite in real time I guess. Bringing in raw 4k camera streams to a single box over a switched COTS fabric.

Actually, this would be "useful" for me, I could stream interactive video from a big beefy workstation to a small installation box. The installation box could handle stereo camera processing and other input, then send the control signals to the rendering station. (But at that point, why not just get longer HDMI and USB cables.)

2018-04-02

Quick timings

NVMe and NFS, cold cache on client and server. 4.3 GiB in under three seconds.

$ cat /nfs/nvme/Various/UV.zip | pv > /dev/null
 4.3GiB 0:00:02 [1.55GiB/s]

The three-disk HDD pool gets around 300 MB/s, but once the ARC picks up the data it goes at NFS + network speed. Cold cache on the client.

$ echo 3 > /proc/sys/vm/drop_caches
$ cat /nfs/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:10 [ 1.5GiB/s]

Samba is heavier somehow.

$ cat /smb/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:13 [1.26GiB/s]

NFS over RDMA from the ARC, direct to /dev/null (which, well, it's not a very useful benchmark). But 2.8 GB/s!

$ time cat /nfs/hdd/Videos/*.mp4 > /dev/null

real    0m6.269s
user    0m0.007s
sys     0m4.056s
$ cat /nfs/hdd/Videos/*.mp4 | wc -c
17722791869
$ python -c 'print(17.7 / 6.269)'
2.82341681289

$ time cat /nfs/hdd/Videos/*.mp4 > /nfs/nvme/bigfile

real    0m15.538s
user    0m0.016s
sys     0m9.731s

# Streaming read + write at 1.13 GB/s

How about some useful work? Parallel grep at 3 GB/s. Ok, we're at the network limit, call it a day.

$ echo 3 > /proc/sys/vm/drop_caches
$ time (for f in /nfs/hdd/Videos/*.mp4; do grep -o --binary-files=text XXXX "$f" & done; for job in `jobs -p`; do wait $job; done)
XXXX
XXXX
XXXX
XXXX
XXXX

real    0m5.825s
user    0m3.567s
sys     0m5.929s

fhtr