art with code

2018-03-26

Infinibanding, pt 4. Almost there

Got my PCIe-M.2 adapters, plugged 'em in, one of them runs at PCIe 1.0 lane speeds instead of PCIe 3.0, capping perf to 850 MB/s. And causes a Critical Interrupt #0x18 | Bus Fatal Error that resets the machine. Then the thing overheats, melts its connection solder, shorts, releases the magic smoke and makes the SSD PCB look like wet plastic. Yeah, it's dead. The SSD's dead too.

[Edit: OR.. IS IT? Wiped the melted goop off the SSD and tried it in the other adapter and it seems to be fine. Just has high controller temps, apparently a thing with the 960s. Above 90 C after 80 GB of writes. It did handle 800 GB of /dev/zero written to it fine and read everything back in order as well. Soooo, maybe my two-NVMe pool lives to ride once more? Shock it, burn it, melt it, it just keeps on truckin'. Disk label: "The Terminator"]

I only noticed this because the server rebooted and one of the mirrors in the ZFS pool was offline. Hooray for redundancy?

Anyway, the fast NVMe work pool is online, it can kinda saturate the connection. It's got two one Samsung 960 EVOs in it, which are is fast for reads, if maybe not the best for synced writes.

I also got a 280 gig Optane 900p. It feels like Flash Done Right. Low queue depths, low parallelism, whatever, it just burns through everything. Optane also survives many more writes than flash, it's really quite something. And it hasn't caught on fire yet. I set up two small partitions (10 GB) as ZFS slog devices for the two pools and now the pools can handle five-second write bursts at 2 GB/s.

Played with BeeGFS over the weekend. Easyish to get going, sort of resilient to nodes going down if you tune it that way, good performance with RDMA (netbench mode dd went at 2-3GB/s). The main thing lacking seems to be the "snapshots, backups and rebuild from nodes gone bye"-story.

Samba and NFS get to 1.4 to 1.8 GB/s per client, around 2.5 GB/s aggregate, high CPU usage on the server somehow, even with NFSoRDMA. I'll see if I can hit 2 GB/s on a single client. Not really native-level random access perf though. The NVMe drive can do two 1 GB/s clients. Fast filestore experiment mostly successful, if not completely.

Next up, wait for a bit of a lull in work, and switch my workstation to a machine that's got more than one PCIe slot. And actually hook it up to the fast network to take advantage of all this bandwidth. Then plonk the backup disks into one box, take it home, off-site backups yaaay.

Somehow I've got nine cables, four network cards, three computers, and a 36-port switch. Worry about that in Q3, time to wrap up with this build.

2018-03-20

InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.

ZFS

Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.

Plans?

I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.

2018-03-12

Unified Interconnect

Playing with InfiniBand got me thinking. This thing is basically a PCIe to PCIe -bridge. The old kit runs at x4 PCIe 3 speeds, the new stuff is x16 PCIe. The next generation is x16 PCIe 4.0 and 5.0.

Why jump through all the hoops? Thunderbolt is x4 PCIe over a USB connector. What if you bundle four of those, throw in some fiber and transceivers for long distance. You get x16 PCIe between two devices.

And once you start thinking of computers as a series of components hanging off a PCIe bus, your system architecture clarifies dramatically. A CPU is a PCIe 3 device with 40 to 64 lanes. DRAM uses around 16 lanes per channel.

GPUs are now hooked up as 16-lane devices, but could saturate 256 to 1024 lanes. Because of that, GPU RAM is on the GPU board. If the GPU had enough lanes, you could hook GPU RAM up to the PCIe bus with 32 lanes per GDDR5 chip. HBM is probably too close to the GPU to separate.

You could build a computer with 1024 lanes, then start filling them up with the mix of GPUs, CPUs, DRAM channels, NVMe and connectivity that you require. Three CPUs with seven DRAM channels? Sure. Need an extra CPU or two? Just plug them in. How about a GPU-only box with the CPU OS on another node. Or CPUs connected as accelerators via 8 lanes per socket as you'd like to use the extra lanes for other stuff. Or a mix of x86 and ARM CPU cards to let you run mixed workloads at max speed and power efficiency.

Think of a rack of servers, sharing a single PCIe bus. It'd be like one big computer with everything hotpluggable. Or a data center, running a single massive OS instance with 4 million cores and 16 PB of RAM.

Appendix, more devices

Then you've got the rest of the devices, and they're pretty well on the bus already. NVMe comes in 4-lane blocks. Thunderbolt 3, Thunderbolt 2 and USB 3.1 are 4, 2 and 1 lane devices. SAS and SATA are a bit awkward, taking up a bit more than 1 lane or bit more than half a lane. I'd replace them with NVMe connectors.

Display connectors could live on the bus as well (given some QoS to keep up the interactivity). HDMI 2.1 uses 6 lanes, HDMI 2 is a bit more than 2 lanes. DisplayPort's next generation might go up to 7-8 lanes.

Existing kit

[Edit] Hey, someone's got products like this already. Dolphin ICS produces a range of PCIe network devices. They've even got an IP-over-PCIe driver.

[Edit #2] Hey, the Gen-Z Interconnect looks a bit like this Gen-Z Interconnect Core Specification 1.0 Published

2018-03-11

InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)


2018-03-07

Quick test with old Infiniband kit

Two IBM ConnectX-2 cards, hooked up to a Voltaire 4036 switch that sounds like a turbocharged hair dryer. CentOS 7, one host bare metal, other on top of ESXi 6.5.

Best I saw thus far: 3009 MB/s RDMA transfer. Around 2.4 GB/s with iperf3. These things seem to be CPU capped, top is showing 100% CPU use. Made an iSER ramdisk too, it was doing 1.5 GB/s-ish with ext4.

Will examine more next week. With later kernel and firmwares and whatnot.

The end goal here would be to get 2+ GB/s file transfers over Samba or NFS. Probably not going happen but eh, give it a try.

That switch though. Need a soundproof cabinet.

2018-03-04

OpenVPN settings for 1 Gbps tunnel

Here are the relevant parts of the OpenVPN 2.4 server config that got me 900+ Mbps iperf3 on GbE LAN. The tunnel was between two PCs with high single-core performance, a Xeon 2450v2 and an i7-3770. OpenVPN uses 50% of a CPU core on the client & server when the tunnel is busy. For reference, I tried running the OpenVPN server on my WiFi router, it peaked out at 60 Mbps.

# Use TCP, I couldn't get good perf out of UDP. 

proto tcp

# tun or tap, roughly same perf
dev tun 

# Use AES-256-GCM:
#  - more secure than 128 bit
#  - GCM has built-in authentication, see https://en.wikipedia.org/wiki/Galois/Counter_Mode
#  - AES-NI accelerated, the raw crypto runs at GB/s speeds per core.

cipher AES-256-GCM

# Don't split the jumbo packets traversing the tunnel.
# This is useful when tun-mtu is different from 1500.
# With default value, my tunnel runs at 630 Mbps, with mssfix 0 it goes to 930 Mbps.

mssfix 0

# Use jumbo frames over the tunnel.
# This reduces the number of packets sent, which reduces CPU load.
# On the other hand, now you need 6-7 MTU 1500 packets to send one tunnel packet. 
# If one of those gets lost, it delays the entire jumbo packet.
# Digression:
#   Testing between two VBox VMs on a i7-7700HQ laptop, MTU 9000 pegs the vCPUs to 100% and the tunnel runs at 1 Gbps.
#   A non-tunneled iperf3 runs at 3 Gbps between the VMs.
#   Upping this to 65k got me 2 Gbps on the tunnel and half the CPU use.

tun-mtu 9000

# Send packets right away instead of bundling them into bigger packets.
# Improves latency over the tunnel.

tcp-nodelay

# Increase the transmission queue length.
# Keeps the TUN busy to get higher throughput.
# Without QoS, you should get worse latency though.

txqueuelen 15000

# Increase the TCP queue size in OpenVPN.
# When OpenVPN overflows the TCP queue, it drops the overflow packets.
# Which kills your bandwidth unless you're using a fancy TCP congestion algo.
# Increase the queue limit to reduce packet loss and TCP throttling.

tcp-queue-limit 256

And here is the client config, pretty much the same except that we only need to set tcp-nodelay on the server:

proto tcp
cipher AES-256-GCM
mssfix 0
tun-mtu 9000
txqueuelen 15000
tcp-queue-limit 256

To test, run iperf3 -s on the server and connect to it over the tunnel from the client: iperf3 -c 10.8.0.1. For more interesting tests, run the iperf server on a different host on the endpoint LAN, or try to access network shares.

I'm still tuning this (and learning about the networking stack) to get a Good Enough connection between the two sites, let me know if you got any tips or corrections.

P.S. Here's the iperf3 output.

$ iperf3 -c 10.8.0.1
Connecting to host 10.8.0.1, port 5201
[  4] local 10.8.2.10 port 39590 connected to 10.8.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec    0   3.01 MBytes
[  4]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   6.00-7.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver

iperf Done.

2018-03-02

Fast-ish OpenVPN tunnel

500 Mbps OpenVPN throughput over the Internet, nice. Was aiming for 900 Mbps, which seems to work on LAN, but no cigar. [Edit: --tcp-nodelay --tcp-queue-limit 256 got me to 680 Mbps. Which is very close to non-tunneled line speed as measured by a HTTP download.]

OpenVPN performance is very random too. I seem to be getting different results just because I restarted the client or the server.

The config is two wired machines, each with a 1 Gbps fibre Internet connection. The server is a Xeon E3-1231v3, a 3.4 GHz Haswell Xeon. The client is my laptop with a USB3 GbE adapter and i7-7700HQ. Both machines get 900+ Mbps on Speedtest, so the Internet connections are fine.

My OpenVPN is set to TCP protocol (faster and more reliable than UDP in my testing), and uses AES-256-GCM as the cipher. Both machines are capable of pushing multiple gigaBYTES per second over openssl AES-256, so crypto isn't a bottleneck AFAICT. The tun-mtu is set to 9000, which performs roughly as well as 48000 or 60000, but has smaller packets which seems to be less flaky than big mtus. The mssfix setting is set to 0 and fragment to 0 as well, though fragment shouldn't matter over TCP.

Over iperf3 I get 500 Mbps between the endpoints. With HTTP, roughly that too. Copying from a remote SMB share on another host goes at 30 MB per second, but the remote endpoint can transfer from the file server at 110 MB per second (protip: mount.cifs -o vers=3.0). Thinking about it a bit, I need to test with a second NIC in the VPN box, now VPN traffic might be competing with LAN traffic.

Blog Archive