fhtr: 2018

2018-10-06

Hardware hacking

I got into these Pi devices recently. At first it was a simple "I want an easy way to control sites accessible on my office WiFi to stop wasting time when I should be working", so I set up an old broken laptop to prototype a simple service to do that. Then I replaced the laptop with a small Orange Pi hacker board. And got some wires and switches and breadboard and LEDs and resistors and ... hey, is that a Raspberry Pi 3B+? I'll get that too, maybe I can use it for something else...

Well.

I took apart a cheap RC car. Bought a soldering station to desolder the wires from the motor control board. Then got a motor controller board (an L298N, a big old chip for controlling 9-35V motors with 2A current draw -- not a good match for 3V 2A Tamiya FA-130 motors in the RC car), wired the motors to it, and the control inputs to the Raspberry Pi GPIOs.

Add some WebSockets and an Intel RealSense camera I had in a desk drawer and hey FPV web controlled car that sees in the dark with a funky structured light IR pattern. (The Intel camera is .. not really the right thing for this, it's more of a mini Kinect and outputs only 480x270 video over the RPi's USB 2 bus. And apparently Z16 depth data as well, but I haven't managed to read that out.) Getting the video streaming low-latency enough to use for driving was a big hassle.

Experimented with powering the car from the usual 5 AA batteries, then the same USB power bank that's powering the RPi (welcome to current spike crash land, the motors can suck up to 2A when stalling), and a separate USB power bank ("Hmm, it's a bit sluggish." The steering motor has two 5.6 ohm resistors wired to it and the L298N has a voltage drop of almost 1.5V at 5V, giving me about 1W of steering power with the USB. The original controller uses a tiny MX1508, which has a voltage drop something like 0.05V. Coupled with the 7.5V battery pack, the car originally steers at 5W. So, yeah, 5x loss in snappiness. Swap motor controller for a MX1508 and replace the resistors with 2.7R or 2.2R? Or keep the L298N and use 1.2R resistors.) Then went back to the 5 AA batteries. Screw it, got some NiMHs.

Tip: Don't mix up GND and +7.5V in the L298N. It doesn't work and gets very hot after a few seconds. Thankfully that didn't destroy the RPi. Nor did plugging the L298N +5V and GND to RPi +5V and GND -- you're supposed to use a jumper to bridge the +12V and GND pins on the L298N, then plug just the GND to the RPi GND (at least that's my latest understanding). I.. might be wrong on the RPi GND part, the hypothesis is that having shared ground for the L298N and the RPi gives a ground reference for the motor control pins coming from the RPi.

Tip 2: Don't wipe off the solder from the tip of the iron, then leave it at 350C for a minute. It'll turn black. The black stuff is oxides. Oxides don't conduct heat well and solder doesn't stick to it. Wipe / buff it off, then re-tin the tip of the iron. The tin should stick to the iron and form a protective layer around it.

Destroyed the power switch of the car. A big power bank in a metal shell, sliding around in a crash, crush. It was used to control the circuit of the AA battery pack. Replaced it with a heavy-duty AC switch of doom.

Cut a USB charging cable in half to splice the power wires into the motor controller. Hey, it works! Hey, it'd be nicer if it was 10 cm longer.

Cut a CAT6 in half and spliced the ends to two RJ45 wall sockets. Plugged the other two into a router. He he he, in-socket firewall.

Got a cheapo digital multimeter. Feel so EE.

Thinking of surface mount components. Like, how to build with them without the pain of soldering and PCB production. Would be neat to build the circuit on the surface of the car seats.

4-color printer with conductive, insulating, N, and P inks. And a scanner to align successive layers.

The kid really likes buttons that turn on LEDs. Should add those to the car.

Hey, the GPIO lib has stuff for I2C and SPI. Hey, there are these super-cheap ESP32 / ESP8266 WiFi boards look neat. Hey, cool, a tiny laser ToF rangefinder.

Man, the IoT rabbit hole runs deep.

(Since writing the initial part, I swapped the L298N for a MX1508 motor controller, and the D415 for a small Raspberry Pi camera module. And got a bunch of sensors, including an ultrasound rangefinder and the tiny laser rangefinder.)

2018-09-27

1 GB/s from Samba

Hey, finally on the fast network (GPU died, so I replaced it with an IB card :P)

In other news, new workstation time.

2018-08-05

Three thoughts

Energy

We're living on a grain of sand, five meters away from a light bulb. Of all the solar energy falling onto our tiny grain of sand, we are currently able to utilize 0.01%. The total energy falling onto our tiny grain of sand is roughly 0.00000005% of the total output of the light bulb.

Matter

Gravity sorting. In a gravity well, denser particles end up below lighter particles given time. Here's a list of densities of common elements. It matches the distribution of elements on Earth: air above water, water above silicon, silicon above iron. Note that the rare elements tend to be at the bottom of the list. There are three reasons for that. 1) Heavy elements are rare in the Universe. You need supernovas to produce them. 2) Heavy elements get gravity sorted below iron and mostly reside in the middle of the planet. 3) Some elements - like gold - don't oxidize easily, so they don't bubble up as lighter oxides. For a counter-example, tungsten has a similar density as gold (19.3 g/cm³), but oxidizes as wolframite (Fe,Mn)WO₄, which has a density of 7.5 g/cm³ (close to elemental iron). As a result, the annual global tungsten production is 75,000 tons, whereas the annual gold production is 3,100 tons.

Elemental abundances from Wikipedia

The Earth formed from dust in the protoplanetary disk, similarly to other planets and objects in the Solar System. As a result, asteroids should have similar overall composition as the Earth.

Gravity sorting acts slower on less massive objects. You should see relatively more heavy elements at the surface of less massive objects. The force of gravity is also weaker, making the grains of matter less densely packed. This should make mining asteroids less energy-intensive compared to mining on Earth. Asteroid ores should also be more abundant in metals compared to Earth ores. At the extreme end, you'd have iridium, which is 500x more common on asteroids compared to the Earth's crust. Couple that with - let's say - factor of two energy reduction compared to Earth mining, and an asteroid mine could generate 1000x the iridium per watt compared to an Earth-based operation.

Suppose you want to have reusable space launch vehicles to minimize launch costs. Rockets that land. Your customers pay you to launch their stuff to space, then you recover your rocket to save on costs. You don't want to recover the stage that gets to LEO because it'd be too expensive. But what if the returning LEO vehicle brought back a ton of gold. That'd be worth $40 million at today's prices. And the amount of gold brought back this way would be small enough (say, 25 tons per year) that it wouldn't put much dent into gold demand or affect the price of gold. If anything, you could sell it as Star Gold for jewelry at 10x the spot price. Even at spot price, it'd go a long way towards covering the cost of the entire launch.

Computation

In a few decades, it looks likely that we'll be able to simulate humans in real-time using a computer the size of a match box. Match box humans. If it takes half a square meter of solar panel to power one match box human, you could sustain a population of 2 trillion in the US alone. If you were to use fusion to convert all the cosmic dust that falls on Earth (let's say, 100 tons) to energy at 1% mass-to-energy ratio, it would generate around 1 PW of energy, which could sustain 10 trillion match box humans running at 100 watts per person.

2018-08-01

The carbon capture gambit

To control the climate of the planet, stopping the loss of carbon to air is not enough. To gain control of the temperature of the planet and the carbon economy, we need to mine the carbon that's been lost into the air.

There are proposed air-mining mechanisms for carbon that run at a cost of 3-7% of GDP. And that's before significant optimizations. To gain control of the climate and amass vast carbon reserves, we would only need a sum of money equivalent to a few years' worth of economic growth. If we invest now, we'll get a massive head start on future mining operations by other countries, and will end up in an immensely powerful geopolitical position. Not only will we have control over the climate of the planet, we also end up in the possession of all the carbon lost into the air.

Control over carbon will give us control over the carbon-based fuel economy. Most of the carbon accessible carbon on the planet is right there in the air, ready for the taking.

Every year, more than 7 gigatons of carbon is lost in the atmosphere in the form of CO2. We can estimate the market value of this lost carbon using the price of coal. Coal is around 50% carbon, with the rest composed of impurities. The price of coal is around $40 per ton. Pure carbon could be twice as valuable as coal. At $80 per ton, 7 gigatons of carbon would be worth $560 billion dollars. More than half a trillion dollars. Every. Single. Year.

2018-07-02

Some more rubbish on P != NP

Feeling crank today, so:

Non-deterministic machines are one-bit oracles. Given a branch in execution, they always follow the correct path. You can use this property to write a universal non-deterministic machine:

readInput();

NonDeterministic done, correctOutput;

while (!done) {
  outputBit(correctOutput);
}

Its execution time is linear to the size of the output. There are still problems that take non-polynomial time compared to the input. For instance, "given an input of n bits, print out n! bits".

The question in P ?= NP then is: "given a problem where the output size is a polynomial of the input size, is it possible to emulate a non-deterministic machine using a deterministic machine with at most a polynomial slowdown". Since the ND machine takes a number of steps linear to the output size, which is polynomial of the input size, it solves the problem in polynomial time to the input. If the deterministic machine can emulate the ND machine at a polynomial slowdown, its running time is a polynomial of the output size, which is a polynomial of a polynomial of the input size.

Maybe you could also ask if it's possible to prove that there exists a problem for which an optimal program running on a deterministic machine requires a number of steps exponential to the size of its output.

Make a bridge from output size to input size? ...exponential to the size of its output and where the output size grows linearly with input size.

2018-04-21

rdma-pipe

Haven't you always wanted to create UNIX pipes that run from one machine to another? Well, you're in luck. Of sorts. For I have spent my Saturday hacking on an InfiniBand RDMA pipeline utility that lets you pipe data between commands running on remote hosts at multi-GB/s speeds.

Unimaginatively named, rdma-pipe comes with the rdpipe utility that coordinates the rdsend and rdrecv programs that do the data piping. The rdpipe program uses SSH as the control channel and starts the send and receive programs on the remote hosts, piping the data through your commands.

For example


  # The fabulous "uppercase on host1, reverse on host2"-pipeline.
  $ echo hello | rdpipe host1:'tr [:lower:] [:upper:]' host2:'rev'
  OLLEH

  # Send ZFS snapshots fast-like from fileserver to backup_server.
  $ rdpipe backup@fileserver:'zfs send -I tank/media@last_backup tank/media@head' backup_server:'zfs recv tank-backup/media'

  # Decode video on localhost, stream raw video to remote host.
  $ ffmpeg -i sintel.mpg -pix_fmt yuv420p -f rawvideo - | rdpipe playback_host:'ffplay -f rawvideo -pix_fmt yuv420p -s 720x480 -'

  # And who could forget the famous "pipe page-cached file over the network to /dev/null" benchmark!
  $ rdpipe -v host1:'</scratch/zeroes' host2:'>/dev/null'
  Bandwidth 2.872 GB/s

Anyhow, it's quite raw, new, exciting, needs more eyeballs and tire-kicking. Have a look if you're on InfiniBand and need to pipe data across hosts.

2018-04-12

RDMA cat

Today I wrote a small RDMA test program using libibverbs. That library has a pretty steep learning curve.

Anyhow. To use libibverbs and librdmacm on CentOS, install rdma-core-devel and compile your things with -lrdmacm -libverbs.

My test setup is two IBM-branded Mellanox ConnectX-2 QDR InfiniBand adapters connected over a Voltaire 4036 QDR switch. These things are operating at PCIe 2.0 x8 speed, which is around 3.3 GB/s. Netcat and friends get around 1 GB/s transfer rates piping data over the network. Iperf3 manages around 2.9 GB/s. With that in mind, let's see what we can reach.

I was basing my test programs on these amazingly useful examples: https://github.com/linzion/RDMA-example-application https://github.com/jcxue/RDMA-Tutorial http://www.digitalvampire.org/rdma-tutorial-2007/notes.pdf and of course http://www.rdmamojo.com/ . At one point after banging my head on the ibverbs library for a bit too long I was thinking of just using MPI to write the thing and wound up on http://mpitutorial.com - but I didn't have the agility to jump from host-to-host programs to strange new worlds, so kept on using ibverbs for these tests.

First light

The first test program was just reading some data from STDIN, sending it to the server, which reverses it and sends it back. From there I worked towards sending multiple blocks of data (my goal here was to write an RDMA version of cat).

I had some trouble figuring out how to make the two programs have a repeatable back-and-forth dialogue. First I was listening to too many events with the blocking ibv_get_cq_event -call, and that was hanging the program. Only call it as many times as you're expecting replies.

The other fib was that my send and receive work requests shared the sge struct, and the send-part of the dialogue was setting the sge buffer length to 1 since it was only sending acks back to the other server. Set it back to the right size before sending each work request, problem solved.

Optimization

Once I got the rdma-cat working, performance wasn't great. I was reading in a file from page cache, sending it to the receiver, and writing it to the STDOUT of the receiver. The program was sending 4k messages, doing a 4k acks, and a mutex-requiring event ack after each message. This ran at around 100 MB/s. Changing the 4k acks to single-byte acks and doing the event acks for all the events at once got me to 140 MB/s.

How about doing larger messages? Change the message size to 65k and the cat goes at 920 MB/s. That's promising! One-megabyte messages and 1.4 GB/s. With eight meg messages I was up to 1.78 GB/s and stuck there.

I did another test program that was just sending an 8 meg buffer to the other machine, which didn't do anything to the data. This is useful to get an optimal baseline and gauge perf for a single process use case. The test program was running at 2.9 GB/s.

Adding a memcpy to the receive loop roughly halved the bandwidth to 1.3 GB/s. Moving to a round-robin setup with one buffer receiving data while another buffer is having the data copied out of it boosted the bandwidth to 3 GB/s.

The send loop could read in data at 5.8 GB/s from the page cache, but the RDMA pipe was only doing 1.8 GB/s. Moving the read to happen right after each send got them both moving in parallel, which got the full rdma_send < inputfile ; rdma_recv | wc -c -pipe running at 2.8 GB/s.

There was an issue with the send buffer contents getting mangled by an incoming receive. Gee, it's almost like I shouldn't use the same buffer for sending and receiving messages. Using a different buffer for the received messages resolved the issue.

It works!

I sent a 4 gig file and ran diff on it, no probs. Ditto for files less than buffer size in size and small strings sent with echo.

RDMA cat! 2.9 GB/s over the network.

Let's try sending video frames next. Based on these CUDA bandwidth timings, I should be able to do 12 GB/s up and down. Now I just need to get my workstation on the IB network (read: buy a new workstation with more than one PCIe slot.)

[Update] For the heck of it, I tried piping through two hosts.

[A]$ rdma_send B < inputfile
[B]$ rdma_recv | rdma_send C
[C]$ rdma_recv | wc -c

2.5 GB/s. Not bad, could do networked stream processing. Wonder if it would help if I zero-copy passed the memory regions along the pipe.

And IB is capable of multicast as well...

2018-04-06

4k over IB

So, technically, I could stream uncompressed 4k@60Hz video over the Infiniband network. 4k60 needs about 2 GB/s of bandwidth, the network goes at 3 GB/s.

This... how would I try this?

I'd need a source of 4k frames. Draw on the GPU to a framebuffer, then glReadPixels (or CUDA GPUDirect RDMA). Then use IB verbs to send the framebuffer to another machine. Upload it to the GPU to display with glTexImage (or GPUDirect from the IB card).

And make sure that everything in the data path runs at above 2 GB/s.

Use cases? Extreme VNC? Combining images from a remote GPU and local GPU? With a 100Gb network, you could pull in 6 frames at a time and composite in real time I guess. Bringing in raw 4k camera streams to a single box over a switched COTS fabric.

Actually, this would be "useful" for me, I could stream interactive video from a big beefy workstation to a small installation box. The installation box could handle stereo camera processing and other input, then send the control signals to the rendering station. (But at that point, why not just get longer HDMI and USB cables.)

2018-04-02

Quick timings

NVMe and NFS, cold cache on client and server. 4.3 GiB in under three seconds.

$ cat /nfs/nvme/Various/UV.zip | pv > /dev/null
 4.3GiB 0:00:02 [1.55GiB/s]

The three-disk HDD pool gets around 300 MB/s, but once the ARC picks up the data it goes at NFS + network speed. Cold cache on the client.

$ echo 3 > /proc/sys/vm/drop_caches
$ cat /nfs/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:10 [ 1.5GiB/s]

Samba is heavier somehow.

$ cat /smb/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:13 [1.26GiB/s]

NFS over RDMA from the ARC, direct to /dev/null (which, well, it's not a very useful benchmark). But 2.8 GB/s!

$ time cat /nfs/hdd/Videos/*.mp4 > /dev/null

real    0m6.269s
user    0m0.007s
sys     0m4.056s
$ cat /nfs/hdd/Videos/*.mp4 | wc -c
17722791869
$ python -c 'print(17.7 / 6.269)'
2.82341681289

$ time cat /nfs/hdd/Videos/*.mp4 > /nfs/nvme/bigfile

real    0m15.538s
user    0m0.016s
sys     0m9.731s

# Streaming read + write at 1.13 GB/s

How about some useful work? Parallel grep at 3 GB/s. Ok, we're at the network limit, call it a day.

$ echo 3 > /proc/sys/vm/drop_caches
$ time (for f in /nfs/hdd/Videos/*.mp4; do grep -o --binary-files=text XXXX "$f" & done; for job in `jobs -p`; do wait $job; done)
XXXX
XXXX
XXXX
XXXX
XXXX

real    0m5.825s
user    0m3.567s
sys     0m5.929s

2018-03-26

Infinibanding, pt 4. Almost there

Got my PCIe-M.2 adapters, plugged 'em in, one of them runs at PCIe 1.0 lane speeds instead of PCIe 3.0, capping perf to 850 MB/s. And causes a Critical Interrupt #0x18 | Bus Fatal Error that resets the machine. Then the thing overheats, melts its connection solder, shorts, releases the magic smoke and makes the SSD PCB look like wet plastic. Yeah, it's dead. The SSD's dead too.

[Edit: OR.. IS IT? Wiped the melted goop off the SSD and tried it in the other adapter and it seems to be fine. Just has high controller temps, apparently a thing with the 960s. Above 90 C after 80 GB of writes. It did handle 800 GB of /dev/zero written to it fine and read everything back in order as well. Soooo, maybe my two-NVMe pool lives to ride once more? Shock it, burn it, melt it, it just keeps on truckin'. Disk label: "The Terminator"]

I only noticed this because the server rebooted and one of the mirrors in the ZFS pool was offline. Hooray for redundancy?

Anyway, the fast NVMe work pool is online, it can kinda saturate the connection. It's got ~~two~~ one Samsung 960 EVOs in it, which ~~are~~ is fast for reads, if maybe not the best for synced writes.

I also got a 280 gig Optane 900p. It feels like Flash Done Right. Low queue depths, low parallelism, whatever, it just burns through everything. Optane also survives many more writes than flash, it's really quite something. And it hasn't caught on fire yet. I set up two small partitions (10 GB) as ZFS slog devices for the two pools and now the pools can handle five-second write bursts at 2 GB/s.

Played with BeeGFS over the weekend. Easyish to get going, sort of resilient to nodes going down if you tune it that way, good performance with RDMA (netbench mode dd went at 2-3GB/s). The main thing lacking seems to be the "snapshots, backups and rebuild from nodes gone bye"-story.

Samba and NFS get to 1.4 to 1.8 GB/s per client, around 2.5 GB/s aggregate, high CPU usage on the server somehow, even with NFSoRDMA. I'll see if I can hit 2 GB/s on a single client. Not really native-level random access perf though. The NVMe drive can do two 1 GB/s clients. Fast filestore experiment mostly successful, if not completely.

Next up, wait for a bit of a lull in work, and switch my workstation to a machine that's got more than one PCIe slot. And actually hook it up to the fast network to take advantage of all this bandwidth. Then plonk the backup disks into one box, take it home, off-site backups yaaay.

Somehow I've got nine cables, four network cards, three computers, and a 36-port switch. Worry about that in Q3, time to wrap up with this build.

2018-03-20

InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.

ZFS

Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.

Plans?

I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.

2018-03-11

InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)

2018-03-07

Quick test with old Infiniband kit

Two IBM ConnectX-2 cards, hooked up to a Voltaire 4036 switch that sounds like a turbocharged hair dryer. CentOS 7, one host bare metal, other on top of ESXi 6.5.

Best I saw thus far: 3009 MB/s RDMA transfer. Around 2.4 GB/s with iperf3. These things seem to be CPU capped, top is showing 100% CPU use. Made an iSER ramdisk too, it was doing 1.5 GB/s-ish with ext4.

Will examine more next week. With later kernel and firmwares and whatnot.

The end goal here would be to get 2+ GB/s file transfers over Samba or NFS. Probably not going happen but eh, give it a try.

That switch though. Need a soundproof cabinet.

2018-03-04

OpenVPN settings for 1 Gbps tunnel

Here are the relevant parts of the OpenVPN 2.4 server config that got me 900+ Mbps iperf3 on GbE LAN. The tunnel was between two PCs with high single-core performance, a Xeon 2450v2 and an i7-3770. OpenVPN uses 50% of a CPU core on the client & server when the tunnel is busy. For reference, I tried running the OpenVPN server on my WiFi router, it peaked out at 60 Mbps.

# Use TCP, I couldn't get good perf out of UDP. 

proto tcp

# tun or tap, roughly same perf
dev tun 

# Use AES-256-GCM:
#  - more secure than 128 bit
#  - GCM has built-in authentication, see https://en.wikipedia.org/wiki/Galois/Counter_Mode
#  - AES-NI accelerated, the raw crypto runs at GB/s speeds per core.

cipher AES-256-GCM

# Don't split the jumbo packets traversing the tunnel.
# This is useful when tun-mtu is different from 1500.
# With default value, my tunnel runs at 630 Mbps, with mssfix 0 it goes to 930 Mbps.

mssfix 0

# Use jumbo frames over the tunnel.
# This reduces the number of packets sent, which reduces CPU load.
# On the other hand, now you need 6-7 MTU 1500 packets to send one tunnel packet. 
# If one of those gets lost, it delays the entire jumbo packet.
# Digression:
#   Testing between two VBox VMs on a i7-7700HQ laptop, MTU 9000 pegs the vCPUs to 100% and the tunnel runs at 1 Gbps.
#   A non-tunneled iperf3 runs at 3 Gbps between the VMs.
#   Upping this to 65k got me 2 Gbps on the tunnel and half the CPU use.

tun-mtu 9000

# Send packets right away instead of bundling them into bigger packets.
# Improves latency over the tunnel.

tcp-nodelay

# Increase the transmission queue length.
# Keeps the TUN busy to get higher throughput.
# Without QoS, you should get worse latency though.

txqueuelen 15000

# Increase the TCP queue size in OpenVPN.
# When OpenVPN overflows the TCP queue, it drops the overflow packets.
# Which kills your bandwidth unless you're using a fancy TCP congestion algo.
# Increase the queue limit to reduce packet loss and TCP throttling.

tcp-queue-limit 256

And here is the client config, pretty much the same except that we only need to set tcp-nodelay on the server:

proto tcp
cipher AES-256-GCM
mssfix 0
tun-mtu 9000
txqueuelen 15000
tcp-queue-limit 256

To test, run iperf3 -s on the server and connect to it over the tunnel from the client: iperf3 -c 10.8.0.1. For more interesting tests, run the iperf server on a different host on the endpoint LAN, or try to access network shares.

I'm still tuning this (and learning about the networking stack) to get a Good Enough connection between the two sites, let me know if you got any tips or corrections.

P.S. Here's the iperf3 output.

$ iperf3 -c 10.8.0.1
Connecting to host 10.8.0.1, port 5201
[  4] local 10.8.2.10 port 39590 connected to 10.8.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec    0   3.01 MBytes
[  4]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   6.00-7.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver

iperf Done.

2018-03-02

Fast-ish OpenVPN tunnel

500 Mbps OpenVPN throughput over the Internet, nice. Was aiming for 900 Mbps, which seems to work on LAN, but no cigar. [Edit: --tcp-nodelay --tcp-queue-limit 256 got me to 680 Mbps. Which is very close to non-tunneled line speed as measured by a HTTP download.]

OpenVPN performance is very random too. I seem to be getting different results just because I restarted the client or the server.

The config is two wired machines, each with a 1 Gbps fibre Internet connection. The server is a Xeon E3-1231v3, a 3.4 GHz Haswell Xeon. The client is my laptop with a USB3 GbE adapter and i7-7700HQ. Both machines get 900+ Mbps on Speedtest, so the Internet connections are fine.

My OpenVPN is set to TCP protocol (faster and more reliable than UDP in my testing), and uses AES-256-GCM as the cipher. Both machines are capable of pushing multiple gigaBYTES per second over openssl AES-256, so crypto isn't a bottleneck AFAICT. The tun-mtu is set to 9000, which performs roughly as well as 48000 or 60000, but has smaller packets which seems to be less flaky than big mtus. The mssfix setting is set to 0 and fragment to 0 as well, though fragment shouldn't matter over TCP.

Over iperf3 I get 500 Mbps between the endpoints. With HTTP, roughly that too. Copying from a remote SMB share on another host goes at 30 MB per second, but the remote endpoint can transfer from the file server at 110 MB per second (protip: mount.cifs -o vers=3.0). Thinking about it a bit, I need to test with a second NIC in the VPN box, now VPN traffic might be competing with LAN traffic.

2018-02-28

Building a storage pyramid

The office IT infrastructure plan is something like this: build interconnected storage pyramids with compute. The storage pyramids consist of compute hooked up to fast memory, then solid state memory to serve mid-IOPS and mid-bandwidth workloads, then big spinning disks as archive. The different layers of the pyramid are hooked up via interconnects that can be machine-local or over the network.

The Storage Pyramid

Each layer of the storage pyramid has different IOPS and bandwidth characteristics. Starting from the top, you've got GPUs with 500 GB/s memory, connected via a 16 GB/s PCIe bus to the CPU, which has 60 GB/s DRAM. The next layer is also on the PCIe bus: Optane-based NVMe SSDs, which can hit 3 GB/s on streaming workloads and 250 MB/s on random workloads (parallelizable to maybe 3x that). After Optane, you've got flash-based SSDs that push 2-3 GB/s streaming accesses and 60 MB/s random accesses. At the next level, you could have SAS/SATA SSDs which are limited to 1200/600 MB/s streaming performance by the bus. And at the bottom lie the HDDs that can do somewhere between 100 to 240 MB/s streaming accesses and around 0.5-1 MB/s random accesses.

The device speeds guide us in picking the interconnects between them. Each HDD can fill a 120 MB/s GbE port. SAS/SATA SSDs plug into 10GbE ports, with their 1 GB/s performance. For PCIe SSDs and Optane, you'd go with either 40GbE or InfiniBand QDR, and hit 3-4 GB/s. After the SSD layer, the interconnect bottlenecks start rearing their ugly heads.

You could use 200Gb InfiniBand to connect single DRAM channels at 20 GB/s, but even then you're starting to get bottlenecked at high DDR4 frequencies. Plus you have to traverse the PCIe bus, which further knocks you down to 16 GB/s over PCIe 3.0 x16. It's still sort of feasible to hook up a cluster with shared DRAM pool, but you're pushing the limits.

Usually you're stuck inside the local node for performance at the DRAM-level. The other storage layers you can run over the network without much performance lost.

The most unbalanced bottleneck in the system is the CPU-GPU interconnect. The GPU's 500 GB/s memory is hooked to the CPU's 60 GB/s memory via a 16 GB/s PCIe bus. Nvidia's NVLink can hook up two GPUs together at 40 GB/s (up to 150 GB/s for Tesla V100), but there's nothing to get faster GPU-to-DRAM access. This is changing with the advent of PCIe 4.0 and PCIe 5.0, which should be able to push 128 GB/s and create a proper DRAM interconnect between nodes and between the GPU and the CPU. The remaining part of the puzzle would be some sort of 1 TB/s interconnect to link GPU memories together. [Edit] NVSwitch goes at 300 GB/s, which is way fast.

The Plan

Capacity-wise, my plan is to get 8 GB of GPU RAM, 64 GB of CPU RAM, 256 GB of Optane, 1 TB of NVMe flash, and 16 TB of HDDs. For a nicer-cleaner-more-satisfying progression, you could throw in a 4 TB SATA flash layer but SATA flash is kind of DOA as long as you have NVMe and PCI-E slots to use -- the price difference between NVMe flash and SATA flash is too small compared to the performance difference.

If I can score an InfiniBand interconnect or 40GbE, I'll stick everything from Optane on down to a storage server. It should perform at near-local speeds and simplify storage management. Shared pool of data that can be expanded and upgraded without having to touch the workstations. Would be cool to have a shared pool of DRAM too but eh.

Now, our projects are actually small enough (half a gig each, maybe 2-3 of them under development at once) that I don't believe we will ever hit disk in daily use. All the daily reads and writes should be to client DRAM, which gets pushed to server DRAM and written down to flash / HDD at some point later. That said, those guys over there *points*, they're doing some video work now...

The HDDs are mirrored to an off-site location over GbE. The HDDs are likely capable of saturating a single GbE link, so 2-3 GbE links would be better for live mirroring. For off-site backup (maybe one that runs overnight), 1 GbE should be plenty.

In addition to the off-site storage mirror, there's some clouds and stuff for storing compiled projects, project code and documents. These don't need to sync fast or are small enough to do so.

Business Value

Dubious. But it's fun. And opens up possible uses that are either not doable on the cloud or way too expensive to maintain on the cloud. (As in, a single month of AWS is more expensive than what I paid for the server hardware...)

2018-02-27

Figments

Ultraviolet Fairies

"Can you make them dance?", Pierre asked. Innocent question, but this was a cloud of half a million particles. Dance? If I could make the thing run in the first place it would be cause for celebration.

The red grid of the Kinect IR illuminator came on. Everything worked perfectly again. Exactly twelve seconds later, it blinked out, as it had done a dozen times before. The visiting glass artist wasn't impressed with our demo.

Good tidings from France. The app works great on Pierre's newly-bought Acer laptop. A thunderhead was building in the horizon. The three-wall cave projection setup comes out with a wrong aspect ratio. I sipped my matcha latte and looked at the sun setting behind the cargo ships moored off West Kowloon. There's still 20 hours before the gig.

The motion was mesmerizing. Tiny fairies weaving around each other, hands swatting them aside on long trajectories off-screen. I clenched my fist and the fairies formed a glowing ring of power, swirling around my hand like a band of living light. The keyboard was bringing the escaping clouds to life, sending electric pulses through the expanding shells of fairies knocked off-course.

Beat. The music and Isabelle's motion become one, the cloud of fairies behind her blows apart from the force of her hands, like sand thrown in the air. Cut to the musicians, illuminated by the gridlines of the projection. Fingers beating the neon buttons of the keyboard, shout building in the microphone. The tension running through the audience is palpable. Beat. The flowing dancer's dress catches a group of fairies. Isabelle spins and sends them flying.

The AI

A dot. I press space. The dot blinks out and reappears. One, two, three, four, five, six, seven, eight, nine, ten. I press space. The dot blinks out and reappears. Human computation.

Sitting at a desk in Zhuzhou. The visual has morphed into a jumble of sharp lines, rhythmically growing every second. The pulse driving it makes it glow. Key presses simulate the drums we want to hook up to it. Rotate, rotate, zoom, disrupt, freeze. The rapid typing beat pushes the visual to fill-rate destroying levels and my laptop starts chugging.

Sharp lines of energy, piercing the void around them. A network of connections shooting out to other systems. Linear logic, strictly defined, followed with utmost precision. The lines begin to _bend_, almost imperceptibly at first. A chaotic flow pulls the lines along its turbulent path. And, stop. Frozen in time, the lines turn. Slowly, they begin to grow again, but straight and rigid, linear logic revitalized. Beginning of the AI Winter.

The fucking Kinect dropped out again! I start kicking the wire and twisting the connector. That fucker, it did it once in the final practice already, of course it has to drop out at the start of the performance as well. Isabelle's taking it in stride, she's such a pro. If the AI doesn't want to dance with her, she'll dance around it. I push myself down from my seat. How about if I push the wire at the Kinect end, maybe taping it to the floor did ... oh it works again. I freeze, not daring to move, lying on the floor of the theater. Don't move don't move don't move, keep working you bastard! The glowing filter bubbles envelop Isabelle's hands, the computer is responsive again. Hey, it's not all bad. We could use this responsiveness toggle for storytelling, one more tool in the box.

We're the pre-war interexpressionist movement. Beautiful butterflies shining in the superheated flashes of atomic explosions.

2018-02-11

Office build, pt. 2

With pretty much all the furniture built (apart from fixing a shelf and a filing cabinet to a wall), the office is getting close to version 1. That is, more than half empty and running at 10% utilization.

What do I mean? Well, what I've got in there now are two big desks and a table, two office chairs and two random chairs. Add in a filing cabinet, a shelf and a couple of drawers, and .. that's it. There's a room in the back that's going to be the demo build area / VR dev area. And probably where the servers and printers are going to live.

As it is now, the main office has space for 3-4 more big desks. I'll haul one desk from home there. And that's it for now. Room to expand.

If you're looking for a place to work from in Tsuen Wan, Hong Kong, come check it out. Or if you want to work on building awesome interactive generative visuals -- fhtr.org & fhtr.net -style but with more art direction, interactivity and actual clients. Sales genius, UI developer and 3D artist could use right away.

2018-02-09

Office build

Recently I've been setting up an office in Hong Kong, a.k.a. setting money on fire. Generating heat that enables us to set up installations in-house, tackle bigger projects and expand the company from its current staff of 1.25. Good plan? Great plan!

Very IKEA, much birch veneer. Spent a couple enjoyable days building furniture, almost done for now. Plan is to get a big board in front of the office's glass door and mount a big TV on it to run a demo loop of interactive content for the other tenants in the building to enjoy.

Also planning some network infrastructure build. The place has a wired network from the previous tenant, with five Cat5e cables strung along the walls, which I hooked up to an 8-port gigabit switch. There are another five cables running inside the walls that terminate at ethernet sockets, but those seem to be broken somehow.

Getting gigabit fibre internet there next week, which was surprisingly affordable. Like, $100/month affordable. For WiFi, I got a small WiFi router, 802.11ac. It's all very small-time setup at the moment.

So hey, how about some servers and a network upgrade? Get a couple of used old Xeon servers, fill them with SSDs and high-capacity drives for a proper storage tower. Run a bunch of VMs for dev servers and NAS. And, hmm, how about QDR InfiniBand cards and cables for 32Gbps network speed between servers and wired workstations, with the GigE and WiFi for others. Sort of like a 2012 supercomputing node setup. The best part? All of this kit (apart from the new drives) would be around $1500, total.

That said, I'll start with one server and a direct connection to the workstation. It'll keep the budget down to $400ish and let me try it out to see if it actually works.

Next to find the parts and get that thing built and running.

And um, yeah, do some sales too.

Here are some recent visuals done with our friends and partners at cityshake.com and plot.ly. For awesome sites and corporate app development in Austria, contact raade.at, who we have also worked with on various projects. Interested? Of course you are!

Mail me at my brand spanking new email address hei@heichen.hk and buy some awesome visuals or a fancy interactive office/retail/hotel installation or a fabulous fashion show. Or your very own VM hosted by a DevOps noob on obsolete hardware. I'm sure a VM can't catch on fire. Or... can it?

2018-01-25

The Bell

A hill rising in the middle of a plain, surrounded by sparse woods. The low hill, covered in tall yellow grass like the rest of the plain, wavy in the relentless dry season heat. On top of the hill rises a lonely figure sits before a bell. The shadow of the bell shielding the figure from the sun's rays. A round clearing surrounds the bell, a staircase leading down from the hill.

The figure rises up and walks down into the shallow pit under the bell. Standing under the bell, the figure reaches up and grabs a rope tied to a large log. Swinging it back and forth, slowly gaining speed, until the log nearly touches the inside wall of the bell. With a violent jerk, he brings the log to crash into the other side of the bell.

A ring to be heard up in heaven and down in hell, the bell rings, deep and clear. With each ring echoing in the skies above, the ghosts shift in the earth below.