fhtr: 2018

2018-10-06

Hardware hacking

I got into these Pi devices recently. At first it was a simple "I want an easy way to control sites accessible on my office WiFi to stop wasting time when I should be working", so I set up an old broken laptop to prototype a simple service to do that. Then I replaced the laptop with a small Orange Pi hacker board. And got some wires and switches and breadboard and LEDs and resistors and ... hey, is that a Raspberry Pi 3B+? I'll get that too, maybe I can use it for something else...

Well.

I took apart a cheap RC car. Bought a soldering station to desolder the wires from the motor control board. Then got a motor controller board (an L298N, a big old chip for controlling 9-35V motors with 2A current draw -- not a good match for 3V 2A Tamiya FA-130 motors in the RC car), wired the motors to it, and the control inputs to the Raspberry Pi GPIOs.

Add some WebSockets and an Intel RealSense camera I had in a desk drawer and hey FPV web controlled car that sees in the dark with a funky structured light IR pattern. (The Intel camera is .. not really the right thing for this, it's more of a mini Kinect and outputs only 480x270 video over the RPi's USB 2 bus. And apparently Z16 depth data as well, but I haven't managed to read that out.) Getting the video streaming low-latency enough to use for driving was a big hassle.

Experimented with powering the car from the usual 5 AA batteries, then the same USB power bank that's powering the RPi (welcome to current spike crash land, the motors can suck up to 2A when stalling), and a separate USB power bank ("Hmm, it's a bit sluggish." The steering motor has two 5.6 ohm resistors wired to it and the L298N has a voltage drop of almost 1.5V at 5V, giving me about 1W of steering power with the USB. The original controller uses a tiny MX1508, which has a voltage drop something like 0.05V. Coupled with the 7.5V battery pack, the car originally steers at 5W. So, yeah, 5x loss in snappiness. Swap motor controller for a MX1508 and replace the resistors with 2.7R or 2.2R? Or keep the L298N and use 1.2R resistors.) Then went back to the 5 AA batteries. Screw it, got some NiMHs.

Tip: Don't mix up GND and +7.5V in the L298N. It doesn't work and gets very hot after a few seconds. Thankfully that didn't destroy the RPi. Nor did plugging the L298N +5V and GND to RPi +5V and GND -- you're supposed to use a jumper to bridge the +12V and GND pins on the L298N, then plug just the GND to the RPi GND (at least that's my latest understanding). I.. might be wrong on the RPi GND part, the hypothesis is that having shared ground for the L298N and the RPi gives a ground reference for the motor control pins coming from the RPi.

Tip 2: Don't wipe off the solder from the tip of the iron, then leave it at 350C for a minute. It'll turn black. The black stuff is oxides. Oxides don't conduct heat well and solder doesn't stick to it. Wipe / buff it off, then re-tin the tip of the iron. The tin should stick to the iron and form a protective layer around it.

Destroyed the power switch of the car. A big power bank in a metal shell, sliding around in a crash, crush. It was used to control the circuit of the AA battery pack. Replaced it with a heavy-duty AC switch of doom.

Cut a USB charging cable in half to splice the power wires into the motor controller. Hey, it works! Hey, it'd be nicer if it was 10 cm longer.

Cut a CAT6 in half and spliced the ends to two RJ45 wall sockets. Plugged the other two into a router. He he he, in-socket firewall.

Got a cheapo digital multimeter. Feel so EE.

Thinking of surface mount components. Like, how to build with them without the pain of soldering and PCB production. Would be neat to build the circuit on the surface of the car seats.

4-color printer with conductive, insulating, N, and P inks. And a scanner to align successive layers.

The kid really likes buttons that turn on LEDs. Should add those to the car.

Hey, the GPIO lib has stuff for I2C and SPI. Hey, there are these super-cheap ESP32 / ESP8266 WiFi boards look neat. Hey, cool, a tiny laser ToF rangefinder.

Man, the IoT rabbit hole runs deep.

(Since writing the initial part, I swapped the L298N for a MX1508 motor controller, and the D415 for a small Raspberry Pi camera module. And got a bunch of sensors, including an ultrasound rangefinder and the tiny laser rangefinder.)

2018-09-27

1 GB/s from Samba

Hey, finally on the fast network (GPU died, so I replaced it with an IB card :P)

In other news, new workstation time.

2018-08-05

Three thoughts

Energy

We're living on a grain of sand, five meters away from a light bulb. Of all the solar energy falling onto our tiny grain of sand, we are currently able to utilize 0.01%. The total energy falling onto our tiny grain of sand is roughly 0.00000005% of the total output of the light bulb.

Matter

Gravity sorting. In a gravity well, denser particles end up below lighter particles given time. Here's a list of densities of common elements. It matches the distribution of elements on Earth: air above water, water above silicon, silicon above iron. Note that the rare elements tend to be at the bottom of the list. There are three reasons for that. 1) Heavy elements are rare in the Universe. You need supernovas to produce them. 2) Heavy elements get gravity sorted below iron and mostly reside in the middle of the planet. 3) Some elements - like gold - don't oxidize easily, so they don't bubble up as lighter oxides. For a counter-example, tungsten has a similar density as gold (19.3 g/cm³), but oxidizes as wolframite (Fe,Mn)WO₄, which has a density of 7.5 g/cm³ (close to elemental iron). As a result, the annual global tungsten production is 75,000 tons, whereas the annual gold production is 3,100 tons.

Elemental abundances from Wikipedia

The Earth formed from dust in the protoplanetary disk, similarly to other planets and objects in the Solar System. As a result, asteroids should have similar overall composition as the Earth.

Gravity sorting acts slower on less massive objects. You should see relatively more heavy elements at the surface of less massive objects. The force of gravity is also weaker, making the grains of matter less densely packed. This should make mining asteroids less energy-intensive compared to mining on Earth. Asteroid ores should also be more abundant in metals compared to Earth ores. At the extreme end, you'd have iridium, which is 500x more common on asteroids compared to the Earth's crust. Couple that with - let's say - factor of two energy reduction compared to Earth mining, and an asteroid mine could generate 1000x the iridium per watt compared to an Earth-based operation.

Suppose you want to have reusable space launch vehicles to minimize launch costs. Rockets that land. Your customers pay you to launch their stuff to space, then you recover your rocket to save on costs. You don't want to recover the stage that gets to LEO because it'd be too expensive. But what if the returning LEO vehicle brought back a ton of gold. That'd be worth $40 million at today's prices. And the amount of gold brought back this way would be small enough (say, 25 tons per year) that it wouldn't put much dent into gold demand or affect the price of gold. If anything, you could sell it as Star Gold for jewelry at 10x the spot price. Even at spot price, it'd go a long way towards covering the cost of the entire launch.

Computation

In a few decades, it looks likely that we'll be able to simulate humans in real-time using a computer the size of a match box. Match box humans. If it takes half a square meter of solar panel to power one match box human, you could sustain a population of 2 trillion in the US alone. If you were to use fusion to convert all the cosmic dust that falls on Earth (let's say, 100 tons) to energy at 1% mass-to-energy ratio, it would generate around 1 PW of energy, which could sustain 10 trillion match box humans running at 100 watts per person.

2018-08-01

The carbon capture gambit

To control the climate of the planet, stopping the loss of carbon to air is not enough. To gain control of the temperature of the planet and the carbon economy, we need to mine the carbon that's been lost into the air.

There are proposed air-mining mechanisms for carbon that run at a cost of 3-7% of GDP. And that's before significant optimizations. To gain control of the climate and amass vast carbon reserves, we would only need a sum of money equivalent to a few years' worth of economic growth. If we invest now, we'll get a massive head start on future mining operations by other countries, and will end up in an immensely powerful geopolitical position. Not only will we have control over the climate of the planet, we also end up in the possession of all the carbon lost into the air.

Control over carbon will give us control over the carbon-based fuel economy. Most of the carbon accessible carbon on the planet is right there in the air, ready for the taking.

Every year, more than 7 gigatons of carbon is lost in the atmosphere in the form of CO2. We can estimate the market value of this lost carbon using the price of coal. Coal is around 50% carbon, with the rest composed of impurities. The price of coal is around $40 per ton. Pure carbon could be twice as valuable as coal. At $80 per ton, 7 gigatons of carbon would be worth $560 billion dollars. More than half a trillion dollars. Every. Single. Year.

2018-07-28

Crank, the SUBSET-SUM edition

SUBSET-SUM: 2^n possible answers for n-element input. Input composed of n k-bit numbers, find a subset that sums up to a target number. Total input length is n*k bits. If n is log(k), solvable in polynomial time to n*k (2^n ~= k ~= n*k). If k is log2(n), also solvable in polynomial time (2^k unique numbers ~= n => you've got all the numbers, greedy solution always works). Therefore k = p(n).

If you pre-sort the numbers, you get a partially sorted binary enumeration. E.g. If you have 4 numbers in the input, you can state your answers as binary numbers from 0001 to 1111. Because the numbers are sorted, you know that 0100 > 0010, regardless of the numbers involved.

In addition to ordered bit patterns, the output space is composed on ambiguous bit patterns. Think of 0110 and 1001. You can't tell which one is larger without evaluating the sum in some way.

The kickers are that for any given bit pattern, you can generate an ordered pattern that is smaller or larger to it. And the number of ambiguous bit patterns between the two ordered patterns grows exponentially to n. Indeed, the number of ambiguous bit patterns for a given bit pattern grows exponentially with n as well.

Think of this 24-bit pattern: 000000111111111111000000. It's got the middle 12 bits set, surrounded by 6 zero bits on either side. Because of the ordered pattern properties (if you move a bit up, the sum is larger, if you move a bit down, the sum is smaller), you know that any bit pattern with 12 bits set where all the bits are below the 6th bit are equal or smaller to the mid-bits-set pattern. Likewise, any bit pattern with zeroes above the 18th bit are equal to or larger to the mid-bits-set pattern. The bit patterns ambiguous to the mid-bits-set pattern are ones where you have bits moved to above and below the pattern. This would constitute around 2¹² patterns, or 2^n/2.

You might think that this is pretty nice to use for solution search, but note that the ordered pattern comparison only decreases the size of the output space under consideration by 2^-n/4. With 24 bits, this comparison would reduce your output space to 63/64 of the original. With 28 bits, 127/128. With 30 bits, 255/256. And so on.

Even if you manage to use some fancy combination of comparisons and pattern recognition to rule out that portion of the search space with each comparison, you're still running at 2^n/4 complexity.

Digression: The SUBSET-SUM search graph is composed of n layers. If you can solve one layer in p(n) time, you can solve the problem in p(n) time (because n*p(n) ~ p(n)). And of course, if you can't solve a layer in p(n), you can't solve the full problem either. This doesn't really help much, since the number of sums in the middle layer grows exponentially with n, and the number of unique sums grows exponentially with k. (If you think of the middle layer as a line with a little tick mark for every number generated on it, adding one to n up-to-doubles the number of tick marks on the line. Adding one to k doubles the length of the line. Adding one to n increases the size of the input by k bits. Adding one to k increases the size of the input by n bits.)

Digression 2: You can easily generate a number that's got the maximum distance from the target number equal to max(input), just sum them up until you reach a number that's higher than the target. Of course, max(input) is around 2^k. I believe you could also generate a number that has a distance around the difference between two numbers in the input (greedily swap numbers to move the generated number towards the target). The difference between two adjacent input numbers is around 2^k/n.

Digression 3: Even if you could get to p(n) distance from the target, you'd have to then iterate towards the target. You can generate an ordered pattern in the right direction, but that'll jump you an exponential distance. And after you've jumped to the closest ordered pattern, you need to find the next largest number for it. And finding the next largest number for a given bit pattern is tough. You need to resolve the ambiguous patterns for it. Which are exponential in count. And where every evaluation of a pattern yields you a portion of the solution that shrinks exponentially with n. If so, finding the next largest number to a pattern sounds like it's harder than NP (how would you verify the adjacency in polynomial time?)

Digression 4: Let me propose a series of simplified problems for SUBSET-SUM. First, let's limit ourselves to problems where a solution exists, and where exactly half of the numbers are used for the generated number. The first simplified problem is finding an adjacent number for such a generated number. The second is finding a number that's p(n) distant from a generated number. The third is finding the actual generated number. I don't think any of these can be solved in p(n)...

2018-07-05

Even more crank on P != NP

Size limits for the input and the output

If the input is exponentially smaller than the output, the problem is not in NP. If the input is power-of-two exponentially larger than the output, the problem is in P.

In the first case, the output is exponentially longer than the input and therefore takes exponentially longer to print out, even on a non-deterministic machine. Even if you could print it out instantly, you need to check the output with a deterministic verifier that needs to run in polynomial time to the original input, and reading the exponentially larger output into the verifier would take exponential time to the original input. So exponentially longer outputs are not in NP.

In the second case, if you have a polynomial time deterministic verification algorithm, you can create a solver by adding an output space enumerator in front of it. If the output is exponentially smaller than the input, hey, you've got a polynomial time solver. If |output| = log2(|input|), enumerating the outputs takes 2^|output| = 2^{log2(|input|)} = |input| steps. From NP, you have the verifier that runs in p(|input|) so a solver that enumerates the output space and tries the verifier on each possible solution would run in 2^|output| * p(|input|), which becomes |input| * p(|input|) if the output is exponentially smaller than the input like above. (So any problems in NP where the output is a single bit are also in P.)

Due to these size limits, for P ?= NP the size of the output must be polynomial(-ish??) to the size of the input.

Size limits for the solver

How about the size of the deterministic solver relative to the input. If the deterministic solver machine is by necessity exponentially larger than the input (by necessity meaning that for any given part of the machine there exists an input that makes the machine access that part), there also exists an input that causes the machine to move the tape a distance exponential to the size of the input, which takes exponential time relative to the input. So the deterministic solver needs to have a size at most polynomial to the size of the input.

For the lower bound, consider the case where the machine extracts zero information from the input. This machine just enumerates over the output space, checking each answer using the verifier. The runtime of the machine is 2^|output| * p(|input|), the code is fixed in size, and it uses |output| extra tape as memory. At the other end is the answer table machine: it's an array of precomputed solutions where the key is the input. The size of the answer table machine is |output| * 2^|input| (and due to the above tape movement limitations, will take exponential time to run relative to some input.)

Knowing nothing about the input is out, and knowing everything about the input is out. Every bit in the program encodes information about the input-output-relation, allowing the solution search to proceed faster than brute-force enumeration when the program encounters an input that matches the encoded information. The program can also learn from the input, but it can't learn an exponential amount of information from the input (because that'd take exponential time.) Wild guess: If the amount of information used by the program is exponentially less than the amount of information in the input-output-relation, the program would end up using brute-force search over parts of the output that are polynomial in size to the input, which would take exponential time to the input. If true, there would be no sub-polynomial-space programs for P ?= NP that run in polynomial time.

Given the existence of the deterministic polynomial time verifier, we know that you can extract information from the input-output-pair in polynomial time, even though the amount of information extracted is very small. How small? Well, at least you get |output| / 2^|output| bits of info. After trying every 2^|output| outputs, you would have |output| bits and a correct output. If there are several correct outputs, you'd gain numberOfOutputs * |output| / 2^|output| bits. If the numberOfOutputs grows exponentially with the output size, this would give you a polynomial time solver. But, if this were the case, someone would've noticed it. Another wild guess: the amount of information we get out of the verifier shrinks exponentially in relation to the output size (and therefore the input size, which has a polynomial size relation to the output).

The number of input enumerations is 2^|input|, ditto for the output 2^p(|input|). We have a deterministic polynomial time verifier that can extract solution information from the input, but the amount of information extracted looks to be a logarithm of the input size. We also have a hypothetical solver program that encodes a polynomially growing amount of information about the input-output-relation, allowing it to avoid brute-force search whenever it encounters an input that matches the encoded information.

If the number of patterns that can be encoded in the input grows exponentially with the input length, then as the problem size grows towards infinity the question becomes: can you discern between an infinite number of input patterns with a zero-length solver program and a verifier that can extract zero bits of information from the input? For P to be NP, the number of patterns in the input needs to grow as a polynomial to the length of the input.

2018-07-02

Some more rubbish on P != NP

Feeling crank today, so:

Non-deterministic machines are one-bit oracles. Given a branch in execution, they always follow the correct path. You can use this property to write a universal non-deterministic machine:

readInput();

NonDeterministic done, correctOutput;

while (!done) {
  outputBit(correctOutput);
}

Its execution time is linear to the size of the output. There are still problems that take non-polynomial time compared to the input. For instance, "given an input of n bits, print out n! bits".

The question in P ?= NP then is: "given a problem where the output size is a polynomial of the input size, is it possible to emulate a non-deterministic machine using a deterministic machine with at most a polynomial slowdown". Since the ND machine takes a number of steps linear to the output size, which is polynomial of the input size, it solves the problem in polynomial time to the input. If the deterministic machine can emulate the ND machine at a polynomial slowdown, its running time is a polynomial of the output size, which is a polynomial of a polynomial of the input size.

Maybe you could also ask if it's possible to prove that there exists a problem for which an optimal program running on a deterministic machine requires a number of steps exponential to the size of its output.

Make a bridge from output size to input size? ...exponential to the size of its output and where the output size grows linearly with input size.

2018-06-07

New CPUs

Lots of cores! More cores! AMD bringing 32-core EPYC to the desktop, Intel bringing top-end 28-core Xeons to the desktop.

In terms of compute, it's getting pretty interesting! 32 cores at 4 GHz and SIMD + ispc to SPMD it up. With 8-wide SIMD and 1 cycle FMA, you'd be looking at 2 TFLOPS. If you could push that to 16-wide (or more usefully, double the core count), 4 TFLOPS. That's discrete GPU territory. Not to forget double precision: 1 TFLOPS DP beats all but the uber-expensive compute GPUs.

If you still have that /usr/bin/directx thing, you wouldn't need a GPU at all. Just plug a display to the PCIe bus and send frames to it.

Memory bandwidth is still an issue, a few GB of HBM would help. And it'd be nice to plug in extra CPUs and RAM to PCIe slots.

2018-05-01

WAN OpenVPN at 915 Mbps


$ iperf -c remote -P4
...
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   284 MBytes   238 Mbits/sec
[  3]  0.0-10.0 sec   278 MBytes   233 Mbits/sec
[  6]  0.0-10.0 sec   275 MBytes   230 Mbits/sec
[  5]  0.0-10.1 sec   259 MBytes   216 Mbits/sec
[SUM]  0.0-10.1 sec  1.07 GBytes   915 Mbits/sec

OK, that's working.

2018-04-23

Compute-NVMe

So there's this single-socket EPYC TYAN server with 24 NVMe hotswap bays. That's... a lot of NVMe. And they're basically PCIe x4 slots.

What if you turned those NVMe boxes into small computers. Beefy well-cooled ARM SoC with 32 gigs of RAM and a terabyte of flash, wired to the ARM SoC with a wide bus. You might get 200 GB/s memory bandwidth and 10 GB/s flash bandwidth. The external connectivity would be through the PCIe 4.0 x4 bus at 8 GB/s or so.

The ARM chip would perform at around a sixth the perf of a 32-core EPYC, but it'd have a half-teraFLOP GPU on it too. With 24 of those in a 2U server, you'd get four 32-core EPYCs worth of CPU compute, and nearly a Tesla V100 of GPU compute. But. You'd also have aggregate 4.8 TB/s memory bandwidth and 240 GB/s storage bandwidth. In a 2U. Running at, what, 10 W per card?

Price-wise, the storage and RAM would eclipse the price of the ARM SoC -- maybe $700 for the RAM and flash, then $50 for the SoC. Put two SoCs in a single box, double the compute?

Anyway, 768 GB of RAM, 24 TB of flash, 128 x86 cores of compute, plus 80% of a Tesla V100, for a price of $20k. Savings: $50k. Savings in energy consumption: 800 W.

OpenVPN over fast(er) links

Tested OpenVPN with 65k tun-mtu on the IPoIB network. It does 5-6 Gbps, compared to the 20-25 Gbps raw network throughput. I was surprised it managed 6 Gbps in the first place. "Oh what did I break now, why does my test run at 6 Mbps ... oh wait, that's Gbps."

Another problem to track is that on the internal GbE network, the VPN runs at 900+ Mbps. But when connecting to the WAN IP, it only manages 350 Mbps. A mystery wrapped in an enigma. (It's probably the underpowered router kicking me in the shins again. Use one of the fast computers as a firewall, see if that solves the problem.)

2018-04-21

rdma-pipe

Haven't you always wanted to create UNIX pipes that run from one machine to another? Well, you're in luck. Of sorts. For I have spent my Saturday hacking on an InfiniBand RDMA pipeline utility that lets you pipe data between commands running on remote hosts at multi-GB/s speeds.

Unimaginatively named, rdma-pipe comes with the rdpipe utility that coordinates the rdsend and rdrecv programs that do the data piping. The rdpipe program uses SSH as the control channel and starts the send and receive programs on the remote hosts, piping the data through your commands.

For example


  # The fabulous "uppercase on host1, reverse on host2"-pipeline.
  $ echo hello | rdpipe host1:'tr [:lower:] [:upper:]' host2:'rev'
  OLLEH

  # Send ZFS snapshots fast-like from fileserver to backup_server.
  $ rdpipe backup@fileserver:'zfs send -I tank/media@last_backup tank/media@head' backup_server:'zfs recv tank-backup/media'

  # Decode video on localhost, stream raw video to remote host.
  $ ffmpeg -i sintel.mpg -pix_fmt yuv420p -f rawvideo - | rdpipe playback_host:'ffplay -f rawvideo -pix_fmt yuv420p -s 720x480 -'

  # And who could forget the famous "pipe page-cached file over the network to /dev/null" benchmark!
  $ rdpipe -v host1:'</scratch/zeroes' host2:'>/dev/null'
  Bandwidth 2.872 GB/s

Anyhow, it's quite raw, new, exciting, needs more eyeballs and tire-kicking. Have a look if you're on InfiniBand and need to pipe data across hosts.

2018-04-17

IO limits

It's all about latency, man. Latency, latency, latency. Latency drives your max IOPS. The other aspects are how big are your IOs and how many can you do in parallel. But, dude, it's mostly about latency. That's the thing, the big kahuna, the ultimate limit.

Suppose you've got a workload. Just chasing some pointers. This is a horrible workload. It just chases tiny 8-byte pointers around an endless expanse of memory, like some sort of demented camel doing a random walk in the Empty Quarter.

This camel, this workload, it's all about latency. How fast can you go from one pointer to the next. That gives you your IOPS. If it's from a super-fast spinning disk with a 10 ms latency, you'll get maybe like a 100 IOPS. From NVMe flash SSD with 0.1 ms latency, 10000 IOPS. Optane's got 6-10 us latency, which gets you 100-170k IOPS. If it's, I don't know, a camel. Yeah. Man. How many IOPS can a camel do? A camel caravan can travel 40 kilometers per day. The average distance between two points in Rub' al Khali? Well, it's like a 500x1000 km rectangle, right? About 400 kilometers[1] then. So on average it'd take the camel 40 days to do an IO. That comes down to, yeah, 289 nanoIOPS.

Camels aren't the best for random access workloads.

There's also the question of the IO size. If you can only read one byte at a time, you aren't going to get huge throughput no matter how fast your access times. Imagine a light speed interconnect with a length of 1.5 meters. That's about a 10 picosecond roundtrip. One bit at a time, you could do 12.5 GB per second. So, while that's super fast, it's still an understandable number. And that's the best-case scenario raw physical limit.

Now, imagine our camel. Trudging along in the sand, carrying a saddle bag with decorative stitchwork, tassels swinging from side to side. Inside the pouches of the saddle bags are 250 kilograms of MicroSD cards at 250 mg each, tiny brightly painted chips protected from the elements in anti-static bags. Each card can store 256 GB of data and the camel is carrying a million of them. The camel's IO size is 256 petabytes. At 289 nanoIOPS, its bandwidth is 74 GB/s. The camel has a higher bandwidth than our light speed interconnect. It's a FTL camel.

Let's add IO parallelism to the mix. Imagine a caravan of twenty camels, each camel carrying 256 petabytes of data. An individual camel has a bandwidth of 74 GB/s, so if you multiply that by 20, you get the aggregate caravan bandwidth: 1.5 TB/s. These camels are a rocking interconnect in a high-latency, high-bandwidth world.

Back to chasing 8-byte pointers. All we want to do is find one tiny pointer, read it, and go to the next one. Now it doesn't really matter how many camels you have or how much each can carry, all that matters is how fast they can go from place to place. In this kind of scenario, the light speed interconnect would still be doing 12.5 GB/s (heck, it'd be doing 12.5 GB/s at any IO size larger than a bit), but our proud caravan of camels would be reduced to 0.0000023 bytes per second. Yes, that's bytes. 2.3 microbytes per second.

If you wanted to speed up the camel network, you could spread them evenly over the desert. Now the maximum distance a camel has to travel to the data is divided by the number of camels serving the requests. This works like a Redundant Array of Independent Camels, or RAIC for short. We handwave away the question how the camels synchronize with each other.

Bringing all this back to the mundane world of disks and chips, the throughput of a chip device at QD1 goes through two phases: first it runs at maximum IOPS up to its maximum IO block size, then it runs at flat IOPS up to its maximum parallelism. In theory this would give you a linear throughput increase with increasing block size until you run into the device throughput limit or the bus throughput limit.

You can roughly calculate the maximum throughput of a device by multiplying its IOPS by its IO block size and its parallelism. E.g. if a flash SSD can do ten thousand 8k IOPS and 16 parallel requests, its throughput would be 1.28 GB/s. If you keep the controller and the block size and replace the flash chips with Optane that can do 10x as many QD1 IOPS, you could reach 12.8 GB/s throughput. PCIe x16 Optane cards anyone?

To take it a step further, DRAM runs at 50 ns latency, which would give you 20 million IOPS, or 200x that of Optane. So why don't we see RAM throughput in the 2.5 TB/s region? First, DDR block size is 64 bits (or 8 bytes). Second, CPUs only have two to four memory channels. Taking those numbers at face value, we should only be seeing 320 MB/s to 640 MB/s memory bandwidth.

"But that doesn't make sense", I hear you say, "my CPU can do 90 GB/s reads from RAM!" Glad you asked! After the initial access latency, DRAM actually operates in a streaming mode that ups the block size eight-fold to 64 bytes and uses the raw 400 MHz bus IOPS [2]. Plugging that number into our equation, we get a four channel setup running at 102.4 GB/s.

To go higher than that, you have to boost that bus. E.g. HBM uses a 1024-bit bus, which gets you up to 400 GB/s over a single channel. With dual memory channels, you're nearly at 1 TB/s. Getting to camel caravan territory. You'll still be screwed on pointer-chasing workloads though. For those, all you want is max MHz.

[1] var x=0, samples=100000; for (var i=0; i < samples; i++) { var dx = 500*(Math.random() - Math.random()), dy = 1000*(Math.random() - Math.random()); x += Math.sqrt(dx*dx + dy*dy); } x /= samples;

[2] Please tell me how it actually works, this is based on incomplete understanding of Wikipedia's incomplete explanation. As in, what kind of workload can you run from DRAM at burst rate.

2018-04-12

RDMA cat

Today I wrote a small RDMA test program using libibverbs. That library has a pretty steep learning curve.

Anyhow. To use libibverbs and librdmacm on CentOS, install rdma-core-devel and compile your things with -lrdmacm -libverbs.

My test setup is two IBM-branded Mellanox ConnectX-2 QDR InfiniBand adapters connected over a Voltaire 4036 QDR switch. These things are operating at PCIe 2.0 x8 speed, which is around 3.3 GB/s. Netcat and friends get around 1 GB/s transfer rates piping data over the network. Iperf3 manages around 2.9 GB/s. With that in mind, let's see what we can reach.

I was basing my test programs on these amazingly useful examples: https://github.com/linzion/RDMA-example-application https://github.com/jcxue/RDMA-Tutorial http://www.digitalvampire.org/rdma-tutorial-2007/notes.pdf and of course http://www.rdmamojo.com/ . At one point after banging my head on the ibverbs library for a bit too long I was thinking of just using MPI to write the thing and wound up on http://mpitutorial.com - but I didn't have the agility to jump from host-to-host programs to strange new worlds, so kept on using ibverbs for these tests.

First light

The first test program was just reading some data from STDIN, sending it to the server, which reverses it and sends it back. From there I worked towards sending multiple blocks of data (my goal here was to write an RDMA version of cat).

I had some trouble figuring out how to make the two programs have a repeatable back-and-forth dialogue. First I was listening to too many events with the blocking ibv_get_cq_event -call, and that was hanging the program. Only call it as many times as you're expecting replies.

The other fib was that my send and receive work requests shared the sge struct, and the send-part of the dialogue was setting the sge buffer length to 1 since it was only sending acks back to the other server. Set it back to the right size before sending each work request, problem solved.

Optimization

Once I got the rdma-cat working, performance wasn't great. I was reading in a file from page cache, sending it to the receiver, and writing it to the STDOUT of the receiver. The program was sending 4k messages, doing a 4k acks, and a mutex-requiring event ack after each message. This ran at around 100 MB/s. Changing the 4k acks to single-byte acks and doing the event acks for all the events at once got me to 140 MB/s.

How about doing larger messages? Change the message size to 65k and the cat goes at 920 MB/s. That's promising! One-megabyte messages and 1.4 GB/s. With eight meg messages I was up to 1.78 GB/s and stuck there.

I did another test program that was just sending an 8 meg buffer to the other machine, which didn't do anything to the data. This is useful to get an optimal baseline and gauge perf for a single process use case. The test program was running at 2.9 GB/s.

Adding a memcpy to the receive loop roughly halved the bandwidth to 1.3 GB/s. Moving to a round-robin setup with one buffer receiving data while another buffer is having the data copied out of it boosted the bandwidth to 3 GB/s.

The send loop could read in data at 5.8 GB/s from the page cache, but the RDMA pipe was only doing 1.8 GB/s. Moving the read to happen right after each send got them both moving in parallel, which got the full rdma_send < inputfile ; rdma_recv | wc -c -pipe running at 2.8 GB/s.

There was an issue with the send buffer contents getting mangled by an incoming receive. Gee, it's almost like I shouldn't use the same buffer for sending and receiving messages. Using a different buffer for the received messages resolved the issue.

It works!

I sent a 4 gig file and ran diff on it, no probs. Ditto for files less than buffer size in size and small strings sent with echo.

RDMA cat! 2.9 GB/s over the network.

Let's try sending video frames next. Based on these CUDA bandwidth timings, I should be able to do 12 GB/s up and down. Now I just need to get my workstation on the IB network (read: buy a new workstation with more than one PCIe slot.)

[Update] For the heck of it, I tried piping through two hosts.

[A]$ rdma_send B < inputfile
[B]$ rdma_recv | rdma_send C
[C]$ rdma_recv | wc -c

2.5 GB/s. Not bad, could do networked stream processing. Wonder if it would help if I zero-copy passed the memory regions along the pipe.

And IB is capable of multicast as well...

2018-04-06

4k over IB

So, technically, I could stream uncompressed 4k@60Hz video over the Infiniband network. 4k60 needs about 2 GB/s of bandwidth, the network goes at 3 GB/s.

This... how would I try this?

I'd need a source of 4k frames. Draw on the GPU to a framebuffer, then glReadPixels (or CUDA GPUDirect RDMA). Then use IB verbs to send the framebuffer to another machine. Upload it to the GPU to display with glTexImage (or GPUDirect from the IB card).

And make sure that everything in the data path runs at above 2 GB/s.

Use cases? Extreme VNC? Combining images from a remote GPU and local GPU? With a 100Gb network, you could pull in 6 frames at a time and composite in real time I guess. Bringing in raw 4k camera streams to a single box over a switched COTS fabric.

Actually, this would be "useful" for me, I could stream interactive video from a big beefy workstation to a small installation box. The installation box could handle stereo camera processing and other input, then send the control signals to the rendering station. (But at that point, why not just get longer HDMI and USB cables.)

2018-04-02

Quick timings

NVMe and NFS, cold cache on client and server. 4.3 GiB in under three seconds.

$ cat /nfs/nvme/Various/UV.zip | pv > /dev/null
 4.3GiB 0:00:02 [1.55GiB/s]

The three-disk HDD pool gets around 300 MB/s, but once the ARC picks up the data it goes at NFS + network speed. Cold cache on the client.

$ echo 3 > /proc/sys/vm/drop_caches
$ cat /nfs/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:10 [ 1.5GiB/s]

Samba is heavier somehow.

$ cat /smb/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:13 [1.26GiB/s]

NFS over RDMA from the ARC, direct to /dev/null (which, well, it's not a very useful benchmark). But 2.8 GB/s!

$ time cat /nfs/hdd/Videos/*.mp4 > /dev/null

real    0m6.269s
user    0m0.007s
sys     0m4.056s
$ cat /nfs/hdd/Videos/*.mp4 | wc -c
17722791869
$ python -c 'print(17.7 / 6.269)'
2.82341681289

$ time cat /nfs/hdd/Videos/*.mp4 > /nfs/nvme/bigfile

real    0m15.538s
user    0m0.016s
sys     0m9.731s

# Streaming read + write at 1.13 GB/s

How about some useful work? Parallel grep at 3 GB/s. Ok, we're at the network limit, call it a day.

$ echo 3 > /proc/sys/vm/drop_caches
$ time (for f in /nfs/hdd/Videos/*.mp4; do grep -o --binary-files=text XXXX "$f" & done; for job in `jobs -p`; do wait $job; done)
XXXX
XXXX
XXXX
XXXX
XXXX

real    0m5.825s
user    0m3.567s
sys     0m5.929s

2018-03-26

Infinibanding, pt 4. Almost there

Got my PCIe-M.2 adapters, plugged 'em in, one of them runs at PCIe 1.0 lane speeds instead of PCIe 3.0, capping perf to 850 MB/s. And causes a Critical Interrupt #0x18 | Bus Fatal Error that resets the machine. Then the thing overheats, melts its connection solder, shorts, releases the magic smoke and makes the SSD PCB look like wet plastic. Yeah, it's dead. The SSD's dead too.

[Edit: OR.. IS IT? Wiped the melted goop off the SSD and tried it in the other adapter and it seems to be fine. Just has high controller temps, apparently a thing with the 960s. Above 90 C after 80 GB of writes. It did handle 800 GB of /dev/zero written to it fine and read everything back in order as well. Soooo, maybe my two-NVMe pool lives to ride once more? Shock it, burn it, melt it, it just keeps on truckin'. Disk label: "The Terminator"]

I only noticed this because the server rebooted and one of the mirrors in the ZFS pool was offline. Hooray for redundancy?

Anyway, the fast NVMe work pool is online, it can kinda saturate the connection. It's got ~~two~~ one Samsung 960 EVOs in it, which ~~are~~ is fast for reads, if maybe not the best for synced writes.

I also got a 280 gig Optane 900p. It feels like Flash Done Right. Low queue depths, low parallelism, whatever, it just burns through everything. Optane also survives many more writes than flash, it's really quite something. And it hasn't caught on fire yet. I set up two small partitions (10 GB) as ZFS slog devices for the two pools and now the pools can handle five-second write bursts at 2 GB/s.

Played with BeeGFS over the weekend. Easyish to get going, sort of resilient to nodes going down if you tune it that way, good performance with RDMA (netbench mode dd went at 2-3GB/s). The main thing lacking seems to be the "snapshots, backups and rebuild from nodes gone bye"-story.

Samba and NFS get to 1.4 to 1.8 GB/s per client, around 2.5 GB/s aggregate, high CPU usage on the server somehow, even with NFSoRDMA. I'll see if I can hit 2 GB/s on a single client. Not really native-level random access perf though. The NVMe drive can do two 1 GB/s clients. Fast filestore experiment mostly successful, if not completely.

Next up, wait for a bit of a lull in work, and switch my workstation to a machine that's got more than one PCIe slot. And actually hook it up to the fast network to take advantage of all this bandwidth. Then plonk the backup disks into one box, take it home, off-site backups yaaay.

Somehow I've got nine cables, four network cards, three computers, and a 36-port switch. Worry about that in Q3, time to wrap up with this build.

2018-03-20

InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.

ZFS

Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.

Plans?

I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.

2018-03-12

Unified Interconnect

Playing with InfiniBand got me thinking. This thing is basically a PCIe to PCIe -bridge. The old kit runs at x4 PCIe 3 speeds, the new stuff is x16 PCIe. The next generation is x16 PCIe 4.0 and 5.0.

Why jump through all the hoops? Thunderbolt is x4 PCIe over a USB connector. What if you bundle four of those, throw in some fiber and transceivers for long distance. You get x16 PCIe between two devices.

And once you start thinking of computers as a series of components hanging off a PCIe bus, your system architecture clarifies dramatically. A CPU is a PCIe 3 device with 40 to 64 lanes. DRAM uses around 16 lanes per channel.

GPUs are now hooked up as 16-lane devices, but could saturate 256 to 1024 lanes. Because of that, GPU RAM is on the GPU board. If the GPU had enough lanes, you could hook GPU RAM up to the PCIe bus with 32 lanes per GDDR5 chip. HBM is probably too close to the GPU to separate.

You could build a computer with 1024 lanes, then start filling them up with the mix of GPUs, CPUs, DRAM channels, NVMe and connectivity that you require. Three CPUs with seven DRAM channels? Sure. Need an extra CPU or two? Just plug them in. How about a GPU-only box with the CPU OS on another node. Or CPUs connected as accelerators via 8 lanes per socket as you'd like to use the extra lanes for other stuff. Or a mix of x86 and ARM CPU cards to let you run mixed workloads at max speed and power efficiency.

Think of a rack of servers, sharing a single PCIe bus. It'd be like one big computer with everything hotpluggable. Or a data center, running a single massive OS instance with 4 million cores and 16 PB of RAM.

Appendix, more devices

Then you've got the rest of the devices, and they're pretty well on the bus already. NVMe comes in 4-lane blocks. Thunderbolt 3, Thunderbolt 2 and USB 3.1 are 4, 2 and 1 lane devices. SAS and SATA are a bit awkward, taking up a bit more than 1 lane or bit more than half a lane. I'd replace them with NVMe connectors.

Display connectors could live on the bus as well (given some QoS to keep up the interactivity). HDMI 2.1 uses 6 lanes, HDMI 2 is a bit more than 2 lanes. DisplayPort's next generation might go up to 7-8 lanes.

Existing kit

[Edit] Hey, someone's got products like this already. Dolphin ICS produces a range of PCIe network devices. They've even got an IP-over-PCIe driver.

[Edit #2] Hey, the Gen-Z Interconnect looks a bit like this Gen-Z Interconnect Core Specification 1.0 Published

2018-03-11

InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)

2018-03-07

Quick test with old Infiniband kit

Two IBM ConnectX-2 cards, hooked up to a Voltaire 4036 switch that sounds like a turbocharged hair dryer. CentOS 7, one host bare metal, other on top of ESXi 6.5.

Best I saw thus far: 3009 MB/s RDMA transfer. Around 2.4 GB/s with iperf3. These things seem to be CPU capped, top is showing 100% CPU use. Made an iSER ramdisk too, it was doing 1.5 GB/s-ish with ext4.

Will examine more next week. With later kernel and firmwares and whatnot.

The end goal here would be to get 2+ GB/s file transfers over Samba or NFS. Probably not going happen but eh, give it a try.

That switch though. Need a soundproof cabinet.

2018-03-04

OpenVPN settings for 1 Gbps tunnel

Here are the relevant parts of the OpenVPN 2.4 server config that got me 900+ Mbps iperf3 on GbE LAN. The tunnel was between two PCs with high single-core performance, a Xeon 2450v2 and an i7-3770. OpenVPN uses 50% of a CPU core on the client & server when the tunnel is busy. For reference, I tried running the OpenVPN server on my WiFi router, it peaked out at 60 Mbps.

# Use TCP, I couldn't get good perf out of UDP. 

proto tcp

# tun or tap, roughly same perf
dev tun 

# Use AES-256-GCM:
#  - more secure than 128 bit
#  - GCM has built-in authentication, see https://en.wikipedia.org/wiki/Galois/Counter_Mode
#  - AES-NI accelerated, the raw crypto runs at GB/s speeds per core.

cipher AES-256-GCM

# Don't split the jumbo packets traversing the tunnel.
# This is useful when tun-mtu is different from 1500.
# With default value, my tunnel runs at 630 Mbps, with mssfix 0 it goes to 930 Mbps.

mssfix 0

# Use jumbo frames over the tunnel.
# This reduces the number of packets sent, which reduces CPU load.
# On the other hand, now you need 6-7 MTU 1500 packets to send one tunnel packet. 
# If one of those gets lost, it delays the entire jumbo packet.
# Digression:
#   Testing between two VBox VMs on a i7-7700HQ laptop, MTU 9000 pegs the vCPUs to 100% and the tunnel runs at 1 Gbps.
#   A non-tunneled iperf3 runs at 3 Gbps between the VMs.
#   Upping this to 65k got me 2 Gbps on the tunnel and half the CPU use.

tun-mtu 9000

# Send packets right away instead of bundling them into bigger packets.
# Improves latency over the tunnel.

tcp-nodelay

# Increase the transmission queue length.
# Keeps the TUN busy to get higher throughput.
# Without QoS, you should get worse latency though.

txqueuelen 15000

# Increase the TCP queue size in OpenVPN.
# When OpenVPN overflows the TCP queue, it drops the overflow packets.
# Which kills your bandwidth unless you're using a fancy TCP congestion algo.
# Increase the queue limit to reduce packet loss and TCP throttling.

tcp-queue-limit 256

And here is the client config, pretty much the same except that we only need to set tcp-nodelay on the server:

proto tcp
cipher AES-256-GCM
mssfix 0
tun-mtu 9000
txqueuelen 15000
tcp-queue-limit 256

To test, run iperf3 -s on the server and connect to it over the tunnel from the client: iperf3 -c 10.8.0.1. For more interesting tests, run the iperf server on a different host on the endpoint LAN, or try to access network shares.

I'm still tuning this (and learning about the networking stack) to get a Good Enough connection between the two sites, let me know if you got any tips or corrections.

P.S. Here's the iperf3 output.

$ iperf3 -c 10.8.0.1
Connecting to host 10.8.0.1, port 5201
[  4] local 10.8.2.10 port 39590 connected to 10.8.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec    0   3.01 MBytes
[  4]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   6.00-7.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver

iperf Done.

2018-03-02

Fast-ish OpenVPN tunnel

500 Mbps OpenVPN throughput over the Internet, nice. Was aiming for 900 Mbps, which seems to work on LAN, but no cigar. [Edit: --tcp-nodelay --tcp-queue-limit 256 got me to 680 Mbps. Which is very close to non-tunneled line speed as measured by a HTTP download.]

OpenVPN performance is very random too. I seem to be getting different results just because I restarted the client or the server.

The config is two wired machines, each with a 1 Gbps fibre Internet connection. The server is a Xeon E3-1231v3, a 3.4 GHz Haswell Xeon. The client is my laptop with a USB3 GbE adapter and i7-7700HQ. Both machines get 900+ Mbps on Speedtest, so the Internet connections are fine.

My OpenVPN is set to TCP protocol (faster and more reliable than UDP in my testing), and uses AES-256-GCM as the cipher. Both machines are capable of pushing multiple gigaBYTES per second over openssl AES-256, so crypto isn't a bottleneck AFAICT. The tun-mtu is set to 9000, which performs roughly as well as 48000 or 60000, but has smaller packets which seems to be less flaky than big mtus. The mssfix setting is set to 0 and fragment to 0 as well, though fragment shouldn't matter over TCP.

Over iperf3 I get 500 Mbps between the endpoints. With HTTP, roughly that too. Copying from a remote SMB share on another host goes at 30 MB per second, but the remote endpoint can transfer from the file server at 110 MB per second (protip: mount.cifs -o vers=3.0). Thinking about it a bit, I need to test with a second NIC in the VPN box, now VPN traffic might be competing with LAN traffic.

2018-02-28

Building a storage pyramid

The office IT infrastructure plan is something like this: build interconnected storage pyramids with compute. The storage pyramids consist of compute hooked up to fast memory, then solid state memory to serve mid-IOPS and mid-bandwidth workloads, then big spinning disks as archive. The different layers of the pyramid are hooked up via interconnects that can be machine-local or over the network.

The Storage Pyramid

Each layer of the storage pyramid has different IOPS and bandwidth characteristics. Starting from the top, you've got GPUs with 500 GB/s memory, connected via a 16 GB/s PCIe bus to the CPU, which has 60 GB/s DRAM. The next layer is also on the PCIe bus: Optane-based NVMe SSDs, which can hit 3 GB/s on streaming workloads and 250 MB/s on random workloads (parallelizable to maybe 3x that). After Optane, you've got flash-based SSDs that push 2-3 GB/s streaming accesses and 60 MB/s random accesses. At the next level, you could have SAS/SATA SSDs which are limited to 1200/600 MB/s streaming performance by the bus. And at the bottom lie the HDDs that can do somewhere between 100 to 240 MB/s streaming accesses and around 0.5-1 MB/s random accesses.

The device speeds guide us in picking the interconnects between them. Each HDD can fill a 120 MB/s GbE port. SAS/SATA SSDs plug into 10GbE ports, with their 1 GB/s performance. For PCIe SSDs and Optane, you'd go with either 40GbE or InfiniBand QDR, and hit 3-4 GB/s. After the SSD layer, the interconnect bottlenecks start rearing their ugly heads.

You could use 200Gb InfiniBand to connect single DRAM channels at 20 GB/s, but even then you're starting to get bottlenecked at high DDR4 frequencies. Plus you have to traverse the PCIe bus, which further knocks you down to 16 GB/s over PCIe 3.0 x16. It's still sort of feasible to hook up a cluster with shared DRAM pool, but you're pushing the limits.

Usually you're stuck inside the local node for performance at the DRAM-level. The other storage layers you can run over the network without much performance lost.

The most unbalanced bottleneck in the system is the CPU-GPU interconnect. The GPU's 500 GB/s memory is hooked to the CPU's 60 GB/s memory via a 16 GB/s PCIe bus. Nvidia's NVLink can hook up two GPUs together at 40 GB/s (up to 150 GB/s for Tesla V100), but there's nothing to get faster GPU-to-DRAM access. This is changing with the advent of PCIe 4.0 and PCIe 5.0, which should be able to push 128 GB/s and create a proper DRAM interconnect between nodes and between the GPU and the CPU. The remaining part of the puzzle would be some sort of 1 TB/s interconnect to link GPU memories together. [Edit] NVSwitch goes at 300 GB/s, which is way fast.

The Plan

Capacity-wise, my plan is to get 8 GB of GPU RAM, 64 GB of CPU RAM, 256 GB of Optane, 1 TB of NVMe flash, and 16 TB of HDDs. For a nicer-cleaner-more-satisfying progression, you could throw in a 4 TB SATA flash layer but SATA flash is kind of DOA as long as you have NVMe and PCI-E slots to use -- the price difference between NVMe flash and SATA flash is too small compared to the performance difference.

If I can score an InfiniBand interconnect or 40GbE, I'll stick everything from Optane on down to a storage server. It should perform at near-local speeds and simplify storage management. Shared pool of data that can be expanded and upgraded without having to touch the workstations. Would be cool to have a shared pool of DRAM too but eh.

Now, our projects are actually small enough (half a gig each, maybe 2-3 of them under development at once) that I don't believe we will ever hit disk in daily use. All the daily reads and writes should be to client DRAM, which gets pushed to server DRAM and written down to flash / HDD at some point later. That said, those guys over there *points*, they're doing some video work now...

The HDDs are mirrored to an off-site location over GbE. The HDDs are likely capable of saturating a single GbE link, so 2-3 GbE links would be better for live mirroring. For off-site backup (maybe one that runs overnight), 1 GbE should be plenty.

In addition to the off-site storage mirror, there's some clouds and stuff for storing compiled projects, project code and documents. These don't need to sync fast or are small enough to do so.

Business Value

Dubious. But it's fun. And opens up possible uses that are either not doable on the cloud or way too expensive to maintain on the cloud. (As in, a single month of AWS is more expensive than what I paid for the server hardware...)

2018-02-27

Figments

Ultraviolet Fairies

"Can you make them dance?", Pierre asked. Innocent question, but this was a cloud of half a million particles. Dance? If I could make the thing run in the first place it would be cause for celebration.

The red grid of the Kinect IR illuminator came on. Everything worked perfectly again. Exactly twelve seconds later, it blinked out, as it had done a dozen times before. The visiting glass artist wasn't impressed with our demo.

Good tidings from France. The app works great on Pierre's newly-bought Acer laptop. A thunderhead was building in the horizon. The three-wall cave projection setup comes out with a wrong aspect ratio. I sipped my matcha latte and looked at the sun setting behind the cargo ships moored off West Kowloon. There's still 20 hours before the gig.

The motion was mesmerizing. Tiny fairies weaving around each other, hands swatting them aside on long trajectories off-screen. I clenched my fist and the fairies formed a glowing ring of power, swirling around my hand like a band of living light. The keyboard was bringing the escaping clouds to life, sending electric pulses through the expanding shells of fairies knocked off-course.

Beat. The music and Isabelle's motion become one, the cloud of fairies behind her blows apart from the force of her hands, like sand thrown in the air. Cut to the musicians, illuminated by the gridlines of the projection. Fingers beating the neon buttons of the keyboard, shout building in the microphone. The tension running through the audience is palpable. Beat. The flowing dancer's dress catches a group of fairies. Isabelle spins and sends them flying.

The AI

A dot. I press space. The dot blinks out and reappears. One, two, three, four, five, six, seven, eight, nine, ten. I press space. The dot blinks out and reappears. Human computation.

Sitting at a desk in Zhuzhou. The visual has morphed into a jumble of sharp lines, rhythmically growing every second. The pulse driving it makes it glow. Key presses simulate the drums we want to hook up to it. Rotate, rotate, zoom, disrupt, freeze. The rapid typing beat pushes the visual to fill-rate destroying levels and my laptop starts chugging.

Sharp lines of energy, piercing the void around them. A network of connections shooting out to other systems. Linear logic, strictly defined, followed with utmost precision. The lines begin to _bend_, almost imperceptibly at first. A chaotic flow pulls the lines along its turbulent path. And, stop. Frozen in time, the lines turn. Slowly, they begin to grow again, but straight and rigid, linear logic revitalized. Beginning of the AI Winter.

The fucking Kinect dropped out again! I start kicking the wire and twisting the connector. That fucker, it did it once in the final practice already, of course it has to drop out at the start of the performance as well. Isabelle's taking it in stride, she's such a pro. If the AI doesn't want to dance with her, she'll dance around it. I push myself down from my seat. How about if I push the wire at the Kinect end, maybe taping it to the floor did ... oh it works again. I freeze, not daring to move, lying on the floor of the theater. Don't move don't move don't move, keep working you bastard! The glowing filter bubbles envelop Isabelle's hands, the computer is responsive again. Hey, it's not all bad. We could use this responsiveness toggle for storytelling, one more tool in the box.

We're the pre-war interexpressionist movement. Beautiful butterflies shining in the superheated flashes of atomic explosions.

2018-02-11

Office build, pt. 2

With pretty much all the furniture built (apart from fixing a shelf and a filing cabinet to a wall), the office is getting close to version 1. That is, more than half empty and running at 10% utilization.

What do I mean? Well, what I've got in there now are two big desks and a table, two office chairs and two random chairs. Add in a filing cabinet, a shelf and a couple of drawers, and .. that's it. There's a room in the back that's going to be the demo build area / VR dev area. And probably where the servers and printers are going to live.

As it is now, the main office has space for 3-4 more big desks. I'll haul one desk from home there. And that's it for now. Room to expand.

If you're looking for a place to work from in Tsuen Wan, Hong Kong, come check it out. Or if you want to work on building awesome interactive generative visuals -- fhtr.org & fhtr.net -style but with more art direction, interactivity and actual clients. Sales genius, UI developer and 3D artist could use right away.

2018-02-09

Office build

Recently I've been setting up an office in Hong Kong, a.k.a. setting money on fire. Generating heat that enables us to set up installations in-house, tackle bigger projects and expand the company from its current staff of 1.25. Good plan? Great plan!

Very IKEA, much birch veneer. Spent a couple enjoyable days building furniture, almost done for now. Plan is to get a big board in front of the office's glass door and mount a big TV on it to run a demo loop of interactive content for the other tenants in the building to enjoy.

Also planning some network infrastructure build. The place has a wired network from the previous tenant, with five Cat5e cables strung along the walls, which I hooked up to an 8-port gigabit switch. There are another five cables running inside the walls that terminate at ethernet sockets, but those seem to be broken somehow.

Getting gigabit fibre internet there next week, which was surprisingly affordable. Like, $100/month affordable. For WiFi, I got a small WiFi router, 802.11ac. It's all very small-time setup at the moment.

So hey, how about some servers and a network upgrade? Get a couple of used old Xeon servers, fill them with SSDs and high-capacity drives for a proper storage tower. Run a bunch of VMs for dev servers and NAS. And, hmm, how about QDR InfiniBand cards and cables for 32Gbps network speed between servers and wired workstations, with the GigE and WiFi for others. Sort of like a 2012 supercomputing node setup. The best part? All of this kit (apart from the new drives) would be around $1500, total.

That said, I'll start with one server and a direct connection to the workstation. It'll keep the budget down to $400ish and let me try it out to see if it actually works.

Next to find the parts and get that thing built and running.

And um, yeah, do some sales too.

Here are some recent visuals done with our friends and partners at cityshake.com and plot.ly. For awesome sites and corporate app development in Austria, contact raade.at, who we have also worked with on various projects. Interested? Of course you are!

Mail me at my brand spanking new email address hei@heichen.hk and buy some awesome visuals or a fancy interactive office/retail/hotel installation or a fabulous fashion show. Or your very own VM hosted by a DevOps noob on obsolete hardware. I'm sure a VM can't catch on fire. Or... can it?

2018-01-25

The Bell

A hill rising in the middle of a plain, surrounded by sparse woods. The low hill, covered in tall yellow grass like the rest of the plain, wavy in the relentless dry season heat. On top of the hill rises a lonely figure sits before a bell. The shadow of the bell shielding the figure from the sun's rays. A round clearing surrounds the bell, a staircase leading down from the hill.

The figure rises up and walks down into the shallow pit under the bell. Standing under the bell, the figure reaches up and grabs a rope tied to a large log. Swinging it back and forth, slowly gaining speed, until the log nearly touches the inside wall of the bell. With a violent jerk, he brings the log to crash into the other side of the bell.

A ring to be heard up in heaven and down in hell, the bell rings, deep and clear. With each ring echoing in the skies above, the ghosts shift in the earth below.

2018-01-17

Open Problems in Computing

Off the top of my head, here are some exciting unsolved problems that future computers can tackle:

- Responsive web design
- Pervasive ubiquitous mind control
- How to use carbohydrates as digital computing substrate
- Turning rocks into solar panels
- Embedding computers inside bones
- Embedding bones inside computers
- Apps that are better than websites from user perspective
- Converting trillions of dollars worth of person-time into a few billion dollars of ad revenue. Oh wait, that one's solved already? Carry on then.
- Taking a 30% cut of every tax payment, like, Tax Store? With Tax Apps and In-Tax Payments.
- Pervasive ubiquitous body control
- Using meat-drones to construct a great obsidian pyramid out of computronium
- Launching life-seeds to other planets and stars, both in DNAtic and C++tic forms.
- Unlimited wireless data plans
- Phones made out of fabric
- Phones made out of fabric that aren't easy to mistake as napkins. Alternatively, phones made out of snotophobic fabric.
- Babies waking up in the middle of the night
- Multiple-Earth-radius solar collectors on stable orbits
- 10x our power generation to suck up all the excess carbon from the air and the seas
- Use the volume of the Earth for matter extraction, rather than just the surface.
- Life-seeds thriving across the solar system
- Work Inc -- the Skinner box social media platform where you're working for Work Inc at tasks best tailored for your skillset. Work Inc resells your labor for a trillion dollars and pays you an infinite scroll of meme gifs.
- Pervasive ubiquitous distributed computing fabric owned by the people of the world in equal shares.
- AIs that work for Work Inc better than any human could (88.32 rating vs 86.2 for best-performing humans).
- Pervasive ubiquitous AI control
- Using enslaved AIs to construct a great crystalline tower out of computronium
- Basketball planet

2018-01-01

Tablets

Got a 10" Android tablet for testing & developing stuff. It's surprisingly nice. I can hold it at a distance to read and watch videos. It's got LTE, so I can do calls with it, and taking and looking at photos is much nicer than on tiny phone screens. It's lightweight too. My parents have an iPad Pro 12", which is even better for videos, but starts resembling a miniature TV in use due to the size and weight (and the iPad is so much smoother at basic interactions, thx properly engineered UI & render loop).

Tablets (and perhaps laptops as well) are a strange device category. In most things, they're inferior to phones: the cameras are second-rate, the screens aren't as sharp, they're heavier, few apps are optimized for the tablet form factor, and in Android-land, they usually ship with old OS versions and have less powerful HW than the flagship phones. But in some ways, I feel like this 10" tablet is a superior phone. It's got a big screen. There's a stylus. The battery lasts longer. It's a lot more comfortable for many phone tasks compared to a 5" screen: more messages, more emails, more text, bigger images.

Why not make a tablet that is a superior phone in every way? Aim for the same battery life as flagship phones when in heavy use (1-2 days). Double the cores and GPU, 4K HDR display, quad camera module, double front-facing cams, stylus, lots of fast storage, lots of fast RAM. Lightweight build. Imagine bolting two flagship phones together, that kind of thing.

You could pretty easily make a device that's got 2x the perf of top shelf x86 laptops since you don't have to pay the x86 tax (x86 chips are 5-10x more expensive per transistor compared to Snapdragon 835 for example).

Pair it off with a smaller device (3" superwatch?) that can handle the mobile tasks (calls, messaging on the move, quick snapshots), and delegate reading, watching and creating to the tablet.

All you'd really have to do is fork Android / iOS to turn it into good computer operating system.

So yeah, the problem I have is that the current mobiles are too big and heavy, and the current tablets are not good mobiles (or good laptops for that matter).

Flip flopo flipo flop scrollophone fanphone foldophone umbrellaphone AR glasses.