art with code
2020-07-16
Compute shader daemon
2020-07-06
GPU IO library design thoughts
Thinking about the design of file.glsl, a file IO library for GLSL.
For a taste, how about calling node.js from a GPU shader:
string r, fn = concat("node-", str(ThreadID), ".txt"); awaitIO(runCmd(concat( "node -e 'fs=require(`fs`); fs.writeFileSync(`", fn, `, http://Date.now().toString())'" ))); r = readSync(fn, malloc(16)); println(concat("Node says ", r));
There are more examples in the test_file.glsl unit tests.
Design of GPU IO
Hardware considerations
- GPUs have around a hundred processors, each with a 32-wide SIMD unit.
- The SIMD unit can execute a 32 thread threadgroup and juggle between ten threadgroups for latency hiding.
- GPU cacheline is 128 bytes.
- CPU cacheline is 64 bytes.
- GPU memory bandwidth is 400 - 1000 GB/s.
- CPU memory bandwidth is around 50 GB/s.
- PCIe bandwidth 11-13 GB/s. On PCIe4, 20 GB/s.
- NVMe flash can do 2.5-10 GB/s on 4-16 channels. PCIe4 could boost to 5-20 GB/s.
- The CPU can do 30 GB/s memcpy with multiple threads, so it’s possible to keep PCIe4 saturated even with x16 -> x16.
- GPUdirect access to other PCIe devices is only available on server GPUs. Other GPUs need a roundtrip via CPU.
- CPU memory accesses require several threads of execution to hit full memory bandwidth (single thread can do ~15 GB/s)
- DRAM is good at random access at >cacheline chunks with ~3-4x the bandwidth of PCIe3 x16, ~2x PCIe4 x16.
- Flash SSDs are good at random access at >128kB chunks, perform best with sequential accesses, can deal with high amounts of parallel requests. Writes are converted to log format.
- Optane is good at random access at small sizes >4kB and low parallelism. The performance of random and sequential accesses is similar.
- => Large reads to flash should be executed in sequence (could be done by prefetching the entire file to page cache and only serving requests once the prefetcher has passed them)
- => Small scattered reads should be dispatched in parallel (if IO rate < prefetch speed, just prefetch the whole file)
- => Writes can be dispatched in parallel with more freedom, especially without fsync. Sequential and/or large block size writes will perform better on flash.
- => Doing 128 small IO requests in parallel may perform better than 16 parallel requests.
- => IOs to page cache should be done in parallel and ASAP.
- => Caching data into GPU RAM is important for performance.
- => Programs that execute faster than the PCIe bus should be run on the CPU if the GPU doesn’t have the data in cache.
- => Fujitsu A64FX-type designs with lots of CPU cores with wide vector units and high bandwidth memory are awesome. No data juggling, no execution environment weirdness.
Software
The IO queue works by using spinlocks on both the CPU and GPU sides.
The fewer IO requests you make, the less time you spend spinning.
Sending data between CPU and GPU works best in large chunks.
To avoid issues with cacheline clashes, align messages on GPU cacheline size.
IO request spinlocks that read across the PCIe bus should have small delays between checks to avoid hogging the PCIe bus.
Workgroups (especially subgroups) should bundle their IOs into a single scatter/gather.
When working with opened files, reads and writes should be done with pread/pwrite. Sharing a FILE* across threads isn’t a great idea.
The cost of opening and closing files with every IO is eclipsed by transfer speeds with large (1 MB) block sizes.
The IO library should be designed for big instructions with minimal roundtrips.
E.g. directory listings should send the entire file list with file stats, and there should be a recursive version to transfer entire hierarchies.
Think more shell utilities than syscalls. Use CPU as IO processor that can do local data processing without involving the GPU.
Workgroup concurrency can be used to run the same code on CPU and GPU in parallel. This extends to multi-GPU and multi-node quite naturally.
The IO queue could be used to exchange data between running workgroups.
Limited amount of memory that can be shared between CPU and GPU (I start seeing issues with > 64 MB allocations).
Having a small IO heap for each thread or even threadgroup, while easy to parallelize, limits IO sizes severely.
32 MB transfer buffer, 32k threads -> 1k max IO per thread, or 32k per 32-wide subgroup.
Preferable to do 1+ MB IOs.
Design with a concurrently running IO manager program that processes IO transfers?
The CPU could also manage this by issuing copyBuffer calls to move data.
Workgroups submit tasks in sync -> readv / writev approach is beneficial for sequential reads/writes.
Readv/writev are internally single-threaded, so probably limited by memcpy to 6-8 GB/s.
Ordering of writes across workgroups requires a way to sequence IOs (either reduce to order on the GPU or reassemble correct order on the CPU.)
IOs could have sequence ids.
Compression of data on the PCIe bus could help. 32 * zstd --format=lz4 --fast -T1 file -o /dev/null goes at 38 GB/s.
[Update] Did a quick test with libzstd (create compressor context for each 1 MB read, compress, send data to GPU) with mixed results. Getting 3.4 GB/s throughput with Zstd. Which is good for zero-effort, but I should really be using liblz4. Running 32 instances of zstd --fast in parallel got 6.4 GB/s, with --format=lz4 16 GB/s.
[Update 2] Libzstd with fast strategy and compression level -9 can do 12.7 GB/s grep.glsl throughput. So if I had a GPU-side decompressor, I might see a performance benefit vs raw data (11.4 GB/s). On already-compressed files, there's a 10-15% perf penalty vs raw.
[Update 3] LZ4 with streaming block compressor and compression level 9 reaches 17.5 GB/s grep.glsl throughput on the above file, 22.5 GB/s on a 13 GB kern.log file (lots of similar errors). About 5% perf penalty on compressed files. Feels like a GPU decompressor could actually be a good idea.
Caching file data on the GPU is important for performance, 40x higher bandwidth than CPU page cache over PCIe.
Without GPU-side caching, you’ll likely get better perf on the CPU on bandwidth-limited tasks (>50 GB/s throughput.)
In those tasks, using memory bandwidth to send data to GPU wouldn’t help any, best you could achieve is zero slowdown.
(Memory bandwidth 50 GB/s. CPU processing speed 50 GB/s. Use 10 GB/s of bandwidth to send data to GPU =>
CPU has only 40 GB/s bandwidth left, GPU can do 10 GB/s => CPU+GPU processing speed 50 GB/s.)
Benchmark suite
- Different block sizes
- Different access patterns (sequential, random)
- Scatter writes
- Sequential writes
- Gather reads
- Sequential reads
- Combined reads & writes
- Different levels of parallelism
- 1 IO per thread group
- each thread does its own IO
- 1 IO on ThreadID 0
- IOs across all invocation
- Compression
- From hot cache on CPU
- From cold cache
- With GPU-side cache
- Repeated access to same file
- Access to multiple files
Does it help to combine reads & writes into sequential blocks on CPU-side when possible, or is it faster to do IOs ASAP?
Caching file descriptors, helps or not?
2020-06-29
grep.glsl
grep.glsl is sort of working. It's a GLSL compute shader version of grep. A very simple one at that. It tests a string against a file's contents and prints out the byte offsets where the string was found.
The awesome part of this is that the shader is running the IO. And it performs reasonably well after tuning. You could imagine a graphics shader dynamically loading geometry and textures when it needs them, then poll for load completion in following frames.
Here are the few first lines of grep.glsl
string filename = aGet(argv, 2);
string pattern = aGet(argv, 1);
if (ThreadLocalID == 0) done = 0;
if (ThreadID == 0) {
FREE(
println(concat("Searching for pattern ", pattern));
println(concat("In file ", filename));
)
setReturnValue(1);
}
Not your run-of-the-mill shader, eh?
This is the file reading part:
while (done == 0) {
FREE(FREE_IO(
barrier(); memoryBarrier();
// Read the file segment for the workgroup.
if (ThreadLocalID == 0) {
wgBuf = readSync(filename, wgOff, wgBufSize, string(wgHeapStart, wgHeapStart + wgBufSize));
if (strLen(wgBuf) != wgBufSize) {
atomicAdd(done, strLen(wgBuf) == 0 ? 2 : 1);
}
}
barrier(); memoryBarrier();
if (done == 2) break; // Got an empty read.
// Get this thread's slice of the workGroup buffer
string buf = string(
min(wgBuf.y, wgBuf.x + ThreadLocalID * blockSize),
min(wgBuf.y, wgBuf.x + (ThreadLocalID+1) * blockSize + patternLength)
);
// Iterate through the buffer slice and add found byte offsets to the search results.
int start = startp;
i32heapPtr = startp;
for (int i = 0; i < blockSize; i++) {
int idx = buf.x + i;
if (startsWith(string(idx, buf.y), pattern)) {
i32heap[i32heapPtr++] = int32_t(i);
found = true;
}
}
int end = i32heapPtr;
...
Performance is complicated. Vulkan compute shaders have a huge 200 ms startup cost and a 160 ms cleanup cost. About 60 ms of that is creating the compute pipeline, the rest is instance and device creation.
Once you get the shader running, performance continues to be complicated. The main bottleneck is IO, as you might imagine. The file.glsl IO implementation uses a device-local host-visible volatile buffer to communicate between the CPU and GPU. The GPU tells the CPU that it has some IO work for the CPU by writing into the buffer, using atomics to prevent several GPU threads from writing into the same request. The CPU spinlocks to grab new requests to process, then spinlocks for the requests to become available. After processing the request, the CPU writes the results to the buffer. The GPU spinlocks awaiting for the IO completion, then copies the IO results to its device-local heap buffer.
The GPU-GPU copies execute reasonably fast (200+ GB/s), but waiting for the IO drops the throughput to around 6 GB/s. This is way better than the 1.6 GB/s it used to be, when it was using int8_t for IO and ~200 kB transfers. Now the transfer buffer is 1 MB and the IO copies are done with i64vec4.
Once you have the data on the GPU, performance continues to be complicated. Iterating through the data one byte at a time goes at roughly 30 GB/s. If the buffer is host-visible, the speed drops to 10 GB/s. Searching a device-local buffer 32 bytes at a time using an i64vec4 goes at 220 GB/s.
The CPU-GPU transfer buffer has a max size of 256 MB. The design of file.glsl causes issues here, since it slices the transfer buffer across shader instances, making it difficult to do full-buffer transfers (and minimize IO spinning). Now grep.glsl does transfers in per-workgroup slices, where 100 workgroups each do a 1 MB read from the searched file, then distribute the search work across 255 workers per workgroup.
This achieves 6 GB/s shader throughput. The PCIe bus is capable of 12 GB/s. If you remove the IO waits and just do buffer copies, the shader runs at 130 GB/s. Taking that into account, the shader PCIe transfers should be happening at 6.3 GB/s. Removing the buffer copies had a very minimal effect on throughput, so doing the search directly on the IO buffer shouldn't improve performance by much.
Thoughts
Why are the transfers not hitting the PCIe bus limits? Suspiciously hovering at around half of PCIe bandwidth too. Could it be that the device-local buffer is slow to write to from the CPU side? Previously I had host-cached memory for the buffer, which lets you do flushes and invalidations to (hopefully) transfer in more efficient chunks. The reason that I'm not using a host-cached memory buffer is that GLSL volatile doesn't work for host-side memory: the GPU seems to cache fetched memory into GPU RAM and volatile only bypasses the GPU L1/L2 caches, so you'll never see CPU writes that landed after your first read. And there doesn't seem to be a buffer type that's host-cached device-local.
Maybe have a CPU-side buffer for IO, and a transfer queue to submit copies to GPU memory. This should land the data into GPU RAM and volatile should work. Or do the fread to a separate buffer first, then memcpy it to the GPU buffer.
Vulkan's startup latency makes the current big binary approach bad for implementing short-lived shell commands. The anemic PCIe bandwidth makes compute-poor programs starve for data. GNU grep runs at 4 GB/s, but you can run 64 instances and achieve 50 GB/s [from page cache]. This isn't possible with grep.glsl. All you have is 12 GB/s.
Suppose this: you've got a long-running daemon. You send it SPIR-V or GLSL. It runs them on the GPU. It also maintains a GPU-side page cache for file I/O. Now your cold-cache data would still run at roughly the speed of storage (6 GB/s isn't _that_ bad.) But a cached file, a cached file would fly. 400 GB/s and more.
Creating the compute pipeline takes around 50 ms. Caching binary pipelines for programs would give you faster startup times.
The GPU driver terminates running programs after around 15 seconds. And they hang the graphics while running. Where's our pre-emptive multitasking.
This should really be running on a CPU because of the PCIe bottleneck and the Vulkan startup bottleneck. Going to try compiling it to ISPC or C++ and see how it goes.
[Update] Tested memcpy performance. Single thread CPU-CPU, about 8 GB/s. With 8 threads: 31 GB/s. Copying to the device-local host-visible buffer: 8 GB/s with one thread. With 4 threads, 10.2 GB/s. Cuda bandwidthTest can do 13.1 GB/s with pinned memory and 11.5 GB/s with pageable memory. The file.glsl IO transfer system doesn't seem to be all that slow, it just needs multi-threaded copies on the CPU side.
Built a simple threaded IO system that spawns a thread for every read IO up to a maximum of 16 threads in flight. Grep.glsl throughput: 10 GB/s. Hooray!
[Update 2] Lots of tweaking, debugging and banging-head-to-the-wall later, we're at 11.45 GB/s.
2020-06-25
file.glsl
2020-06-16
Real-time path tracing
2020-06-12
GLSL strings
for (int i = str.x; i < str.y; i++) {
heap[i]...
}
Malloc
Performance
2020-05-31
Raspberry Pi RC car
Here's my software package to turn your Raspberry Pi into a RC car https://github.com/kig/rpi-car-control
rpi-car-control
Use a Raspberry Pi to drive an RC car from a web page.
Fisheye camera with two IR lamps, a white USB power bank underneath. The wires go inside the Raspberry Pi case.
A ToF laser range finder for the reversing distance indicator.
How does it work?
Open up a cheap RC toy car. Connect the motors to a Raspberry Pi. Add a camera. Run a web server on the Raspberry Pi that controls the car.
In more detail, you need to replace the car PCB with a motor controller board (say, a tiny cheap MX1508 module). Then solder the motors and the car battery pack to the motor controller. Solder M-F jumper cables to the motor controller's control connectors. Plug the other end of the jumpers to the Raspberry Pi GPIOs. Now you can control the motors from the Raspberry Pi.
Expose the Raspberry Pi camera as an MJPEG stream so that you can directly view it as an IMG on the browser. This is the easiest low latency, low CPU, high quality streaming format.
If the car has lights, you can drive them from the GPIOs as well (either directly or via a proper LED controller). Add a bunch of sensors to the car for the heck of it. I've got a tiny VL53L1X ToF laser-ranging sensor as a reversing radar, and a DHT temperature and humidity sensor. There's code in the repo to hook up an ultrasonic range finder too (it can even use the DHT sensor to calculate the speed of sound for given temperature and humidity - and has a Kalman filter of sorts, so you can reach ~mm accuracy), and some bits and bops for using a PIR sensor.
There was also a microphone input and playback either through wired speakers or to a Bluetooth speaker, but that's not enabled at the moment. There was also a WebRTC-based streaming solution for doing 2-way video calls, but that was such a pain I gave up on it. I was using RWS which is pretty easy to set up, but the STUN/TURN stuff was tough.
Add a USB battery pack to power the Raspberry Pi and you're about done. If you're feeling adventurous, you could use a 5V step-up/step-down regulator to run the Raspberry Pi directly from the car batteries.
Install
raspi-config # Enable I2C to use the VL53L1X sensor
sh install.sh
The install script installs the car service and its dependencies. This is best done on a fresh install of Raspbian. The install script overwrites NGINX's default site configuration.
After starting the car control app with sudo systemctl start car
, you can connect to http://raspberrypi/car/
and play with the controls web page.
The car control app is installed in /opt/rpi-car-control
.
To use a SSH tunnel server, edit /etc/rpi-car-control/env.sh
and change the line RPROXY_SERVER=
to RPROXY_SERVER=my.server
.
With the SSH tunnel, you can access the car from http://my.server:9999/car/
. Best to firewall this port and add a HTTPS reverse proxy that points to it. Look at etc/remote_nginx.conf
for a snippet that sets up an authenticated NGINX reverse proxy on the remote server. (Run htpasswd -c /etc/nginx/car_htpasswd my_username
to create the password file.)
Configuration
See /etc/rpi-car-control/env.sh
for settings.
# SSH tunnel reverse proxy
RPROXY_SERVER=my.server
# One of v4l2-mjpeg, v4l2-raw, raspivid
VIDEO_MODE=v4l2-raw
# Which camera to use in the v4l2 modes
V4L2_DEVICE=/dev/video2
# Video settings
VIDEO_WIDTH=480
VIDEO_HEIGHT=270
VIDEO_FPS=60
VIDEO_ROTATION=0
Controls
The circle on the left is the accelerator indicator, and the circle on the right is the steering indicator. The bar in the bottom middle is the reversing distance indicator. The sensor data readout is at top left. The little square at the bottom right toggles the full screen mode.
The controls are defined near the bottom of html/main.js
.
Touch controls
- Use left thumb to accelerate and reverse, right thumb to steer.
Keyboard controls
- Use arrow keys to drive.
- The numbers
1
-4
control front lights intensity and0
turns the rear lights on and off. - The
z
key blinks the left front light, thec
key blinks the right front light and thex
key turns off the blinkers.
Requirements
The app is very modular, so you can run the app without an actual car or camera. And just play with a web page with controls that do nothing.
If you wire up the motors, you should be able to drive. If you wire up the lights, they should light up.
Wire up the sensors and you should start seeing sensor data in the HUD.
Add a camera and you'll see a live video stream.
Wiring
See control/car.py
and sensors/sensors_websocket.py
for the pin definitions. The VCC and GND connections have been left out. Just remember to use the correct voltage when wiring those.
Component | GPIO | Notes |
---|---|---|
Motor forward (A) | 17 | |
Motor backward (B) | 27 | |
Steering left (A) | 24 | |
Steering right (B) | 23 | |
Left headlight | 5 | The headlights turn on when you connect |
Right headlight | 6 | They can also blink a turning signal |
Rear lights | 13 | Rear lights light up when you reverse |
Power PWM | 12 | Disabled, for use with L298N |
DHT11 signal | 14 | |
PIR signal | 22 | |
VL53L1X power | 4 | Use a GPIO and you can turn it off when not in use |
VL53L1X SDA | 2 | I2C bus 1 |
VL53L1X SCL | 3 | I2C bus 1 |
Features
- FPV stream web page with keyboard & touch controls to drive the car, along with a reversing distance indicator and a thermometer.
- Low latency video stream for driving (down to 50 ms glass-to-glass when using a 90 Hz camera and a 240 Hz display.)
- Bunch of websocket servers to send out sensor data and receive car controls.
- Nginx reverse proxy config to tie all the servers together.
- Systemd service to start the car control server on boot.
- SSH tunnel to a remote control server to drive the car from anywhere.
- Low-power tweaks to increase battery life (disables HDMI, Ethernet and USB.)
- Use RaspiCam or a V4L2 USB webcam, either with raw video (eats CPU) or camera-supplied MJPEG
Disabled
- Bluetooth speaker pairing for playing audio.
- Stream car microphone to the browser.
- Speak to the car from the browser by sending audio with Web Audio API.
- WebRTC call between browser and car.
In progress
- PoseNet with Coral USB accelerator for "point and I'll drive there"
Wanted
- OMX JPEG encoder for raw video cameras
- SLAM and "click on a map position to drive there"
- Good small microphone + speaker solution
- Small display to do two-way video calls
- Non-sucky camera mount (duct tape doesn't really work)
- Power car and computer from one battery
- Automatic wireless charging when battery is low
- Shutdown when battery critical
- Speech controls
Customize
Take a look at run.sh
first. It starts the web server and optionally the reverse proxy tunnel. The web server is in web/web_server.py
and starts up bin/start_control_server.sh
and bin/start_server.sh
when needed. The sensors are controlled by sensors/sensors_websocket.py
, and the car controls are in control/car_websockets.py
. For video streaming, have a look at video/start_stream.sh
. The HUD is in html/
, see html/main.js
for the car controls and how the video and sensor data are streamed.
License
MIT
Ilmari Heikkinen © 2020