fhtr: July 2020

Riffing on the WebCompute experiment (run GLSL compute shaders across CPUs, GPUs and the network), I'm now thinking of making the GLSL IO runtime capable of hanging around and running shaders on demand. In WebCompute, the Vulkan runner ran a single shader and read work units from the STDIN (the network server was feeding the STDIN from a WebSocket).

With GLSL IO, that goal is extended to handle arbitrary new shaders coming at various times. On a high level, you'd send it a shader file, argv and fds for stdin/stderr/stdout through a socket. Then it would create a new compute pipeline, allocate and bind buffers and run the pipeline on a free compute queue. On completing a dispatch, it'd delete the pipeline and free the buffers. This cycle of recreation might be expensive, so it should have a cache for buffers and compute pipelines.

The compute shaders could share a global IO processor, or each could have its own IO processor thread. A global IO processor could be tuned to the IO CPU and have better coordination of the IO operations, but could end up with slow IO requests from one shader clogging the pipe (well, hogging the threadpool) for others. This could be worked around with async IO in the IO processor.

The other issue is the cooperative multitasking model of compute shaders. If your shader is stuck in an infinite loop on all compute units, other shaders can't run. To remedy this, the GPU driver allows compute shaders to run only a few seconds before it terminates them. On mobiles this can be as low as 2 seconds, on the desktop 15 seconds. If a discrete GPU has no display connected to it, it may allow compute shaders to run as long as they want.

If your shaders needs to run longer than 10 seconds, this is a problem. The usual way around it is to build programs that run a few milliseconds at a time, and are designed to be run several times in succession. With the IO runtime, this sounds painful. An IO request might take longer than 10 seconds to complete. Writing a program that issues a bunch of async IOs, terminates, polls on the IOs on successive runs and does less than 10 seconds of processing on the results (doing the restart loop if it's running out of runtime), then finally tells the runtime that it doesn't need to be run again. In the absence of anything better, this is how the first version of long-running shaders will work.

The second version would be a more automated version of that. A yield keyword that gets turned into a call to saveRegisters() followed by program exit. On program start, it'd do loadRegisters() and jump into the stored instruction pointer to continue execution. The third version would insert periodic checks for how long the program has been running, and yield if the program's been running longer than the scheduler slice time.

Of course, this is only useful on GPU shaders. If you run the shaders on the CPU, the kernel's got you covered. The IO runtime is still useful since high-performance IO doesn't just happen.

I think the key learning from writing the GLSL IO runtime has been that IO bandwidth is the only thing that matters for workloads like grep. You can grep on the CPU at 50 GB/s. You can grep on the GPU at 200 GB/s. But if you need to transfer data from the CPU to the GPU, the GPU grep is limited to 11 GB/s. If you do a compress-decompress pipe from CPU to GPU, you can grep at 24 GB/s (if the compression ratio is good enough). GPUs give you density, but they don't have enough bandwidth to DRAM to really make use of the compute in common tasks.

Getting to even 11 GB/s requires doing multithreaded IO since memcpy is limited to 7 GB/s per thread. You need to fetch multiple blocks of data in parallel to get to 30 GB/s. Without the memcpy (just reading), you should be able to reach double the speed.

Thinking about the design of file.glsl, a file IO library for GLSL.

For a taste, how about calling node.js from a GPU shader:

string r, fn = concat("node-", str(ThreadID), ".txt");
awaitIO(runCmd(concat(
  "node -e 'fs=require(`fs`); fs.writeFileSync(`", 
  fn,
  `, http://Date.now().toString())'"    
)));
r = readSync(fn, malloc(16));
println(concat("Node says ", r));

There are more examples in the test_file.glsl unit tests.

Design of GPU IO

Hardware considerations

GPUs have around a hundred processors, each with a 32-wide SIMD unit.
The SIMD unit can execute a 32 thread threadgroup and juggle between ten threadgroups for latency hiding.
GPU cacheline is 128 bytes.
CPU cacheline is 64 bytes.
GPU memory bandwidth is 400 - 1000 GB/s.
CPU memory bandwidth is around 50 GB/s.
PCIe bandwidth 11-13 GB/s. On PCIe4, 20 GB/s.
NVMe flash can do 2.5-10 GB/s on 4-16 channels. PCIe4 could boost to 5-20 GB/s.
The CPU can do 30 GB/s memcpy with multiple threads, so it’s possible to keep PCIe4 saturated even with x16 -> x16.
GPUdirect access to other PCIe devices is only available on server GPUs. Other GPUs need a roundtrip via CPU.
CPU memory accesses require several threads of execution to hit full memory bandwidth (single thread can do ~15 GB/s)
DRAM is good at random access at >cacheline chunks with ~3-4x the bandwidth of PCIe3 x16, ~2x PCIe4 x16.
Flash SSDs are good at random access at >128kB chunks, perform best with sequential accesses, can deal with high amounts of parallel requests. Writes are converted to log format.
Optane is good at random access at small sizes >4kB and low parallelism. The performance of random and sequential accesses is similar.

=> Large reads to flash should be executed in sequence (could be done by prefetching the entire file to page cache and only serving requests once the prefetcher has passed them)
=> Small scattered reads should be dispatched in parallel (if IO rate < prefetch speed, just prefetch the whole file)
=> Writes can be dispatched in parallel with more freedom, especially without fsync. Sequential and/or large block size writes will perform better on flash.
=> Doing 128 small IO requests in parallel may perform better than 16 parallel requests.
=> IOs to page cache should be done in parallel and ASAP.
=> Caching data into GPU RAM is important for performance.
=> Programs that execute faster than the PCIe bus should be run on the CPU if the GPU doesn’t have the data in cache.
=> Fujitsu A64FX-type designs with lots of CPU cores with wide vector units and high bandwidth memory are awesome. No data juggling, no execution environment weirdness.

Software

The IO queue works by using spinlocks on both the CPU and GPU sides.
The fewer IO requests you make, the less time you spend spinning.
Sending data between CPU and GPU works best in large chunks.

To avoid issues with cacheline clashes, align messages on GPU cacheline size.
IO request spinlocks that read across the PCIe bus should have small delays between checks to avoid hogging the PCIe bus.
Workgroups (especially subgroups) should bundle their IOs into a single scatter/gather.

When working with opened files, reads and writes should be done with pread/pwrite. Sharing a FILE* across threads isn’t a great idea.
The cost of opening and closing files with every IO is eclipsed by transfer speeds with large (1 MB) block sizes.

The IO library should be designed for big instructions with minimal roundtrips.
E.g. directory listings should send the entire file list with file stats, and there should be a recursive version to transfer entire hierarchies.
Think more shell utilities than syscalls. Use CPU as IO processor that can do local data processing without involving the GPU.

Workgroup concurrency can be used to run the same code on CPU and GPU in parallel. This extends to multi-GPU and multi-node quite naturally.
The IO queue could be used to exchange data between running workgroups.

Limited amount of memory that can be shared between CPU and GPU (I start seeing issues with > 64 MB allocations).
Having a small IO heap for each thread or even threadgroup, while easy to parallelize, limits IO sizes severely.
32 MB transfer buffer, 32k threads -> 1k max IO per thread, or 32k per 32-wide subgroup.
Preferable to do 1+ MB IOs.
Design with a concurrently running IO manager program that processes IO transfers?
The CPU could also manage this by issuing copyBuffer calls to move data.

Workgroups submit tasks in sync -> readv / writev approach is beneficial for sequential reads/writes.
Readv/writev are internally single-threaded, so probably limited by memcpy to 6-8 GB/s.

Ordering of writes across workgroups requires a way to sequence IOs (either reduce to order on the GPU or reassemble correct order on the CPU.)
IOs could have sequence ids.

Compression of data on the PCIe bus could help. 32 * zstd --format=lz4 --fast -T1 file -o /dev/null goes at 38 GB/s.

[Update] Did a quick test with libzstd (create compressor context for each 1 MB read, compress, send data to GPU) with mixed results. Getting 3.4 GB/s throughput with Zstd. Which is good for zero-effort, but I should really be using liblz4. Running 32 instances of zstd --fast in parallel got 6.4 GB/s, with --format=lz4 16 GB/s.

[Update 2] Libzstd with fast strategy and compression level -9 can do 12.7 GB/s grep.glsl throughput. So if I had a GPU-side decompressor, I might see a performance benefit vs raw data (11.4 GB/s). On already-compressed files, there's a 10-15% perf penalty vs raw.

[Update 3] LZ4 with streaming block compressor and compression level 9 reaches 17.5 GB/s grep.glsl throughput on the above file, 22.5 GB/s on a 13 GB kern.log file (lots of similar errors). About 5% perf penalty on compressed files. Feels like a GPU decompressor could actually be a good idea.

Caching file data on the GPU is important for performance, 40x higher bandwidth than CPU page cache over PCIe.
Without GPU-side caching, you’ll likely get better perf on the CPU on bandwidth-limited tasks (>50 GB/s throughput.)
In those tasks, using memory bandwidth to send data to GPU wouldn’t help any, best you could achieve is zero slowdown.
(Memory bandwidth 50 GB/s. CPU processing speed 50 GB/s. Use 10 GB/s of bandwidth to send data to GPU =>
CPU has only 40 GB/s bandwidth left, GPU can do 10 GB/s => CPU+GPU processing speed 50 GB/s.)

Benchmark suite

Different block sizes
Different access patterns (sequential, random)
- Scatter writes
- Sequential writes
- Gather reads
- Sequential reads
- Combined reads & writes
Different levels of parallelism
- 1 IO per thread group
- each thread does its own IO
- 1 IO on ThreadID 0
- IOs across all invocation
Compression
From hot cache on CPU
From cold cache
With GPU-side cache
Repeated access to same file
Access to multiple files

Does it help to combine reads & writes into sequential blocks on CPU-side when possible, or is it faster to do IOs ASAP?

Caching file descriptors, helps or not?

fhtr

2020-07-16

Compute shader daemon

2020-07-06

GPU IO library design thoughts

Design of GPU IO

Hardware considerations

Software

Benchmark suite

Blog Archive

About Me