art with code

2020-07-16

Compute shader daemon

Riffing on the WebCompute experiment (run GLSL compute shaders across CPUs, GPUs and the network), I'm now thinking of making the GLSL IO runtime capable of hanging around and running shaders on demand. In WebCompute, the Vulkan runner ran a single shader and read work units from the STDIN (the network server was feeding the STDIN from a WebSocket).

With GLSL IO, that goal is extended to handle arbitrary new shaders coming at various times. On a high level, you'd send it a shader file, argv and fds for stdin/stderr/stdout through a socket. Then it would create a new compute pipeline, allocate and bind buffers and run the pipeline on a free compute queue. On completing a dispatch, it'd delete the pipeline and free the buffers. This cycle of recreation might be expensive, so it should have a cache for buffers and compute pipelines.

The compute shaders could share a global IO processor, or each could have its own IO processor thread. A global IO processor could be tuned to the IO CPU and have better coordination of the IO operations, but could end up with slow IO requests from one shader clogging the pipe (well, hogging the threadpool) for others. This could be worked around with async IO in the IO processor.

The other issue is the cooperative multitasking model of compute shaders. If your shader is stuck in an infinite loop on all compute units, other shaders can't run. To remedy this, the GPU driver allows compute shaders to run only a few seconds before it terminates them. On mobiles this can be as low as 2 seconds, on the desktop 15 seconds. If a discrete GPU has no display connected to it, it may allow compute shaders to run as long as they want.

If your shaders needs to run longer than 10 seconds, this is a problem. The usual way around it is to build programs that run a few milliseconds at a time, and are designed to be run several times in succession. With the IO runtime, this sounds painful. An IO request might take longer than 10 seconds to complete. Writing a program that issues a bunch of async IOs, terminates, polls on the IOs on successive runs and does less than 10 seconds of processing on the results (doing the restart loop if it's running out of runtime), then finally tells the runtime that it doesn't need to be run again. In the absence of anything better, this is how the first version of long-running shaders will work.

The second version would be a more automated version of that. A yield keyword that gets turned into a call to saveRegisters() followed by program exit. On program start, it'd do loadRegisters() and jump into the stored instruction pointer to continue execution. The third version would insert periodic checks for how long the program has been running, and yield if the program's been running longer than the scheduler slice time.

Of course, this is only useful on GPU shaders. If you run the shaders on the CPU, the kernel's got you covered. The IO runtime is still useful since high-performance IO doesn't just happen.

I think the key learning from writing the GLSL IO runtime has been that IO bandwidth is the only thing that matters for workloads like grep. You can grep on the CPU at 50 GB/s. You can grep on the GPU at 200 GB/s. But if you need to transfer data from the CPU to the GPU, the GPU grep is limited to 11 GB/s. If you do a compress-decompress pipe from CPU to GPU, you can grep at 24 GB/s (if the compression ratio is good enough). GPUs give you density, but they don't have enough bandwidth to DRAM to really make use of the compute in common tasks.

Getting to even 11 GB/s requires doing multithreaded IO since memcpy is limited to 7 GB/s per thread. You need to fetch multiple blocks of data in parallel to get to 30 GB/s. Without the memcpy (just reading), you should be able to reach double the speed.

No comments:

Blog Archive