art with code

2020-06-29

grep.glsl

grep.glsl is sort of working. It's a GLSL compute shader version of grep. A very simple one at that. It tests a string against a file's contents and prints out the byte offsets where the string was found.

The awesome part of this is that the shader is running the IO. And it performs reasonably well after tuning. You could imagine a graphics shader dynamically loading geometry and textures when it needs them, then poll for load completion in following frames.

Here are the few first lines of grep.glsl

    string filename = aGet(argv, 2);
    string pattern = aGet(argv, 1);

    if (ThreadLocalID == 0) done = 0;

    if (ThreadID == 0) {
        FREE(
            println(concat("Searching for pattern ", pattern));
            println(concat("In file ", filename));
        )
        setReturnValue(1);
    }

Not your run-of-the-mill shader, eh?

This is the file reading part:

    while (done == 0) {
        FREE(FREE_IO(
            barrier(); memoryBarrier();

            // Read the file segment for the workgroup.
            if (ThreadLocalID == 0) {
                wgBuf = readSync(filename, wgOff, wgBufSize, string(wgHeapStart, wgHeapStart + wgBufSize));
                if (strLen(wgBuf) != wgBufSize) {
                    atomicAdd(done, strLen(wgBuf) == 0 ? 2 : 1);
                }
            }

            barrier(); memoryBarrier();

            if (done == 2) break; // Got an empty read.
            
            // Get this thread's slice of the workGroup buffer
            string buf = string(
                min(wgBuf.y, wgBuf.x + ThreadLocalID * blockSize),
                min(wgBuf.y, wgBuf.x + (ThreadLocalID+1) * blockSize + patternLength)
            );

            // Iterate through the buffer slice and add found byte offsets to the search results.
            int start = startp;
            i32heapPtr = startp;
            for (int i = 0; i < blockSize; i++) {
                int idx = buf.x + i;
                if (startsWith(string(idx, buf.y), pattern)) {
                    i32heap[i32heapPtr++] = int32_t(i);
                    found = true;
                }
            }
            int end = i32heapPtr;

            ...

Performance is complicated. Vulkan compute shaders have a huge 200 ms startup cost and a 160 ms cleanup cost. About 60 ms of that is creating the compute pipeline, the rest is instance and device creation.

Once you get the shader running, performance continues to be complicated. The main bottleneck is IO, as you might imagine. The file.glsl IO implementation uses a device-local host-visible volatile buffer to communicate between the CPU and GPU. The GPU tells the CPU that it has some IO work for the CPU by writing into the buffer, using atomics to prevent several GPU threads from writing into the same request. The CPU spinlocks to grab new requests to process, then spinlocks for the requests to become available. After processing the request, the CPU writes the results to the buffer. The GPU spinlocks awaiting for the IO completion, then copies the IO results to its device-local heap buffer.

The GPU-GPU copies execute reasonably fast (200+ GB/s), but waiting for the IO drops the throughput to around 6 GB/s. This is way better than the 1.6 GB/s it used to be, when it was using int8_t for IO and ~200 kB transfers. Now the transfer buffer is 1 MB and the IO copies are done with i64vec4.

Once you have the data on the GPU, performance continues to be complicated. Iterating through the data one byte at a time goes at roughly 30 GB/s. If the buffer is host-visible, the speed drops to 10 GB/s. Searching a device-local buffer 32 bytes at a time using an i64vec4 goes at 220 GB/s.

The CPU-GPU transfer buffer has a max size of 256 MB. The design of file.glsl causes issues here, since it slices the transfer buffer across shader instances, making it difficult to do full-buffer transfers (and minimize IO spinning). Now grep.glsl does transfers in per-workgroup slices, where 100 workgroups each do a 1 MB read from the searched file, then distribute the search work across 255 workers per workgroup.

This achieves 6 GB/s shader throughput. The PCIe bus is capable of 12 GB/s. If you remove the IO waits and just do buffer copies, the shader runs at 130 GB/s. Taking that into account, the shader PCIe transfers should be happening at 6.3 GB/s. Removing the buffer copies had a very minimal effect on throughput, so doing the search directly on the IO buffer shouldn't improve performance by much.

Thoughts

Why are the transfers not hitting the PCIe bus limits? Suspiciously hovering at around half of PCIe bandwidth too. Could it be that the device-local buffer is slow to write to from the CPU side? Previously I had host-cached memory for the buffer, which lets you do flushes and invalidations to (hopefully) transfer in more efficient chunks. The reason that I'm not using a host-cached memory buffer is that GLSL volatile doesn't work for host-side memory: the GPU seems to cache fetched memory into GPU RAM and volatile only bypasses the GPU L1/L2 caches, so you'll never see CPU writes that landed after your first read. And there doesn't seem to be a buffer type that's host-cached device-local.

Maybe have a CPU-side buffer for IO, and a transfer queue to submit copies to GPU memory. This should land the data into GPU RAM and volatile should work. Or do the fread to a separate buffer first, then memcpy it to the GPU buffer.

Vulkan's startup latency makes the current big binary approach bad for implementing short-lived shell commands. The anemic PCIe bandwidth makes compute-poor programs starve for data. GNU grep runs at 4 GB/s, but you can run 64 instances and achieve 50 GB/s [from page cache]. This isn't possible with grep.glsl. All you have is 12 GB/s.

Suppose this: you've got a long-running daemon. You send it SPIR-V or GLSL. It runs them on the GPU. It also maintains a GPU-side page cache for file I/O. Now your cold-cache data would still run at roughly the speed of storage (6 GB/s isn't _that_ bad.) But a cached file, a cached file would fly. 400 GB/s and more.

Creating the compute pipeline takes around 50 ms. Caching binary pipelines for programs would give you faster startup times.

The GPU driver terminates running programs after around 15 seconds. And they hang the graphics while running. Where's our pre-emptive multitasking.

This should really be running on a CPU because of the PCIe bottleneck and the Vulkan startup bottleneck. Going to try compiling it to ISPC or C++ and see how it goes.

[Update] Tested memcpy performance. Single thread CPU-CPU, about 8 GB/s. With 8 threads: 31 GB/s. Copying to the device-local host-visible buffer: 8 GB/s with one thread. With 4 threads, 10.2 GB/s. Cuda bandwidthTest can do 13.1 GB/s with pinned memory and 11.5 GB/s with pageable memory. The file.glsl IO transfer system doesn't seem to be all that slow, it just needs multi-threaded copies on the CPU side.

Built a simple threaded IO system that spawns a thread for every read IO up to a maximum of 16 threads in flight. Grep.glsl throughput: 10 GB/s. Hooray!

[Update 2] Lots of tweaking, debugging and banging-head-to-the-wall later, we're at 11.45 GB/s.

No comments:

Blog Archive