art with code

2020-06-12

GLSL strings

Started writing a small string library for GLSL. Because it might be nice to do massively parallel string processing. It's difficult to make it fast on a GPU though, lots of variable-length reads and variable-length writes.

You can follow the progress on the spirv-wasm repo at https://github.com/kig/spirv-wasm/tree/master/string

It's still in the shambles-phase. That is, I haven't actually run the code. Adding string literals to GLSL is done with a small pre-processing script that will eventually turn string literals into heap allocations. The approach is similar to the GLSL HTTP parser hack

In the HTTP parser pre-processor, string literals were done with "abcd" becoming an array of ints and 'abcd' turning into an int with four ASCII characters packed into it. The array approach is difficult with GLSL, since arrays need to have a size known at compile time. Packing four chars into an int and four ints into an ivec4, is, uh, fast? But a lot of hassle to work with.

In string.glsl I'm trying a different approach, representing strings with an ivec2 with a start pointer and end pointer to the heap buffer. So there are a lot of loops that look like 
for (int i = str.x; i < str.y; i++) {
    heap[i]...
}
This is nice in that you can have strings of varying lengths and do the whole slice-lowercase-replace-split-join -kaboodle. But you need malloc. And GLSL doesn't have malloc. So we need to make a malloc. The malloc is a super simple stack-style malloc.

Malloc


We've got a heap buffer that's split into slices. Each program instance gets its own slice of the heap. Each program instance also has a heapPtr that points to the heap slice of the instance. To allocate N bytes of memory, malloc increments heapPtr by N and returns an ivec2 with the previous heapPtr and the new heapPtr. There's no free. If you allocate too much, you'll run out of space and corrupt the next instance's slice. To reclaim memory, make a copy of heapPtr before doing your allocations, then set heapPtr to the old value once you're done.

This kind of malloc is not perfect by any means but it's fast and predictable. GLSL's lack of recursion and the intended <16 ms program runtime also make it unlikely that you're going to need something more sophisticated. Free by terminating the program.

In case you do need something fancier, how to improve on this? If you wanted to share the heap between program instances, you could use atomics. To free allocations, you'd need a data structure to keep track of them. Probably a two-level setup where program instances reserve pages, and then internally allocate from the pages. Use virtual memory to allow allocations to span several pages, or reserve sequential pages to avoid VM. On first malloc, a program instance would reserve a page by setting a bit in the page table bitmap. On program exit, the instance would unset the bits for the pages it had reserved.

Performance


String performance in the HTTP compute shader produced some insights. The CPU doesn't like gathers, so striping the HTTP requests runs quite a bit better in ISPC. You take the requests that are AAAABBBBCCCCDDDD... etc. and rearrange the bytes to ABCDABCDABCDABCD... This way the SIMD threads can read their next byte with a single vector load. On the GPU this was less helpful. GPU and CPU performance was boosted significantly by doing a larger amount of work per program instance. Processing one request per instance vs. processing 1024 requests per instance could give a >10x speed boost.

The general performance of the HTTP compute shader is fun. Compiling GLSL for the CPU via ISPC produces code that runs very fast. Running the same code on the GPU is slower. We're talking about the CPU (16-core TR2950X) doing 18 million requests per second and the GPU (RTX2070) doing 9 million requests per second, on a 66/33 read/write sequential workload to an in-memory key-value-store of half a million 1kB blocks. Granted, this has very low computational intensity, it's basically "read a few bytes, branch, parse an int, branch, memcpy." Perhaps the tables would turntables if you rendered mandelbrots instead. Also likely is that I haven't found the right way to write this shader for the GPU. I believe the GPU would also benefit from using a separate transfer queue for sending and receiving the request and response buffers asynchronously. With sync transfers & transfer times included, the GPU perf drops further to 5 Mreqs/s.

Feeding the beast is left as an exercise for the reader. You'd need to receive 18 million packets per second and send the same out. You could have a bunch of edge servers collect their requests into a buffer and send the buffer over to the DB server, which would process the requests and send the response buffers back to the edge servers. If one edge server could deal with a million packets per second, you'd need 36 of them for a single CPU server.

Scaling should be mostly linear. I don't have a 128-core server handy, but you might well be able to handle over 100 million requests per second with one. That'd be 100 GB/s of 1 kB responses. Or one request per minute for every person on Earth.

[Edit] On further investigation, the CPU performance seems to be memory limited. Running the same task in multi-threaded C++ (via spirv-cross) runs at nearly the same speed regardless of whether your thread count is 8, 16 or 32. At 4 threads, the performance roughly halves. So while a multi-socket server might help (more memory channels!), you might not need more than two cores per memory channel for this.

No comments:

Blog Archive