fhtr: June 2020

2020-06-29

grep.glsl

grep.glsl is sort of working. It's a GLSL compute shader version of grep. A very simple one at that. It tests a string against a file's contents and prints out the byte offsets where the string was found.

The awesome part of this is that the shader is running the IO. And it performs reasonably well after tuning. You could imagine a graphics shader dynamically loading geometry and textures when it needs them, then poll for load completion in following frames.

Here are the few first lines of grep.glsl

    string filename = aGet(argv, 2);
    string pattern = aGet(argv, 1);

    if (ThreadLocalID == 0) done = 0;

    if (ThreadID == 0) {
        FREE(
            println(concat("Searching for pattern ", pattern));
            println(concat("In file ", filename));
        )
        setReturnValue(1);
    }

Not your run-of-the-mill shader, eh?

This is the file reading part:

    while (done == 0) {
        FREE(FREE_IO(
            barrier(); memoryBarrier();

            // Read the file segment for the workgroup.
            if (ThreadLocalID == 0) {
                wgBuf = readSync(filename, wgOff, wgBufSize, string(wgHeapStart, wgHeapStart + wgBufSize));
                if (strLen(wgBuf) != wgBufSize) {
                    atomicAdd(done, strLen(wgBuf) == 0 ? 2 : 1);
                }
            }

            barrier(); memoryBarrier();

            if (done == 2) break; // Got an empty read.
            
            // Get this thread's slice of the workGroup buffer
            string buf = string(
                min(wgBuf.y, wgBuf.x + ThreadLocalID * blockSize),
                min(wgBuf.y, wgBuf.x + (ThreadLocalID+1) * blockSize + patternLength)
            );

            // Iterate through the buffer slice and add found byte offsets to the search results.
            int start = startp;
            i32heapPtr = startp;
            for (int i = 0; i < blockSize; i++) {
                int idx = buf.x + i;
                if (startsWith(string(idx, buf.y), pattern)) {
                    i32heap[i32heapPtr++] = int32_t(i);
                    found = true;
                }
            }
            int end = i32heapPtr;

            ...

Performance is complicated. Vulkan compute shaders have a huge 200 ms startup cost and a 160 ms cleanup cost. About 60 ms of that is creating the compute pipeline, the rest is instance and device creation.

Once you get the shader running, performance continues to be complicated. The main bottleneck is IO, as you might imagine. The file.glsl IO implementation uses a device-local host-visible volatile buffer to communicate between the CPU and GPU. The GPU tells the CPU that it has some IO work for the CPU by writing into the buffer, using atomics to prevent several GPU threads from writing into the same request. The CPU spinlocks to grab new requests to process, then spinlocks for the requests to become available. After processing the request, the CPU writes the results to the buffer. The GPU spinlocks awaiting for the IO completion, then copies the IO results to its device-local heap buffer.

The GPU-GPU copies execute reasonably fast (200+ GB/s), but waiting for the IO drops the throughput to around 6 GB/s. This is way better than the 1.6 GB/s it used to be, when it was using int8_t for IO and ~200 kB transfers. Now the transfer buffer is 1 MB and the IO copies are done with i64vec4.

Once you have the data on the GPU, performance continues to be complicated. Iterating through the data one byte at a time goes at roughly 30 GB/s. If the buffer is host-visible, the speed drops to 10 GB/s. Searching a device-local buffer 32 bytes at a time using an i64vec4 goes at 220 GB/s.

The CPU-GPU transfer buffer has a max size of 256 MB. The design of file.glsl causes issues here, since it slices the transfer buffer across shader instances, making it difficult to do full-buffer transfers (and minimize IO spinning). Now grep.glsl does transfers in per-workgroup slices, where 100 workgroups each do a 1 MB read from the searched file, then distribute the search work across 255 workers per workgroup.

This achieves 6 GB/s shader throughput. The PCIe bus is capable of 12 GB/s. If you remove the IO waits and just do buffer copies, the shader runs at 130 GB/s. Taking that into account, the shader PCIe transfers should be happening at 6.3 GB/s. Removing the buffer copies had a very minimal effect on throughput, so doing the search directly on the IO buffer shouldn't improve performance by much.

Thoughts

Why are the transfers not hitting the PCIe bus limits? Suspiciously hovering at around half of PCIe bandwidth too. Could it be that the device-local buffer is slow to write to from the CPU side? Previously I had host-cached memory for the buffer, which lets you do flushes and invalidations to (hopefully) transfer in more efficient chunks. The reason that I'm not using a host-cached memory buffer is that GLSL volatile doesn't work for host-side memory: the GPU seems to cache fetched memory into GPU RAM and volatile only bypasses the GPU L1/L2 caches, so you'll never see CPU writes that landed after your first read. And there doesn't seem to be a buffer type that's host-cached device-local.

Maybe have a CPU-side buffer for IO, and a transfer queue to submit copies to GPU memory. This should land the data into GPU RAM and volatile should work. Or do the fread to a separate buffer first, then memcpy it to the GPU buffer.

Vulkan's startup latency makes the current big binary approach bad for implementing short-lived shell commands. The anemic PCIe bandwidth makes compute-poor programs starve for data. GNU grep runs at 4 GB/s, but you can run 64 instances and achieve 50 GB/s [from page cache]. This isn't possible with grep.glsl. All you have is 12 GB/s.

Suppose this: you've got a long-running daemon. You send it SPIR-V or GLSL. It runs them on the GPU. It also maintains a GPU-side page cache for file I/O. Now your cold-cache data would still run at roughly the speed of storage (6 GB/s isn't _that_ bad.) But a cached file, a cached file would fly. 400 GB/s and more.

Creating the compute pipeline takes around 50 ms. Caching binary pipelines for programs would give you faster startup times.

The GPU driver terminates running programs after around 15 seconds. And they hang the graphics while running. Where's our pre-emptive multitasking.

This should really be running on a CPU because of the PCIe bottleneck and the Vulkan startup bottleneck. Going to try compiling it to ISPC or C++ and see how it goes.

[Update] Tested memcpy performance. Single thread CPU-CPU, about 8 GB/s. With 8 threads: 31 GB/s. Copying to the device-local host-visible buffer: 8 GB/s with one thread. With 4 threads, 10.2 GB/s. Cuda bandwidthTest can do 13.1 GB/s with pinned memory and 11.5 GB/s with pageable memory. The file.glsl IO transfer system doesn't seem to be all that slow, it just needs multi-threaded copies on the CPU side.

Built a simple threaded IO system that spawns a thread for every read IO up to a maximum of 16 threads in flight. Grep.glsl throughput: 10 GB/s. Hooray!

[Update 2] Lots of tweaking, debugging and banging-head-to-the-wall later, we're at 11.45 GB/s.

2020-06-25

file.glsl

Current design: Each shader invocation has an 8k heap slice and an 8k IO heap slice. There's an IO request buffer. GPU writes IOs to the req buffer, CPU picks them up and writes the result to the IO heap. The GPU copies the result from the IO heap to the main heap. The IO heap and the IO request buffer are marked volatile. The main heap isn't, so it can benefit from caches.

Now trying to handle argv nicely. Allocate an extra slice in the heap and copy the argv there before shader start.

This could also be used to store string literals. Now string literals are malloc'd and filled in each invocation at shader start, which is a total waste of time. But because the string lib has funcs that do in-place modification, this avoids one class of errors. Switching to immutable strings in the lib is an enticing option.

Memory allocation is done with a simple bump malloc. Freeing heap memory after use is a hassle. I have macros FREE() and FREE_IO() that free whatever heap / IO heap allocations you did inside the call.

It might be nice to have a global heap with a proper malloc to store large allocations that are re-used across threads. E.g. for loading texture data. This would probably have to be a fixed size buffer. I doubt that it's possible to allocate more buffer memory on the CPU side while the shader is running. Would be nice though!

Wrote a sketch of grep.glsl. Very educational, as it exposed a bunch of missing features and "this'd be nice to have"-things. Handling argv and program return value fall in the first category. Having helpers for reductions across all invocations and sorted-order IO fall in the second category.

The current grep.glsl sketch is: each thread does a read for a 4kB+strLen(pattern) chunk, then runs indexOf(chunk, pattern) to find all occurrences of pattern in it. The threads then iterate through all invocationIDs in order (with a barrier() call to keep them in lockstep) and when the thread id matches the current id, the thread prints out its results. This should keep the workgroup threads in order, but the workgroups might still be in different order. Then the threads or-reduce whether they found any hits or not to set the program return value to 0 or 1. Advance read offset by total thread count times 4kB, repeat until a thread reads less than 4kB (EOF).

This makes a bunch of design issues apparent. The IOs should be run concurrently. Copying IO bytes to heap has limited value here, so having a way to run string functions directly on the IO heap would be nice. The read could be done with a single IO instead of threadCount IOs. Overlapping IO and execution: could issue two IOs, wait for the first, issue third IO, process first IO, wait for the second, issue fourth IO, process second IO, ... How to do line numbers. Reads should have an EOF flag. Can we hit 12 GB/s for files in page cache? Can we cache files in GPU RAM and match more patterns later at 400 GB/s?

2020-06-16

Real-time path tracing

[WIP - I'll update this over time]

The key problem in real-time path tracing is that you have enough compute to do a couple of paths per pixel, but the quality you want requires ten thousand paths per pixel.

There are three ways to solve this correctly. One is to wait a few decades for computers to get 10 000 times faster. One is to use 10 000 computers to do the rendering. And one is to make paths 10 000 times cheaper.

Or you could render at a lower resolution and frame rate. If you can do 1 sample per pixel at 4k resolution at 60 Hz, then you can do 10k samples per pixel at 30 Hz at 60x25 resolution. Yes, you can have real-time movie quality graphics on a thumbnail-sized display. If you want to hit C64 resolutions like 320x200, you could get a 3-4 cryptomining PCs with 10+ GPUs each. For a paltry $15000 you can have the C64 of your dreams!

But, well, what if you want to do high-resolution rendering at high frame rates? Without the budget for a massive cluster, and without having to find The Secret of Monte Carlo which would enable incredibly fast and accurate integration of the lighting equation.

There are ways. You could put together four techniques that would give you a 10x perceptual sample count boost. Or you could put together 14 techniques with a 2x perceptual sample count boost. Or perhaps a mix of the two. Then there are optimizations and mathematical techniques to improve the performance of the Monte Carlo integration used in path tracing. Even ways to arrive at better approximations of the light field in the same time. What's important is picking a set of techniques that compound.

Monte Carlo integration is a process where you integrate a function by calling it with random values, sum up the results, and divide the sum with the number of results. The nice thing about MC integration is that it will eventually converge to the correct solution, no matter what initial value and weight you start with.

Images generated with MC integration start off noisy since the initial samples of the function have a high weight and neighboring pixels often end up with quite different values at any given sample due to us picking random values on the function. This noise gets smoothed out as the number of samples increases and the pixel neighborhood ends up sampling a similar hemisphere.

The relationship between noise and sample count is roughly that noise is the square root of the sample count. Quadruple the sample count and the noise halves. At 10k samples, you halve the noise about 6.6 times. If you've got an 8-bit range, your noise level should be about 1.4 bits or about +-1.3 units.

If you want to get rid of noise in the early stages of the integration, you have to downweigh the initial samples vs the neighborhood. The easiest way is to blur the image. Another way is to pick a non-noisy starting point and give it a high weight, so that the new MC samples don't cause as much noise. Yet another way is to correlate the sampling of the neighboring pixels so that they sample the same regions - this will trade high frequency noise for low frequency noise.

E.g. use a sampling strategy for a 64x64 region where you first take 100 samples of the hemisphere at the center of the region, keep the top-5 highest energy samples, weigh them down by their probability, and use shadow rays to connect the 64x64 region to only the high-energy values. You'll get no pixel-to-pixel noise in the 64x64 region, but you might end up with a blotchy image when transitioning from one region to another.

Fireflies are rare high-energy samples that jump a pixel value much higher than the neighborhood. Fireflies are actually good for the energy search part of the integration, you can use the firefly path as a guiding star for the neighborhood. But they cause bright noise that takes a large number of samples to tone down. One way to deal with fireflies is to clamp them. This way you'll still get bright pixels but they won't slow convergence towards darkness. The issue with clamping is that fireflies can be an important contributor of energy in the neighborhood. If a firefly boosts a pixel by 10 units and the path occurs only on every 100th sample, the firefly's contribution should be 0.1 units. If you clamp the firefly to 1, it'll contribute only 0.01 units, and the pixel ends up too dark.

A better way might be to estimate the probability of the firefly path based on the neighborhood samples. If only every 100th pixel has a firefly of 10 units, scale the fireflies down to 0.1 units and add 0.1 to the entire neighborhood. As the sample count increases, you can start passing the fireflies straight through and converge to an unbiased result.

Temporal reprojection lets you use samples computed in previous frames as the starting point of the integration in the current frame. Adaptive sampling allows you to focus your ray budget to the areas of the scene that need it the most. Sparse sampling lets you skip some pixels completely.

Multiple importance sampling guides your integrator towards high-energy regions of the scene. Bidirectional path tracing tries to connect high-energy paths with camera paths. Photon mapping caches high-energy paths in the scene. Vertex connection merging combines BDPT and BDPPM.

Path reuse techniques like bidirectional path tracing can give you additional paths at the expense of a single shadow ray and BSDF evaluation. You generate a camera path and then connect the vertices on the camera path to generated path suffixes. In simple BDPT, you'd generate a camera path and a light path, then connect each vertex of the camera path to each vertex of the light path. If you have three connecting vertices in the camera path and three in the light path, you'd get ten camera paths at the expense of one camera path, one light path, and nine shadow rays. If you reuse light paths across multiple pixels, you can amortize the light path cost. In static scenes, you could also reuse camera paths. This way you might be able to generate decent paths with a single shadow ray per path.

For outdoor scenes, you'll start seeing vanishing returns after five bounces, so path reuse techniques for easy light paths might be limited to around 5x speed boost. In indoor scenes with longer paths, path reuse becomes more useful. If you're also dealing with difficult lighting conditions, the generated high-energy path suffixes are very helpful.

Temporal reprojection can allow you to start integration at a high effective sample count, based on integration results from previous frames. It is most helpful with diffuse surfaces, where camera angle and position don't change the integration result. On glossy and reflective surfaces and refractive volumes you'll see significant ghosting.

Adaptive sampling is quite crucial for fast renders at low sample counts. Not all parts of the scene require the same number of samples to converge. If you can skip low-noise, low-contrast regions, you can allocate more samples to noisy regions and high-contrast regions. You can borrow some tricks from artists and blur out dark regions of the image since the eye is going to skip them anyhow and seek out high-contrast regions and bright areas. Throwing extra samples to determine if a shadow region has a value of 0.01 or 0.02 is a waste of time, especially in animated scenes. You can use foveated rendering tricks and spend your ray budget where the viewer is looking at. If you don't have eye tracking, you can make educated guesses (center of the screen, anything moving, faces, silhouettes.)

Sparse sampling is the extreme cousin of adaptive sampling. With sparse sampling you can completely skip rendering some pixels, filling them from the temporal reprojection buffer and the nearest sampled pixels. Necessary for hitting a target framerate. Handy for foveal rendering. If you have sparse sampling, you can run your interactive scene at a locked 60 FPS.

Combining the above techniques gives you some nice synergies. Adaptive sampling can focus on parts of the scene that are lacking previous frame data. Sparse sampling can be used to generate a low-resolution render with a high sample count, to be used as the start for integration and as a low-res variance estimate to guide the adaptive sampler. Temporal reprojection could pull samples out of the path reuse cache.

The path reuse technique can prioritize path variants at low bounce counts (say, 10 variants at bounce 1, 2 variants at bounce 2, 1 at bounce 3 -- this'd get you a good sampling of the first bounce hemisphere). The path caches can be used to estimate energy at different parts of the scene and the probability of connecting two parts of the scene, this could be used for MIS pdfs and to guide paths towards high energy regions with high probability of getting there.

Denoisers work by looking at the variance of a region and the level of convergence (how much each new sample changes the value at the pixel, and how close the pixel's value is to its neighbors.) If the convergence is low and the variance is high, the pixel's light estimate is likely off, and the region's light estimate likewise. A denoiser can pool the samples in the neighborhood and distribute the energy so that the variation between samples is low. That is, blur the region.

By looking at geometry and texture changes, the denoiser can avoid denoising high-contrast regions that should be high-contrast (e.g. you've got a sharp edge, so the normals on both sides of the edge are very different => denoiser can avoid blurring across the edge). Textures have high-frequency signal that also should be there. Denoising without the texture applied and then applying the texture on top can give you a smooth lighting solution without blurring the texture.

Denoising can give you overly smooth results. You could make a denoiser to bring down noise to a wanted level but not all the way down. Keep some of the noise, don't blur out natural noise. Use a variable amount of denoising depending on the region variance, convergence and sample count.

Noise in rendering is tricky. Let's say you're using blue noise to get a nicer sampling pattern. And you render an animation with a moving camera. If you fix the noise to the screen, it will look like the screen is a bit dirty. If you fix the noise to world coordinates, it will look like your object is dirty. If you animate the noise, it will look like your renderer is broken.

At one 5-ray path per pixel at 4k, you've got 40 million rays worth of scene connectivity data in every frame. Figuring out ways to make use of these rays for MC integration might well give you a high-quality render at a low cost.

Acceleration structures make ray-scene intersection cheaper to evaluate. They have a build cost. For dynamic scenes where you have to update or rebuild the acceleration structure every frame, it's important to use one that's fast to build. For static scenes, you can go with a more expensive acceleration structure that makes ray intersection faster.

Acceleration structures map a 5D ray (origin, direction) to primitives intersected by the ray. Usually this is done either with a tree that splits the space into hierarchical regions, or with a tree that splits the scene into bounding boxes of primitives. The difficulty, as you might imagine, is that a 3D tree doesn't give you exact matches for a 5D ray. Instead you end up with a list of tree nodes to test against the ray. If the tree doesn't have an ordering property (i.e. "if the ray hits something in this node, it's the nearest object"), you might even need to do several intersections to find the closest one.

The trees do have directionality. For a BVH, sorting the primitives recursively by different coordinate axes, you're creating lists ordered along a dimension. Once you know that you're dealing with an ordered list, you can do early exit if your hit or exit is before the start of the next untested element.

Grids are nice in that they're directional, so you have you can do early exit on first hit. And you've got a fixed max number of steps you need to take to step through a grid.

What you'd really like to have is a proper 5D lookup structure where you can directly find the next hit for a ray. This is difficult.

Ray classification (Arvo & Kirk 87) assigns objects into a lazily-constructed 5D grid. On each bounce, you assign your rays into 5D grid cells based on their origin and direction. Then for each grid cell, find all objects that match with it (a 5D grid cell is basically a frustum, so you'd do frustum-to-BVH checks). Then trace your rays, intersecting only against the objects inside the ray's grid cell.

You could split the scene into a hierarchical tree of beams that cover the direction space, then put the primitives inside each beam into a tree sorted along the beam direction. The good part about this approach is that you can find the beam for a ray fast, then jump to tree position, and there's your intersect. The bad part is that it uses memory like crazy since you're storing references to the entire scene N times where N is the number of beam directions you have.

You'll start hitting diminishing returns pretty soon. BVH intersection takes a few dozen steps and half a dozen intersections. If you're slower than that, just use a BVH. If you can optimize your structure to half a dozen steps and a single intersection, you'll be 5-6x faster than a BVH. Nothing to scoff at but it won't get you from 1 spp to 10k spp alone.

Let's look at memory bandwidth. Doing 10k paths of length 5, you'd need to read 50k triangles per pixel. One triangle is nine floats, or 36 bytes. That'd add up to 1.8 MB of triangle reads per pixel. A GPU with 1 TB/s memory bandwidth could process half a million pixels per second. Or about 100x100 pixels per frame. And then you need to do the compute and fetch the materials and textures.

If you want to render 10k paths per pixel at 4k@60 with 1 TB/s memory bandwidth, each ray can use only 0.04 bytes of memory bandwidth, or a third of a bit. How to fit three rays in a bit is left as an exercise to the reader. If you limit yourself to 100 paths per frame at 1080p, you'll have 16 bytes of memory bandwidth per ray. Still not enough to fit a 9-float triangle, but maybe you could trace SDFs or something.

Compute-wise, well, 10k paths per pixel at 4k is about 80 billion paths per frame. At 60 Hz, that's a cool five trillion paths per second. A 13 TFLOPS RTX 2080 Ti could spare 2.5 ops per path. If each path has five rays, that's half an op per ray. Going down to 100 paths at 1080p30, your budget becomes 432 ops per path. If you also go from f32 to f16, you'd have 860 ops per path and 32 bytes of bandwidth per ray @ 1TB/s, for 16 half-floats. A memory-dense acceleration structure with avg 1 triangle per traversal, and you could theoretically fit the acceleration structure and one triangle in that.

If you could somehow do path tracing with INT4s, you'd have a lot more headroom for memory bandwidth and compute. 32 bytes would fit 64 INT4s. And an A100 card can do 1248 trillion tensor ops per second on INT4s. That'd be 50 ops per ray at 4k60 and 10k paths per pixel. You could render something! I don't know what it would look like, but something at least!

Three types of paths: high-energy paths, high-probability paths and connecting paths. High-energy paths start from emissive surfaces. High-probability paths start from the camera. Connecting paths connect the two. High-P paths change when the camera moves and the scene in front of the camera changes. High-E paths change when the scene around the light sources moves. Connecting paths can be done via MIS weighing. You multiply the contribution of a path by its probability.

Unidirectional path tracing only generates high-P paths. Path tracing with next event estimation (shooting shadow rays at lights) tries to connect high-P paths with high-E regions. Bidirectional path tracing generates high-E paths and tries to connect high-P paths with high-E path suffixes. Photon mapping propagates high-E regions throughout the scene. Radiosity is another method to propagate high-E regions.

We know how to generate high-P paths and high-E paths. What about generating connecting paths? You could pick two random surfaces in the scene and try to connect them with a ray. Do this a few million times and you should have a decent connectivity map to estimate the transmission ratio from one arbitrary ray to another, and what the connecting path would look like. Then figure out high-P regions (what parts of the scene have camera rays) and sum up the energy in high-E regions, then search for high transmission connective paths from high-P regions to high-E regions and try to do connections that would contribute the most to camera pixels. That is, you've got a high-transmission, high-probability (according to the BSDF) connection from a camera path to a light path.

More tricks. Scene traversal on a GPU. You've got 32 rays flying at the same time, they're stepping through the acceleration structure to find the next hit. The hit happens at different depths of the acceleration structure, so the whole thing proceeds at the speed of the slowest ray. Then you bounce all the rays, they go all over the place. Now your rays are diverging even more. Not only do they finish at different times, they also access memory in different places. Another bounce. Some rays finished their paths and the execution lanes for those rays are just idling until all the 32 rays have finished.

You could make the rays fly in a bundle, forcing them to bounce in the same direction. That'd get you lower divergence among the bundle. This'll give you a very biased ray bunch though, and it'll look like your renderer is doing some weird hackery. Maybe you could scramble the starting pixels for the rays. Use a dither pattern to pick a bunch of pixels from a region, trace them in a coherent fashion. Pick another dither bunch of pixels, trace them in a different coherent fashion. Repeat until you've got the region covered. Now neighboring pixels won't have traversed a coherent path, so you get something more of a dithered noise pattern. And hopefully get the performance gains of coherent ray bundles without the biased render look at low sample counts.

Also, this:

2020-06-12

GLSL strings

Started writing a small string library for GLSL. Because it might be nice to do massively parallel string processing. It's difficult to make it fast on a GPU though, lots of variable-length reads and variable-length writes.

You can follow the progress on the spirv-wasm repo at https://github.com/kig/spirv-wasm/tree/master/string

It's still in the shambles-phase. That is, I haven't actually run the code. Adding string literals to GLSL is done with a small pre-processing script that will eventually turn string literals into heap allocations. The approach is similar to the GLSL HTTP parser hack.

In the HTTP parser pre-processor, string literals were done with "abcd" becoming an array of ints and 'abcd' turning into an int with four ASCII characters packed into it. The array approach is difficult with GLSL, since arrays need to have a size known at compile time. Packing four chars into an int and four ints into an ivec4, is, uh, fast? But a lot of hassle to work with.

In string.glsl I'm trying a different approach, representing strings with an ivec2 with a start pointer and end pointer to the heap buffer. So there are a lot of loops that look like

for (int i = str.x; i < str.y; i++) {
    heap[i]...
}

This is nice in that you can have strings of varying lengths and do the whole slice-lowercase-replace-split-join -kaboodle. But you need malloc. And GLSL doesn't have malloc. So we need to make a malloc. The malloc is a super simple stack-style malloc.

Malloc

We've got a heap buffer that's split into slices. Each program instance gets its own slice of the heap. Each program instance also has a heapPtr that points to the heap slice of the instance. To allocate N bytes of memory, malloc increments heapPtr by N and returns an ivec2 with the previous heapPtr and the new heapPtr. There's no free. If you allocate too much, you'll run out of space and corrupt the next instance's slice. To reclaim memory, make a copy of heapPtr before doing your allocations, then set heapPtr to the old value once you're done.

This kind of malloc is not perfect by any means but it's fast and predictable. GLSL's lack of recursion and the intended <16 ms program runtime also make it unlikely that you're going to need something more sophisticated. Free by terminating the program.

In case you do need something fancier, how to improve on this? If you wanted to share the heap between program instances, you could use atomics. To free allocations, you'd need a data structure to keep track of them. Probably a two-level setup where program instances reserve pages, and then internally allocate from the pages. Use virtual memory to allow allocations to span several pages, or reserve sequential pages to avoid VM. On first malloc, a program instance would reserve a page by setting a bit in the page table bitmap. On program exit, the instance would unset the bits for the pages it had reserved.

Performance

String performance in the HTTP compute shader produced some insights. The CPU doesn't like gathers, so striping the HTTP requests runs quite a bit better in ISPC. You take the requests that are AAAABBBBCCCCDDDD... etc. and rearrange the bytes to ABCDABCDABCDABCD... This way the SIMD threads can read their next byte with a single vector load. On the GPU this was less helpful. GPU and CPU performance was boosted significantly by doing a larger amount of work per program instance. Processing one request per instance vs. processing 1024 requests per instance could give a >10x speed boost.

The general performance of the HTTP compute shader is fun. Compiling GLSL for the CPU via ISPC produces code that runs very fast. Running the same code on the GPU is slower. We're talking about the CPU (16-core TR2950X) doing 18 million requests per second and the GPU (RTX2070) doing 9 million requests per second, on a 66/33 read/write sequential workload to an in-memory key-value-store of half a million 1kB blocks. Granted, this has very low computational intensity, it's basically "read a few bytes, branch, parse an int, branch, memcpy." Perhaps the tables would turntables if you rendered mandelbrots instead. Also likely is that I haven't found the right way to write this shader for the GPU. I believe the GPU would also benefit from using a separate transfer queue for sending and receiving the request and response buffers asynchronously. With sync transfers & transfer times included, the GPU perf drops further to 5 Mreqs/s.

Feeding the beast is left as an exercise for the reader. You'd need to receive 18 million packets per second and send the same out. You could have a bunch of edge servers collect their requests into a buffer and send the buffer over to the DB server, which would process the requests and send the response buffers back to the edge servers. If one edge server could deal with a million packets per second, you'd need 36 of them for a single CPU server.

Scaling should be mostly linear. I don't have a 128-core server handy, but you might well be able to handle over 100 million requests per second with one. That'd be 100 GB/s of 1 kB responses. Or one request per minute for every person on Earth.

[Edit] On further investigation, the CPU performance seems to be memory limited. Running the same task in multi-threaded C++ (via spirv-cross) runs at nearly the same speed regardless of whether your thread count is 8, 16 or 32. At 4 threads, the performance roughly halves. So while a multi-socket server might help (more memory channels!), you might not need more than two cores per memory channel for this.

fhtr