fhtr: 2020

2020-07-16

Compute shader daemon

Riffing on the WebCompute experiment (run GLSL compute shaders across CPUs, GPUs and the network), I'm now thinking of making the GLSL IO runtime capable of hanging around and running shaders on demand. In WebCompute, the Vulkan runner ran a single shader and read work units from the STDIN (the network server was feeding the STDIN from a WebSocket).

With GLSL IO, that goal is extended to handle arbitrary new shaders coming at various times. On a high level, you'd send it a shader file, argv and fds for stdin/stderr/stdout through a socket. Then it would create a new compute pipeline, allocate and bind buffers and run the pipeline on a free compute queue. On completing a dispatch, it'd delete the pipeline and free the buffers. This cycle of recreation might be expensive, so it should have a cache for buffers and compute pipelines.

The compute shaders could share a global IO processor, or each could have its own IO processor thread. A global IO processor could be tuned to the IO CPU and have better coordination of the IO operations, but could end up with slow IO requests from one shader clogging the pipe (well, hogging the threadpool) for others. This could be worked around with async IO in the IO processor.

The other issue is the cooperative multitasking model of compute shaders. If your shader is stuck in an infinite loop on all compute units, other shaders can't run. To remedy this, the GPU driver allows compute shaders to run only a few seconds before it terminates them. On mobiles this can be as low as 2 seconds, on the desktop 15 seconds. If a discrete GPU has no display connected to it, it may allow compute shaders to run as long as they want.

If your shaders needs to run longer than 10 seconds, this is a problem. The usual way around it is to build programs that run a few milliseconds at a time, and are designed to be run several times in succession. With the IO runtime, this sounds painful. An IO request might take longer than 10 seconds to complete. Writing a program that issues a bunch of async IOs, terminates, polls on the IOs on successive runs and does less than 10 seconds of processing on the results (doing the restart loop if it's running out of runtime), then finally tells the runtime that it doesn't need to be run again. In the absence of anything better, this is how the first version of long-running shaders will work.

The second version would be a more automated version of that. A yield keyword that gets turned into a call to saveRegisters() followed by program exit. On program start, it'd do loadRegisters() and jump into the stored instruction pointer to continue execution. The third version would insert periodic checks for how long the program has been running, and yield if the program's been running longer than the scheduler slice time.

Of course, this is only useful on GPU shaders. If you run the shaders on the CPU, the kernel's got you covered. The IO runtime is still useful since high-performance IO doesn't just happen.

I think the key learning from writing the GLSL IO runtime has been that IO bandwidth is the only thing that matters for workloads like grep. You can grep on the CPU at 50 GB/s. You can grep on the GPU at 200 GB/s. But if you need to transfer data from the CPU to the GPU, the GPU grep is limited to 11 GB/s. If you do a compress-decompress pipe from CPU to GPU, you can grep at 24 GB/s (if the compression ratio is good enough). GPUs give you density, but they don't have enough bandwidth to DRAM to really make use of the compute in common tasks.

Getting to even 11 GB/s requires doing multithreaded IO since memcpy is limited to 7 GB/s per thread. You need to fetch multiple blocks of data in parallel to get to 30 GB/s. Without the memcpy (just reading), you should be able to reach double the speed.

2020-07-06

GPU IO library design thoughts

Thinking about the design of file.glsl, a file IO library for GLSL.

For a taste, how about calling node.js from a GPU shader:

string r, fn = concat("node-", str(ThreadID), ".txt");
awaitIO(runCmd(concat(
  "node -e 'fs=require(`fs`); fs.writeFileSync(`", 
  fn,
  `, http://Date.now().toString())'"    
)));
r = readSync(fn, malloc(16));
println(concat("Node says ", r));

There are more examples in the test_file.glsl unit tests.

Design of GPU IO

Hardware considerations

GPUs have around a hundred processors, each with a 32-wide SIMD unit.
The SIMD unit can execute a 32 thread threadgroup and juggle between ten threadgroups for latency hiding.
GPU cacheline is 128 bytes.
CPU cacheline is 64 bytes.
GPU memory bandwidth is 400 - 1000 GB/s.
CPU memory bandwidth is around 50 GB/s.
PCIe bandwidth 11-13 GB/s. On PCIe4, 20 GB/s.
NVMe flash can do 2.5-10 GB/s on 4-16 channels. PCIe4 could boost to 5-20 GB/s.
The CPU can do 30 GB/s memcpy with multiple threads, so it’s possible to keep PCIe4 saturated even with x16 -> x16.
GPUdirect access to other PCIe devices is only available on server GPUs. Other GPUs need a roundtrip via CPU.
CPU memory accesses require several threads of execution to hit full memory bandwidth (single thread can do ~15 GB/s)
DRAM is good at random access at >cacheline chunks with ~3-4x the bandwidth of PCIe3 x16, ~2x PCIe4 x16.
Flash SSDs are good at random access at >128kB chunks, perform best with sequential accesses, can deal with high amounts of parallel requests. Writes are converted to log format.
Optane is good at random access at small sizes >4kB and low parallelism. The performance of random and sequential accesses is similar.

=> Large reads to flash should be executed in sequence (could be done by prefetching the entire file to page cache and only serving requests once the prefetcher has passed them)
=> Small scattered reads should be dispatched in parallel (if IO rate < prefetch speed, just prefetch the whole file)
=> Writes can be dispatched in parallel with more freedom, especially without fsync. Sequential and/or large block size writes will perform better on flash.
=> Doing 128 small IO requests in parallel may perform better than 16 parallel requests.
=> IOs to page cache should be done in parallel and ASAP.
=> Caching data into GPU RAM is important for performance.
=> Programs that execute faster than the PCIe bus should be run on the CPU if the GPU doesn’t have the data in cache.
=> Fujitsu A64FX-type designs with lots of CPU cores with wide vector units and high bandwidth memory are awesome. No data juggling, no execution environment weirdness.

Software

The IO queue works by using spinlocks on both the CPU and GPU sides.
The fewer IO requests you make, the less time you spend spinning.
Sending data between CPU and GPU works best in large chunks.

To avoid issues with cacheline clashes, align messages on GPU cacheline size.
IO request spinlocks that read across the PCIe bus should have small delays between checks to avoid hogging the PCIe bus.
Workgroups (especially subgroups) should bundle their IOs into a single scatter/gather.

When working with opened files, reads and writes should be done with pread/pwrite. Sharing a FILE* across threads isn’t a great idea.
The cost of opening and closing files with every IO is eclipsed by transfer speeds with large (1 MB) block sizes.

The IO library should be designed for big instructions with minimal roundtrips.
E.g. directory listings should send the entire file list with file stats, and there should be a recursive version to transfer entire hierarchies.
Think more shell utilities than syscalls. Use CPU as IO processor that can do local data processing without involving the GPU.

Workgroup concurrency can be used to run the same code on CPU and GPU in parallel. This extends to multi-GPU and multi-node quite naturally.
The IO queue could be used to exchange data between running workgroups.

Limited amount of memory that can be shared between CPU and GPU (I start seeing issues with > 64 MB allocations).
Having a small IO heap for each thread or even threadgroup, while easy to parallelize, limits IO sizes severely.
32 MB transfer buffer, 32k threads -> 1k max IO per thread, or 32k per 32-wide subgroup.
Preferable to do 1+ MB IOs.
Design with a concurrently running IO manager program that processes IO transfers?
The CPU could also manage this by issuing copyBuffer calls to move data.

Workgroups submit tasks in sync -> readv / writev approach is beneficial for sequential reads/writes.
Readv/writev are internally single-threaded, so probably limited by memcpy to 6-8 GB/s.

Ordering of writes across workgroups requires a way to sequence IOs (either reduce to order on the GPU or reassemble correct order on the CPU.)
IOs could have sequence ids.

Compression of data on the PCIe bus could help. 32 * zstd --format=lz4 --fast -T1 file -o /dev/null goes at 38 GB/s.

[Update] Did a quick test with libzstd (create compressor context for each 1 MB read, compress, send data to GPU) with mixed results. Getting 3.4 GB/s throughput with Zstd. Which is good for zero-effort, but I should really be using liblz4. Running 32 instances of zstd --fast in parallel got 6.4 GB/s, with --format=lz4 16 GB/s.

[Update 2] Libzstd with fast strategy and compression level -9 can do 12.7 GB/s grep.glsl throughput. So if I had a GPU-side decompressor, I might see a performance benefit vs raw data (11.4 GB/s). On already-compressed files, there's a 10-15% perf penalty vs raw.

[Update 3] LZ4 with streaming block compressor and compression level 9 reaches 17.5 GB/s grep.glsl throughput on the above file, 22.5 GB/s on a 13 GB kern.log file (lots of similar errors). About 5% perf penalty on compressed files. Feels like a GPU decompressor could actually be a good idea.

Caching file data on the GPU is important for performance, 40x higher bandwidth than CPU page cache over PCIe.
Without GPU-side caching, you’ll likely get better perf on the CPU on bandwidth-limited tasks (>50 GB/s throughput.)
In those tasks, using memory bandwidth to send data to GPU wouldn’t help any, best you could achieve is zero slowdown.
(Memory bandwidth 50 GB/s. CPU processing speed 50 GB/s. Use 10 GB/s of bandwidth to send data to GPU =>
CPU has only 40 GB/s bandwidth left, GPU can do 10 GB/s => CPU+GPU processing speed 50 GB/s.)

Benchmark suite

Different block sizes
Different access patterns (sequential, random)
- Scatter writes
- Sequential writes
- Gather reads
- Sequential reads
- Combined reads & writes
Different levels of parallelism
- 1 IO per thread group
- each thread does its own IO
- 1 IO on ThreadID 0
- IOs across all invocation
Compression
From hot cache on CPU
From cold cache
With GPU-side cache
Repeated access to same file
Access to multiple files

Does it help to combine reads & writes into sequential blocks on CPU-side when possible, or is it faster to do IOs ASAP?

Caching file descriptors, helps or not?

2020-06-29

grep.glsl

grep.glsl is sort of working. It's a GLSL compute shader version of grep. A very simple one at that. It tests a string against a file's contents and prints out the byte offsets where the string was found.

The awesome part of this is that the shader is running the IO. And it performs reasonably well after tuning. You could imagine a graphics shader dynamically loading geometry and textures when it needs them, then poll for load completion in following frames.

Here are the few first lines of grep.glsl

    string filename = aGet(argv, 2);
    string pattern = aGet(argv, 1);

    if (ThreadLocalID == 0) done = 0;

    if (ThreadID == 0) {
        FREE(
            println(concat("Searching for pattern ", pattern));
            println(concat("In file ", filename));
        )
        setReturnValue(1);
    }

Not your run-of-the-mill shader, eh?

This is the file reading part:

    while (done == 0) {
        FREE(FREE_IO(
            barrier(); memoryBarrier();

            // Read the file segment for the workgroup.
            if (ThreadLocalID == 0) {
                wgBuf = readSync(filename, wgOff, wgBufSize, string(wgHeapStart, wgHeapStart + wgBufSize));
                if (strLen(wgBuf) != wgBufSize) {
                    atomicAdd(done, strLen(wgBuf) == 0 ? 2 : 1);
                }
            }

            barrier(); memoryBarrier();

            if (done == 2) break; // Got an empty read.
            
            // Get this thread's slice of the workGroup buffer
            string buf = string(
                min(wgBuf.y, wgBuf.x + ThreadLocalID * blockSize),
                min(wgBuf.y, wgBuf.x + (ThreadLocalID+1) * blockSize + patternLength)
            );

            // Iterate through the buffer slice and add found byte offsets to the search results.
            int start = startp;
            i32heapPtr = startp;
            for (int i = 0; i < blockSize; i++) {
                int idx = buf.x + i;
                if (startsWith(string(idx, buf.y), pattern)) {
                    i32heap[i32heapPtr++] = int32_t(i);
                    found = true;
                }
            }
            int end = i32heapPtr;

            ...

Performance is complicated. Vulkan compute shaders have a huge 200 ms startup cost and a 160 ms cleanup cost. About 60 ms of that is creating the compute pipeline, the rest is instance and device creation.

Once you get the shader running, performance continues to be complicated. The main bottleneck is IO, as you might imagine. The file.glsl IO implementation uses a device-local host-visible volatile buffer to communicate between the CPU and GPU. The GPU tells the CPU that it has some IO work for the CPU by writing into the buffer, using atomics to prevent several GPU threads from writing into the same request. The CPU spinlocks to grab new requests to process, then spinlocks for the requests to become available. After processing the request, the CPU writes the results to the buffer. The GPU spinlocks awaiting for the IO completion, then copies the IO results to its device-local heap buffer.

The GPU-GPU copies execute reasonably fast (200+ GB/s), but waiting for the IO drops the throughput to around 6 GB/s. This is way better than the 1.6 GB/s it used to be, when it was using int8_t for IO and ~200 kB transfers. Now the transfer buffer is 1 MB and the IO copies are done with i64vec4.

Once you have the data on the GPU, performance continues to be complicated. Iterating through the data one byte at a time goes at roughly 30 GB/s. If the buffer is host-visible, the speed drops to 10 GB/s. Searching a device-local buffer 32 bytes at a time using an i64vec4 goes at 220 GB/s.

The CPU-GPU transfer buffer has a max size of 256 MB. The design of file.glsl causes issues here, since it slices the transfer buffer across shader instances, making it difficult to do full-buffer transfers (and minimize IO spinning). Now grep.glsl does transfers in per-workgroup slices, where 100 workgroups each do a 1 MB read from the searched file, then distribute the search work across 255 workers per workgroup.

This achieves 6 GB/s shader throughput. The PCIe bus is capable of 12 GB/s. If you remove the IO waits and just do buffer copies, the shader runs at 130 GB/s. Taking that into account, the shader PCIe transfers should be happening at 6.3 GB/s. Removing the buffer copies had a very minimal effect on throughput, so doing the search directly on the IO buffer shouldn't improve performance by much.

Thoughts

Why are the transfers not hitting the PCIe bus limits? Suspiciously hovering at around half of PCIe bandwidth too. Could it be that the device-local buffer is slow to write to from the CPU side? Previously I had host-cached memory for the buffer, which lets you do flushes and invalidations to (hopefully) transfer in more efficient chunks. The reason that I'm not using a host-cached memory buffer is that GLSL volatile doesn't work for host-side memory: the GPU seems to cache fetched memory into GPU RAM and volatile only bypasses the GPU L1/L2 caches, so you'll never see CPU writes that landed after your first read. And there doesn't seem to be a buffer type that's host-cached device-local.

Maybe have a CPU-side buffer for IO, and a transfer queue to submit copies to GPU memory. This should land the data into GPU RAM and volatile should work. Or do the fread to a separate buffer first, then memcpy it to the GPU buffer.

Vulkan's startup latency makes the current big binary approach bad for implementing short-lived shell commands. The anemic PCIe bandwidth makes compute-poor programs starve for data. GNU grep runs at 4 GB/s, but you can run 64 instances and achieve 50 GB/s [from page cache]. This isn't possible with grep.glsl. All you have is 12 GB/s.

Suppose this: you've got a long-running daemon. You send it SPIR-V or GLSL. It runs them on the GPU. It also maintains a GPU-side page cache for file I/O. Now your cold-cache data would still run at roughly the speed of storage (6 GB/s isn't _that_ bad.) But a cached file, a cached file would fly. 400 GB/s and more.

Creating the compute pipeline takes around 50 ms. Caching binary pipelines for programs would give you faster startup times.

The GPU driver terminates running programs after around 15 seconds. And they hang the graphics while running. Where's our pre-emptive multitasking.

This should really be running on a CPU because of the PCIe bottleneck and the Vulkan startup bottleneck. Going to try compiling it to ISPC or C++ and see how it goes.

[Update] Tested memcpy performance. Single thread CPU-CPU, about 8 GB/s. With 8 threads: 31 GB/s. Copying to the device-local host-visible buffer: 8 GB/s with one thread. With 4 threads, 10.2 GB/s. Cuda bandwidthTest can do 13.1 GB/s with pinned memory and 11.5 GB/s with pageable memory. The file.glsl IO transfer system doesn't seem to be all that slow, it just needs multi-threaded copies on the CPU side.

Built a simple threaded IO system that spawns a thread for every read IO up to a maximum of 16 threads in flight. Grep.glsl throughput: 10 GB/s. Hooray!

[Update 2] Lots of tweaking, debugging and banging-head-to-the-wall later, we're at 11.45 GB/s.

2020-06-25

file.glsl

Current design: Each shader invocation has an 8k heap slice and an 8k IO heap slice. There's an IO request buffer. GPU writes IOs to the req buffer, CPU picks them up and writes the result to the IO heap. The GPU copies the result from the IO heap to the main heap. The IO heap and the IO request buffer are marked volatile. The main heap isn't, so it can benefit from caches.

Now trying to handle argv nicely. Allocate an extra slice in the heap and copy the argv there before shader start.

This could also be used to store string literals. Now string literals are malloc'd and filled in each invocation at shader start, which is a total waste of time. But because the string lib has funcs that do in-place modification, this avoids one class of errors. Switching to immutable strings in the lib is an enticing option.

Memory allocation is done with a simple bump malloc. Freeing heap memory after use is a hassle. I have macros FREE() and FREE_IO() that free whatever heap / IO heap allocations you did inside the call.

It might be nice to have a global heap with a proper malloc to store large allocations that are re-used across threads. E.g. for loading texture data. This would probably have to be a fixed size buffer. I doubt that it's possible to allocate more buffer memory on the CPU side while the shader is running. Would be nice though!

Wrote a sketch of grep.glsl. Very educational, as it exposed a bunch of missing features and "this'd be nice to have"-things. Handling argv and program return value fall in the first category. Having helpers for reductions across all invocations and sorted-order IO fall in the second category.

The current grep.glsl sketch is: each thread does a read for a 4kB+strLen(pattern) chunk, then runs indexOf(chunk, pattern) to find all occurrences of pattern in it. The threads then iterate through all invocationIDs in order (with a barrier() call to keep them in lockstep) and when the thread id matches the current id, the thread prints out its results. This should keep the workgroup threads in order, but the workgroups might still be in different order. Then the threads or-reduce whether they found any hits or not to set the program return value to 0 or 1. Advance read offset by total thread count times 4kB, repeat until a thread reads less than 4kB (EOF).

This makes a bunch of design issues apparent. The IOs should be run concurrently. Copying IO bytes to heap has limited value here, so having a way to run string functions directly on the IO heap would be nice. The read could be done with a single IO instead of threadCount IOs. Overlapping IO and execution: could issue two IOs, wait for the first, issue third IO, process first IO, wait for the second, issue fourth IO, process second IO, ... How to do line numbers. Reads should have an EOF flag. Can we hit 12 GB/s for files in page cache? Can we cache files in GPU RAM and match more patterns later at 400 GB/s?

2020-06-16

Real-time path tracing

[WIP - I'll update this over time]

The key problem in real-time path tracing is that you have enough compute to do a couple of paths per pixel, but the quality you want requires ten thousand paths per pixel.

There are three ways to solve this correctly. One is to wait a few decades for computers to get 10 000 times faster. One is to use 10 000 computers to do the rendering. And one is to make paths 10 000 times cheaper.

Or you could render at a lower resolution and frame rate. If you can do 1 sample per pixel at 4k resolution at 60 Hz, then you can do 10k samples per pixel at 30 Hz at 60x25 resolution. Yes, you can have real-time movie quality graphics on a thumbnail-sized display. If you want to hit C64 resolutions like 320x200, you could get a 3-4 cryptomining PCs with 10+ GPUs each. For a paltry $15000 you can have the C64 of your dreams!

But, well, what if you want to do high-resolution rendering at high frame rates? Without the budget for a massive cluster, and without having to find The Secret of Monte Carlo which would enable incredibly fast and accurate integration of the lighting equation.

There are ways. You could put together four techniques that would give you a 10x perceptual sample count boost. Or you could put together 14 techniques with a 2x perceptual sample count boost. Or perhaps a mix of the two. Then there are optimizations and mathematical techniques to improve the performance of the Monte Carlo integration used in path tracing. Even ways to arrive at better approximations of the light field in the same time. What's important is picking a set of techniques that compound.

Monte Carlo integration is a process where you integrate a function by calling it with random values, sum up the results, and divide the sum with the number of results. The nice thing about MC integration is that it will eventually converge to the correct solution, no matter what initial value and weight you start with.

Images generated with MC integration start off noisy since the initial samples of the function have a high weight and neighboring pixels often end up with quite different values at any given sample due to us picking random values on the function. This noise gets smoothed out as the number of samples increases and the pixel neighborhood ends up sampling a similar hemisphere.

The relationship between noise and sample count is roughly that noise is the square root of the sample count. Quadruple the sample count and the noise halves. At 10k samples, you halve the noise about 6.6 times. If you've got an 8-bit range, your noise level should be about 1.4 bits or about +-1.3 units.

If you want to get rid of noise in the early stages of the integration, you have to downweigh the initial samples vs the neighborhood. The easiest way is to blur the image. Another way is to pick a non-noisy starting point and give it a high weight, so that the new MC samples don't cause as much noise. Yet another way is to correlate the sampling of the neighboring pixels so that they sample the same regions - this will trade high frequency noise for low frequency noise.

E.g. use a sampling strategy for a 64x64 region where you first take 100 samples of the hemisphere at the center of the region, keep the top-5 highest energy samples, weigh them down by their probability, and use shadow rays to connect the 64x64 region to only the high-energy values. You'll get no pixel-to-pixel noise in the 64x64 region, but you might end up with a blotchy image when transitioning from one region to another.

Fireflies are rare high-energy samples that jump a pixel value much higher than the neighborhood. Fireflies are actually good for the energy search part of the integration, you can use the firefly path as a guiding star for the neighborhood. But they cause bright noise that takes a large number of samples to tone down. One way to deal with fireflies is to clamp them. This way you'll still get bright pixels but they won't slow convergence towards darkness. The issue with clamping is that fireflies can be an important contributor of energy in the neighborhood. If a firefly boosts a pixel by 10 units and the path occurs only on every 100th sample, the firefly's contribution should be 0.1 units. If you clamp the firefly to 1, it'll contribute only 0.01 units, and the pixel ends up too dark.

A better way might be to estimate the probability of the firefly path based on the neighborhood samples. If only every 100th pixel has a firefly of 10 units, scale the fireflies down to 0.1 units and add 0.1 to the entire neighborhood. As the sample count increases, you can start passing the fireflies straight through and converge to an unbiased result.

Temporal reprojection lets you use samples computed in previous frames as the starting point of the integration in the current frame. Adaptive sampling allows you to focus your ray budget to the areas of the scene that need it the most. Sparse sampling lets you skip some pixels completely.

Multiple importance sampling guides your integrator towards high-energy regions of the scene. Bidirectional path tracing tries to connect high-energy paths with camera paths. Photon mapping caches high-energy paths in the scene. Vertex connection merging combines BDPT and BDPPM.

Path reuse techniques like bidirectional path tracing can give you additional paths at the expense of a single shadow ray and BSDF evaluation. You generate a camera path and then connect the vertices on the camera path to generated path suffixes. In simple BDPT, you'd generate a camera path and a light path, then connect each vertex of the camera path to each vertex of the light path. If you have three connecting vertices in the camera path and three in the light path, you'd get ten camera paths at the expense of one camera path, one light path, and nine shadow rays. If you reuse light paths across multiple pixels, you can amortize the light path cost. In static scenes, you could also reuse camera paths. This way you might be able to generate decent paths with a single shadow ray per path.

For outdoor scenes, you'll start seeing vanishing returns after five bounces, so path reuse techniques for easy light paths might be limited to around 5x speed boost. In indoor scenes with longer paths, path reuse becomes more useful. If you're also dealing with difficult lighting conditions, the generated high-energy path suffixes are very helpful.

Temporal reprojection can allow you to start integration at a high effective sample count, based on integration results from previous frames. It is most helpful with diffuse surfaces, where camera angle and position don't change the integration result. On glossy and reflective surfaces and refractive volumes you'll see significant ghosting.

Adaptive sampling is quite crucial for fast renders at low sample counts. Not all parts of the scene require the same number of samples to converge. If you can skip low-noise, low-contrast regions, you can allocate more samples to noisy regions and high-contrast regions. You can borrow some tricks from artists and blur out dark regions of the image since the eye is going to skip them anyhow and seek out high-contrast regions and bright areas. Throwing extra samples to determine if a shadow region has a value of 0.01 or 0.02 is a waste of time, especially in animated scenes. You can use foveated rendering tricks and spend your ray budget where the viewer is looking at. If you don't have eye tracking, you can make educated guesses (center of the screen, anything moving, faces, silhouettes.)

Sparse sampling is the extreme cousin of adaptive sampling. With sparse sampling you can completely skip rendering some pixels, filling them from the temporal reprojection buffer and the nearest sampled pixels. Necessary for hitting a target framerate. Handy for foveal rendering. If you have sparse sampling, you can run your interactive scene at a locked 60 FPS.

Combining the above techniques gives you some nice synergies. Adaptive sampling can focus on parts of the scene that are lacking previous frame data. Sparse sampling can be used to generate a low-resolution render with a high sample count, to be used as the start for integration and as a low-res variance estimate to guide the adaptive sampler. Temporal reprojection could pull samples out of the path reuse cache.

The path reuse technique can prioritize path variants at low bounce counts (say, 10 variants at bounce 1, 2 variants at bounce 2, 1 at bounce 3 -- this'd get you a good sampling of the first bounce hemisphere). The path caches can be used to estimate energy at different parts of the scene and the probability of connecting two parts of the scene, this could be used for MIS pdfs and to guide paths towards high energy regions with high probability of getting there.

Denoisers work by looking at the variance of a region and the level of convergence (how much each new sample changes the value at the pixel, and how close the pixel's value is to its neighbors.) If the convergence is low and the variance is high, the pixel's light estimate is likely off, and the region's light estimate likewise. A denoiser can pool the samples in the neighborhood and distribute the energy so that the variation between samples is low. That is, blur the region.

By looking at geometry and texture changes, the denoiser can avoid denoising high-contrast regions that should be high-contrast (e.g. you've got a sharp edge, so the normals on both sides of the edge are very different => denoiser can avoid blurring across the edge). Textures have high-frequency signal that also should be there. Denoising without the texture applied and then applying the texture on top can give you a smooth lighting solution without blurring the texture.

Denoising can give you overly smooth results. You could make a denoiser to bring down noise to a wanted level but not all the way down. Keep some of the noise, don't blur out natural noise. Use a variable amount of denoising depending on the region variance, convergence and sample count.

Noise in rendering is tricky. Let's say you're using blue noise to get a nicer sampling pattern. And you render an animation with a moving camera. If you fix the noise to the screen, it will look like the screen is a bit dirty. If you fix the noise to world coordinates, it will look like your object is dirty. If you animate the noise, it will look like your renderer is broken.

At one 5-ray path per pixel at 4k, you've got 40 million rays worth of scene connectivity data in every frame. Figuring out ways to make use of these rays for MC integration might well give you a high-quality render at a low cost.

Acceleration structures make ray-scene intersection cheaper to evaluate. They have a build cost. For dynamic scenes where you have to update or rebuild the acceleration structure every frame, it's important to use one that's fast to build. For static scenes, you can go with a more expensive acceleration structure that makes ray intersection faster.

Acceleration structures map a 5D ray (origin, direction) to primitives intersected by the ray. Usually this is done either with a tree that splits the space into hierarchical regions, or with a tree that splits the scene into bounding boxes of primitives. The difficulty, as you might imagine, is that a 3D tree doesn't give you exact matches for a 5D ray. Instead you end up with a list of tree nodes to test against the ray. If the tree doesn't have an ordering property (i.e. "if the ray hits something in this node, it's the nearest object"), you might even need to do several intersections to find the closest one.

The trees do have directionality. For a BVH, sorting the primitives recursively by different coordinate axes, you're creating lists ordered along a dimension. Once you know that you're dealing with an ordered list, you can do early exit if your hit or exit is before the start of the next untested element.

Grids are nice in that they're directional, so you have you can do early exit on first hit. And you've got a fixed max number of steps you need to take to step through a grid.

What you'd really like to have is a proper 5D lookup structure where you can directly find the next hit for a ray. This is difficult.

Ray classification (Arvo & Kirk 87) assigns objects into a lazily-constructed 5D grid. On each bounce, you assign your rays into 5D grid cells based on their origin and direction. Then for each grid cell, find all objects that match with it (a 5D grid cell is basically a frustum, so you'd do frustum-to-BVH checks). Then trace your rays, intersecting only against the objects inside the ray's grid cell.

You could split the scene into a hierarchical tree of beams that cover the direction space, then put the primitives inside each beam into a tree sorted along the beam direction. The good part about this approach is that you can find the beam for a ray fast, then jump to tree position, and there's your intersect. The bad part is that it uses memory like crazy since you're storing references to the entire scene N times where N is the number of beam directions you have.

You'll start hitting diminishing returns pretty soon. BVH intersection takes a few dozen steps and half a dozen intersections. If you're slower than that, just use a BVH. If you can optimize your structure to half a dozen steps and a single intersection, you'll be 5-6x faster than a BVH. Nothing to scoff at but it won't get you from 1 spp to 10k spp alone.

Let's look at memory bandwidth. Doing 10k paths of length 5, you'd need to read 50k triangles per pixel. One triangle is nine floats, or 36 bytes. That'd add up to 1.8 MB of triangle reads per pixel. A GPU with 1 TB/s memory bandwidth could process half a million pixels per second. Or about 100x100 pixels per frame. And then you need to do the compute and fetch the materials and textures.

If you want to render 10k paths per pixel at 4k@60 with 1 TB/s memory bandwidth, each ray can use only 0.04 bytes of memory bandwidth, or a third of a bit. How to fit three rays in a bit is left as an exercise to the reader. If you limit yourself to 100 paths per frame at 1080p, you'll have 16 bytes of memory bandwidth per ray. Still not enough to fit a 9-float triangle, but maybe you could trace SDFs or something.

Compute-wise, well, 10k paths per pixel at 4k is about 80 billion paths per frame. At 60 Hz, that's a cool five trillion paths per second. A 13 TFLOPS RTX 2080 Ti could spare 2.5 ops per path. If each path has five rays, that's half an op per ray. Going down to 100 paths at 1080p30, your budget becomes 432 ops per path. If you also go from f32 to f16, you'd have 860 ops per path and 32 bytes of bandwidth per ray @ 1TB/s, for 16 half-floats. A memory-dense acceleration structure with avg 1 triangle per traversal, and you could theoretically fit the acceleration structure and one triangle in that.

If you could somehow do path tracing with INT4s, you'd have a lot more headroom for memory bandwidth and compute. 32 bytes would fit 64 INT4s. And an A100 card can do 1248 trillion tensor ops per second on INT4s. That'd be 50 ops per ray at 4k60 and 10k paths per pixel. You could render something! I don't know what it would look like, but something at least!

Three types of paths: high-energy paths, high-probability paths and connecting paths. High-energy paths start from emissive surfaces. High-probability paths start from the camera. Connecting paths connect the two. High-P paths change when the camera moves and the scene in front of the camera changes. High-E paths change when the scene around the light sources moves. Connecting paths can be done via MIS weighing. You multiply the contribution of a path by its probability.

Unidirectional path tracing only generates high-P paths. Path tracing with next event estimation (shooting shadow rays at lights) tries to connect high-P paths with high-E regions. Bidirectional path tracing generates high-E paths and tries to connect high-P paths with high-E path suffixes. Photon mapping propagates high-E regions throughout the scene. Radiosity is another method to propagate high-E regions.

We know how to generate high-P paths and high-E paths. What about generating connecting paths? You could pick two random surfaces in the scene and try to connect them with a ray. Do this a few million times and you should have a decent connectivity map to estimate the transmission ratio from one arbitrary ray to another, and what the connecting path would look like. Then figure out high-P regions (what parts of the scene have camera rays) and sum up the energy in high-E regions, then search for high transmission connective paths from high-P regions to high-E regions and try to do connections that would contribute the most to camera pixels. That is, you've got a high-transmission, high-probability (according to the BSDF) connection from a camera path to a light path.

More tricks. Scene traversal on a GPU. You've got 32 rays flying at the same time, they're stepping through the acceleration structure to find the next hit. The hit happens at different depths of the acceleration structure, so the whole thing proceeds at the speed of the slowest ray. Then you bounce all the rays, they go all over the place. Now your rays are diverging even more. Not only do they finish at different times, they also access memory in different places. Another bounce. Some rays finished their paths and the execution lanes for those rays are just idling until all the 32 rays have finished.

You could make the rays fly in a bundle, forcing them to bounce in the same direction. That'd get you lower divergence among the bundle. This'll give you a very biased ray bunch though, and it'll look like your renderer is doing some weird hackery. Maybe you could scramble the starting pixels for the rays. Use a dither pattern to pick a bunch of pixels from a region, trace them in a coherent fashion. Pick another dither bunch of pixels, trace them in a different coherent fashion. Repeat until you've got the region covered. Now neighboring pixels won't have traversed a coherent path, so you get something more of a dithered noise pattern. And hopefully get the performance gains of coherent ray bundles without the biased render look at low sample counts.

Also, this:

2020-06-12

GLSL strings

Started writing a small string library for GLSL. Because it might be nice to do massively parallel string processing. It's difficult to make it fast on a GPU though, lots of variable-length reads and variable-length writes.

You can follow the progress on the spirv-wasm repo at https://github.com/kig/spirv-wasm/tree/master/string

It's still in the shambles-phase. That is, I haven't actually run the code. Adding string literals to GLSL is done with a small pre-processing script that will eventually turn string literals into heap allocations. The approach is similar to the GLSL HTTP parser hack.

In the HTTP parser pre-processor, string literals were done with "abcd" becoming an array of ints and 'abcd' turning into an int with four ASCII characters packed into it. The array approach is difficult with GLSL, since arrays need to have a size known at compile time. Packing four chars into an int and four ints into an ivec4, is, uh, fast? But a lot of hassle to work with.

In string.glsl I'm trying a different approach, representing strings with an ivec2 with a start pointer and end pointer to the heap buffer. So there are a lot of loops that look like

for (int i = str.x; i < str.y; i++) {
    heap[i]...
}

This is nice in that you can have strings of varying lengths and do the whole slice-lowercase-replace-split-join -kaboodle. But you need malloc. And GLSL doesn't have malloc. So we need to make a malloc. The malloc is a super simple stack-style malloc.

Malloc

We've got a heap buffer that's split into slices. Each program instance gets its own slice of the heap. Each program instance also has a heapPtr that points to the heap slice of the instance. To allocate N bytes of memory, malloc increments heapPtr by N and returns an ivec2 with the previous heapPtr and the new heapPtr. There's no free. If you allocate too much, you'll run out of space and corrupt the next instance's slice. To reclaim memory, make a copy of heapPtr before doing your allocations, then set heapPtr to the old value once you're done.

This kind of malloc is not perfect by any means but it's fast and predictable. GLSL's lack of recursion and the intended <16 ms program runtime also make it unlikely that you're going to need something more sophisticated. Free by terminating the program.

In case you do need something fancier, how to improve on this? If you wanted to share the heap between program instances, you could use atomics. To free allocations, you'd need a data structure to keep track of them. Probably a two-level setup where program instances reserve pages, and then internally allocate from the pages. Use virtual memory to allow allocations to span several pages, or reserve sequential pages to avoid VM. On first malloc, a program instance would reserve a page by setting a bit in the page table bitmap. On program exit, the instance would unset the bits for the pages it had reserved.

Performance

String performance in the HTTP compute shader produced some insights. The CPU doesn't like gathers, so striping the HTTP requests runs quite a bit better in ISPC. You take the requests that are AAAABBBBCCCCDDDD... etc. and rearrange the bytes to ABCDABCDABCDABCD... This way the SIMD threads can read their next byte with a single vector load. On the GPU this was less helpful. GPU and CPU performance was boosted significantly by doing a larger amount of work per program instance. Processing one request per instance vs. processing 1024 requests per instance could give a >10x speed boost.

The general performance of the HTTP compute shader is fun. Compiling GLSL for the CPU via ISPC produces code that runs very fast. Running the same code on the GPU is slower. We're talking about the CPU (16-core TR2950X) doing 18 million requests per second and the GPU (RTX2070) doing 9 million requests per second, on a 66/33 read/write sequential workload to an in-memory key-value-store of half a million 1kB blocks. Granted, this has very low computational intensity, it's basically "read a few bytes, branch, parse an int, branch, memcpy." Perhaps the tables would turntables if you rendered mandelbrots instead. Also likely is that I haven't found the right way to write this shader for the GPU. I believe the GPU would also benefit from using a separate transfer queue for sending and receiving the request and response buffers asynchronously. With sync transfers & transfer times included, the GPU perf drops further to 5 Mreqs/s.

Feeding the beast is left as an exercise for the reader. You'd need to receive 18 million packets per second and send the same out. You could have a bunch of edge servers collect their requests into a buffer and send the buffer over to the DB server, which would process the requests and send the response buffers back to the edge servers. If one edge server could deal with a million packets per second, you'd need 36 of them for a single CPU server.

Scaling should be mostly linear. I don't have a 128-core server handy, but you might well be able to handle over 100 million requests per second with one. That'd be 100 GB/s of 1 kB responses. Or one request per minute for every person on Earth.

[Edit] On further investigation, the CPU performance seems to be memory limited. Running the same task in multi-threaded C++ (via spirv-cross) runs at nearly the same speed regardless of whether your thread count is 8, 16 or 32. At 4 threads, the performance roughly halves. So while a multi-socket server might help (more memory channels!), you might not need more than two cores per memory channel for this.

2020-05-31

Raspberry Pi RC car

Here's my software package to turn your Raspberry Pi into a RC car https://github.com/kig/rpi-car-control

rpi-car-control

Use a Raspberry Pi to drive an RC car from a web page.

Fisheye camera with two IR lamps, a white USB power bank underneath. The wires go inside the Raspberry Pi case.

A ToF laser range finder for the reversing distance indicator.

How does it work?

Open up a cheap RC toy car. Connect the motors to a Raspberry Pi. Add a camera. Run a web server on the Raspberry Pi that controls the car.

In more detail, you need to replace the car PCB with a motor controller board (say, a tiny cheap MX1508 module). Then solder the motors and the car battery pack to the motor controller. Solder M-F jumper cables to the motor controller's control connectors. Plug the other end of the jumpers to the Raspberry Pi GPIOs. Now you can control the motors from the Raspberry Pi.

Expose the Raspberry Pi camera as an MJPEG stream so that you can directly view it as an IMG on the browser. This is the easiest low latency, low CPU, high quality streaming format.

If the car has lights, you can drive them from the GPIOs as well (either directly or via a proper LED controller). Add a bunch of sensors to the car for the heck of it. I've got a tiny VL53L1X ToF laser-ranging sensor as a reversing radar, and a DHT temperature and humidity sensor. There's code in the repo to hook up an ultrasonic range finder too (it can even use the DHT sensor to calculate the speed of sound for given temperature and humidity - and has a Kalman filter of sorts, so you can reach ~mm accuracy), and some bits and bops for using a PIR sensor.

There was also a microphone input and playback either through wired speakers or to a Bluetooth speaker, but that's not enabled at the moment. There was also a WebRTC-based streaming solution for doing 2-way video calls, but that was such a pain I gave up on it. I was using RWS which is pretty easy to set up, but the STUN/TURN stuff was tough.

Add a USB battery pack to power the Raspberry Pi and you're about done. If you're feeling adventurous, you could use a 5V step-up/step-down regulator to run the Raspberry Pi directly from the car batteries.

Install

raspi-config # Enable I2C to use the VL53L1X sensor
sh install.sh

The install script installs the car service and its dependencies. This is best done on a fresh install of Raspbian. The install script overwrites NGINX's default site configuration.

After starting the car control app with sudo systemctl start car, you can connect to http://raspberrypi/car/ and play with the controls web page.

The car control app is installed in /opt/rpi-car-control.

To use a SSH tunnel server, edit /etc/rpi-car-control/env.sh and change the line RPROXY_SERVER= to RPROXY_SERVER=my.server.

With the SSH tunnel, you can access the car from http://my.server:9999/car/. Best to firewall this port and add a HTTPS reverse proxy that points to it. Look at etc/remote_nginx.conf for a snippet that sets up an authenticated NGINX reverse proxy on the remote server. (Run htpasswd -c /etc/nginx/car_htpasswd my_username to create the password file.)

Configuration

See /etc/rpi-car-control/env.sh for settings.

# SSH tunnel reverse proxy
RPROXY_SERVER=my.server

# One of v4l2-mjpeg, v4l2-raw, raspivid
VIDEO_MODE=v4l2-raw

# Which camera to use in the v4l2 modes
V4L2_DEVICE=/dev/video2

# Video settings
VIDEO_WIDTH=480
VIDEO_HEIGHT=270
VIDEO_FPS=60

VIDEO_ROTATION=0

Controls

The circle on the left is the accelerator indicator, and the circle on the right is the steering indicator. The bar in the bottom middle is the reversing distance indicator. The sensor data readout is at top left. The little square at the bottom right toggles the full screen mode.

The controls are defined near the bottom of html/main.js.

Touch controls

Use left thumb to accelerate and reverse, right thumb to steer.

Keyboard controls

Use arrow keys to drive.
The numbers 1-4 control front lights intensity and 0 turns the rear lights on and off.
The z key blinks the left front light, the c key blinks the right front light and the x key turns off the blinkers.

Requirements

The app is very modular, so you can run the app without an actual car or camera. And just play with a web page with controls that do nothing.

If you wire up the motors, you should be able to drive. If you wire up the lights, they should light up.

Wire up the sensors and you should start seeing sensor data in the HUD.

Add a camera and you'll see a live video stream.

Wiring

See control/car.py and sensors/sensors_websocket.py for the pin definitions. The VCC and GND connections have been left out. Just remember to use the correct voltage when wiring those.

Component	GPIO	Notes
Motor forward (A)	17
Motor backward (B)	27
Steering left (A)	24
Steering right (B)	23
Left headlight	5	The headlights turn on when you connect
Right headlight	6	They can also blink a turning signal
Rear lights	13	Rear lights light up when you reverse
Power PWM	12	Disabled, for use with L298N
DHT11 signal	14
PIR signal	22
VL53L1X power	4	Use a GPIO and you can turn it off when not in use
VL53L1X SDA	2	I2C bus 1
VL53L1X SCL	3	I2C bus 1

Features

FPV stream web page with keyboard & touch controls to drive the car, along with a reversing distance indicator and a thermometer.
Low latency video stream for driving (down to 50 ms glass-to-glass when using a 90 Hz camera and a 240 Hz display.)
Bunch of websocket servers to send out sensor data and receive car controls.
Nginx reverse proxy config to tie all the servers together.
Systemd service to start the car control server on boot.
SSH tunnel to a remote control server to drive the car from anywhere.
Low-power tweaks to increase battery life (disables HDMI, Ethernet and USB.)
Use RaspiCam or a V4L2 USB webcam, either with raw video (eats CPU) or camera-supplied MJPEG

Disabled

Bluetooth speaker pairing for playing audio.
Stream car microphone to the browser.
Speak to the car from the browser by sending audio with Web Audio API.
WebRTC call between browser and car.

In progress

PoseNet with Coral USB accelerator for "point and I'll drive there"

Wanted

OMX JPEG encoder for raw video cameras
SLAM and "click on a map position to drive there"
Good small microphone + speaker solution
Small display to do two-way video calls
Non-sucky camera mount (duct tape doesn't really work)
Power car and computer from one battery
Automatic wireless charging when battery is low
Shutdown when battery critical
Speech controls

Customize

Take a look at run.sh first. It starts the web server and optionally the reverse proxy tunnel. The web server is in web/web_server.py and starts up bin/start_control_server.sh and bin/start_server.sh when needed. The sensors are controlled by sensors/sensors_websocket.py, and the car controls are in control/car_websockets.py. For video streaming, have a look at video/start_stream.sh. The HUD is in html/, see html/main.js for the car controls and how the video and sensor data are streamed.

License

MIT

fhtr