art with code

2020-06-16

Real-time path tracing

[WIP - I'll update this over time]

The key problem in real-time path tracing is that you have enough compute to do a couple of paths per pixel, but the quality you want requires ten thousand paths per pixel.

There are three ways to solve this correctly. One is to wait a few decades for computers to get 10 000 times faster. One is to use 10 000 computers to do the rendering. And one is to make paths 10 000 times cheaper.

Or you could render at a lower resolution and frame rate. If you can do 1 sample per pixel at 4k resolution at 60 Hz, then you can do 10k samples per pixel at 30 Hz at 60x25 resolution. Yes, you can have real-time movie quality graphics on a thumbnail-sized display. If you want to hit C64 resolutions like 320x200, you could get a 3-4 cryptomining PCs with 10+ GPUs each. For a paltry $15000 you can have the C64 of your dreams!

But, well, what if you want to do high-resolution rendering at high frame rates? Without the budget for a massive cluster, and without having to find The Secret of Monte Carlo which would enable incredibly fast and accurate integration of the lighting equation.

There are ways. You could put together four techniques that would give you a 10x perceptual sample count boost. Or you could put together 14 techniques with a 2x perceptual sample count boost. Or perhaps a mix of the two. Then there are optimizations and mathematical techniques to improve the performance of the Monte Carlo integration used in path tracing. Even ways to arrive at better approximations of the light field in the same time. What's important is picking a set of techniques that compound.

Monte Carlo integration is a process where you integrate a function by calling it with random values, sum up the results, and divide the sum with the number of results. The nice thing about MC integration is that it will eventually converge to the correct solution, no matter what initial value and weight you start with.

Images generated with MC integration start off noisy since the initial samples of the function have a high weight and neighboring pixels often end up with quite different values at any given sample due to us picking random values on the function. This noise gets smoothed out as the number of samples increases and the pixel neighborhood ends up sampling a similar hemisphere.

The relationship between noise and sample count is roughly that noise is the square root of the sample count. Quadruple the sample count and the noise halves. At 10k samples, you halve the noise about 6.6 times. If you've got an 8-bit range, your noise level should be about 1.4 bits or about +-1.3 units.

If you want to get rid of noise in the early stages of the integration, you have to downweigh the initial samples vs the neighborhood. The easiest way is to blur the image. Another way is to pick a non-noisy starting point and give it a high weight, so that the new MC samples don't cause as much noise. Yet another way is to correlate the sampling of the neighboring pixels so that they sample the same regions - this will trade high frequency noise for low frequency noise. 

E.g. use a sampling strategy for a 64x64 region where you first take 100 samples of the hemisphere at the center of the region, keep the top-5 highest energy samples, weigh them down by their probability, and use shadow rays to connect the 64x64 region to only the high-energy values. You'll get no pixel-to-pixel noise in the 64x64 region, but you might end up with a blotchy image when transitioning from one region to another. 

Fireflies are rare high-energy samples that jump a pixel value much higher than the neighborhood. Fireflies are actually good for the energy search part of the integration, you can use the firefly path as a guiding star for the neighborhood. But they cause bright noise that takes a large number of samples to tone down. One way to deal with fireflies is to clamp them. This way you'll still get bright pixels but they won't slow convergence towards darkness. The issue with clamping is that fireflies can be an important contributor of energy in the neighborhood. If a firefly boosts a pixel by 10 units and the path occurs only on every 100th sample, the firefly's contribution should be 0.1 units. If you clamp the firefly to 1, it'll contribute only 0.01 units, and the pixel ends up too dark.

A better way might be to estimate the probability of the firefly path based on the neighborhood samples. If only every 100th pixel has a firefly of 10 units, scale the fireflies down to 0.1 units and add 0.1 to the entire neighborhood. As the sample count increases, you can start passing the fireflies straight through and converge to an unbiased result.

Temporal reprojection lets you use samples computed in previous frames as the starting point of the integration in the current frame. Adaptive sampling allows you to focus your ray budget to the areas of the scene that need it the most. Sparse sampling lets you skip some pixels completely.

Multiple importance sampling guides your integrator towards high-energy regions of the scene. Bidirectional path tracing tries to connect high-energy paths with camera paths. Photon mapping caches high-energy paths in the scene. Vertex connection merging combines BDPT and BDPPM.

Path reuse techniques like bidirectional path tracing can give you additional paths at the expense of a single shadow ray and BSDF evaluation. You generate a camera path and then connect the vertices on the camera path to generated path suffixes. In simple BDPT, you'd generate a camera path and a light path, then connect each vertex of the camera path to each vertex of the light path. If you have three connecting vertices in the camera path and three in the light path, you'd get ten camera paths at the expense of one camera path, one light path, and nine shadow rays. If you reuse light paths across multiple pixels, you can amortize the light path cost. In static scenes, you could also reuse camera paths. This way you might be able to generate decent paths with a single shadow ray per path.

For outdoor scenes, you'll start seeing vanishing returns after five bounces, so path reuse techniques for easy light paths might be limited to around 5x speed boost. In indoor scenes with longer paths, path reuse becomes more useful. If you're also dealing with difficult lighting conditions, the generated high-energy path suffixes are very helpful.

Temporal reprojection can allow you to start integration at a high effective sample count, based on integration results from previous frames. It is most helpful with diffuse surfaces, where camera angle and position don't change the integration result. On glossy and reflective surfaces and refractive volumes you'll see significant ghosting.

Adaptive sampling is quite crucial for fast renders at low sample counts. Not all parts of the scene require the same number of samples to converge. If you can skip low-noise, low-contrast regions, you can allocate more samples to noisy regions and high-contrast regions. You can borrow some tricks from artists and blur out dark regions of the image since the eye is going to skip them anyhow and seek out high-contrast regions and bright areas. Throwing extra samples to determine if a shadow region has a value of 0.01 or 0.02 is a waste of time, especially in animated scenes. You can use foveated rendering tricks and spend your ray budget where the viewer is looking at. If you don't have eye tracking, you can make educated guesses (center of the screen, anything moving, faces, silhouettes.)

Sparse sampling is the extreme cousin of adaptive sampling. With sparse sampling you can completely skip rendering some pixels, filling them from the temporal reprojection buffer and the nearest sampled pixels. Necessary for hitting a target framerate. Handy for foveal rendering. If you have sparse sampling, you can run your interactive scene at a locked 60 FPS.

Combining the above techniques gives you some nice synergies. Adaptive sampling can focus on parts of the scene that are lacking previous frame data. Sparse sampling can be used to generate a low-resolution render with a high sample count, to be used as the start for integration and as a low-res variance estimate to guide the adaptive sampler. Temporal reprojection could pull samples out of the path reuse cache.

The path reuse technique can prioritize path variants at low bounce counts (say, 10 variants at bounce 1, 2 variants at bounce 2, 1 at bounce 3 -- this'd get you a good sampling of the first bounce hemisphere). The path caches can be used to estimate energy at different parts of the scene and the probability of connecting two parts of the scene, this could be used for MIS pdfs and to guide paths towards high energy regions with high probability of getting there. 

Denoisers work by looking at the variance of a region and the level of convergence (how much each new sample changes the value at the pixel, and how close the pixel's value is to its neighbors.) If the convergence is low and the variance is high, the pixel's light estimate is likely off, and the region's light estimate likewise. A denoiser can pool the samples in the neighborhood and distribute the energy so that the variation between samples is low. That is, blur the region.

By looking at geometry and texture changes, the denoiser can avoid denoising high-contrast regions that should be high-contrast (e.g. you've got a sharp edge, so the normals on both sides of the edge are very different => denoiser can avoid blurring across the edge). Textures have high-frequency signal that also should be there. Denoising without the texture applied and then applying the texture on top can give you a smooth lighting solution without blurring the texture.

Denoising can give you overly smooth results. You could make a denoiser to bring down noise to a wanted level but not all the way down. Keep some of the noise, don't blur out natural noise. Use a variable amount of denoising depending on the region variance, convergence and sample count.

Noise in rendering is tricky. Let's say you're using blue noise to get a nicer sampling pattern. And you render an animation with a moving camera. If you fix the noise to the screen, it will look like the screen is a bit dirty. If you fix the noise to world coordinates, it will look like your object is dirty. If you animate the noise, it will look like your renderer is broken.

At one 5-ray path per pixel at 4k, you've got 40 million rays worth of scene connectivity data in every frame. Figuring out ways to make use of these rays for MC integration might well give you a high-quality render at a low cost.

Acceleration structures make ray-scene intersection cheaper to evaluate. They have a build cost. For dynamic scenes where you have to update or rebuild the acceleration structure every frame, it's important to use one that's fast to build. For static scenes, you can go with a more expensive acceleration structure that makes ray intersection faster.

Acceleration structures map a 5D ray (origin, direction) to primitives intersected by the ray. Usually this is done either with a tree that splits the space into hierarchical regions, or with a tree that splits the scene into bounding boxes of primitives. The difficulty, as you might imagine, is that a 3D tree doesn't give you exact matches for a 5D ray. Instead you end up with a list of tree nodes to test against the ray. If the tree doesn't have an ordering property (i.e. "if the ray hits something in this node, it's the nearest object"), you might even need to do several intersections to find the closest one.

The trees do have directionality. For a BVH, sorting the primitives recursively by different coordinate axes, you're creating lists ordered along a dimension. Once you know that you're dealing with an ordered list, you can do early exit if your hit or exit is before the start of the next untested element. 

Grids are nice in that they're directional, so you have you can do early exit on first hit. And you've got a fixed max number of steps you need to take to step through a grid.

What you'd really like to have is a proper 5D lookup structure where you can directly find the next hit for a ray. This is difficult. 

Ray classification (Arvo & Kirk 87) assigns objects into a lazily-constructed 5D grid. On each bounce, you assign your rays into 5D grid cells based on their origin and direction. Then for each grid cell, find all objects that match with it (a 5D grid cell is basically a frustum, so you'd do frustum-to-BVH checks). Then trace your rays, intersecting only against the objects inside the ray's grid cell.

You could split the scene into a hierarchical tree of beams that cover the direction space, then put the primitives inside each beam into a tree sorted along the beam direction. The good part about this approach is that you can find the beam for a ray fast, then jump to tree position, and there's your intersect. The bad part is that it uses memory like crazy since you're storing references to the entire scene N times where N is the number of beam directions you have.

You'll start hitting diminishing returns pretty soon. BVH intersection takes a few dozen steps and half a dozen intersections. If you're slower than that, just use a BVH. If you can optimize your structure to half a dozen steps and a single intersection, you'll be 5-6x faster than a BVH. Nothing to scoff at but it won't get you from 1 spp to 10k spp alone.

Let's look at memory bandwidth. Doing 10k paths of length 5, you'd need to read 50k triangles per pixel. One triangle is nine floats, or 36 bytes. That'd add up to 1.8 MB of triangle reads per pixel. A GPU with 1 TB/s memory bandwidth could process half a million pixels per second. Or about 100x100 pixels per frame. And then you need to do the compute and fetch the materials and textures.

If you want to render 10k paths per pixel at 4k@60 with 1 TB/s memory bandwidth, each ray can use only 0.04 bytes of memory bandwidth, or a third of a bit. How to fit three rays in a bit is left as an exercise to the reader. If you limit yourself to 100 paths per frame at 1080p, you'll have 16 bytes of memory bandwidth per ray. Still not enough to fit a 9-float triangle, but maybe you could trace SDFs or something.

Compute-wise, well, 10k paths per pixel at 4k is about 80 billion paths per frame. At 60 Hz, that's a cool five trillion paths per second. A 13 TFLOPS RTX 2080 Ti could spare 2.5 ops per path. If each path has five rays, that's half an op per ray. Going down to 100 paths at 1080p30, your budget becomes 432 ops per path. If you also go from f32 to f16, you'd have 860 ops per path and 32 bytes of bandwidth per ray @ 1TB/s, for 16 half-floats. A memory-dense acceleration structure with avg 1 triangle per traversal, and you could theoretically fit the acceleration structure and one triangle in that.

If you could somehow do path tracing with INT4s, you'd have a lot more headroom for memory bandwidth and compute. 32 bytes would fit 64 INT4s. And an A100 card can do 1248 trillion tensor ops per second on INT4s. That'd be 50 ops per ray at 4k60 and 10k paths per pixel. You could render something! I don't know what it would look like, but something at least!

Three types of paths: high-energy paths, high-probability paths and connecting paths. High-energy paths start from emissive surfaces. High-probability paths start from the camera. Connecting paths connect the two. High-P paths change when the camera moves and the scene in front of the camera changes. High-E paths change when the scene around the light sources moves. Connecting paths can be done via MIS weighing. You multiply the contribution of a path by its probability.

Unidirectional path tracing only generates high-P paths. Path tracing with next event estimation (shooting shadow rays at lights) tries to connect high-P paths with high-E regions. Bidirectional path tracing generates high-E paths and tries to connect high-P paths with high-E path suffixes. Photon mapping propagates high-E regions throughout the scene. Radiosity is another method to propagate high-E regions.

We know how to generate high-P paths and high-E paths. What about generating connecting paths? You could pick two random surfaces in the scene and try to connect them with a ray. Do this a few million times and you should have a decent connectivity map to estimate the transmission ratio from one arbitrary ray to another, and what the connecting path would look like. Then figure out high-P regions (what parts of the scene have camera rays) and sum up the energy in high-E regions, then search for high transmission connective paths from high-P regions to high-E regions and try to do connections that would contribute the most to camera pixels. That is, you've got a high-transmission, high-probability (according to the BSDF) connection from a camera path to a light path.

More tricks. Scene traversal on a GPU. You've got 32 rays flying at the same time, they're stepping through the acceleration structure to find the next hit. The hit happens at different depths of the acceleration structure, so the whole thing proceeds at the speed of the slowest ray. Then you bounce all the rays, they go all over the place. Now your rays are diverging even more. Not only do they finish at different times, they also access memory in different places. Another bounce. Some rays finished their paths and the execution lanes for those rays are just idling until all the 32 rays have finished.

You could make the rays fly in a bundle, forcing them to bounce in the same direction. That'd get you lower divergence among the bundle. This'll give you a very biased ray bunch though, and it'll look like your renderer is doing some weird hackery. Maybe you could scramble the starting pixels for the rays. Use a dither pattern to pick a bunch of pixels from a region, trace them in a coherent fashion. Pick another dither bunch of pixels, trace them in a different coherent fashion. Repeat until you've got the region covered. Now neighboring pixels won't have traversed a coherent path, so you get something more of a dithered noise pattern. And hopefully get the performance gains of coherent ray bundles without the biased render look at low sample counts.

Also, this:


No comments:

Blog Archive