art with code

2020-06-25

file.glsl

Current design: Each shader invocation has an 8k heap slice and an 8k IO heap slice. There's an IO request buffer. GPU writes IOs to the req buffer, CPU picks them up and writes the result to the IO heap. The GPU copies the result from the IO heap to the main heap. The IO heap and the IO request buffer are marked volatile. The main heap isn't, so it can benefit from caches.

Now trying to handle argv nicely. Allocate an extra slice in the heap and copy the argv there before shader start. 

This could also be used to store string literals. Now string literals are malloc'd and filled in each invocation at shader start, which is a total waste of time. But because the string lib has funcs that do in-place modification, this avoids one class of errors. Switching to immutable strings in the lib is an enticing option.

Memory allocation is done with a simple bump malloc. Freeing heap memory after use is a hassle. I have macros FREE() and FREE_IO() that free whatever heap / IO heap allocations you did inside the call. 

It might be nice to have a global heap with a proper malloc to store large allocations that are re-used across threads. E.g. for loading texture data. This would probably have to be a fixed size buffer. I doubt that it's possible to allocate more buffer memory on the CPU side while the shader is running. Would be nice though!

Wrote a sketch of grep.glsl. Very educational, as it exposed a bunch of missing features and "this'd be nice to have"-things. Handling argv and program return value fall in the first category. Having helpers for reductions across all invocations and sorted-order IO fall in the second category. 

The current grep.glsl sketch is: each thread does a read for a 4kB+strLen(pattern) chunk, then runs indexOf(chunk, pattern) to find all occurrences of pattern in it. The threads then iterate through all invocationIDs in order (with a barrier() call to keep them in lockstep) and when the thread id matches the current id, the thread prints out its results. This should keep the workgroup threads in order, but the workgroups might still be in different order. Then the threads or-reduce whether they found any hits or not to set the program return value to 0 or 1. Advance read offset by total thread count times 4kB, repeat until a thread reads less than 4kB (EOF).

This makes a bunch of design issues apparent. The IOs should be run concurrently. Copying IO bytes to heap has limited value here, so having a way to run string functions directly on the IO heap would be nice. The read could be done with a single IO instead of threadCount IOs. Overlapping IO and execution: could issue two IOs, wait for the first, issue third IO, process first IO, wait for the second, issue fourth IO, process second IO, ... How to do line numbers. Reads should have an EOF flag. Can we hit 12 GB/s for files in page cache? Can we cache files in GPU RAM and match more patterns later at 400 GB/s?


No comments:

Blog Archive