fhtr: Fast code

I was thinking of the characteristics of high-performance language runtimes (read: execution times close to optimal for hardware) and came up with this list:

Flat data structures (err, like an array of structs where struct N is in memory right before struct N+1)

streaming memory reads are prefetcher-friendly, spend less time chasing pointers

Tightly-packed data

memory fetches happen in cache line -sized chunks, tightly-packed data gives you more payload per memory fetch
fit more payload into cache, faster subsequent memory accesses

Reusable memory

keep more of the working set of data in cache

Unboxed values

spend less time chasing pointers
generate tight code for data manipulation because data type known (float/double/int/short/byte/vector)

Vector instructions

more bytes manipulated per instruction, data moves faster through an execution unit

Parallel execution

more bytes manipulated per clock cycle, data moves faster through the processor

Keep data in registers when possible

less time spent waiting for caches

Keep data in cache when possible

less time spent waiting for memory
instead of going over the full data set several times end-to-end, split it into cache-sized chunks and process each one fully before moving onto the next one

Minimize amount of unnecessary data movement between processing units

keep data close to processor until you're done with it, even more important with GPUs

Flat code layout

low amount of jumps per byte processed

Tight code

keep more of the program in cache

Interleaved I/O

work on already loaded data while loading in new data
minimum amount of time spent waiting for I/O

You might notice that the major theme is optimizing memory use. I started thinking of program execution as a way to read in the input data set and write out the output data set. The sizes of the input and output data give you a nice optimum execution time by dividing the data set size by memory bandwidth (or I/O bandwidth if you're working on big things). The flow of the program then becomes pushing this river of data through the CPU.

Suck in the data in cache line -sized chunks, process entire cache line before moving to the next, preload next cache line while processing the current one. Use vector instructions to manipulate several bytes of data at the same time, use parallelism to manipulate several streams of data at the same time. Make your processing kernel fit into L1 instruction cache

Gnnnn ok, back to writing JavaScript.

fhtr

2012-01-21

Fast code

No comments:

Blog Archive

About Me