fhtr: June 2010

2010-06-27

Optimization madness

This paper, Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications!, from IBM Research is kinda fun. They did a benchmark with an Nvidia GTX 285 pitted against a dual quad-core Nehalem Xeon and a Power7. The benchmark is doing O(n²) dot products from two 4 MB (not MiB, mind) source images to a 250 kB destination image. Maybe you can see where this is going?

TL;DR

The aggregate cache bandwidth of two high-end $1386 quad-core Xeons almost reaches the main memory bandwidth of a $300 graphics card. The cache bandwidth of an eight-core Power7 blade (cheapest system I could find started at $9788.00) actually exceeds the main memory bandwidth of the graphics card by 40%. A cache-conscious version of the GPU implementation might make it 5x faster (as according to this blog post, the GTX 285 has 1.5 TB/s cache bandwidth). Currently trying to write one. Testing on my 2GHz Core 2 Duo and a Radeon HD 4850 that has 40% of GTX 285's bandwidth.

My results as of now

Hardware tested

ATI Radeon HD 4850, 67.9 GBps memory bandwidth, 384 GBps L2, 480 GBps L1, 1000 mul-add SP GFLOPS, hobbled by lack of local mem and the lack of image support in AMD's OpenCL drivers, resulting in most reads coming from main memory
Core 2 Duo E6400, 6.4 GBps memory bandwidth, 18 GBps L2, 90.5 GBps L1, 37.5 mul-add SP GFLOPS, hobbled by lack of cache bandwidth and computing performance (but hey, maybe there's some way to unlock the last 16 GFLOPS)

Table of current results (8 FLOPS per 32 bytes memory read)


  The results from the paper are on 500x500 input images. The paper reported
  runtimes in seconds, I calculated the bandwidth numbers by dividing 281 GB
  by the reported runtimes.

  The OpenCL results are measured without the context and kernel creation
  and release times, as those are likely only relevant for a single-run
  workload (and a 0.7 sec one-time overhead to do a 1 sec operation would
  be a bit skewy). The OpenCL results do include the data write and read times.
  GLSL times are similarly measured without global setup overhead.

  359 GBps (90 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (1024x1024 input)
  337 GBps (84 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (500x500 input)
  278 GBps (70 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (640x640 input)
 [276 GBps (69 GFLOPS) the paper's optimized impl on 8-core 3 GHz Power7]
  259 GBps (65 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (512x512 input)
  243 GBps (61 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (500x500 input)
  240 GBps (60 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (250x250 input)
 [230 GBps (58 GFLOPS) the paper's GTX 285 implementation using local mem]
  204 GBps (51 GFLOPS) naive GLSL implementation on Radeon HD 4850
 [159 GBps (40 GFLOPS) the paper's GTX 285 implementation]
 [150 GBps (38 GFLOPS) the paper's optimized impl on 2x4-core 2.93 GHz Nehalem]
   86 GBps (22 GFLOPS) naive OpenCL GPU implementation (1D workgroup)
   84 GBps (21 GFLOPS) optimized OpenCL CPU on Core 2 Duo E6400
   45 GBps (11 GFLOPS) manually optimized OpenMP SSE implementation
   25 GBps (6 GFLOPS) naive OpenCL GPU implementation (2D workgroup)
    5 GBps (1 GFLOPS) naive OpenMP SSE implementation
    5 GBps (1 GFLOPS) naive scalar C implementation
    1.1 GBps (0.3 GFLOPS) naive OCaml implementation
    0.03 GBps (0.008 GFLOPS) naive pure-Python implementation

Notes

The GPU data transfer overhead scales linearly with problem size, but the computational complexity of the algorithm grows quadratically. So on small problem sizes the effect of the overhead is more visible, as you can see above. The CPU does not suffer from the data transfer and setup overheads, so there is a point where the CPU is faster for problems smaller than that. With the current results for the GLSL implementation and the server CPUs, that point is at somewhere around 200x200 input image size. For my Core 2, it's around 150x150. For single image runs (with all the OpenGL overhead factored in the runtime), the GLSL-Power7-cutoff runs at around 600x600.

CPU performance went up from a 5.5 GBps naive parallelized SSE implementation to a 84 GBps OpenCL optimized version (93% of L1 bandwidth, 56% peak GFLOPS). The fastest manual SSE code I managed topped out at 43 GBps. GPU performance went from a 86 GBps naive implementation to a 337 GBps optimized version (70% of L1 bandwidth, 8% peak GFLOPS). Note that the naive CPU implementation is written in C, parallelized and vectorized. If you're writing it in pure Python instead, the performance is around 0.03 GBps, or 180x slower than the C baseline. For another baseline, a naive OCaml implementation manages 1.1 GBps. And the OCaml and Python implementations don't get faster from reordering the memory accesses because they're bottlenecked by computation, not memory access.

But in vectorized C, this algorithm is all about optimizing your memory accesses. The CPU got a 6x boost from changing the loop nesting (4x from the order optimization, rest from it enabling each core use its L1 caches and thus making parallelization more effective) and another 2x boost from reusing already loaded values (again, reducing average read latency). The GPU seems to be mostly bound by the memory access pattern (and the lack of local memory and global mem caching on the Radeon 4xxx series). On the GPU I got a 2.2x performance increase from updating 8 values at a time and reading the current block of used values to a local array. Heck, a more optimal OpenCL access pattern with _more_ reads and computation (but, sadly, _wrong_ reads) nearly doubles the GPU performance. A mostly lock-step-iterating version that does more reads got me an extra 20 GBps on OpenCL (kinda freaky, that.) GLSL does use the caches, so it's a good bit faster.

The AMD OpenCL implementation in Stream SDK 2.1 doesn't support images on HD 4850, and global memory accesses aren't cached (if I understood correctly. The generated assembly has texture fetches with the UNCACHED keyword, so probably!), which might be the cause for the low bandwidth. Guess I should buy an HD 5xxx / Nvidia and test the performance with images (I hear they have real local memory as well :P). Testing with a faster CPU would be nice as well.. Not that I have money for any of those so pfft.

Some interesting things from my POV: Slow CPU code is slow. It's possibly to reach at least 93% L1 bandwidth on the CPU, though even that doesn't max out the ALU (56% of the peak GFLOPS). And it's still kinda slow even when you do reach it. OpenCL is pretty nice for writing high-performance CPU code, it takes the pain out of SSE and parallelization. The setup is a pain though, and there's a 0.1 s binary program loading overhead. GPUs are kinda fast, even when hobbled with incomplete driver support. I'm stuck at 8% the peak GFLOPS because I can't read from the memory fast enough (and the GLSL implementation doesn't reuse loaded values to compute several output values in parallel, so it's doing cache reads instead of register reads. Which might well help.)

The algorithm used in the benchmark is massively amenable to optimization as it has an embarrassingly parallel map pass, a completely associative reduce pass, completely vectorizable, each output value is basically a single N-wide mul with a horizontal add, has a large amount of shared reads between adjacent output values, benefits a lot from cache, benefits a lot from prefetch, etc. Any numbers you see about a benchmark like that should be taken with a huge grain of salt, it's just too easy to get 2x performance differences with small optimizations. The difference between my scalar implementation and parallelized SSE implementation was 15%. The difference between the parallelized SSE implementation and the optimized implementation was 1700%. The GPU performance delta was from a pessimized version that ran at 20 GBps to a regular version that ran at 85 GBps to an optimized version that ran at 240 GBps to a GLSL version that ran at 337 GBps. And the OpenCL GPU number is very sensitive to data access pattern, problem size and the size of the OpenCL workgroup: I'm getting 240 GBps with 64 workers on 512x512 images, 190 GBps with 32 workers, 180 GBps with 128 workers, 30 GBps on 500x500 images (that rises to 150 GBps by changing stride size), 150 GBps on 256x256 images, etc. The GLSL version is much stabler with regard to input size, though it too suffers from lower performance at smaller input sizes.

The CPU is less prone to wild swings in throughput from small changes, which is nice. The only way to get decent performance on this algorithm seems to be memory access optimization, SSE, parallelization and collecting several values at a time. So you pretty much need to write it in C / C++ with OpenMP and SSE intrinsics, or OpenCL. And if you're writing it in OpenCL, why not get the GPU in the mix as well...

GPUs would be so much better if they didn't hard-lock when encountering an infinite loop. Over the past few days I've done Alt-SysRq-REISUB more times than I can remember. The OpenCL GPU implementation on the HD 48xx is ... finicky. Small changes in code can lead to either the GPU hard-locking, a 10x performance drop, or just plain incorrect behaviour. I'd like to see either GPUs with CPU levels of usability or CPUs with GPU levels of performance. Or both. I'm not picky.

Based on my initial experience, I'd say that if your algorithm is amenable to GPUs (i.e. it does lots of single-precision FP math or needs high memory bandwidth), it's not too difficult to have a naive GPU implementation that performs better than a naive CPU implementation, especially if your CPU implementation is in a slow language. In my case, optimizing the CPU and GPU implementations happened in parallel and the optimizations were quite often portable between the different implementations. The CPU was easier to optimize for and had a much better development experience (i.e. no lockups), but my cheapo CPU didn't have the performance to match the GPU at any point. The CPU-GPU performance delta went from 15x between naive implementations (40x with GLSL) to 4x between the optimized implementations.

In terms of price-performance, a $100 GPU and a $100 quad-core CPU (hmm, should you include the memory price in the CPU prices?) should perform about the same on this algorithm when using OpenCL. With GLSL, the GPU would be around 50% faster. A 2.9 GHz Athlon II X4 should get something like 220 GBps (247 GBps L1 bandwidth) and a 3 GHz quad-core Core i7 should get around 175 GBps (192 GBps L1 bandwidth). At the $200 price point, the gap between the GPU and CPU will likely increase: you get around 2x the GPU L1 bandwidth, but only a 50% extra CPU L1 bandwidth (with 6-core Phenom. Opting for a higher-clocked quad-core CPU would likely show very little performance benefit, you'd only get +20% out of going from 2.9GHz to 3.5GHz, for example.) But lacking the hardware to test, this is mere speculation.

Development log

Update #0: Source code.

Update #1: GPU at 86 GBps. Managed to drop the CPU runtime from 49 s (5.5 GBps) to 30 s (9 GBps) by making the inner loop update the values in blocks. The intuition there being that you're going to have better cache locality if you're updating a bunch of values that use data from the same area. It also returns bad values for some input sizes, so it probably doesn't work right.

Update #2: The OpenCL GPU kernel is kinda hard to optimize, most of my naive tries ended up hurting performance. But I did manage to get the performance up by 40% using a similar block-based approach as on the CPU (by using a local cache array instead of relying implicitly on L1.) Currently at 118 GBps with roughly the same optimization effort as with the CPU implementation. I don't really know much anything about optimizing OpenCL though, so I think the number could go considerably higher. On the CPU, there's maybe another 10 GBps squeezable from there, I'll have to see about that too.

Update #3: GPU at 165 GBps (note that this is on a HD 4850 that has 40% of the memory bandwidth of the GTX 285. If the results scale linearly, GTX 285 would get 391 GBps.) CPU at 27 GBps, but the CPU results are wrong so I need to fix them to see if that's a real score.

Update #4: Fixed the optimized CPU version. It's at 22 GBps, but could probably go to 27 GBps with little work. The upper limit for the CPU version is 42 GBps where all reads come from L1. The L2 limit is less than half the L1 limit, so the optimized version is already beating pure L2 bandwidth.

Update #5: According to this article on iXBT Labs, the L2 bandwidth of the HD 4850 is 384 GBps and the L1 bandwidth is 480 GBps, so those would be the limits for the GPU score. If the CPU scores are any indication, 400 GBps lies within the realms of possibility.

Update #6: Wow, a less math-heavy version of the OpenCL kernel absolutely flies on the CPU. 33 GBps. That's equivalent to 8 cycles per cache line. If the Nehalem was doing 10 cycles per cache line, it might go 20% faster with this version and reach 187 GBps. Anyone with an 8-core 2.93 GHz Nehalem want to give it a go (:?

The difference between the GPU version and the CPU version is that the CPU version is doing the dot product with an accumulator vector and doing the horizontal add at the very end of the loop. This is because a float4 dot product is mul + horizontal add + float add on the CPU, and that's going to take 6 cycles or something. On the other hand acc += v0 * v1 executes in one cycle. The GPU does the dot product in a single cycle and the add in another, so it's just 2 cycles. I tried running the CPU version on the GPU but it strokes it the wrong way and results in the GPU hanging (or just running at 0.1 GBps through a benchmark designed for thousand times that). And the ATI Linux drivers don't do the Windows 7 thing of resetting the GPU if it goes unresponsive for several seconds.

There's also a weird data alignment issue (or something) on the GPU: 496x496 images run at 152 GBps, 500x500 images run at 145 GBps, 512x512 runs at 165 GBps. The CPU doesn't care all that much.

Update #7: On trying to optimize the GPU kernel, managed to get the CPU kernel up to 42 GBps by refactoring a bit. That's nuts. That's a Core 2 Duo 2.12 GHz from four years ago beating the paper's two-thread Nehalem 2.93 GHz version by 13%. Maybe I should write a paper titled "Believe it or Not! 2 GHz Core 2 CPUs can Match 3 GHz Nehalems and POWER7s for FLOP-intensive Applications!"

Update #8: Did a few bogus GPU kernels to figure out where the realistic limits are. If the 160 GBps kernel did all its reads from L1, it could achieve 270 GBps.

Update #9: GPU kernel at 182 GBps, achieved by breaking each row into its own extra work unit. Fixed the algorithm as well.. The paper was offsetting both images on Y but the idea is to keep one image static and move the other image over it. Anyhow, that had no effect on performance. It's funny how an extra 15 GBps on the GPU is kinda meh, but a 7 GBps increase on the CPU is WHOAH AWESOMESAUCE. Hurr durr, still lacking the L1 breakthrough. If the new kernel did all its reads from L1, 350 GBps.

Update #10: Oh right, += isn't atomic. Now I get to rewrite things.

Update #11: Actually, += not being atomic wasn't the problem. The problem was that the HD 4850 OpenCL implementation only gives you 16k of local mem. And that's across all the workgroups. And I can't get it to work. Oh well. Anyhow, optimized work group size and got 190 GBps from the GPU (including data transfer to-from main mem), and the kernel bandwidth broke 200 GBps! Woop!

Update #12: CPU kernel at *drumroll* 78 GBps! Insane! That's equivalent to a single quad-core Nehalem socket. From a previous-gen chip with 66% of the clock-speed, half as many cores and quarter of the cache. Applying the same optimizations to the GPU kernel also got a nice 30 GBps performance boost, bringing it up to 235 GBps.

Update #13: Tuned the GPU kernel a bit and got 250 GBPs at 640x640 problem size. Kernel is doing 260 GBps internally. Still need to eke out 16 GBps and it'd match the paper's Power7 numbers. Man, that thing is a beast. Or maybe they were using a version that was more optimized for it than for the Nehalem. My CPU kernel should do, hmm, maybe 500 GBps on the Nehalem, if it scales in a linear fashion. Who knows. [Update: 92% of an 8-core Nehalem's 370 GBps L1 bandwidth would be 340 GBps.]

Update #14: Fixed the kernels, as I had optimized correctness away. 6 GBps drop for the CPU, GPU speed dropped 10-20 GBps.

Update #15: CPU kernel at 84 GBps, GPU kernel at 220 GBps. The CPU kernel is now about as fast as the naive GPU kernel, but still 3x slower than the optimized one. And the CPU kernel produces different results from the GPU kernel and the reference implementation because it does the additions in a different order.

Update #16: GPU kernel at 207 GBps (500x500), 245 GBps (512x512), 274 GBps (640x640), with internal kernel bandwidth reaching 285 GBps on 640x640. The 640x640 bandwidth is about the same as the Power7 number in the paper, yay!

Update #17: 500x500 kernel at 243 GBps. Also, before I was removing the context creation times and kernel build times from the OpenCL timings, but not the context and kernel release times. Which meant that the results weren't really descriptive of real multi-image run performance as both the creation and release are done one time only regardless of the amount of images you process. The creation overhead is 0.4 s and the release overhead is around 0.3 s, which will skew numbers for a test that takes ~1 s to run. To make the times more accurate for multi-image runs, I'm now deducting both the creation and release overheads from the OpenCL runtimes. The runtimes do include the data read and write overheads, as those are image-specific. (Hey, now all my OpenCL implementations have 0.3 s shorter times, yay?)

Update #18: Wrote a GLSL version of the GPU kernel to see if it might be able to use the texture caches better. Short answer: Yes. 204 GBps on a naive kernel.

Update #19: GLSL version at 337 GBps for 500x500 images (350 GBps on 640x640). 70% of L1 bandwidth. Cached memory reads sure are awesome. The performance improvement came from reading in 2x2 blocks. The reason why it's faster is that a texture memory read also loads the adjacent values to cache. It might also be possible to use a similar load-reusing algorithm as I have on the CPU and the OpenCL implementations and make the GLSL version faster. But GLSL is painful for general computation, so maybe not.

The paper

By my calculations^[1] they achieved 160 GBps read bandwidth from the GTX 285 (advertised memory bandwidth 159 GBps), 150 GBps from the Nehalem (advertised peak 2x22 GBps) and 276 GBps from the eight-core Power7 (advertised peak 100 GBps). The algorithm is doing 1 cycle of computation (a SSE mul-add) per 32 bytes of memory read, so it's pretty much a pure bandwidth benchmark. As the total problem size is 8.25 MB, it snugly fits inside the 8.38 MB L3 caches of the CPUs, so maybe the CPU L3 caches are made out of infinite bandwidth magic?

But, uh, the L3 latency on the Nehalem is 39 cycles, which'd give it 37.5 GBps bandwidth at 8 cores. 150 GBps is 10 cycles per cache line. The L2 latency on the Nehalem is 11 cycles. Plus you actually have to do computation for at least four cycles per cache line, though that's going to be masked by prefetch. How can it get 150 GBps?

The paper had a table that compared the different stages of optimization. A single-threaded scalar version ran in 30 s, compared to a single-threaded SSE version that ran in 12 s. A 16-threaded SSE version took 1.8 seconds, which is 6.7x faster than the single-threaded one. Pretty nice with 8 cores. But this is a memory benchmark. How does a memory benchmark get 7x faster by increasing the thread count? That's only going to happen if each core has its own copy of the memory block it's working on. The same thing happened on the Power7 implementation: SIMD + OpenMP resulted in a 14x performance boost(!) So maybe their algorithm is really good at caching, who knows. There's no compilable source code in the paper so...

I did a quick reimplementation of the algorithm in C++ and got 69 s for the scalar version and 58 s for the SSE version on my Core 2 Duo 2.12 GHz. Parallelizing it by tossing a '#pragma omp parallel for' above the top-level loop brought the SSE execution time down to 49 s. Why so slow? The 8 MB input images don't fit in the CPU cache, turning the program from a cache bandwidth benchmark to a memory bandwidth one. By quick calculation, it's plowing through the benchmark at 5.6 GBps.

[Update: A more cache-optimal memory access pattern got the CPU up to 84 GBps, so it's definitely possible to do most reads from L1.]

Now, I also tested the algorithm on uninitialized allocations. And the results were very strange. On uninitialized data, the SSE version completed in 13 seconds. The parallel version ran in 7.5 seconds, for a bandwidth of 37.5 GBps. That's a 7 cycle latency. L2 latency is 14 cycles. It's actually reading from L1.

So how can two 4 MB images fit into 32 kB of L1 cache? The easy answer? They don't. I think what we're seeing here is the Linux VM subsystem doing optimistic allocation. That is, when you allocate a block of memory, all 4 kiB pages in that block are going to map to The Empty Page at first. The pages are only actually created when you write on them. So as all the pages of a 4 MB image actually map to the same empty page of RAM, the benchmark is going to act as if the whole image is in L1.

I next wanted to check out the GPU numbers and wrote an OpenCL version of the algorithm. Which also gave me an excuse to learn OpenCL. I downloaded the ATI Stream SDK and a couple hours later had a working OpenCL version. There's no Linux installer in the package, so you get to do a strange dance setting LD_LIBRARY_PATH and ATISTREAMSDKROOT environment variables.

Writing the kernel was simple. I pretty much copy-pasted the C++ version of the algorithm to do the OpenCL kernel (there are two things that differ between the versions: vec4 turned into float4 and offset loop counters are now job ids). The state setup one has to do to execute the kernel was quite tedious though, and weighed in at 30 lines of code.

The nice thing about OpenCL is that you can also run the kernels on the CPU. And, well, the CPU build of the OpenCL kernel actually ran faster than my parallelized SSE version, which is pretty awesome. The bad thing about OpenCL is that it's like C and OpenGL when it comes to error reporting: it fails silently and you have to milk the errors out. And when you manage to do an infinite loop, the GPU hangs and you have to cold-boot the computer. Additionally, when you're running something time-consuming on the GPU, it freezes compiz for the duration of the computation and you're left twiddling your thumbs. It would be really REALLY nice if the GPU could do pre-emptive multitasking. I've had to Magic-SysRq-boot my computer 4 times today. It's so Windows 95.

But how about the runtimes for the GPU? My first version took 3.5 seconds on a Radeon HD 4850. 14 times as fast as the CPU implementation, twice as fast as the bogus uninitialized version, not to forget having to account for a 0.6 sec kernel compilation and setup overhead. Caching a copy of the built kernel binary to disk and reusing it brought the GPU overhead down to 0.2 s. The CPU OpenCL overhead was 0.2 s with compilation, 0.1 s with binary kernel.

The HD 4850 has a memory bandwidth of 67 GBps, but the benchmark achieves 91 GBps. Did I make an error in my bandwidth calculations? Or might it cache more of the input image than the CPU? Time to write a small memory benchmark... Yeah. 10.3 seconds to do 69 billion 16-byte reads from a 4 MB buffer, 106 GBps. Guess it caches it better than the CPU (which managed 7.6 GBps.)

By spending the same time optimizing the CPU implementation and the GPU implementation, I managed to further drop the realistic case CPU runtime to 12 s (23 GBps) by making the inner loop update the values in blocks. The intuition there being that you're going to have better cache locality if you're updating a bunch of values that use data from the same area. This kind of a thing could make the Nehalem keep the reads mostly in L2, which might well get it close to 150 GBps. [Update: As noted above, you can keep the reads mostly in L1, which might get the dual Nehalem to ~340 GBps, if the Core 2 results scale.]

Applying the same optimization to the GPU version brought its bandwidth up to 165 GBps. That's 10% faster than the $2600 Nehalem duo. On a $100 graphics card. And that's not even close to the 384 GBps L2 bandwidth limit for the HD 4850, so there is probably quite a bit of headroom there. [Update: Well, if you could use the cache with the HD 4xxx OpenCL implementation...]

To wrap up: Benchmarking a cache-optimized algorithm against an algorithm that reads from the main memory is kinda pointless. Or was the goal of the paper to say "Hey, we've got an awesome optimizing compiler that can reorder loops to be very cache-optimal"? Also, I should remember to initialize buffers when benchmarking. And a $100 graphics card trounces $2600 worth of CPU power in this benchmark. Finally, that $100 graphics card sure could use some damn hardware support for pre-emptive multitasking.

I put the sources on GitHub, go check 'em out.

[1] Estimated bandwidth use for the benchmark:


float_sz = 4
out_size = 250 * 250 = 62500

reads_per_out = 2 * (500-0.5*250) * (500-0.5*250) * 4*float_sz
              = 2 * 375*375 * 4*float_sz
              = 4500000
writes_per_out = float_sz

total_reads = out_size * reads_per_out
total_writes = out_size * writes_per_out

total_bw = out_size * (reads_per_out + writes_per_out)
         = 62500 * (4500000 + 4) 
         = 281 GB

2010-06-25

Democracy

You can only call an economy a market economy if at least 60% of the population can fully participate in it on a daily basis and own property.

By this measure, there are several functioning market economies in the world.

You can only call a country democratic if at least 60% of the population can fully participate and vote in a parliamentary meeting about the matters of the state at least once every two weeks.

By this measure, there are zero democracies in the world.

Suppose that you had an economic system similar to a representational republic. Only the members of the parliament are allowed to own property and participate in the economy. Every four years, the people vote to select a couple hundred people who are allowed to participate in the economy. Does that sound sane? A good system of economy?

I blame Socrates and Plato for the travesty that is known as the Republic. They were children of a democracy^[1] and wanted to overthrow the system to install a philosopher dictatorship in its place. The only reason their writings have survived is that they were written in a democracy (and hence not summarily burned) and that they have provided a nice argument for rulers ever since.

"We actually have the best system of governance. Only us well-born/popular/rich/divine/ideologically correct should be the ones to have any political power. See, even the Ancient Greeks who lived in an oh-so-terrible democracy agree: the perfect political system is one governed by a perfect philosopher-king who is omniscient and makes perfect decisions. Not one that's governed by flawed people who know nothing and make imperfect decisions. QED."

Compare:

"We actually have the best system of economy. Only us well-born/popular/rich/divine/ideologically correct should be the ones to own any property. See, even the Ancient Greeks who lived in an oh-so-terrible free market agree: the perfect economic system is one governed by a perfect economist-king who is omniscient and makes perfect trades. Not one where property can be owned by flawed people who know nothing and make imperfect trades. QED."

[1] Athens had a population of around 200 000, of which only 30 000 were members of parliament, so it fails to reach the 60% standard. However, that was 2500 years ago, in the times of the Persian and Egyptian God-Kings, the Spartan military dictatorship, the Warring States Period, etc. and should perhaps be judged in that context. In a modern society, where you have equality of sexes and no slavery, they should have had 120 000 members of parliament. Then again, where are the modern countries with even a paltry 30 000 members of parliament.

2010-06-24

Funny multiplication tricks

This kinda struck me yesterday: multiplication is additive pattern splatting (I just came up with that term.) I've been seeing binary tricks that have used multiplication as copy operation but never really understood it before now.

With sparse binary patterns it's quite apparent. For example, let's multiply 00100010001001001 by 111. Sounds hard? It's actually easy, you replace each 001 in the long pattern with the short pattern and end up with 11101110111111111. In the same way you can copy a byte to different places in a word. This may sometimes be faster and/or less bother than bit-shifting:


int64 fillInt64WithByte_shift(int8 b) {
  return b << 56 | b << 48 | b << 40 | b << 32 | b << 24 | b << 16 | b << 8 | b;
}

vs.

int64 fillInt64WithByte_mul(int8 b) {
  return 0x0101010101010101 * b;
}

It also works with decimal numbers: 10101 x 99 is 999999. If you have a number that's not a 1 in the target pattern, you need to multiply your source pattern with that: 001 002 003 0005 x 203 = 203 406 609 1015.

However, the simple copy splatting only works when you don't have overlaps. With overlaps you need to add the new splat to the target pattern. Let's consider non-carry addition first using 1001301001 x 211 as the example. First you fill in the non-overlapping parts (the 1001001001 pattern) and end up with 211 211 211 211. To integrate the 3 into the result you multiply it by 211 and add the result to the target pattern.


   1001301001 x 211 =
 211211211211
+    633
 ------------
 211274511211

When you have carries, it gets a good deal more tedious. Consider binary 1111 x 1111. First we splat 1111 on the first 1 in the target pattern: 01111000, then on the second 1: 00111100, etc. and finally get 01111000 + 00111100 + 00011110 + 00001111 = 11100001. Drudgery.

We could move it to base-15 and get 10 x 10 = 100, then convert that back to base-2... for no benefit whatsoever. 1111^2 x 1 = 1111 x 1111. Or do a trick with 10000 x 1111 - 1111 = 11110000 - 1111 = 11100001.

Or convert to base-16 for F x F = E1, then convert that to base-2 for 1110 0001. Note the similarity between F x F = E1, 9 x 9 = 81, o7 x o7 = o61, b1 x b1 = b01. This can be explained by


For base N, N = 10 - 1:
N x N =  N x (10 - 1)
      =  N x 10 - N
      =  N x 10 - (10 - 1)
      = (N x 10 - 10) + 1
      = (N-1) x 10 + 1

Back to splat-additive multiplication, it seems to work for positive integers. Does it work for negative ones? Well, yes, 1001001 * -101 = -101000000 + -101000 + -101 = -101101101. And there's that sign-trick to turn multiplication of negative integers to multiplication of positive integers: -A x B = -(A x B), -A x -B = -(-(AxB)) = AxB.

Does it work for matrices... not that I can tell. Matrix multiplication is not splatting anyhow. What would a splat-additive matrix multiplication operation look like? I don't know, let's start exploring by trying to extend the scalar splat-additive multiplication to vectors. If we treat vectors as bucket-based integers (that is, no carries, base-∞), integer-like splat-additive vector multiplication would be quite easy. For example, [1 0 0 2] [x] [3 6] = [3 6 0 6 12], but what can that be used for? Is there some sensible conversion to scalar from it? Can it be extended to matrices?

As an aside, scalar vector multiplication can be used to make integer multiplication a bit simpler. Using 32 x 1935 as an example: 32 x [1 9 3 5] = [32 288 96 160]. We can convert that to an integer value by doing a base conversion for each element and adding them together: 32*1000 + 288*100 + 96*10 + 160*1 = 61920 = 32 x 1935.

Another way to do splat multiplication with vectors would be to generate a matrix from vector multiplication, an approach not entirely without appeal. [A B] x [C D] would result in [ [AC AD] [BC BD] ]. It can express number multiplication as a special case: AxB would be |A| x |B| = AxB (I'm using |A| to say that |A| is a single-element vector of zero dimension). If you have dimensional numbers, it preserves the dimensionality of the result, [Length] x [Width] = [[Area]].

Can the vector-splatting multiplication be extended to vectors of vectors? Seems so:


[ [1 2] [3] ] x [ [5 5] [6 2] ] would be, uh, ... 
[  [   [1 2]x[5 5]     [1 2]x[6 2]   ]  [   [3]x[5 5]  [3]x[6 2] ]  ] =
[  [  [[5 5] [10 10]] [[6 2] [12 4]] ]  [ [[15] [15]] [[18] [6]] ]  ]

And to continue with the dimensional numbers,


[Height] x [[Area]] = [[[ Height x Area ]]]
[[Area]] x [[Area]] = [[[[ Hypercube! ]]]]

What properties would this vector-splat multiplication have?


[ A B ] x [ C D ] = [ [AC AD] [BC BD] ]
[ C D ] x [ A B ] = [ [CA CB] [DA DB] ]
Not commutative (unless the equality is sum-equality: A = B if sum(A) = sum(B).)

([ A B ] x [ C D ]) x [ E F ] = 
[ [AC AD] [BC BD] ] x [ E F ] =
[ [[ACE ACF] [ADE ADF]] [[BCE BCF] [BDE BDF]] ]

[ A B ] x ([ C D ] x [ E F ]) =
[ A B ] x [ [CE CF] [DE DF] ] =
[ [[ACE ACF] [ADE ADF]] [[BCE BCF] [BDE BDF]] ]
Seems to be associative.

The identity element would be a dimensionless 1:
1 x [ A B ] = [ 1xA 1xB ] = [ A B ]

Note that a dimensional one won't do:
[1] x [ A B ] = [ [A] [B] ]

To do the inverse element, we need vectors with negative dimensions:
I x [ A B ] = 1 requires I to reduce the dimension of [ A B ]

Additionally, the inverse element requires sum-equality (i.e. [A B] = [B A])
[A B] = [C] iff A + B = C
|0.5 0.5| = 0.5 + 0.5 = 1, hello confusing notation for dimensionless vectors
[[A [B] C]] = |[[[B]]] [[A C]]|, hello polynomial notation

I x [ A B ] = 1
[1/2A 1/2B]^-1 x [ A B ] = |A/2A B/2B| = |1/2 1/2| = 1

So the inverse function would be
-- get inverse vector for which V x inv(V) = 1
inv(V) = (map (\i -> 1/(V_len*i)) V)^-V_dim

...ghhh. So, uh, I guess with sum-equality it could maybe form a field with some sort of vector addition. The real question is whether there is anything where this multidimensional splat-multiplication works better than existing methods. It looks a lot like polynomials. Maybe the vector addition could be..


|A B| = (A + B)₀
[A B] = (A + B)₁
|[[A B] C] D| = |[[A B]] [C] D| = (A+B)₂ + C₁ + D₀
= [[A B]] + [C] + |D|

That's kinda interesting, the (A+B)₂ isn't (A+B)² but rather means that (A+B) is a two-dimensional number.

Oh well, back to the grind.

2010-06-23

Status update

Taking two days off, a mid-week weekend.

Estimated that doing an editable zoomable tilemap would take a week. 15 days of consecutive commits say that I was wrong by at least a factor of three. Zoomable tilemap, check. Editable, sort of. Lacking in genericity as rendering in browsers is either slow or requires more ugly performance hacks than you can shake a stick at. And 60 fps is the law of the land, so screw genericity.

OCaml is horrible on 32-bit. 16MB maximum string size. And strings are the intended way to handle binary data. And if you use BigArrays, they're not GC'd properly. Then again, the worst-case allocation in my use would be 2.5 gigs, so perhaps it would be reasonable to work very hard to do things in smaller increments. Loading 16k x 16k images, that is my nemesis (I had a Ruby + C version of this but...)

Next need to either make an interactive character animation system in JS or find one. Hrm.. Maybe look at the Big Buck Bunny Blender files, see how to do the whole model-rig-animate-thing, export to JS, write some simple engine to run it. Or just draw the animation by hand (On paper! With a reed! A broken reed!)

2010-06-06

WebGL stereo rendering demo

Cross your eyes and prepare for eyestrain, here comes the WebGL stereogram. Drag with LMB to rotate, drag with MMB to pan, wheel zooms, shift-wheel scales model.

The demo uses glScissor and glViewport to restrict drawing to one side of the viewport, then draws the scene to the left and right sides with the camera offset slightly. The camera offset is along the right-vector of the camera and you can get it by finding the normal of the camera up and look vectors. Basically right = cross((lookAt - position), up), then leftEye = position - (sep/2 * right); rightEye = position + (sep/2 * right). Then you render to left viewport from rightEye and to the right viewport from leftEye.

2010-06-02

Loading WebGL models from tarballs

Plugged the TarGZ loader together with my old OBJ and binary model loaders and emerged with a couple demos that load a model and a texture from a gzipped tarball and render it using WebGL.

Verdict: use uncompressed tarballs and HTTP gzip compression whenever possible. Doing the gzip with JS doesn't cache the uncompressed data transparently (you'd have to do an offline storage hack), and it's kinda slow. Parsing tar vs. plain files is a bit different tradeoff. With tar you only have to manage a single file. With plain files, you need to manage the whole directory hierarchy they live in. Which may not be something that you can or want to do with your file serving host.

One solution is uncompressed tarballs and a .htaccess with deflate filter set for them, though it seems to screw up caching somehow (thanks for the .htaccess tip go to Evgeny Demidov)


AddOutputFilterByType DEFLATE text/html text/plain application/x-tar application/javascript

<Files *.bin>
SetOutputFilter DEFLATE
</Files>

which also helps for JSON models and OBJs and other textual things. The main problem with the tarballs is that loading the images takes a bit longer (read: hundredths / tenths of a second) due to having to dataURLize them. Well. If you pre-dataURL the images, that won't be a problem.

The main benefit of the tarballs is that everything's in a single file and you don't need a chain loader to initialize models in a single step. Instead of first showing an untextured model, then sometime later the model with a diffuse texture, then later the model with a diffuse texture and a normal map, etc., with single-step initialization you get the full textured model once all the data is loaded. Plus, if you have a standard model format of some sort with manifests and whatnot, you can swap models by changing a single URL. You can do that with a plain model.xml or somesuch as well, but then you need to manage multiple files. It's harder to reuse resources inside tarballs than resources as separate files, though.

Also updated the big whale model viewer to be mouse-controllable (drag with LMB to rotate, drag with MMB to pan, scroll wheel to zoom, shift-scroll wheel to scale).

fhtr