art with code

2009-03-10

4x4 matrix multiplication in JavaScript vs C

The benchmark: do a million 4x4 double matrix multiplications.

Update: Here's the C version and the ASM output. And here's JavaScript version.

Update #2: I wrote an SSE2 version of the 4x4 float matrix multiplication, check it out. It's roughly four times faster than the scalar version.

SpiderMonkey (as x86_64 doesn't get any TraceMonkey love)


7 seconds.

SquirrelFish (extrapolated from someone else's numbers on a faster computer)


1.6 seconds.

C (gcc -O3) with alloca'd double arrays (yes, this is a bad idea)


2.6 seconds.

C (gcc -O3) with malloc'd double arrays


0.100 seconds.

C (gcc -O3) with posix_memalign'd double arrays (get them on a 16-byte boundary)


0.082 seconds.

C using floats (and malloc'd arrays, posix_memalign made the times the same as with doubles... go figure.)


0.052 seconds.

What did I learn


SSE (which GCC outputs for fp math) really doesn't like unaligned data. Or stack data. Or both.
And that fast JavaScript math is 20x slower than normal C.
And that data allocation is a finicky beast.
The speed difference between malloc'd and posix_memalign'd data is just wack.

Edit: On reading the ASM generated for the float arrays: GCC inlines the multiplication using movaps (i.e. 128-bit wide float mov) when you use only malloc, but if you have a posix_memalign call on an array (you don't even need to use the result), it falls back to using movss (scalar float mov.) The throughput for both is the same, but movss moves 4 times less data.

Looking at the ASM generated for the double arrays, it's the same thing but opposite results. The malloc version uses movapd and movhpd (and some mulpd and addpd), while the posix_memalign version uses movsd and does scalar math. But the malloc version is jumpier and longer, so that probably screws its performance.

If you do an average of one mmult per active object per frame, and the rest of your engine overhead is a fixed 12 ms per 60fps frame, you could have 50000 objects with C, 1700 objects with SquirrelFish, and 370 objects with SpiderMonkey. And then you still would have to try and do something about the 150 ms GC pauses that SpiderMonkey hoists on you...
Post a Comment

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes, much bad code, much bad art.

Have freelanced for Verizon, Google, Mozilla, Warner Bros, Sony Pictures, Yahoo!, Microsoft, Valve Software, TDK Electronics.

Ex-Chrome Developer Relations.