fhtr: 4x4 matrix multiplication in JavaScript vs C

2009-03-10

4x4 matrix multiplication in JavaScript vs C

The benchmark: do a million 4x4 double matrix multiplications.

Update: Here's the C version and the ASM output. And here's JavaScript version.

Update #2: I wrote an SSE2 version of the 4x4 float matrix multiplication, check it out. It's roughly four times faster than the scalar version.

SpiderMonkey (as x86_64 doesn't get any TraceMonkey love)

7 seconds.

SquirrelFish (extrapolated from someone else's numbers on a faster computer)

1.6 seconds.

C (gcc -O3) with alloca'd double arrays (yes, this is a bad idea)

2.6 seconds.

C (gcc -O3) with malloc'd double arrays

0.100 seconds.

C (gcc -O3) with posix_memalign'd double arrays (get them on a 16-byte boundary)

0.082 seconds.

C using floats (and malloc'd arrays, posix_memalign made the times the same as with doubles... go figure.)

0.052 seconds.

What did I learn

SSE (which GCC outputs for fp math) really doesn't like unaligned data. Or stack data. Or both.
And that fast JavaScript math is 20x slower than normal C.
And that data allocation is a finicky beast.
The speed difference between malloc'd and posix_memalign'd data is just wack.

Edit: On reading the ASM generated for the float arrays: GCC inlines the multiplication using movaps (i.e. 128-bit wide float mov) when you use only malloc, but if you have a posix_memalign call on an array (you don't even need to use the result), it falls back to using movss (scalar float mov.) The throughput for both is the same, but movss moves 4 times less data.

Looking at the ASM generated for the double arrays, it's the same thing but opposite results. The malloc version uses movapd and movhpd (and some mulpd and addpd), while the posix_memalign version uses movsd and does scalar math. But the malloc version is jumpier and longer, so that probably screws its performance.

If you do an average of one mmult per active object per frame, and the rest of your engine overhead is a fixed 12 ms per 60fps frame, you could have 50000 objects with C, 1700 objects with SquirrelFish, and 370 objects with SpiderMonkey. And then you still would have to try and do something about the 150 ms GC pauses that SpiderMonkey hoists on you...

4 comments:

Paul said...: can we take a look at your test code ?; March 11, 2009 at 6:45 PM
Ilmari Heikkinen said...: Sure, added links to the top of the post.; March 11, 2009 at 7:25 PM
Anonymous said...: Less than a year later we can reduce it in Firefox from 1500 ms to just 200 ms. That makes it 12500 Objects in WebGL;).

And I think it will only get better coming months with jaegermonkey, GC and tracing improvements.

This is the used code:

var identity = new WebGLFloatArray([
1.0, 0.0, 0.0, 0.0,
0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 1.0
]);

function mul4x4InPlace (a, b, c) {
var a0 = a[0], a1 = a[1], a2 = a[2], a3 = a[3], a4 = a[4], a5 = a[5], a6 = a[6], a7 = a[7], a8 = a[8], a9 = a[9], a10 = a[10], a11 = a[11], a12 = a[12], a13 = a[13], a14 = a[14], a15 = a[15],
b0 = b[0], b1 = b[1], b2 = b[2], b3 = b[3], b4 = b[4], b5 = b[5], b6 = b[6], b7 = b[7], b8 = b[8], b9 = b[9], b10 = b[10], b11 = b[11], b12 = b[12], b13 = b[13], b14 = b[14], b15 = b[15];

c[0] = b0 * a0 +
b1 * a4 +
b2 * a8 +
b3 * a12;
c[1] = b0 * a1 +
b1 * a5 +
b2 * a9 +
b3 * a13;
c[2] = b0 * a2 +
b1 * a6 +
b2 * a10 +
b3 * a14;
c[3] = b0 * a3 +
b1 * a7 +
b2 * a11 +
b3 * a15;
c[4] = b4 * a0 +
b5 * a4 +
b6 * a8 +
b7 * a12;
c[5] = b4 * a1 +
b5 * a5 +
b6 * a9 +
b7 * a13;
c[6] = b4 * a2 +
b5 * a6 +
b6 * a10 +
b7 * a14;
c[7] = b4 * a3 +
b5 * a7 +
b6 * a11 +
b7 * a15;
c[8] = b8 * a0 +
b9 * a4 +
b10 * a8 +
b11 * a12;
c[9] = b8 * a1 +
b9 * a5 +
b10 * a9 +
b11 * a13;
c[10] = b8 * a2 +
b9 * a6 +
b10 * a10 +
b11 * a14;
c[11] = b8 * a3 +
b9 * a7 +
b10 * a11 +
b11 * a15;
c[12] = b12 * a0 +
b13 * a4 +
b14 * a8 +
b15 * a12;
c[13] = b12 * a1 +
b13 * a5 +
b14 * a9 +
b15 * a13;
c[14] = b12 * a2 +
b13 * a6 +
b14 * a10 +
b15 * a14;
c[15] = b12 * a3 +
b13 * a7 +
b14 * a11 +
b15 * a15;
return c;
}

function time(elementId, f) {
var s = document.getElementById(elementId);
var t0 = new Date().getTime();
f();
var t1 = new Date().getTime();
s.textContent = 'Elapsed: '+(t1-t0)+' ms';
}

function testMatrixMultiply() {
time("testMatrixMultiply", function() {
var dst = new WebGLFloatArray(16);
for (var i=0; i<1000000; i++)
mul4x4InPlace(identity, identity, dst);
});
return false;
}; March 28, 2010 at 6:57 PM
MendicantMonkey said...: Stumbled across this in a Google search. With Firefox 4.0 on a Core 2 Duo T7200 2.00 GHz laptop running Windows 7 64-bit Pro, I get between 500-550 ms. The C version of the test compiled through Cygwin gets about the same speed, which makes me think Cygwin overhead must be awful.; March 24, 2011 at 6:05 PM

fhtr