Update #2: I wrote an SSE2 version of the 4x4 float matrix multiplication, check it out. It's roughly four times faster than the scalar version.
SpiderMonkey (as x86_64 doesn't get any TraceMonkey love)
SquirrelFish (extrapolated from someone else's numbers on a faster computer)
C (gcc -O3) with alloca'd double arrays (yes, this is a bad idea)
C (gcc -O3) with malloc'd double arrays
C (gcc -O3) with posix_memalign'd double arrays (get them on a 16-byte boundary)
C using floats (and malloc'd arrays, posix_memalign made the times the same as with doubles... go figure.)
What did I learn
SSE (which GCC outputs for fp math) really doesn't like unaligned data. Or stack data. Or both.
And that data allocation is a finicky beast.
The speed difference between malloc'd and posix_memalign'd data is just wack.
Edit: On reading the ASM generated for the float arrays: GCC inlines the multiplication using movaps (i.e. 128-bit wide float mov) when you use only malloc, but if you have a posix_memalign call on an array (you don't even need to use the result), it falls back to using movss (scalar float mov.) The throughput for both is the same, but movss moves 4 times less data.
Looking at the ASM generated for the double arrays, it's the same thing but opposite results. The malloc version uses movapd and movhpd (and some mulpd and addpd), while the posix_memalign version uses movsd and does scalar math. But the malloc version is jumpier and longer, so that probably screws its performance.
If you do an average of one mmult per active object per frame, and the rest of your engine overhead is a fixed 12 ms per 60fps frame, you could have 50000 objects with C, 1700 objects with SquirrelFish, and 370 objects with SpiderMonkey. And then you still would have to try and do something about the 150 ms GC pauses that SpiderMonkey hoists on you...