# fhtr

art with code

## 2010-02-11

### 4x4 float matrix multiplication using SSE intrinsics

Here's a bit of fun from earlier this week: 4x4 matrix multiplication using SSE. Benchmarked it to be 4x faster than the scalar version (on a Pentium M, using GCC 4.4.1 with flags -O3 -mfpmath=sse -msse2 -std=c99).

To refresh your memory, here's the definition of matrix multiplication: A x B = R, where R[i][j] = dot(row(A, i), column(B, j)) and dot(u, v) = sum(i=0 to N, u[i] * v[i]). Only defined when A's width equals B's height. Matrix multiplication is used for transformations in computer graphics, 4x4 matrix multiplication specifically in 3D graphics.

First the scalar version. Multiply A with B and store the result in R. The matrices are stored in column-major order in a flat float array (so that the 4x4 transform matrix translation column is the last four values in the array).
`void mmul(const float *a, const float *b, float *r){  for (int i=0; i<16; i+=4)    for (int j=0; j<4; j++)      r[i+j] = b[i]*a[j] + b[i+1]*a[j+4] + b[i+2]*a[j+8] + b[i+3]*a[j+12];}`

Now, a naive slow SSE version. Load a row from A and a column from B, multiply them together and add the values of the result vector together.

Note that we need to align all the matrices to 16-byte boundaries. You can use `memalign(16, 16*sizeof(float))` or posix_memalign or compiler pragmas to do that. If the matrices are not aligned to 16-byte boundaries, the SSE code will segfault. I'm using the SSE intrinsics, so you won't see any cool inline assembler tricks here.
`voidmmul_sse_dot(const float * a, const float * b, float * r){  __m128 a_line, b_line, r_line;  float mc __attribute__((aligned(16)));  // 16-byte aligned temp array  for (int i=0; i<16; i+=4) {    b_line = _mm_load_ps(&b[i]);              // b_line = vec4(column(b, i))                                              // remember that we are column-major    for (int j=0; j<4; j++) {      a_line = _mm_set_ps(a[j+12], a[j+8], a[j+4], a[j]);                                               // a_line = vec4(row(a, j))                                              // note that SSE is little-endian      r_line = _mm_mul_ps(a_line, b_line);    // r_line = a_line * b_line      _mm_store_ps(mc, r_line);               // copy r_line to memory      r[i+j] = mc+mc+mc+mc;       // r[i][j] = sum(r_line)                                              //         = dot(a_line, b_line)    }  }}`

Sadly, this way is slow in practice (30% slower than the scalar version, in fact) as it stores only one value at a time and there's no operation to sum the elements of an SSE vector.

What we must do instead is execute four dots in parallel. Multiply the first value of the row A[I] with the first row of B and store the result to row R[I] (R[I] = A[I] * B). Then multiply the second value of row A[I] with the second row of B, add to row R[I] (R[I] += A[I] * B). Repeat for all the remaining values in row A[I].

Now the first value of row R[I] is the sum of products of the row A[I] and the first values of the rows of B. I.e. the dot product of row A[I] and the first column of B.

Correspondingly, the second value of R[I] is the sum of products of row A[I] and the second values of the rows of B. I.e. the dot of row A[I] and the second column of B. And so forth.

In pseudo-code:
`R[I] = A[I] * B +        A[I] * B +        A[I] * B +        A[I] * B`

But as our matrices are column-major, we need to flip A and B, so that we're multiplying whole columns with the values of a row. Ok, time for code:
`voidmmul_sse(const float * a, const float * b, float * r){  __m128 a_line, b_line, r_line;  for (int i=0; i<16; i+=4) {    // unroll the first step of the loop to avoid having to initialize r_line to zero    a_line = _mm_load_ps(a);         // a_line = vec4(column(a, 0))    b_line = _mm_set1_ps(b[i]);      // b_line = vec4(b[i])    r_line = _mm_mul_ps(a_line, b_line); // r_line = a_line * b_line    for (int j=1; j<4; j++) {      a_line = _mm_load_ps(&a[j*4]); // a_line = vec4(column(a, j))      b_line = _mm_set1_ps(b[i+j]);  // b_line = vec4(b[i][j])                                     // r_line += a_line * b_line      r_line = _mm_add_ps(_mm_mul_ps(a_line, b_line), r_line);    }    _mm_store_ps(&r[i], r_line);     // r[i] = r_line  }}`

Now we're cooking. 4x faster than the scalar version. And it even produces the same results.

You can get the full code here: mmul.c.

Wrote a C++ version with a vec4 struct to make the code super short and simple:
`void mmul_vec4(const float * a, const float * b, float * r){  for (int i=0; i<16; i+=4) {    vec4 rl = vec4(a) * vec4(b[i]);    for (int j=1; j<4; j++)      rl += vec4(&a[j*4]) * vec4(b[i+j]);    rl >> &r[i];  }}`

The best thing? Vec4 is actually a wrapper on __m128 SSE operations, so mmul_vec4 runs as fast as mmul_sse. And the wrapper was dirt simple to write too, see below.
`struct vec4{  __m128 xmm;  vec4 (__m128 v) : xmm (v) {}  vec4 (float v) { xmm = _mm_set1_ps(v); }  vec4 (float x, float y, float z, float w)  { xmm = _mm_set_ps(w,z,y,x); }  vec4 (const float *v) { xmm = _mm_load_ps(v); }  vec4 operator* (const vec4 &v) const  { return vec4(_mm_mul_ps(xmm, v.xmm)); }  vec4 operator+ (const vec4 &v) const  { return vec4(_mm_add_ps(xmm, v.xmm)); }  vec4 operator- (const vec4 &v) const  { return vec4(_mm_sub_ps(xmm, v.xmm)); }  vec4 operator/ (const vec4 &v) const  { return vec4(_mm_div_ps(xmm, v.xmm)); }  void operator*= (const vec4 &v)  { xmm = _mm_mul_ps(xmm, v.xmm); }  void operator+= (const vec4 &v)  { xmm = _mm_add_ps(xmm, v.xmm); }  void operator-= (const vec4 &v)  { xmm = _mm_sub_ps(xmm, v.xmm); }  void operator/= (const vec4 &v)  { xmm = _mm_div_ps(xmm, v.xmm); }  void operator>> (float *v)  { _mm_store_ps(v, xmm); }};`

Here's the full code for the C++ version: mmul.cpp.

#### 6 comments: Anonymous said...

well done, great to see some useful overloading :-) Anonymous said...

Nice! Although the achievement I got on a Core i5 750 (compiled with VS2008, all speed optimizations on) was only about a factor of 2. My exact timings (in ms) 504, 279, 277, 624 (for the mmul, mmul_vec4, mmul_sse, mmul_sse_dot). BTW, have you tried the new dot product _mm_dp_ps from SSE4? Best regards, Rolf Anonymous said...

Excellently written, I can't beat this. Can I use this code in a BSD-licensed project? (particularly, the SDL 1.3 license.)

Ilmari Heikkinen said...

Sure thing, use as you like! Anonymous said...

Tried the _mm_dp_ps but very low perf. My solution is to load both matrices, transpose the second one with _MM_TRANSPOSE4_PS (negligible!) and perform multiplication and horizontal additions. Seems to be the fastest solution (Core2) but it requires at least SSE3 for the hadd intrinsic.
Here is the code for the first line of the resulting matrix m2, r0 to r3 are used to store the first matrix, r4 to r7 are used to store the second one after transposition, other registers are temporaries :

r8 = _mm_mul_ps(r0, r4);
r9 = _mm_mul_ps(r0, r5);
r10 = _mm_mul_ps(r0, r6);
r11 = _mm_mul_ps(r0, r7);

r8 = _mm_hadd_ps(r8, r9);
r9 = _mm_hadd_ps(r10, r11);
r12 = _mm_hadd_ps(r8, r9);
_mm_store_ps(&m2, r12);

Perform the same operations to obtain the 3 others lines !

Albert said...

Great stuff! I looked around at stack overflow and other places and the fastest way (at least on core-* cpu:s) seems to be the haddps method with the second matrix transposed, as suggested by Anonymous on March 11, 2012. I will try to imlement and benchmark this myself to be sure. The sse3 requirement is not so bad. Most modern computers have it nowadays. You could always call cpuid and use the legacy sse method as fallback if it is important to be able to run on old hardware.