To refresh your memory, here's the definition of matrix multiplication: A x B = R, where R[i][j] = dot(row(A, i), column(B, j)) and dot(u, v) = sum(i=0 to N, u[i] * v[i]). Only defined when A's width equals B's height. Matrix multiplication is used for transformations in computer graphics, 4x4 matrix multiplication specifically in 3D graphics.

First the scalar version. Multiply A with B and store the result in R. The matrices are stored in column-major order in a flat float array (so that the 4x4 transform matrix translation column is the last four values in the array).

void

mmul(const float *a, const float *b, float *r)

{

for (int i=0; i<16; i+=4)

for (int j=0; j<4; j++)

r[i+j] = b[i]*a[j] + b[i+1]*a[j+4] + b[i+2]*a[j+8] + b[i+3]*a[j+12];

}

Now, a naive slow SSE version. Load a row from A and a column from B, multiply them together and add the values of the result vector together.

Note that we need to align all the matrices to 16-byte boundaries. You can use

`memalign(16, 16*sizeof(float))`

or posix_memalign or compiler pragmas to do that. If the matrices are not aligned to 16-byte boundaries, the SSE code will segfault. I'm using the SSE intrinsics, so you won't see any cool inline assembler tricks here.

void

mmul_sse_dot(const float * a, const float * b, float * r)

{

__m128 a_line, b_line, r_line;

float mc[16] __attribute__((aligned(16))); // 16-byte aligned temp array

for (int i=0; i<16; i+=4) {

b_line = _mm_load_ps(&b[i]); // b_line = vec4(column(b, i))

// remember that we are column-major

for (int j=0; j<4; j++) {

a_line = _mm_set_ps(a[j+12], a[j+8], a[j+4], a[j]);

// a_line = vec4(row(a, j))

// note that SSE is little-endian

r_line = _mm_mul_ps(a_line, b_line); // r_line = a_line * b_line

_mm_store_ps(mc, r_line); // copy r_line to memory

r[i+j] = mc[0]+mc[1]+mc[2]+mc[3]; // r[i][j] = sum(r_line)

// = dot(a_line, b_line)

}

}

}

Sadly, this way is slow in practice (30% slower than the scalar version, in fact) as it stores only one value at a time and there's no operation to sum the elements of an SSE vector.

What we must do instead is execute four dots in parallel. Multiply the first value of the row A[I] with the first row of B and store the result to row R[I] (R[I] = A[I][0] * B[0]). Then multiply the second value of row A[I] with the second row of B, add to row R[I] (R[I] += A[I][1] * B[1]). Repeat for all the remaining values in row A[I].

Now the first value of row R[I] is the sum of products of the row A[I] and the first values of the rows of B. I.e. the dot product of row A[I] and the first column of B.

Correspondingly, the second value of R[I] is the sum of products of row A[I] and the second values of the rows of B. I.e. the dot of row A[I] and the second column of B. And so forth.

In pseudo-code:

R[I] = A[I][0] * B[0] +

A[I][1] * B[1] +

A[I][2] * B[2] +

A[I][3] * B[3]

But as our matrices are column-major, we need to flip A and B, so that we're multiplying whole columns with the values of a row. Ok, time for code:

void

mmul_sse(const float * a, const float * b, float * r)

{

__m128 a_line, b_line, r_line;

for (int i=0; i<16; i+=4) {

// unroll the first step of the loop to avoid having to initialize r_line to zero

a_line = _mm_load_ps(a); // a_line = vec4(column(a, 0))

b_line = _mm_set1_ps(b[i]); // b_line = vec4(b[i][0])

r_line = _mm_mul_ps(a_line, b_line); // r_line = a_line * b_line

for (int j=1; j<4; j++) {

a_line = _mm_load_ps(&a[j*4]); // a_line = vec4(column(a, j))

b_line = _mm_set1_ps(b[i+j]); // b_line = vec4(b[i][j])

// r_line += a_line * b_line

r_line = _mm_add_ps(_mm_mul_ps(a_line, b_line), r_line);

}

_mm_store_ps(&r[i], r_line); // r[i] = r_line

}

}

Now we're cooking. 4x faster than the scalar version. And it even produces the same results.

You can get the full code here: mmul.c.

[Edit]

Wrote a C++ version with a vec4 struct to make the code super short and simple:

void mmul_vec4

(const float * a, const float * b, float * r)

{

for (int i=0; i<16; i+=4) {

vec4 rl = vec4(a) * vec4(b[i]);

for (int j=1; j<4; j++)

rl += vec4(&a[j*4]) * vec4(b[i+j]);

rl >> &r[i];

}

}

The best thing? Vec4 is actually a wrapper on __m128 SSE operations, so mmul_vec4 runs as fast as mmul_sse. And the wrapper was dirt simple to write too, see below.

struct vec4

{

__m128 xmm;

vec4 (__m128 v) : xmm (v) {}

vec4 (float v) { xmm = _mm_set1_ps(v); }

vec4 (float x, float y, float z, float w)

{ xmm = _mm_set_ps(w,z,y,x); }

vec4 (const float *v) { xmm = _mm_load_ps(v); }

vec4 operator* (const vec4 &v) const

{ return vec4(_mm_mul_ps(xmm, v.xmm)); }

vec4 operator+ (const vec4 &v) const

{ return vec4(_mm_add_ps(xmm, v.xmm)); }

vec4 operator- (const vec4 &v) const

{ return vec4(_mm_sub_ps(xmm, v.xmm)); }

vec4 operator/ (const vec4 &v) const

{ return vec4(_mm_div_ps(xmm, v.xmm)); }

void operator*= (const vec4 &v)

{ xmm = _mm_mul_ps(xmm, v.xmm); }

void operator+= (const vec4 &v)

{ xmm = _mm_add_ps(xmm, v.xmm); }

void operator-= (const vec4 &v)

{ xmm = _mm_sub_ps(xmm, v.xmm); }

void operator/= (const vec4 &v)

{ xmm = _mm_div_ps(xmm, v.xmm); }

void operator>> (float *v)

{ _mm_store_ps(v, xmm); }

};

Here's the full code for the C++ version: mmul.cpp.