art with code

2009-01-09

Multi-threaded qsort in C

Edit: Added memory access pattern analysis.
Edit #2: And a note on _qsort_1 and _qsort_2 having horrible performance. Plus swap accesses in the graphs.
Edit #3: The slowness in sorting chars comes from [x, x, x, ..., x] arrays triggering worst case behaviour (reverse sorted triggers it as well.) And as the char alphabet is only 8 bits, you get [x,x,x,...,x] spans after a couple rounds of partitioning, triggering the worst case. E.g. with 1M array, you'll get 256 spans of 2^20 / 2^8 = 2^14 = 16k length, which means that it'll be doing a total of 2^8 * 2^14^2 = 2^8 * 2^28 = 2^36 = ~65 billion swaps and compares in total. Which'll take a while.

Reading through K&R, wrote a libc-style qsort based on the simple version in K&R. Then wrote special cases for hardware element sizes (8/4/2/1 bytes.) At that point it took 2.25s to run on a 10 million element array of random int64s, versus 2.75s for the glibc version. Adding in simple threading brought the time down to 1.4s on my dualcore.

The generic version of the K&R qsort took around 2.5s for the single-threaded case (again, vs. glibc's 2.75s.) The above numbers are with -O2 and -O3. With optimizations turned off, the glibc qsort takes 7.2 seconds[1], while the K&R qsort takes 6.8 seconds (of which 1.5 sec can be regained by manually doing the -O3 loop invariant optimizations and inlining the swaps.) Which is interesting. What does the glibc qsort do wrong?

To investigate, I instrumented the swap and comparison functions to generate graphs of the memory access patterns. Here's the graph for the glibc qsort and the graph for the K&R qsort (with trailing insertion sorts.) A dot to the left of the graph is an access to the start of the array, while a dot to the right is an access to the end of the array.

In the graph for the glibc qsort you can see its "collapsing walls"-strategy in the tornado-like structures. Another feature of interest is at the very end: it runs insertion sort on each qsorted 4-element span. The tornadoes might be what's causing the performance gap between the glibc qsort and the K&R qsort, as they do streaming access both from the start and the end.

Timings for a small memory access pattern benchmark:

$ time ./mem_bm forward

real 0m5.138s
user 0m5.024s
sys 0m0.064s

$ time ./mem_bm reverse

real 0m5.004s
user 0m4.868s
sys 0m0.064s

$ time ./mem_bm both

real 0m6.197s
user 0m5.932s
sys 0m0.080s


Showing the tornado access pattern to be 20% slower than streaming access. Do take that with a hefty dose of salt though.

Also, the _qsort_2 and _qsort_1 versions are very slow compared to the glibc qsort. _qsort_1 specifically does something very wrong.

[1] Note that you have to use the _quicksort from stdlib/qsort.c, using the stdlib.h qsort links in the compiled library and -O doesn't have much effect on that (it does give you a small performance improvement though, enough to fool you into thinking that it does something to the qsort.)

Here's the generic version of the K&R qsort:

#include <stdio.h>
#include <string.h>
#include <alloca.h>
#include <sys/types.h>
#include <stdlib.h>
#include <pthread.h>

typedef int (*comparer)(const void *, const void *);

#define THREAD_THRESHOLD 8192

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp);

struct qsort_args {
size_t element_size;
void *arr;
int left;
int right;
comparer cmp;
};

struct qsort_args _qsort_args
(size_t element_size, void *arr, int left, int right, comparer cmp)
{
struct qsort_args qa;
qa.element_size = element_size;
qa.arr = arr;
qa.left = left;
qa.right = right;
qa.cmp = cmp;
return qa;
}

void *_qsort_start (void *args)
{
struct qsort_args *a = (struct qsort_args *)args;
_qsort (a->element_size, a->arr, a->left, a->right, a->cmp);
return NULL;
}

inline size_t average2 (size_t a, size_t b)
{
return ((a ^ b) >> 1) + (a & b);
}

void swap (size_t sz, void *arr, size_t i, size_t j)
{
char *a = (char*)(arr + i*sz);
char *b = (char*)(arr + j*sz);
long tmp;
for (i=0; i<sz-sz%sizeof(long); i+=sizeof(long)) {
tmp = *(long*)&a[i];
*(long*)&a[i] = *(long*)&b[i];
*(long*)&b[i] = tmp;
}
for (; i<sz; i++) {
tmp = a[i];
a[i] = b[i];
b[i] = (char)tmp;
}
}

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp)
{
int i, last;
pthread_t thread;
struct qsort_args qa;

if (left >= right)
return;
swap(element_size, arr, left, average2(left, right));
last = left;
for (i = left + 1; i <= right; i++) {
if (cmp(arr + i*element_size, arr + left*element_size) < 0) {
last++;
swap(element_size, arr, last, i);
}
}
swap (element_size, arr, left, last);
if (right-left > THREAD_THRESHOLD) {
qa = _qsort_args(element_size, arr,left,last-1,cmp);
if (0 == pthread_create(&thread, PTHREAD_CREATE_JOINABLE,
_qsort_start, (void*)&qa)) {
_qsort (element_size, arr, last+1, right, cmp);
pthread_join(thread, NULL);
return;
}
}
_qsort (element_size, arr, left, last-1, cmp);
_qsort (element_size, arr, last+1, right, cmp);
}

void* my_qsort (void *arr, size_t length, size_t element_size, comparer cmp)
{
_qsort (element_size, arr, 0, length-1, cmp);
return arr;
}

And the whole kaboodle can be seen at my_qsort.c.
Post a Comment

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes, much bad code, much bad art.

Have freelanced for Verizon, Google, Mozilla, Warner Bros, Sony Pictures, Yahoo!, Microsoft, Valve Software, TDK Electronics.

Ex-Chrome Developer Relations.