art with code

2009-01-09

Multi-threaded qsort in C

Edit: Added memory access pattern analysis.
Edit #2: And a note on _qsort_1 and _qsort_2 having horrible performance. Plus swap accesses in the graphs.
Edit #3: The slowness in sorting chars comes from [x, x, x, ..., x] arrays triggering worst case behaviour (reverse sorted triggers it as well.) And as the char alphabet is only 8 bits, you get [x,x,x,...,x] spans after a couple rounds of partitioning, triggering the worst case. E.g. with 1M array, you'll get 256 spans of 2^20 / 2^8 = 2^14 = 16k length, which means that it'll be doing a total of 2^8 * 2^14^2 = 2^8 * 2^28 = 2^36 = ~65 billion swaps and compares in total. Which'll take a while.

Reading through K&R, wrote a libc-style qsort based on the simple version in K&R. Then wrote special cases for hardware element sizes (8/4/2/1 bytes.) At that point it took 2.25s to run on a 10 million element array of random int64s, versus 2.75s for the glibc version. Adding in simple threading brought the time down to 1.4s on my dualcore.

The generic version of the K&R qsort took around 2.5s for the single-threaded case (again, vs. glibc's 2.75s.) The above numbers are with -O2 and -O3. With optimizations turned off, the glibc qsort takes 7.2 seconds[1], while the K&R qsort takes 6.8 seconds (of which 1.5 sec can be regained by manually doing the -O3 loop invariant optimizations and inlining the swaps.) Which is interesting. What does the glibc qsort do wrong?

To investigate, I instrumented the swap and comparison functions to generate graphs of the memory access patterns. Here's the graph for the glibc qsort and the graph for the K&R qsort (with trailing insertion sorts.) A dot to the left of the graph is an access to the start of the array, while a dot to the right is an access to the end of the array.

In the graph for the glibc qsort you can see its "collapsing walls"-strategy in the tornado-like structures. Another feature of interest is at the very end: it runs insertion sort on each qsorted 4-element span. The tornadoes might be what's causing the performance gap between the glibc qsort and the K&R qsort, as they do streaming access both from the start and the end.

Timings for a small memory access pattern benchmark:

$ time ./mem_bm forward

real 0m5.138s
user 0m5.024s
sys 0m0.064s

$ time ./mem_bm reverse

real 0m5.004s
user 0m4.868s
sys 0m0.064s

$ time ./mem_bm both

real 0m6.197s
user 0m5.932s
sys 0m0.080s


Showing the tornado access pattern to be 20% slower than streaming access. Do take that with a hefty dose of salt though.

Also, the _qsort_2 and _qsort_1 versions are very slow compared to the glibc qsort. _qsort_1 specifically does something very wrong.

[1] Note that you have to use the _quicksort from stdlib/qsort.c, using the stdlib.h qsort links in the compiled library and -O doesn't have much effect on that (it does give you a small performance improvement though, enough to fool you into thinking that it does something to the qsort.)

Here's the generic version of the K&R qsort:

#include <stdio.h>
#include <string.h>
#include <alloca.h>
#include <sys/types.h>
#include <stdlib.h>
#include <pthread.h>

typedef int (*comparer)(const void *, const void *);

#define THREAD_THRESHOLD 8192

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp);

struct qsort_args {
size_t element_size;
void *arr;
int left;
int right;
comparer cmp;
};

struct qsort_args _qsort_args
(size_t element_size, void *arr, int left, int right, comparer cmp)
{
struct qsort_args qa;
qa.element_size = element_size;
qa.arr = arr;
qa.left = left;
qa.right = right;
qa.cmp = cmp;
return qa;
}

void *_qsort_start (void *args)
{
struct qsort_args *a = (struct qsort_args *)args;
_qsort (a->element_size, a->arr, a->left, a->right, a->cmp);
return NULL;
}

inline size_t average2 (size_t a, size_t b)
{
return ((a ^ b) >> 1) + (a & b);
}

void swap (size_t sz, void *arr, size_t i, size_t j)
{
char *a = (char*)(arr + i*sz);
char *b = (char*)(arr + j*sz);
long tmp;
for (i=0; i<sz-sz%sizeof(long); i+=sizeof(long)) {
tmp = *(long*)&a[i];
*(long*)&a[i] = *(long*)&b[i];
*(long*)&b[i] = tmp;
}
for (; i<sz; i++) {
tmp = a[i];
a[i] = b[i];
b[i] = (char)tmp;
}
}

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp)
{
int i, last;
pthread_t thread;
struct qsort_args qa;

if (left >= right)
return;
swap(element_size, arr, left, average2(left, right));
last = left;
for (i = left + 1; i <= right; i++) {
if (cmp(arr + i*element_size, arr + left*element_size) < 0) {
last++;
swap(element_size, arr, last, i);
}
}
swap (element_size, arr, left, last);
if (right-left > THREAD_THRESHOLD) {
qa = _qsort_args(element_size, arr,left,last-1,cmp);
if (0 == pthread_create(&thread, PTHREAD_CREATE_JOINABLE,
_qsort_start, (void*)&qa)) {
_qsort (element_size, arr, last+1, right, cmp);
pthread_join(thread, NULL);
return;
}
}
_qsort (element_size, arr, left, last-1, cmp);
_qsort (element_size, arr, last+1, right, cmp);
}

void* my_qsort (void *arr, size_t length, size_t element_size, comparer cmp)
{
_qsort (element_size, arr, 0, length-1, cmp);
return arr;
}

And the whole kaboodle can be seen at my_qsort.c.
Post a Comment

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes