art with code

2009-01-09

Multi-threaded qsort in C

Edit: Added memory access pattern analysis.
Edit #2: And a note on _qsort_1 and _qsort_2 having horrible performance. Plus swap accesses in the graphs.
Edit #3: The slowness in sorting chars comes from [x, x, x, ..., x] arrays triggering worst case behaviour (reverse sorted triggers it as well.) And as the char alphabet is only 8 bits, you get [x,x,x,...,x] spans after a couple rounds of partitioning, triggering the worst case. E.g. with 1M array, you'll get 256 spans of 2^20 / 2^8 = 2^14 = 16k length, which means that it'll be doing a total of 2^8 * 2^14^2 = 2^8 * 2^28 = 2^36 = ~65 billion swaps and compares in total. Which'll take a while.

Reading through K&R, wrote a libc-style qsort based on the simple version in K&R. Then wrote special cases for hardware element sizes (8/4/2/1 bytes.) At that point it took 2.25s to run on a 10 million element array of random int64s, versus 2.75s for the glibc version. Adding in simple threading brought the time down to 1.4s on my dualcore.

The generic version of the K&R qsort took around 2.5s for the single-threaded case (again, vs. glibc's 2.75s.) The above numbers are with -O2 and -O3. With optimizations turned off, the glibc qsort takes 7.2 seconds[1], while the K&R qsort takes 6.8 seconds (of which 1.5 sec can be regained by manually doing the -O3 loop invariant optimizations and inlining the swaps.) Which is interesting. What does the glibc qsort do wrong?

To investigate, I instrumented the swap and comparison functions to generate graphs of the memory access patterns. Here's the graph for the glibc qsort and the graph for the K&R qsort (with trailing insertion sorts.) A dot to the left of the graph is an access to the start of the array, while a dot to the right is an access to the end of the array.

In the graph for the glibc qsort you can see its "collapsing walls"-strategy in the tornado-like structures. Another feature of interest is at the very end: it runs insertion sort on each qsorted 4-element span. The tornadoes might be what's causing the performance gap between the glibc qsort and the K&R qsort, as they do streaming access both from the start and the end.

Timings for a small memory access pattern benchmark:

$ time ./mem_bm forward

real 0m5.138s
user 0m5.024s
sys 0m0.064s

$ time ./mem_bm reverse

real 0m5.004s
user 0m4.868s
sys 0m0.064s

$ time ./mem_bm both

real 0m6.197s
user 0m5.932s
sys 0m0.080s


Showing the tornado access pattern to be 20% slower than streaming access. Do take that with a hefty dose of salt though.

Also, the _qsort_2 and _qsort_1 versions are very slow compared to the glibc qsort. _qsort_1 specifically does something very wrong.

[1] Note that you have to use the _quicksort from stdlib/qsort.c, using the stdlib.h qsort links in the compiled library and -O doesn't have much effect on that (it does give you a small performance improvement though, enough to fool you into thinking that it does something to the qsort.)

Here's the generic version of the K&R qsort:

#include <stdio.h>
#include <string.h>
#include <alloca.h>
#include <sys/types.h>
#include <stdlib.h>
#include <pthread.h>

typedef int (*comparer)(const void *, const void *);

#define THREAD_THRESHOLD 8192

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp);

struct qsort_args {
size_t element_size;
void *arr;
int left;
int right;
comparer cmp;
};

struct qsort_args _qsort_args
(size_t element_size, void *arr, int left, int right, comparer cmp)
{
struct qsort_args qa;
qa.element_size = element_size;
qa.arr = arr;
qa.left = left;
qa.right = right;
qa.cmp = cmp;
return qa;
}

void *_qsort_start (void *args)
{
struct qsort_args *a = (struct qsort_args *)args;
_qsort (a->element_size, a->arr, a->left, a->right, a->cmp);
return NULL;
}

inline size_t average2 (size_t a, size_t b)
{
return ((a ^ b) >> 1) + (a & b);
}

void swap (size_t sz, void *arr, size_t i, size_t j)
{
char *a = (char*)(arr + i*sz);
char *b = (char*)(arr + j*sz);
long tmp;
for (i=0; i<sz-sz%sizeof(long); i+=sizeof(long)) {
tmp = *(long*)&a[i];
*(long*)&a[i] = *(long*)&b[i];
*(long*)&b[i] = tmp;
}
for (; i<sz; i++) {
tmp = a[i];
a[i] = b[i];
b[i] = (char)tmp;
}
}

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp)
{
int i, last;
pthread_t thread;
struct qsort_args qa;

if (left >= right)
return;
swap(element_size, arr, left, average2(left, right));
last = left;
for (i = left + 1; i <= right; i++) {
if (cmp(arr + i*element_size, arr + left*element_size) < 0) {
last++;
swap(element_size, arr, last, i);
}
}
swap (element_size, arr, left, last);
if (right-left > THREAD_THRESHOLD) {
qa = _qsort_args(element_size, arr,left,last-1,cmp);
if (0 == pthread_create(&thread, PTHREAD_CREATE_JOINABLE,
_qsort_start, (void*)&qa)) {
_qsort (element_size, arr, last+1, right, cmp);
pthread_join(thread, NULL);
return;
}
}
_qsort (element_size, arr, left, last-1, cmp);
_qsort (element_size, arr, last+1, right, cmp);
}

void* my_qsort (void *arr, size_t length, size_t element_size, comparer cmp)
{
_qsort (element_size, arr, 0, length-1, cmp);
return arr;
}

And the whole kaboodle can be seen at my_qsort.c.

3 comments:

Muxecoid said...

I'm new to multi-threading so forgive me if I say something stupid.
The results look interesting.
1.4 sec vs 2.5 means that you have little overhead and almost always have more than one thread running.

I want to ask a few questions.

What's the real overhead of creating a new thread? Switching threads within the same process?

How can you limit the number of threads to the number of cores? Would it improve the performance?


When I first thought about the problem yesterday my first idea was a number of detachable threads that perform a qsort and than add the 2 unsorted chunks to the shared pool of chunks that require sorting and request a new chunk, waiting for it if there are no chunks available. The weakness of this approach is the need to implement sleep/wakeup mechanism for threads waiting for data. Did not implement it yet, will try it when I have more time.

Ilmari Heikkinen said...

The pthread create+join-overhead was something like 9 us the last time i measured, so around 20000 cycles.

Thread switching overhead, I did a small counter test with several threads (with cpu affinity set to the first cpu with sched_setaffinity) incrementing a counter for one second. With one thread, the result was ~308 million increments per second. With 100 threads, the same. So I couldn't measure the switching overhead.

The worker-pool approach might have a lower overhead, but it is more fiddly to implement. The overhead there being waiting on the work pool mutex and the top initiating thread waiting for the completion.

Anonymous said...

Thanks for this post, just dropping a note about the performance.

I implemented this for a project on a node with 2x Interlagos processors (64 integer "cores" and 32 FPUs) which sorted ~135 million 64 bit integers in 5.5 sec (inplace) or ~11 sec (to return permutation indices). This was about 5x of single-threaded qsort() and about 10x the single-threaded sort() implementation in IDL.

Blog Archive