fhtr: Multi-threaded qsort in C

Edit: Added memory access pattern analysis.
Edit #2: And a note on _qsort_1 and _qsort_2 having horrible performance. Plus swap accesses in the graphs.
Edit #3: The slowness in sorting chars comes from [x, x, x, ..., x] arrays triggering worst case behaviour (reverse sorted triggers it as well.) And as the char alphabet is only 8 bits, you get [x,x,x,...,x] spans after a couple rounds of partitioning, triggering the worst case. E.g. with 1M array, you'll get 256 spans of 2^20 / 2^8 = 2^14 = 16k length, which means that it'll be doing a total of 2^8 * 2^14^2 = 2^8 * 2^28 = 2^36 = ~65 billion swaps and compares in total. Which'll take a while.

Reading through K&R, wrote a libc-style qsort based on the simple version in K&R. Then wrote special cases for hardware element sizes (8/4/2/1 bytes.) At that point it took 2.25s to run on a 10 million element array of random int64s, versus 2.75s for the glibc version. Adding in simple threading brought the time down to 1.4s on my dualcore.

The generic version of the K&R qsort took around 2.5s for the single-threaded case (again, vs. glibc's 2.75s.) The above numbers are with -O2 and -O3. With optimizations turned off, the glibc qsort takes 7.2 seconds[1], while the K&R qsort takes 6.8 seconds (of which 1.5 sec can be regained by manually doing the -O3 loop invariant optimizations and inlining the swaps.) Which is interesting. What does the glibc qsort do wrong?

To investigate, I instrumented the swap and comparison functions to generate graphs of the memory access patterns. Here's the graph for the glibc qsort and the graph for the K&R qsort (with trailing insertion sorts.) A dot to the left of the graph is an access to the start of the array, while a dot to the right is an access to the end of the array.

In the graph for the glibc qsort you can see its "collapsing walls"-strategy in the tornado-like structures. Another feature of interest is at the very end: it runs insertion sort on each qsorted 4-element span. The tornadoes might be what's causing the performance gap between the glibc qsort and the K&R qsort, as they do streaming access both from the start and the end.

Timings for a small memory access pattern benchmark:


$ time ./mem_bm forward

real 0m5.138s
user 0m5.024s
sys 0m0.064s

$ time ./mem_bm reverse

real 0m5.004s
user 0m4.868s
sys 0m0.064s

$ time ./mem_bm both

real 0m6.197s
user 0m5.932s
sys 0m0.080s

Showing the tornado access pattern to be 20% slower than streaming access. Do take that with a hefty dose of salt though.

Also, the _qsort_2 and _qsort_1 versions are very slow compared to the glibc qsort. _qsort_1 specifically does something very wrong.

[1] Note that you have to use the _quicksort from stdlib/qsort.c, using the stdlib.h qsort links in the compiled library and -O doesn't have much effect on that (it does give you a small performance improvement though, enough to fool you into thinking that it does something to the qsort.)

Here's the generic version of the K&R qsort:


#include <stdio.h>
#include <string.h>
#include <alloca.h>
#include <sys/types.h>
#include <stdlib.h>
#include <pthread.h>

typedef int (*comparer)(const void *, const void *);

#define THREAD_THRESHOLD 8192

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp);

struct qsort_args {
  size_t element_size;
  void *arr;
  int left;
  int right;
  comparer cmp;
};

struct qsort_args _qsort_args 
(size_t element_size, void *arr, int left, int right, comparer cmp)
{
  struct qsort_args qa;
  qa.element_size = element_size;
  qa.arr = arr;
  qa.left = left;
  qa.right = right;
  qa.cmp = cmp;
  return qa;
}

void *_qsort_start (void *args)
{
  struct qsort_args *a = (struct qsort_args *)args;
  _qsort (a->element_size, a->arr, a->left, a->right, a->cmp);
  return NULL;
}

inline size_t average2 (size_t a, size_t b)
{
  return ((a ^ b) >> 1) + (a & b);
}

void swap (size_t sz, void *arr, size_t i, size_t j)
{
  char *a = (char*)(arr + i*sz);
  char *b = (char*)(arr + j*sz);
  long tmp;
  for (i=0; i<sz-sz%sizeof(long); i+=sizeof(long)) {
    tmp = *(long*)&a[i];
    *(long*)&a[i] = *(long*)&b[i];
    *(long*)&b[i] = tmp;
  }
  for (; i<sz; i++) {
    tmp = a[i];
    a[i] = b[i];
    b[i] = (char)tmp;
  }
}

void _qsort (size_t element_size, void *arr, int left, int right, comparer cmp)
{
  int i, last;
  pthread_t thread;
  struct qsort_args qa;

  if (left >= right)
    return;
  swap(element_size, arr, left, average2(left, right));
  last = left;
  for (i = left + 1; i <= right; i++) {
    if (cmp(arr + i*element_size, arr + left*element_size) < 0) {
      last++;
      swap(element_size, arr, last, i);
    }
  }
  swap (element_size, arr, left, last);
  if (right-left > THREAD_THRESHOLD) {
    qa = _qsort_args(element_size, arr,left,last-1,cmp);
    if (0 == pthread_create(&thread, PTHREAD_CREATE_JOINABLE, 
                            _qsort_start, (void*)&qa)) {
      _qsort (element_size, arr, last+1, right, cmp);
      pthread_join(thread, NULL);
      return;
    }
  }
  _qsort (element_size, arr, left, last-1, cmp);
  _qsort (element_size, arr, last+1, right, cmp);
}

void* my_qsort (void *arr, size_t length, size_t element_size, comparer cmp)
{
  _qsort (element_size, arr, 0, length-1, cmp);
  return arr;
}

And the whole kaboodle can be seen at my_qsort.c.

3 comments:

Muxecoid said...: I'm new to multi-threading so forgive me if I say something stupid.
The results look interesting.
1.4 sec vs 2.5 means that you have little overhead and almost always have more than one thread running.

I want to ask a few questions.

What's the real overhead of creating a new thread? Switching threads within the same process?

How can you limit the number of threads to the number of cores? Would it improve the performance?

When I first thought about the problem yesterday my first idea was a number of detachable threads that perform a qsort and than add the 2 unsorted chunks to the shared pool of chunks that require sorting and request a new chunk, waiting for it if there are no chunks available. The weakness of this approach is the need to implement sleep/wakeup mechanism for threads waiting for data. Did not implement it yet, will try it when I have more time.; January 18, 2009 at 10:32 AM
Ilmari Heikkinen said...: The pthread create+join-overhead was something like 9 us the last time i measured, so around 20000 cycles.

Thread switching overhead, I did a small counter test with several threads (with cpu affinity set to the first cpu with sched_setaffinity) incrementing a counter for one second. With one thread, the result was ~308 million increments per second. With 100 threads, the same. So I couldn't measure the switching overhead.

The worker-pool approach might have a lower overhead, but it is more fiddly to implement. The overhead there being waiting on the work pool mutex and the top initiating thread waiting for the completion.; January 18, 2009 at 6:27 PM
Anonymous said...: Thanks for this post, just dropping a note about the performance.

I implemented this for a project on a node with 2x Interlagos processors (64 integer "cores" and 32 FPUs) which sorted ~135 million 64 bit integers in 5.5 sec (inplace) or ~11 sec (to return permutation indices). This was about 5x of single-threaded qsort() and about 10x the single-threaded sort() implementation in IDL.; March 25, 2013 at 9:58 PM

fhtr

2009-01-09

Multi-threaded qsort in C

3 comments:

Blog Archive

About Me