There are some sorted regions that have a trivial merge operation: if the largest value in region X is smaller or equal to the smallest value in region Y, the merge operation is the concatenation X ++ Y. We can pre-process the input of the merge sort so that the two merged regions have this property. The pre-processing pass uses a pivot value and moves all values smaller than the pivot to the start of the region, leaving the end of the region for values larger than the pivot. Now you don't have to do anything to merge the regions together once they're sorted. This is the idea behind the quicksort algorithm.
Because pivoting can cause the regions to have different sizes, there is a worst-case scenario where the pivot is the only element in the other region. Because of this scenario, quicksort has O(n^2) worst-case time complexity. If you pick the pivot to be the median of the region with an O(n) median finding algorithm and further split the equal-to-pivot values equally between the subregions, you can get an O(n lg n) quicksort (but the O(n) median finding algorithm has a high overhead, so they say the resulting algorithm is slower than quicksort in the average case).
You can do a O(n) pre-processing pass over the input data to find all already sorted regions in the input data. If the number of found regions is small, you're dealing with nearly sorted data. You can directly merge all sorted regions that are larger than the region size for the region sorting operation. The merging should be done so that the smallest regions are merged together first. Merging regions of unequal size causes the worst-time complexity to increase (merging a 2-region and a X-region takes X+2 time, consider N/4 2-regions merged into a N/2-region: it takes N/4 * (N/2 + (2+N) / 2) time, which is in O(N^2).)
The pre-processing pass can also reverse regions that are sorted in the opposite direction. And it is probably possible to topologically sort the sorted region descriptors by their min-max-values to find regions that can be merged by concatenation. Another merge trick would be to use binary search on one region to find the span that the other region overlaps, then you could get away with
memcpy(dst,smaller,smaller_sz); merge(dst+smaller_sz,overlap,region2,overlap_sz,region2_sz); memcpy(dst+smaller_sz+overlap_sz+region2_sz, larger, larger_sz).
If you keep track of how equally the quicksort pivot pass splits a region, you can avoid the quicksort worst-case by handing pathological splits over to merge sort. That is a bit nasty though, as the resulting regions no longer have the concatenative property and must be merged the slow way. Another alternative is to do something like introsort and switch over to heapsort once recursion depth exceeds lg n.
Once merge sort or quicksort reaches the maximum unsorted region size (say, 8 elements), implementations usually switch over to a low-overhead O(n^2) sorting algorithm like insertion sort.
The partitioning operation of merge sort is a no-op, but the partitioning operation in quicksort takes O(n) time. In the same fashion, the merge operation of quicksort is a no-op, but the merge operation of merge sort takes O(n) time.
You can do the merge sort merge in parallel by picking a pivot and splitting the sorted regions into two regions: one with values smaller or equal to the pivot and the other with values larger or equal to the pivot. The splitting operation consists of finding the index of the pivot in each region using a binary search. You then merge the small values with the small values and the large values with the large values. The merge between the merged smalls and larges is concatenation, so you can just allocate an array with the beginning reserved for the small values and the end reserved for the large values, and run the small and large merges in parallel. Recurse to increase parallelism. The nice thing about this approach is that you get segregated in-place parallelism, each of the sub-partitions can write to its own part of the root target array. The problem it suffers from is the quicksort partitioning problem: the pivot should be the median of the data set. Because the sorted regions are sorted, we can get the medians of each region easily enough, the worst case with one of those should be a 25:75-split when the regions are of equal size.
If your input data is 256 elements long and you split it into 8-element insertion-sorted regions, executing each in parallel (on a 32-core), what's the merge time? The first merge level you want to split into 2-way parallel merges (16 merges -> 32 merges). The second level is split 4-way (8 merges -> 32 merges), then 8-way and 16-way? The split bsearch overhead is 2 log2 mergeSize per level (sub-splits execute in parallel).
To do the 16-way split (assuming we get a perfectly balanced split), we first do a 2 log2 128 = 14 op split, then 2 log2 64 = 12 op split, then 10 op and 8 op, for total split cost of 44 ops. The merge then takes 16 ops to run (8+8-merge), for total merge cost of 60 ops. If we stop the split one level short, it'd take 36 ops to split and 32 ops to merge, 68 ops in total. So I guess you do want to split it all the way down if you have negligible overhead.
The total runtime for the zero-overhead parallel merge tree would be 44 + 30 + 18 + 8 + 4*16 = 164 ops. Total sort time with 64-op 8-elem sorts would be 228 ops. Hey, sub-O(n). Quadruple the input size and the core count and it's 8+18+30+44+60+78+6*16 + 64 = 398 ops. Hey, sub-linear parallel scaling.
There is an another way to do parallel merging as well (cribbed from some lecture slides): for regions A and B, write A[i] to i + B.indexOfLarger(A[i]) and write B[i] to i + A.indexOfLarger(B[i]). N-parallelizable, serial O(N log N) time, parallel O(log N) time. However, it relies on scattered writes, so it doesn't seem very memory-access friendly (ideally each CPU core should be working on its own cache line, otherwise you get cache contention and traffic between the CPUs).
Fun reading: Heapsort, Quicksort, and Entropy.