Introduction
QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. In this article, we shall provide such version using CoWork parallelization class of U++ framework.
Serial Version of the Algorithm
As the starting point, we shall use the standard Sort
function template from U++ Core library:
template <class I, class Less>
void ForwardSort(I begin, I end, const Less& less);
template <class I, class Less>
force_inline
void OrderIter2__(I a, I b, const Less& less)
{
if(less(*b, *a))
IterSwap(a, b);
}
dword Random(dword n);
template <class I, class Less>
void Sort(I l, I h, const Less& less)
{
for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) { ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1); OrderIter2__(l, middle, less); OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less); IterSwap(l + 1, middle); I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i) if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1); I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) ++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { if(ii - l < h - iih - 1) { Sort(l, ii, less);
l = iih + 1;
}
else {
Sort(iih + 1, h, less);
h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count)); IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}
This is a fairly standard implementation. For small final partitions (<8 elements), optimized selection sort is used.
Pivot is decided as median of first, middle, last elements and algorithm contains detection of degenerate cases, in which case it does 2 more attempts to find a better pivot, using random elements to get median.
Algorithm uses "fat" partitioning - central elements equal to pivot are excluded from partitions.
Finally, algorithm uses recursion on smaller partition and tail execution on larger.
Now the feature of QuickSort
that makes it particularly easy to parallelize is the fact that once partitioned, there is no data sharing between partitions. Practically, all we need to do is to replace recursion with parallel execution.
The only problem here is how to reasonably manage threads. Thankfully, U++ provides a useful parallelization class CoWork
.
Meet the CoWork
CoWork
can be thought of as an interface to the global thread pool. It has two fundamental operations:
Do
Schedules some code to be executed. Note that Do
can also be invoked using operator &
.
Finish
Waits for all the code scheduled by Do
to be finished. Finish
is also invoked from CoWork
destructor, so usually is omitted from the code.
With respect to memory visibility, all changes made before Do
are visible when the scheduled code is performed and all changes made in the scheduled code are visible after Finish
.
Now, how the code is executed by CoWork
is an implementation issue, anyway you can expect CoWork
to try as hard as possible to use the global thread pool to run scheduled code in parallel.
Parallel QuickSort
Now we have the algorithm, and we have the tool to make it parallel, let us combine them:
template <class I, class Less>
void CoSort(CoWork& cw, I l, I h, const Less& less)
{
const int PARALLEL_THRESHOLD = 80;
for(;;) {
int count = int(h - l);
if(count < 2)
return;
if(count < 8) { ForwardSort(l, h, less);
return;
}
int pass = 4;
for(;;) {
I middle = l + (count >> 1); OrderIter2__(l, middle, less); OrderIter2__(middle, h - 1, less);
OrderIter2__(l, middle, less); IterSwap(l + 1, middle); I ii = l + 1;
for(I i = l + 2; i != h - 1; ++i) if(less(*i, *(l + 1)))
IterSwap(++ii, i);
IterSwap(ii, l + 1); I iih = ii;
while(iih + 1 != h && !less(*ii, *(iih + 1))) ++iih;
if(pass > 5 || min(ii - l, h - iih) >
(max(ii - l, h - iih) >> pass)) { if(ii - l < h - iih - 1) { if(ii - l < PARALLEL_THRESHOLD) Sort(l, ii, less); else
cw & [=, &cw] { CoSort(cw, l, ii, less); }; l = iih + 1;
}
else {
if(h - iih - 1 < PARALLEL_THRESHOLD) Sort(iih + 1, h, less); else
cw & [=, &cw] { CoSort(cw, iih + 1, h, less); }; h = ii;
}
break;
}
IterSwap(l, l + (int)Random(count)); IterSwap(middle, l + (int)Random(count));
IterSwap(h - 1, l + (int)Random(count));
pass++;
}
}
}
template <class I, class Less>
void CoSort(I l, I h, const Less& less)
{
CoWork cw;
CoSort(cw, l, h, less);
}
Just as promised, about the only change in the QuickSort
code is replacement of recursion with CoWork::Do
scheduling (in operator &
form).
PARALLEL_THRESHOLD
constant prevents parallel execution for small partitions - scheduling them in parallel has diminishing returns and at some point, false cacheline sharing and CoWork
management costs are higher than benefits of using more CPU cores. The value 80
used here is a combination of gut feeling and some experimentation and is not really critical, experiments with sorting String
s revealed that performance generally is getting worse if this value is >200
or <20
.
Benchmarks
We have benchmarked CoSort
against the original serial version, sorting Upp::Vector
of N
randomly generated short Upp::Strings
. For reference, we have also benchmarked standard library std::sort
of std::vector<std::string>
with the same data set.
Benchmarks were performed on two platforms, using svn revision 9390 of U++ framework. Benchmarking package was marked for 'fast' optimization.
Intel® Pentium(R) CPU G620 @ 2.60GHz (2 cores, 2 threads), 4GB RAM
Ubuntu Linux 64-bit
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010
N | Sort | CoSort | Sort / CoSort | std::sort |
103 | 63.22 us | 53.12 us | 1.19x | 241.82 us |
104 | 968.93 us | 603.93 us | 1.60x | 3180 us |
105 | 11.50 ms | 6.58 ms | 1.75x | 38.32 ms |
106 | 137.20 ms | 77.00 ms | 1.78x | 427.70 ms |
107 | 1.63 s | 0.89 s | 1.83x | 4.85 s |
Intel® Pentium(R) Core i7-4771 @ 3.5GHz (4 cores, 8 threads), 16GB RAM
Windows 10 64-bit
Visual C++ 2014 64-bit mode
N | Sort | CoSort | Sort / CoSort | std::sort |
103 | 47.87 us | 44.27 us | 1.08x | 193.37 us |
104 | 642.97 us | 301.97 us | 2.13x | 2490 us |
105 | 7.96 ms | 2.62 ms | 3.04x | 31.32 ms |
106 | 95.90 ms | 28.50 ms | 3.36x | 381.60 ms |
107 | 1.16 s | 0.30 s | 3.87x | 4.20 s |
108 | 13.63 s | 3.69 s | 3.69x | 49.08 s |
Broadcom BCM2836 ARM Cortex-A7 900 MHz, 4 cores, 4 threads, 1GB SDRAM (Raspberry Pi 2)
Ubuntu Linux
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010, 32-bit
N | Sort | CoSort | Sort / CoSort | std::sort |
103 | 305.87 us | 291.77 us | 1.04x | 1470 us |
104 | 4.58 ms | 2.33 ms | 1.96x | 20.00 ms |
105 | 69.05 ms | 33.64 ms | 2.05x | 250.48 ms |
106 | 1.02 s | 0.49 s | 2.08x | 3.13 s |
(Pentium G620 system does not have enough memory to run 108 test, ARM system does not have enough memory for more than 106 tests).
Conclusion
Benchmark results indicate that for large enough datasets, the performance ratio between serial and parallel version of algorithm is approaching the number of physical CPU cores on Intel CPUs. On the other hand, parallelization does not seem to make much sense for small datasets where factors like false cache-line sharing and job/thread management costs in CoWork
prevent any noticeable speedup and ultimately lead to slowdown for small sets. In production code, it would probably be advisable to revert to purely serial code for less than 500 elements (in production version of CoSort
).
ARM results are sort of disappointing, the plausible explanation is memory bandwidth bottleneck - unlike Intel i7 system, which has about 20GB/s bandwidth, RAM bandwidth of Rapsberry PI is only about 200MB/s.
Useful Links
History
- January 18th, 2016: ARM CPU benchmark results added
Mirek Fidler is C/C++ programmer for more than 20 years. He is a coauthor of U++ framework.