Click here to Skip to main content
15,881,709 members
Articles / Programming Languages / C++

Parallel QuickSort with U++ CoWork

Rate me:
Please Sign up or sign in to vote.
3.50/5 (7 votes)
10 Jan 2016BSD5 min read 12.2K   111   5  
QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. We provide such version using CoWork parallelization class of U++ framework.

Introduction

QuickSort like other divide-and-conquer algorithms is quite amenable for parallelization. In this article, we shall provide such version using CoWork parallelization class of U++ framework.

Serial Version of the Algorithm

As the starting point, we shall use the standard Sort function template from U++ Core library:

C++
// highly optimized selection sort for small number of elements
template <class I, class Less>
void ForwardSort(I begin, I end, const Less& less);

template <class I, class Less>
force_inline
void OrderIter2__(I a, I b, const Less& less)
{
    if(less(*b, *a))
        IterSwap(a, b);
}

dword  Random(dword n); // High quality PRNG

template <class I, class Less>
void Sort(I l, I h, const Less& less)
{
    for(;;) {
        int count = int(h - l);
        if(count < 2)
            return;
        if(count < 8) {                         // Final optimized SelectSort
            ForwardSort(l, h, less);
            return;
        }
        int pass = 4;
        for(;;) {
            I middle = l + (count >> 1);        // get the middle element
            OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
            OrderIter2__(middle, h - 1, less);
            OrderIter2__(l, middle, less);      // median is now in middle
            IterSwap(l + 1, middle);            // move median pivot to l + 1
            I ii = l + 1;
            for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
                if(less(*i, *(l + 1)))
                    IterSwap(++ii, i);
            IterSwap(ii, l + 1);                // put pivot back in between partitions
            I iih = ii;
            while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
                ++iih;
            if(pass > 5 || min(ii - l, h - iih) > 
            (max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
                if(ii - l < h - iih - 1) {       // recurse on smaller partition, tail on larger
                    Sort(l, ii, less);
                    l = iih + 1;
                }
                else {
                    Sort(iih + 1, h, less);
                    h = ii;
                }
                break;
            }
            IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
            IterSwap(middle, l + (int)Random(count));
            IterSwap(h - 1, l + (int)Random(count));
            pass++;
        }
    }
}

This is a fairly standard implementation. For small final partitions (<8 elements), optimized selection sort is used.

Pivot is decided as median of first, middle, last elements and algorithm contains detection of degenerate cases, in which case it does 2 more attempts to find a better pivot, using random elements to get median.

Algorithm uses "fat" partitioning - central elements equal to pivot are excluded from partitions.

Finally, algorithm uses recursion on smaller partition and tail execution on larger.

Now the feature of QuickSort that makes it particularly easy to parallelize is the fact that once partitioned, there is no data sharing between partitions. Practically, all we need to do is to replace recursion with parallel execution.

The only problem here is how to reasonably manage threads. Thankfully, U++ provides a useful parallelization class CoWork.

Meet the CoWork

CoWork can be thought of as an interface to the global thread pool. It has two fundamental operations:

Do

Schedules some code to be executed. Note that Do can also be invoked using operator &.

Finish

Waits for all the code scheduled by Do to be finished. Finish is also invoked from CoWork destructor, so usually is omitted from the code.

With respect to memory visibility, all changes made before Do are visible when the scheduled code is performed and all changes made in the scheduled code are visible after Finish.

Now, how the code is executed by CoWork is an implementation issue, anyway you can expect CoWork to try as hard as possible to use the global thread pool to run scheduled code in parallel.

Parallel QuickSort

Now we have the algorithm, and we have the tool to make it parallel, let us combine them:

C++
template <class I, class Less>
void CoSort(CoWork& cw, I l, I h, const Less& less)
{
    const int PARALLEL_THRESHOLD = 80;

    for(;;) {
        int count = int(h - l);
        if(count < 2)
            return;
        if(count < 8) {                         // Final optimized SelectSort
            ForwardSort(l, h, less);
            return;
        }
        int pass = 4;
        for(;;) {
            I middle = l + (count >> 1);        // get the middle element
            OrderIter2__(l, middle, less);      // sort l, middle, h-1 to find median of 3
            OrderIter2__(middle, h - 1, less);
            OrderIter2__(l, middle, less);      // median is now in middle
            IterSwap(l + 1, middle);            // move median pivot to l + 1
            I ii = l + 1;
            for(I i = l + 2; i != h - 1; ++i)   // do partitioning; already l <= pivot <= h - 1
                if(less(*i, *(l + 1)))
                    IterSwap(++ii, i);
            IterSwap(ii, l + 1);                // put pivot back in between partitions
            I iih = ii;
            while(iih + 1 != h && !less(*ii, *(iih + 1))) // Find middle range of elements equal to pivot
                ++iih;
            if(pass > 5 || min(ii - l, h - iih) > 
            (max(ii - l, h - iih) >> pass)) { // partition sizes ok or we have done max attempts
                if(ii - l < h - iih - 1) {       // schedule or recurse on smaller partition, tail on larger
                    if(ii - l < PARALLEL_THRESHOLD) // too small to run in parallel?
                        Sort(l, ii, less); // resolve in this thread
                    else
                        cw & [=, &cw] { CoSort(cw, l, ii, less); }; // schedule for parallel execution
                    l = iih + 1;
                }
                else {
                    if(h - iih - 1 < PARALLEL_THRESHOLD) // too small to run in parallel?
                        Sort(iih + 1, h, less); // resolve in this thread
                    else
                        cw & [=, &cw] { CoSort(cw, iih + 1, h, less); }; // schedule for parallel execution
                    h = ii;
                }
                break;
            }
            IterSwap(l, l + (int)Random(count));     // try some other random elements for median pivot
            IterSwap(middle, l + (int)Random(count));
            IterSwap(h - 1, l + (int)Random(count));
            pass++;
        }
    }
}

template <class I, class Less>
void CoSort(I l, I h, const Less& less)
{
    CoWork cw;
    CoSort(cw, l, h, less);
    // CoWork destructor waits until all scheduled work is finished
}

Just as promised, about the only change in the QuickSort code is replacement of recursion with CoWork::Do scheduling (in operator & form).

PARALLEL_THRESHOLD constant prevents parallel execution for small partitions - scheduling them in parallel has diminishing returns and at some point, false cacheline sharing and CoWork management costs are higher than benefits of using more CPU cores. The value 80 used here is a combination of gut feeling and some experimentation and is not really critical, experiments with sorting Strings revealed that performance generally is getting worse if this value is >200 or <20.

Benchmarks

We have benchmarked CoSort against the original serial version, sorting Upp::Vector of N randomly generated short Upp::Strings. For reference, we have also benchmarked standard library std::sort of std::vector<std::string> with the same data set.

Benchmarks were performed on two platforms, using svn revision 9390 of U++ framework. Benchmarking package was marked for 'fast' optimization.

Intel® Pentium(R) CPU G620 @ 2.60GHz (2 cores, 2 threads), 4GB RAM
Ubuntu Linux 64-bit
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010

N Sort CoSort Sort / CoSort std::sort
103 63.22 us 53.12 us 1.19x 241.82 us
104 968.93 us 603.93 us 1.60x 3180 us
105 11.50 ms 6.58 ms 1.75x 38.32 ms
106 137.20 ms 77.00 ms 1.78x 427.70 ms
107 1.63 s 0.89 s 1.83x 4.85 s

 

Intel® Pentium(R) Core i7-4771 @ 3.5GHz (4 cores, 8 threads), 16GB RAM
Windows 10 64-bit
Visual C++ 2014 64-bit mode

N Sort CoSort Sort / CoSort std::sort
103 47.87 us 44.27 us 1.08x 193.37 us
104 642.97 us 301.97 us 2.13x 2490 us
105 7.96 ms 2.62 ms 3.04x 31.32 ms
106 95.90 ms 28.50 ms 3.36x 381.60 ms
107 1.16 s 0.30 s 3.87x 4.20 s
108 13.63 s 3.69 s 3.69x 49.08 s


Broadcom BCM2836 ARM Cortex-A7 900 MHz, 4 cores, 4 threads, 1GB SDRAM (Raspberry Pi 2)
Ubuntu Linux
gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010, 32-bit

N Sort CoSort Sort / CoSort std::sort
103 305.87 us 291.77 us 1.04x 1470 us
104 4.58 ms 2.33 ms 1.96x 20.00 ms
105 69.05 ms 33.64 ms 2.05x 250.48 ms
106 1.02 s 0.49 s 2.08x 3.13 s

 

(Pentium G620 system does not have enough memory to run 108 test, ARM system does not have enough memory for more than 106 tests).

Conclusion

Benchmark results indicate that for large enough datasets, the performance ratio between serial and parallel version of algorithm is approaching the number of physical CPU cores on Intel CPUs. On the other hand, parallelization does not seem to make much sense for small datasets where factors like false cache-line sharing and job/thread management costs in CoWork prevent any noticeable speedup and ultimately lead to slowdown for small sets. In production code, it would probably be advisable to revert to purely serial code for less than 500 elements (in production version of CoSort).

ARM results are sort of disappointing, the plausible explanation is memory bandwidth bottleneck - unlike Intel i7 system, which has about 20GB/s bandwidth, RAM bandwidth of Rapsberry PI is only about 200MB/s.

Useful Links

History

  • January 18th, 2016: ARM CPU benchmark results added

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Czech Republic Czech Republic
Mirek Fidler is C/C++ programmer for more than 20 years. He is a coauthor of U++ framework.

Comments and Discussions

 
-- There are no messages in this forum --