C Loop Optimization: Duff’s Device
Yesterday I stumbled across an interesting article on a code snippet known as "Duff’s Device". It’s a clever little C Macro to partially unroll a loop, offering pretty significant performance improvements in certain scenarios. The code is relatively simple, consisting of the single define shown here:
| #define DUFF_DEVICE_8(aCount, aAction) \ { \ int count_ = (aCount); \ int times_ = (count_ + 7) >> 3; \ switch (count_ & 7){ \ case 0: do { aAction; \ case 7: aAction; \ case 6: aAction; \ case 5: aAction; \ case 4: aAction; \ case 3: aAction; \ case 2: aAction; \ case 1: aAction; \ } while (–times_ > 0); \ } \ } |
The code itself is a little strange, placing a Do/While loop inside a Switch is something that most programmers have never seen before (Myself included). Upon closer inspection tho, there are several little optimizations in here aside from the main one:
- The loop iterator –times_ is a pre-decrement. This is a minor detail, but pre-decrement (and increment) are faster than post-decrement (and increment). Post requires the compiler to create a duplicate of the variable and copy the updated value back at the end, while a pre-operation can be done in-place immediately.
- Use of Shifts instead of Division at the beginning. This is a huge improvement when working with integers and powers of 2 (>>3 is equivalent to division by 8).
- Comparison against 0. This is a special-case in most hardware that is faster to evaluate than any other comparison.
These three optimizations are often done by the compiler for you, so specifying them won’t usually do much for you. But the entire "Duff’s Device" is somewhat mysterious. How does it work? Click inside for details…
[tag:c][tag:source][tag:optimization][tag:duffdevice]
| void Bitmap::Blank(COLORREF colour) { unsigned long *blitPtr = m_BitmapData; unsigned long *blitEnd = blitPtr + (m_Width*m_Height); unsigned int index=0; int numPixels = static_cast<int>(blitEnd-blitPtr); |
This is a simple function that you would see in alot of image or graphics libraries. It simply takes the image (a large array) and blanks it by setting all the values to a single color. The Call to the DUFF_DEVICE_8 Macro takes the number of iterations in the loop (numPixels) and the code to execute.
First the Duff Device computes times_, which is the requested number of loop iterations divided by 8. The Duff Device will work by executing the statement 8 times for every execution of the while loop, so we need to recompute the number of times to loop. Once that it started, the switch statement takes the remainder of the operation (count_ & 7) and uses the switch statement to jump within the do/while loop. This handles all the cases of loops iterate a number of times not evenly divisible by 8. Since there are no break statements within the switch, all the remaining statements are executed, and the loop continues to process as normal.
It’s a bit strange since most programmers are used to seeing looping/branching statements perfectly nested, but in this case they’re simply overlapping. It’s a clever little snippet of code that I plan to keep on hand for use later. The article even claims that the performance in this snippet beats memcpy and memset for doing similar work. I’m hoping to do some simple benchmarking tests (just like with InvSqrt) to get my own results, and I’ll post them here when I have them.
