Why a simple test can get parallel slowdown

By Alexey Kukanov (Intel) (2 posts) on March 4, 2008 at 9:04 am

 Those who read Russian may follow this link.

 With multicore processors becoming widespread, parallel programming moves to mainstream. As indirect evidence, recently there are seen many attempts to develop a simple parallel benchmarking test to see performance benefits from multicore and multithreading. And with no doubt this is good, but…

 I have observed a few such attempts where parallel code used the Threading Building Blocks library (TBB). Much to experimenters’ astonishment, not only their simple parallel programs sometimes expose no reasonable speedup but even those can be slower than sequential counterparts! And unfortunately, not every program author could (and wants to) understand why.

 Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing about these cases. Those on the one hand are simple and comprehensible; most developers do not start their parallel experiments with complex algorithms but with something simple like a loop to sum all array elements. On the other hand, these examples can as well be interesting.

 The array summation is a good example to start :). The question about why so simple a program became slower with TBB had arisen at the TBB forum (see “performance of parallel_for” there). Here is relevant part of the code:

#define REAL double
#define MAX 10000000
class SumFoo {
    REAL *a;
    REAL sum;
    ...
    void operator()(const blocked_range<size_t>& r) {
        REAL *arr = a;
        for(size_t i=r.begin(); i!=r.end(); i++) sum+=arr[i];
    }
};
void ParallelSumFoo(REAL a[], size_t n, size_t gs) {
    SumFoo sf(a,n);
    parallel_reduce(blocked_range<size_t>(0,n,gs), sf);
    ...
}
class SerialApplyFoo {
    REAL *a;
    size_t len;
    double sum;
public :
    SerialApplyFoo(REAL *array, size_t length) : a(array), len(length){
        sum = 0.0;
        for(size_t i=0; i<len; i++)
            sum += a[i];
    }
    …
};
int main() {
    task_scheduler_init init;
    REAL *array = new REAL[MAX];
    ...
    SerialApplyFoo sf(array, MAX);
    ParallelSumFoo(array, MAX, GRAINSIZE);
    ...
}

On a system with Intel® Core® 2 Duo processor, instead of expected 2x speedup, ParallelSumFoo was almost twice slower than SerialApplyFoo, e.g. on my laptop it ran for 0.08 sec vs 0.044 in serial. Let’s see what was bad there.

 First, let’s examine what time the TBBfied code takes to run sequentially. There are two ways for that; one is to explicitly specify the number of threads when creating task_scheduler_init object for library initialization, and the other is to set the grain size of parallel_reduce equal to the array size. I used the first way:

    task_scheduler_init init(1);

And the result was kind of shocking: 0.21 sec, which is almost 5 times slower than the original serial code! Now no wonder that on two cores it was also slow; and TBB can’t be the direct reason of such slowdown.

 The root of the issue is in optimization by compilers. The serial loop is very simple to optimize; the compiler knows all about it, from the constant length to the fact that sum should only be visible after the constructor of SerialApplyFoo completes. So it can generate very efficient machine code for this loop; for example, Visual* C++ 2005 compiler produced the following:

            fldz
            lea      eax, DWORD PTR [edi+010h]
            mov      ecx, 0x2625A0h
main+0x105: fadd     QWORD PTR [eax-16]
            add      eax, 0x20h
            sub      ecx, 0x1h
            fadd     QWORD PTR [eax-40]
            fadd     QWORD PTR [eax-32]
            fadd     QWORD PTR [eax-24]
            jnz      main+0x105

The magic 0x2625A0 constant is 2500000, one fourth of the array length. The compiler applied loop enrolling optimization and replaced four iterations by one, thus decreasing overhead in the loop. The intermediate sum is accumulated in a register and only stored to memory once at the end (not shown). And there is exactly one memory load (of the needed array element) per iteration of the original loop.

 The TBBfied code is not that simple for the compiler. Start and end of the loop are specified in data fields of an object passed by reference; loop length is unknown; the result is stored in a data field of an active object. The compiler has to be more conservative:

execute+0x57: fld      QWORD PTR [edi]
              add      ecx, 0x1h
              fadd     QWORD PTR [edx+08h]
              add      edi, 0x8h
              fstp     QWORD PTR [edx+08h]
              cmp      ecx, DWORD PTR [esi+08h]
              jnz      execute+0x57

In this code, there are 3 memory loads (!) and 1 store to process just one array element. Besides the element itself, the loop end value is loaded every time, as well as the intermediate sum which is then stored back only to be reloaded at the next iteration.

 I wrote the above considerations in the forum, but to bad luck I said that the benchmark “favors” the serial case; by that I probably pointed the author to wrong direction. He tried to improve the test by reading the array length from console (which is good because it makes the test more realistic) and replaced SerialApplyFoo object to the following function:

REAL SerialSum(REAL *a_, unsigned long int start, unsigned long int end) {
    REAL sum = 0.0;
    for(unsigned long int i=start; i<end; i++)
        sum += a_[i];
    return sum;
}

But this change has next to no effect because now the intermediate sum is accumulated in a local variable and compiler uses a register for it with the same ease as before, and this is good. Slowing down the serial code shouldn't be the goal; to solve the problem, the parallel code should be improved to simplify optimizations for the compiler; namely, local variables should be used in the loop in operator():

    void operator()(blocked_range<size_t> &r) {
        REAL local_sum = 0;
        const size_t end = r.end();
        for(size_t i=r.begin(); i!=end; i++)
            local_sum+=a[i];
        sum += local_sum;
    }

 This is the corresponding machine code I got:

              fldz
execute+0x58: fadd        qword ptr [edx]
              add         edx,8
              sub         ecx,1
              jne         execute+58h
              fadd        qword ptr [edi+8]
              fstp        qword ptr [edi+8]

Like in the serial code, there is just one memory load per iteration, the intermediate sum is accumulated in a register and then added to the sum field of SumFoo (see the last two instructions). And it significantly improved the speed: now ParallelForSum completes in 0.033 sec if both cores are used, so the speedup of 1.33 is achieved. But why not 2x, you might ask. Well, it might be a subject of another post.

 Conclusion: when developing programs with TBB, you should take into account that using TBB classes and functions may impact compiler optimizations, which has especially bad impact on simple algorithms with small amount of work per iteration. Proper use of local variables helps optimization and improves parallel speedup.

Categories: Multicore, Software Engineering, Threading Building Blocks

Comments (9) Comments RSS Feed

By Martin Watt on March 4th, 2008 at 10:36 am
Thanks for the posting. I've always been a bit worried about the iterators in TBB parallel for constructs from an optimization perspective, and this appears to confirm it. So even though the blocked range structure is const, the it.end() cannot be optimized out of the loop? Is there some possibility it may be changed elsewhere?

Given this, it seems we should always recode those loops to extract the end condition as a const variable?

By Ray Zed Blog on March 4th, 2008 at 6:13 pm
links from Technoratiunknown: Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing

By Jérôme Muffat-méridol on March 5th, 2008 at 12:45 am
This is a very useful and interesting post and I'm glad to read you'll be posting more like this.

When I talk about TBB to people, I tend to say that where we are used to optimising the inner loop, TBB offers a very flexible way to optimise the "outer loop(s)". The example you present today is at the boundary, where fine grain optimisation collides with what you can gain by spreading the work...

I think it is quite important to be able to know where that boundary is, to get a feel for the amount of work that should go in a TBB task (too little and you miss on fine grain optimisation, too much and work doesn't spread well...).

So thanks for the post, looking forward to reading more...

By Alexey Kukanov (Intel) on March 5th, 2008 at 3:56 pm
Thanks for the feedback!
Martin, I did some more experiments and it seems that replacing r.end() with a local variable has little impact on performance even in this case with small work inside. Most performance points are due to using local_sum. I will check a few more things and probably write about it next week.
Jérôme, your comment more relates to the question of right grainsize selection. Yes there should be a balance between enough work to justify overhead of task creation etc and enough tasks to feed all the threads. Fortunately the tbb::auto_partitioner works reasonably well in many cases so that you do not need to manually adjust grainsize unless fighting for every performance point. Another point in your comment seems to be on the choice between serial optimization and parallelism; and to me, the answer is that both make sense; as the example showed, after switching to TBB it makes sense to check that there is no significant decrease in serial performance.

By Martin Watt on March 5th, 2008 at 4:58 pm
Yes, I also tried replacing r.end() with a local variable in my existing TBB code and also could not see a difference in performance.

By may_max11 on March 10th, 2008 at 4:58 am
thnx Alexey ur posts helped me to understand the real problem behind the "poor performance of parallel_for" :-) .Trying out some new experiments and TBB is working fine with them.

By Jim on March 26th, 2008 at 2:29 am
This is cool stuff, but I do have a question. You are processing a tight loop many times over using TBB in parallel, but what if you want to process a longer algorithm a few times in parallel. Something like an FFT on block data or the fast hadamard algorithm?

By Alexey Kukanov (Intel) on March 27th, 2008 at 2:47 pm
Jim, I am not sure I fully understand what your question is. Are you asking if you can do such processing with TBB? I think yes. Or, do you wonder if these algorithms suite better as a benchmark? Probably yes; also there should be more computations per memory operation I believe. Or, are you asking something different?

By Arch Robison (Intel) on May 15th, 2008 at 1:33 pm
Replacing r.end() does not help further because local_sum is the only variable written by the loop. Since the compiler can tell that local_sum has no aliases, it can safely hoist the other reads out of the loop.


What do you think?

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)

Comments (required)