<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Intel® Software Network Blogs &#187; Alexey Kukanov (Intel)</title>
	<atom:link href="http://softwareblogs.intel.com/author/alexey-kukanov/feed/" rel="self" type="application/rss+xml" />
	<link>http://softwareblogs.intel.com</link>
	<description></description>
	<pubDate>Sun, 20 Jul 2008 16:43:44 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>The user-driven evolution of tbb::pipeline</title>
		<link>http://softwareblogs.intel.com/2008/06/09/the-user-driven-evolution-of-tbbpipeline/</link>
		<comments>http://softwareblogs.intel.com/2008/06/09/the-user-driven-evolution-of-tbbpipeline/#comments</comments>
		<pubDate>Mon, 09 Jun 2008 15:10:09 +0000</pubDate>
		<dc:creator>Alexey Kukanov (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA["open source"]]></category>

		<category><![CDATA[Community]]></category>

		<category><![CDATA[contribution]]></category>

		<category><![CDATA[multithreading]]></category>

		<category><![CDATA[parallel programming]]></category>

		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/06/09/the-user-driven-evolution-of-tbbpipeline/</guid>
		<description><![CDATA[The implementation of tbb::pipeline algorithm has been significantly reworked for the next TBB version, and the improvements were mostly driven by customers and community. Read this post to learn about the changes that have got into the product and those that have not, and comment on what other improvements you think should be done.]]></description>
			<content:encoded><![CDATA[<p>Pipelining is a common and widely used pattern to have some job done by few workers, i.e. in parallel. Threading Building Blocks cannot miss this pattern, and we provided tbb::pipeline since the first version. You might learn some basics about the algorithm and its use in <a href="http://www.threadingbuildingblocks.org/documentation.php">TBB documents</a> and in <a href="http://softwareblogs.intel.com/author/robert-reed/">Robert Reed’s</a> “<a href="http://softwareblogs.intel.com/2007/08/23/overlapping-io-and-processing-in-a-pipeline/">Overlapping I/O and processing in a pipeline</a>” blogging series. Here I will tell you what has changed in the pipeline implementation since TBB 2.0.</p>
<p>Opening TBB 2.0 code and providing it as <a href="http://www.fsf.org/licensing/essays/free-sw.html">free software</a> gave us an excellent source of new ideas for improvements and enhancements in the library, in addition to the feedback from customers. The pipeline implementation also got a few. All the proposals were reviewed by the team; and the following were implemented:</p>
<p><strong>Fixed idle spinning of worker threads in case of blocked input</strong><br />
Threads spinning idle always were the area of concern for TBB customers; I can recall a few places where we had to minimize idle spinning. The issue in pipeline was missed for some time however, because we did not expect that a thread may block inside input stage. It was obviously an oversight; pipeline users wanted it to read the data coming dynamically at runtime, e.g. from network, therefore waiting for data inside the input filter is very possible. The fix for the issue required significant overhaul of the implementation; however it was so important that we released it as part of TBB 2.0 Update 3. And the new implementation allowed the next improvement.</p>
<p><strong>Pipeline input stage allowed to be parallel</strong><br />
In TBB 2.0 and before, the pipeline input stage was always serialized, even if the corresponding filter was constructed as parallel. After the aforementioned fix, it became possible to allow input stages run in parallel; and we implemented this improvement.<br />
Be aware that if you used a “parallel” input stage with TBB 2.0, switching to TBB 2.1 may reveal some hidden pitfalls. In order to work correctly, the parallel input filter should conform to some non-formalized requirements:<br />
- it should be thread-safe, as it will be called by different threads simultaneously;<br />
- it should be tolerant to latecomers, because it can be called even <strong>after</strong> one of the previous calls returned NULL;<br />
- it should better not block, otherwise every thread could block there.<br />
Interestingly, this change has some affect on serial stages. As you might know, serial stages of TBB pipeline are in fact strictly ordered. In TBB 2.0, the order was defined exactly by the input stage; and all subsequent serial stages processed items (tokens) in this order. But now, the input stage can be parallel; in this case, the token order is defined by the first serial stage encountered in the pipeline. It happened to be easier than thread-safe token number assignment for parallel input stage; and it is better this way, because the first serial stage runs as "first come first served" that might improve total throughput.</p>
<p><strong>Enumeration is used to distinguish filter types</strong><br />
It’s an example of community influence on TBB: we were requested at the forum to change Boolean parameter to an enumeration for filter type specification. Even if we only have serial vs. parallel filters, using enumeration improves code readability and allows compile-time check of the argument. So now you could write <em>MyFilter::MyFilter() : tbb::filter(<strong>serial</strong>) {}</em> instead of <em>MyFilter::MyFilter() : tbb::filter(/*is_serial*/ <strong>true</strong>) {}</em>. It really is easier to read, isn’t it? Thanks to the community for pointing the developers to these small but handy improvements.</p>
<p><strong>Automatic removal of filters</strong><br />
This improvement in how the pipeline object should be destroyed was the last of the changes accepted for the next version of TBB. In TBB 2.0, one needs to call clear() for the pipeline prior to filter deletion; it is necessary for safe pipeline destruction, and it is documented. But why can’t filter destructor remove the filter from its pipeline? Well, now it can. But doing this in the way that is compatible with the previous implementation was somewhat tricky. We needed to change the single-linked list of filters to become double-linked, and add a pointer to the pipeline. Thus the layout of the filter class was changed, but user-defined filters inherit this class – so how we keep it backward compatible? Check the code to find the answer. :)</p>
<p>You might check the improved pipeline functionality with all the above features in <a href="http://www.threadingbuildingblocks.org/file.php?fid=77">the recent open-source stable release</a> (at the moment of writing, it was tbb21_20080605oss).</p>
<p>There also were a couple of internal proposals we declined, and there remains a list of things still to be done or under consideration, such as:<br />
- We added general support for cancellation and handling of exceptions to TBB (see <a href="http://softwareblogs.intel.com/author/andrey-marochko">Andrey Marochko’s blog</a> for interesting details); now the pipeline should be improved to work correctly in case of cancellation.<br />
- <a href="http://www.nf.ull.es/software/tpipeline">Type-safe filters</a> were also proposed at the forum together with the implementation. We like the idea of compile-time type checks and operator&gt;&gt; semantics to build a pipeline, and would be glad to elaborate on it and improve the implementation; but unfortunately the author did not contribute it officially to the project.<br />
- Out-of-order serial filters were considered and postponed because we did not see a really good motivation case. A case was recently proposed at the forum where unordered serial filter might improve output latency.<br />
- The idea of automatic selection of the maximal number of tokens in flight was proposed, but it’s hard to find heuristics that would work well enough for every use case.<br />
- And some others.<br />
So we are not yet at the end of tbb::pipeline evolution.</p>
<p>And by the way, have you noticed how many times I mentioned the TBB community and the forum? If you have a great idea for TBB, you are very welcome with it. We listen to you; leaving a comment in this blog is a simple way to make your contribution. :)</p>
]]></content:encoded>
			<wfw:commentRss>http://softwareblogs.intel.com/2008/06/09/the-user-driven-evolution-of-tbbpipeline/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Why a simple test can get parallel slowdown</title>
		<link>http://softwareblogs.intel.com/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/</link>
		<comments>http://softwareblogs.intel.com/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comments</comments>
		<pubDate>Tue, 04 Mar 2008 17:04:06 +0000</pubDate>
		<dc:creator>Alexey Kukanov (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Software Engineering]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[compiler optimizations]]></category>

		<category><![CDATA[multithreading]]></category>

		<category><![CDATA[parallel programming]]></category>

		<category><![CDATA[performance]]></category>

		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/</guid>
		<description><![CDATA[With multicore processors becoming widespread, recently there are seen many attempts to develop a simple parallel benchmarking test to see performance benefits from multicore and multithreading. I have observed a few such attempts where parallel code used the Threading Building Blocks library (TBB). Much to experimenters’ astonishment, not only their simple parallel programs sometimes expose no reasonable speedup but even those can be slower than sequential counterparts! And unfortunately, not every program author could (and wants to) understand why.]]></description>
			<content:encoded><![CDATA[<p><em> Those who read Russian may </em><a href="http://softwareblogs-rus.intel.com/2008/03/03/43/"><em>follow this link</em></a><em>. </em></p>
<p> With multicore processors becoming widespread, parallel programming moves to mainstream. As indirect evidence, recently there are seen many attempts to develop a simple parallel benchmarking test to see performance benefits from multicore and multithreading. And with no doubt this is good, but…</p>
<p> I have observed a few such attempts where parallel code used the <a href="http://threadingbuildingblocks.org/">Threading Building Blocks</a> library (TBB). Much to experimenters’ astonishment, not only their simple parallel programs sometimes expose no reasonable speedup but even those can be slower than sequential counterparts! And unfortunately, not every program author could (and wants to) understand why.</p>
<p> Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing about these cases. Those on the one hand are simple and comprehensible; most developers do not start their parallel experiments with complex algorithms but with something simple like a loop to sum all array elements. On the other hand, these examples can as well be interesting.</p>
<p> The array summation is a good example to start :). The question about why so simple a program became slower with TBB had arisen at <a href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/2471/ShowForum.aspx">the TBB forum</a> (see “<a href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30250043/30249884/ShowThread.aspx">performance of parallel_for</a>” there). Here is relevant part of the code:</p>
<blockquote>
<pre>#define REAL double
#define MAX 10000000
class SumFoo {
    REAL *a;
    REAL sum;
    ...
    void operator()(const blocked_range&lt;size_t&gt;&amp; r) {
        REAL *arr = a;
        for(size_t i=r.begin(); i!=r.end(); i++) sum+=arr[i];
    }
};
void ParallelSumFoo(REAL a[], size_t n, size_t gs) {
    SumFoo sf(a,n);
    parallel_reduce(blocked_range&lt;size_t&gt;(0,n,gs), sf);
    ...
}
class SerialApplyFoo {
    REAL *a;
    size_t len;
    double sum;
public :
    SerialApplyFoo(REAL *array, size_t length) : a(array), len(length){
        sum = 0.0;
        for(size_t i=0; i&lt;len; i++)
            sum += a[i];
    }
    …
};
int main() {
    task_scheduler_init init;
    REAL *array = new REAL[MAX];
    ...
    SerialApplyFoo sf(array, MAX);
    ParallelSumFoo(array, MAX, GRAINSIZE);
    ...
}</pre>
</blockquote>
<p>On a system with Intel® Core® 2 Duo processor, <em>instead of expected 2x speedup, ParallelSumFoo was almost twice slower than SerialApplyFoo</em>, e.g. on my laptop it ran for 0.08 sec vs 0.044 in serial. Let’s see what was bad there.</p>
<p> First, let’s examine what time the TBBfied code takes to run sequentially. There are two ways for that; one is to explicitly specify the number of threads when creating task_scheduler_init object for library initialization, and the other is to set the grain size of parallel_reduce equal to the array size. I used the first way:</p>
<pre>    task_scheduler_init init(1);</pre>
<pre></pre>
<p>And the result was kind of shocking: 0.21 sec, which is almost 5 times slower than the original serial code! Now no wonder that on two cores it was also slow; and TBB can’t be the direct reason of such slowdown.</p>
<p> The root of the issue is in optimization by compilers. The serial loop is very simple to optimize; the compiler knows all about it, from the constant length to the fact that sum should only be visible after the constructor of SerialApplyFoo completes. So it can generate very efficient machine code for this loop; for example, Visual* C++ 2005 compiler produced the following:</p>
<blockquote>
<pre>            fldz
            lea      eax, DWORD PTR [edi+010h]
            mov      ecx, 0x2625A0h
main+0x105: fadd     QWORD PTR [eax-16]
            add      eax, 0x20h
            sub      ecx, 0x1h
            fadd     QWORD PTR [eax-40]
            fadd     QWORD PTR [eax-32]
            fadd     QWORD PTR [eax-24]
            jnz      main+0x105</pre>
</blockquote>
<p>The magic 0x2625A0 constant is 2500000, one fourth of the array length. The compiler applied loop enrolling optimization and replaced four iterations by one, thus decreasing overhead in the loop. The intermediate sum is accumulated in a register and only stored to memory once at the end (not shown). And there is exactly one memory load (of the needed array element) per iteration of the original loop.</p>
<p> The TBBfied code is not that simple for the compiler. Start and end of the loop are specified in data fields of an object passed by reference; loop length is unknown; the result is stored in a data field of an active object. The compiler has to be more conservative:</p>
<blockquote>
<pre>execute+0x57: fld      QWORD PTR [edi]
              add      ecx, 0x1h
              fadd     QWORD PTR [edx+08h]
              add      edi, 0x8h
              fstp     QWORD PTR [edx+08h]
              cmp      ecx, DWORD PTR [esi+08h]
              jnz      execute+0x57</pre>
</blockquote>
<p>In this code, there are 3 memory loads (!) and 1 store to process just one array element. Besides the element itself, the loop end value is loaded every time, as well as the intermediate sum which is then stored back only to be reloaded at the next iteration.</p>
<p> I wrote the above considerations in the forum, but to bad luck I said that the benchmark “favors” the serial case; by that I probably pointed the author to wrong direction. He tried to improve the test by reading the array length from console (which is good because it makes the test more realistic) and replaced SerialApplyFoo object to the following function:</p>
<blockquote>
<pre>REAL SerialSum(REAL *a_, unsigned long int start, unsigned long int end) {
    REAL sum = 0.0;
    for(unsigned long int i=start; i&lt;end; i++)
        sum += a_[i];
    return sum;
}</pre>
</blockquote>
<p>But this change has next to no effect because now the intermediate sum is accumulated in a local variable and compiler uses a register for it with the same ease as before, and this is good. <em>Slowing down the serial code shouldn't be the goal; to solve the problem, the parallel code should be improved </em>to simplify optimizations for the compiler; namely, local variables should be used in the loop in operator():</p>
<blockquote>
<pre>    void operator()(blocked_range&lt;size_t&gt; &amp;r) {
        REAL local_sum = 0;
        const size_t end = r.end();
        for(size_t i=r.begin(); i!=end; i++)
            local_sum+=a[i];
        sum += local_sum;
    }</pre>
</blockquote>
<p> This is the corresponding machine code I got:</p>
<blockquote>
<pre>              fldz
execute+0x58: fadd        qword ptr [edx]
              add         edx,8
              sub         ecx,1
              jne         execute+58h
              fadd        qword ptr [edi+8]
              fstp        qword ptr [edi+8]</pre>
</blockquote>
<p>Like in the serial code, there is just one memory load per iteration, the intermediate sum is accumulated in a register and then added to the sum field of SumFoo (see the last two instructions). And it significantly improved the speed: now ParallelForSum completes in 0.033 sec if both cores are used, so the speedup of 1.33 is achieved. But why not 2x, you might ask. Well, it might be a subject of another post.</p>
<p> <strong>Conclusion:</strong> when developing programs with TBB, you should take into account that using TBB classes and functions may impact compiler optimizations, which has especially bad impact on simple algorithms with small amount of work per iteration. Proper use of local variables helps optimization and improves parallel speedup.</p>
]]></content:encoded>
			<wfw:commentRss>http://softwareblogs.intel.com/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
