<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Why a simple test can get parallel slowdown</title>
	<atom:link href="http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/</link>
	<description></description>
	<pubDate>Mon, 13 Oct 2008 23:20:31 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
		<item>
		<title>By: Arch Robison (Intel)</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-12566</link>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		<pubDate>Thu, 15 May 2008 21:33:21 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-12566</guid>
		<description>Replacing r.end() does not help further because local_sum is the only variable written by the loop.  Since the compiler can tell that local_sum has no aliases, it can safely hoist the other reads out of the loop.</description>
		<content:encoded><![CDATA[<p>Replacing r.end() does not help further because local_sum is the only variable written by the loop.  Since the compiler can tell that local_sum has no aliases, it can safely hoist the other reads out of the loop.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alexey Kukanov (Intel)</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-11313</link>
		<dc:creator>Alexey Kukanov (Intel)</dc:creator>
		<pubDate>Thu, 27 Mar 2008 22:47:31 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-11313</guid>
		<description>Jim, I am not sure I fully understand what your question is. Are you asking if you can do such processing with TBB? I think yes. Or, do you wonder if these algorithms suite better as a benchmark? Probably yes; also there should be more computations per memory operation I believe. Or, are you asking something different?</description>
		<content:encoded><![CDATA[<p>Jim, I am not sure I fully understand what your question is. Are you asking if you can do such processing with TBB? I think yes. Or, do you wonder if these algorithms suite better as a benchmark? Probably yes; also there should be more computations per memory operation I believe. Or, are you asking something different?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jim</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-11276</link>
		<dc:creator>Jim</dc:creator>
		<pubDate>Wed, 26 Mar 2008 10:29:51 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-11276</guid>
		<description>This is cool stuff, but I do have a question.  You are processing a tight loop many times over using TBB in parallel, but what if you want to process a longer algorithm a few times in parallel.  Something like an FFT on block data or the fast hadamard algorithm?</description>
		<content:encoded><![CDATA[<p>This is cool stuff, but I do have a question.  You are processing a tight loop many times over using TBB in parallel, but what if you want to process a longer algorithm a few times in parallel.  Something like an FFT on block data or the fast hadamard algorithm?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: may_max11</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10958</link>
		<dc:creator>may_max11</dc:creator>
		<pubDate>Mon, 10 Mar 2008 12:58:02 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10958</guid>
		<description>thnx Alexey ur posts helped me to understand the real problem behind the "poor performance of parallel_for" :-) .Trying out some new experiments and TBB is working fine with them.</description>
		<content:encoded><![CDATA[<p>thnx Alexey ur posts helped me to understand the real problem behind the "poor performance of parallel_for" :-) .Trying out some new experiments and TBB is working fine with them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Watt</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10865</link>
		<dc:creator>Martin Watt</dc:creator>
		<pubDate>Thu, 06 Mar 2008 00:58:32 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10865</guid>
		<description>Yes, I also tried replacing r.end() with a local variable in my existing TBB code and also could not see a difference in performance.</description>
		<content:encoded><![CDATA[<p>Yes, I also tried replacing r.end() with a local variable in my existing TBB code and also could not see a difference in performance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alexey Kukanov (Intel)</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10862</link>
		<dc:creator>Alexey Kukanov (Intel)</dc:creator>
		<pubDate>Wed, 05 Mar 2008 23:56:27 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10862</guid>
		<description>Thanks for the feedback!
&lt;strong&gt;Martin&lt;/strong&gt;, I did some more experiments and it seems that replacing r.end() with a local variable has little impact on performance even in this case with small work inside. Most performance points are due to using local_sum. I will check a few  more things and probably write about it next week.
&lt;strong&gt;Jérôme&lt;/strong&gt;, your comment more relates to the question of right grainsize selection. Yes there should be a balance between enough work to justify overhead of task creation etc and enough tasks to feed all the threads. Fortunately the tbb::auto_partitioner works reasonably well in many cases so that you do not need to manually adjust grainsize unless fighting for every performance point. Another point in your comment seems to be on the choice between serial optimization and parallelism; and to me, the answer is that both make sense; as the example showed, after switching to TBB it makes sense to check that there is no significant decrease in serial performance.</description>
		<content:encoded><![CDATA[<p>Thanks for the feedback!<br />
<strong>Martin</strong>, I did some more experiments and it seems that replacing r.end() with a local variable has little impact on performance even in this case with small work inside. Most performance points are due to using local_sum. I will check a few  more things and probably write about it next week.<br />
<strong>Jérôme</strong>, your comment more relates to the question of right grainsize selection. Yes there should be a balance between enough work to justify overhead of task creation etc and enough tasks to feed all the threads. Fortunately the tbb::auto_partitioner works reasonably well in many cases so that you do not need to manually adjust grainsize unless fighting for every performance point. Another point in your comment seems to be on the choice between serial optimization and parallelism; and to me, the answer is that both make sense; as the example showed, after switching to TBB it makes sense to check that there is no significant decrease in serial performance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jérôme Muffat-méridol</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10840</link>
		<dc:creator>Jérôme Muffat-méridol</dc:creator>
		<pubDate>Wed, 05 Mar 2008 08:45:29 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10840</guid>
		<description>This is a very useful and interesting post and I'm glad to read you'll be posting more like this.

When I talk about TBB to people, I tend to say that where we are used to optimising the inner loop, TBB offers a very flexible way to optimise the "outer loop(s)". The example you present today is at the boundary, where fine grain optimisation collides with what you can gain by spreading the work... 

I think it is quite important to be able to know where that boundary is, to get a feel for the amount of work that should go in a TBB task (too little and you miss on fine grain optimisation, too much and work doesn't spread well...). 

So thanks for the post, looking forward to reading more...</description>
		<content:encoded><![CDATA[<p>This is a very useful and interesting post and I'm glad to read you'll be posting more like this.</p>
<p>When I talk about TBB to people, I tend to say that where we are used to optimising the inner loop, TBB offers a very flexible way to optimise the "outer loop(s)". The example you present today is at the boundary, where fine grain optimisation collides with what you can gain by spreading the work... </p>
<p>I think it is quite important to be able to know where that boundary is, to get a feel for the amount of work that should go in a TBB task (too little and you miss on fine grain optimisation, too much and work doesn't spread well...). </p>
<p>So thanks for the post, looking forward to reading more...</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Watt</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10819</link>
		<dc:creator>Martin Watt</dc:creator>
		<pubDate>Tue, 04 Mar 2008 18:36:25 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10819</guid>
		<description>Thanks for the posting. I've always been a bit worried about the iterators in TBB parallel for constructs from an optimization perspective, and this appears to confirm it. So even though the blocked range structure is const, the it.end() cannot be optimized out of the loop? Is there some possibility it may be changed elsewhere?

Given this, it seems we should always recode those loops to extract the end condition as a const variable?</description>
		<content:encoded><![CDATA[<p>Thanks for the posting. I've always been a bit worried about the iterators in TBB parallel for constructs from an optimization perspective, and this appears to confirm it. So even though the blocked range structure is const, the it.end() cannot be optimized out of the loop? Is there some possibility it may be changed elsewhere?</p>
<p>Given this, it seems we should always recode those loops to extract the end condition as a const variable?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By:  Ray Zed Blog</title>
		<link>http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10850</link>
		<dc:creator> Ray Zed Blog</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/#comment-10850</guid>
		<description>&lt;!--%kramer-pre%--&gt;unknown:   Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing&lt;!--%kramer-post%--&gt;</description>
		<content:encoded><![CDATA[<p><a class="technorati-balloon" href="http://www.technorati.com/cosmos/search.html?url=http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/feed/"><img src="http://static.technorati.com/images/bubble_h11.gif" class="technorati-balloon" alt="links from Technorati" style="border:0;" /></a>unknown:   Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing</p>
]]></content:encoded>
	</item>
</channel>
</rss>
