<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Under the hood: Learning more about task scheduling</title>
	<atom:link href="http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/feed/" rel="self" type="application/rss+xml" />
	<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/</link>
	<description></description>
	<pubDate>Wed, 09 Jul 2008 05:45:29 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Mick Turner</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12658</link>
		<dc:creator>Mick Turner</dc:creator>
		<pubDate>Tue, 20 May 2008 13:08:45 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12658</guid>
		<description>I have just been looking at the code for the task scheduler and I 'think' the following would work...

Add another state, similar to reexecute, perhaps delayed_reexecute. Normally a reexecute calls an immediate spawn on the task. delayed_reexecute would set a local task* delayed_task variable (initially set to NULL) and then after the next successful get_task or steal_task the delayed_task would be spawned and the delayed_task reset to NULL.

This method adds a couple of if tests against the delayed_task variable, one in the get_task middle loop and one in the steal_task outer loop, so reasonably small overhead.

Unfortunately, my app is not is in a position to easily test this, but I will do so as soon as I can.</description>
		<content:encoded><![CDATA[<p>I have just been looking at the code for the task scheduler and I 'think' the following would work...</p>
<p>Add another state, similar to reexecute, perhaps delayed_reexecute. Normally a reexecute calls an immediate spawn on the task. delayed_reexecute would set a local task* delayed_task variable (initially set to NULL) and then after the next successful get_task or steal_task the delayed_task would be spawned and the delayed_task reset to NULL.</p>
<p>This method adds a couple of if tests against the delayed_task variable, one in the get_task middle loop and one in the steal_task outer loop, so reasonably small overhead.</p>
<p>Unfortunately, my app is not is in a position to easily test this, but I will do so as soon as I can.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Intel® Software Network Blogs &#187; Under the hood: Building hooks to explore TBB task scheduler</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12637</link>
		<dc:creator>Intel® Software Network Blogs &#187; Under the hood: Building hooks to explore TBB task scheduler</dc:creator>
		<pubDate>Mon, 19 May 2008 22:42:53 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12637</guid>
		<description>[...] did I suspect as I was introducing the topic of blocking in parallel computation in my last post that it would generate such interest, even though it seemed a common problem I’d been working on [...]</description>
		<content:encoded><![CDATA[<p>[...] did I suspect as I was introducing the topic of blocking in parallel computation in my last post that it would generate such interest, even though it seemed a common problem I’d been working on [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: randomizer</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12473</link>
		<dc:creator>randomizer</dc:creator>
		<pubDate>Sun, 11 May 2008 10:06:54 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12473</guid>
		<description>Re [Charles E. Leiserson]: Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span &#62;&#62; P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see 
http://supertech.csail.mit.edu/cilk/lecture-1.ppt.
Other lectures and materials can be found here:
http://supertech.csail.mit.edu/cilk/.


I've read Cilk papers some time ago. Can you, please, explain why
(A) work must be structured as balanced tree (I mean not ideally balanced, just sufficiently balanced)
not followed from
(B) work/span &#62;&#62; P
?

I can't imagine unbalanced tree with work/span &#62;&#62; P. Also take into account that stealing is expensive operation.

Dmitriy V'jukov</description>
		<content:encoded><![CDATA[<p>Re [Charles E. Leiserson]: Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span &gt;&gt; P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see<br />
<a href="http://supertech.csail.mit.edu/cilk/lecture-1.ppt" rel="nofollow">http://supertech.csail.mit.edu/cilk/lecture-1.ppt</a>.<br />
Other lectures and materials can be found here:<br />
<a href="http://supertech.csail.mit.edu/cilk/" rel="nofollow">http://supertech.csail.mit.edu/cilk/</a>.</p>
<p>I've read Cilk papers some time ago. Can you, please, explain why<br />
(A) work must be structured as balanced tree (I mean not ideally balanced, just sufficiently balanced)<br />
not followed from<br />
(B) work/span &gt;&gt; P<br />
?</p>
<p>I can't imagine unbalanced tree with work/span &gt;&gt; P. Also take into account that stealing is expensive operation.</p>
<p>Dmitriy V'jukov</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick Turner</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12446</link>
		<dc:creator>Mick Turner</dc:creator>
		<pubDate>Fri, 09 May 2008 20:41:07 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12446</guid>
		<description>I have a similar issue. My app is effectively a OLAP type database. The task I am parallelising is resolution of user data selections to a data set. So I have a tree of operators on picked items from dimensions. A node in the tree (and possibly occuring more than once in a tree) could be a previously saved data set. This saved data set may or may not exist in memory and may require construction from its initial definition and thereby generating a new resolution of data set task. As soon as such a saved data set appears more than once I could have a second tbb thread looking to use it whilst it is being constructed. Without getting more sophisticated it would have to block and wait.

Wouldn't the following provide a solution?

If a tbb thread could recycle what is left of its task as a new task and then force a task steal rather than immediately checking to see what other work it had available then there is a chance it would do some useful work - possibly even helping the first tbb thread as it is busy resolving the saved data set (assuming that was broken into many tasks, which mine will be). Once it had completed the stolen task it would go back to normal processing of its task set.

I wouldn't claim my TBB knowledge to be anywhere near 100% yet... but I believe a TBB thread will always execute all task in its task set before looking to steal, so I don't think there is a way of simulating this process by playing around with levels in the stack of task lists.

It may be better if this thread moved onto some other task in its task set if such tasks were available and only looked to steal if there were no other tasks. Then what I am proposing is that the task that would otherwise block would temporarily become invisible to the thread, probably only for one 'get next task' call. 

I.e. Have a task.spawntemporarilyinvisible() call

Mick</description>
		<content:encoded><![CDATA[<p>I have a similar issue. My app is effectively a OLAP type database. The task I am parallelising is resolution of user data selections to a data set. So I have a tree of operators on picked items from dimensions. A node in the tree (and possibly occuring more than once in a tree) could be a previously saved data set. This saved data set may or may not exist in memory and may require construction from its initial definition and thereby generating a new resolution of data set task. As soon as such a saved data set appears more than once I could have a second tbb thread looking to use it whilst it is being constructed. Without getting more sophisticated it would have to block and wait.</p>
<p>Wouldn't the following provide a solution?</p>
<p>If a tbb thread could recycle what is left of its task as a new task and then force a task steal rather than immediately checking to see what other work it had available then there is a chance it would do some useful work - possibly even helping the first tbb thread as it is busy resolving the saved data set (assuming that was broken into many tasks, which mine will be). Once it had completed the stolen task it would go back to normal processing of its task set.</p>
<p>I wouldn't claim my TBB knowledge to be anywhere near 100% yet... but I believe a TBB thread will always execute all task in its task set before looking to steal, so I don't think there is a way of simulating this process by playing around with levels in the stack of task lists.</p>
<p>It may be better if this thread moved onto some other task in its task set if such tasks were available and only looked to steal if there were no other tasks. Then what I am proposing is that the task that would otherwise block would temporarily become invisible to the thread, probably only for one 'get next task' call. </p>
<p>I.e. Have a task.spawntemporarilyinvisible() call</p>
<p>Mick</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robert Reed (Intel)</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12415</link>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		<pubDate>Fri, 09 May 2008 00:18:25 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12415</guid>
		<description>Thanks very much, Charles!  I think I understand the basics of your algorithm and I'd like to dig deeper as I have the time.  I'll certainly report back here anything I find interesting. Translating from the theoretical DAG to a practical object computing code, it seems to me that the grey-following threads would reexecute the serial regions in the object computation functions that without the following would only get executed once.  So once-per-object operations like reference counting, memory allocations and such would have to be guarded from the follower(s).  Also in that list, there'd need to be some mechanism for the code to figure out that the initial spawn for a successor had already occurred and a way to find the outgoing spans for that parallel region in, say the TBB task tree.  Another complication would be serial regions between parallel regions within a particular function that may require a synchronizer to ensure tasks associated with the first parallel block were complete before spawning tasks for the second parallel block.

Unfortunately, I do not have any data sets to share.  The customers suggesting these models have their own concerns about privacy and have either not shared more than the sample code I show above or have constrained me under a nondisclosure agreement.  But I can share anything I come up with personally, and I can point them to this exchange in the hope that they may participate to the level they are able.

Thanks for your comments.  I hope we'll continue to keep your interest.</description>
		<content:encoded><![CDATA[<p>Thanks very much, Charles!  I think I understand the basics of your algorithm and I'd like to dig deeper as I have the time.  I'll certainly report back here anything I find interesting. Translating from the theoretical DAG to a practical object computing code, it seems to me that the grey-following threads would reexecute the serial regions in the object computation functions that without the following would only get executed once.  So once-per-object operations like reference counting, memory allocations and such would have to be guarded from the follower(s).  Also in that list, there'd need to be some mechanism for the code to figure out that the initial spawn for a successor had already occurred and a way to find the outgoing spans for that parallel region in, say the TBB task tree.  Another complication would be serial regions between parallel regions within a particular function that may require a synchronizer to ensure tasks associated with the first parallel block were complete before spawning tasks for the second parallel block.</p>
<p>Unfortunately, I do not have any data sets to share.  The customers suggesting these models have their own concerns about privacy and have either not shared more than the sample code I show above or have constrained me under a nondisclosure agreement.  But I can share anything I come up with personally, and I can point them to this exchange in the hope that they may participate to the level they are able.</p>
<p>Thanks for your comments.  I hope we'll continue to keep your interest.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Charles E. Leiserson</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12410</link>
		<dc:creator>Charles E. Leiserson</dc:creator>
		<pubDate>Thu, 08 May 2008 19:24:00 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12410</guid>
		<description>I think you can actually do the algorithm I proposed easily using only synchronization through memory, because the state bits change monotonically: white =&#62; grey =&#62; black and no-value =&#62; value.  You might get some redundant computation, but it would likely be minimal.

Work-stealing need not be balanced.  The key issues are work (total ops) and span (critical-path length).  Parallelism is work/span.  As long as there's substantially more parallelism than processors (work/span &#62;&#62; P), work-stealing will give linear speed-up.  Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal.  For more details, see http://supertech.csail.mit.edu/cilk/lecture-1.ppt.  Other lectures and materials can be found here: http://supertech.csail.mit.edu/cilk/.</description>
		<content:encoded><![CDATA[<p>I think you can actually do the algorithm I proposed easily using only synchronization through memory, because the state bits change monotonically: white =&gt; grey =&gt; black and no-value =&gt; value.  You might get some redundant computation, but it would likely be minimal.</p>
<p>Work-stealing need not be balanced.  The key issues are work (total ops) and span (critical-path length).  Parallelism is work/span.  As long as there's substantially more parallelism than processors (work/span &gt;&gt; P), work-stealing will give linear speed-up.  Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal.  For more details, see <a href="http://supertech.csail.mit.edu/cilk/lecture-1.ppt" rel="nofollow">http://supertech.csail.mit.edu/cilk/lecture-1.ppt</a>.  Other lectures and materials can be found here: <a href="http://supertech.csail.mit.edu/cilk/" rel="nofollow">http://supertech.csail.mit.edu/cilk/</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dmitriy V'jukov</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12409</link>
		<dc:creator>Dmitriy V'jukov</dc:creator>
		<pubDate>Thu, 08 May 2008 18:41:38 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12409</guid>
		<description>Code formatting in previous post is lost. Damn!

Re [Charles E. Leiserson]: Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies.

Btw, work-stealing scheduler supports only parallelism structured as *balanced* tree. Am I right here?
So user MUST structure parallelism as *balanced* tree. Provided extreme case of unbalanced tree, user can get extremely bad performance.

For example following structure *allows* parallelism, but will be executed extremely badly by work-stealing scheduler:

work11 forks work21, work22

work21 forks nothing
work22 forks work31, work32

work31 forks nothing
work32 forks work41, work42
etc

Is my understanding correct?</description>
		<content:encoded><![CDATA[<p>Code formatting in previous post is lost. Damn!</p>
<p>Re [Charles E. Leiserson]: Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies.</p>
<p>Btw, work-stealing scheduler supports only parallelism structured as *balanced* tree. Am I right here?<br />
So user MUST structure parallelism as *balanced* tree. Provided extreme case of unbalanced tree, user can get extremely bad performance.</p>
<p>For example following structure *allows* parallelism, but will be executed extremely badly by work-stealing scheduler:</p>
<p>work11 forks work21, work22</p>
<p>work21 forks nothing<br />
work22 forks work31, work32</p>
<p>work31 forks nothing<br />
work32 forks work41, work42<br />
etc</p>
<p>Is my understanding correct?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dmitriy V'jukov</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12406</link>
		<dc:creator>Dmitriy V'jukov</dc:creator>
		<pubDate>Thu, 08 May 2008 18:31:42 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12406</guid>
		<description>Re [Charles E. Leiserson]: But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies? I'd love to find that I'm overlooking an obvious solution that doesn't require locks. Got an idea?


What do about following approach?

// pseudo-code
void task::operator()
{
  secondary_t* sec = get_associated_secondary_element();
  if (sec)
  {
    if (false == sec-&#62;is_computed())
    {
      if (sec-&#62;try_lock())
      {
         sec-&#62;compute();
         sec-&#62;unlock();
      }
      else
      {
         // in fifo order, not in lifo!
         postpone_task(this);
         // maybe next time
         return;
      }
    }
  }
  // sec is computed, can use it
  fork_child1(sec);
  fork_child2(sec);
}</description>
		<content:encoded><![CDATA[<p>Re [Charles E. Leiserson]: But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies? I'd love to find that I'm overlooking an obvious solution that doesn't require locks. Got an idea?</p>
<p>What do about following approach?</p>
<p>// pseudo-code<br />
void task::operator()<br />
{<br />
  secondary_t* sec = get_associated_secondary_element();<br />
  if (sec)<br />
  {<br />
    if (false == sec-&gt;is_computed())<br />
    {<br />
      if (sec-&gt;try_lock())<br />
      {<br />
         sec-&gt;compute();<br />
         sec-&gt;unlock();<br />
      }<br />
      else<br />
      {<br />
         // in fifo order, not in lifo!<br />
         postpone_task(this);<br />
         // maybe next time<br />
         return;<br />
      }<br />
    }<br />
  }<br />
  // sec is computed, can use it<br />
  fork_child1(sec);<br />
  fork_child2(sec);<br />
}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Charles E. Leiserson</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12400</link>
		<dc:creator>Charles E. Leiserson</dc:creator>
		<pubDate>Thu, 08 May 2008 15:58:20 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12400</guid>
		<description>You're right that there are issues of memory use.  A properly implemented work-stealing scheduler guarantees that if S1 is the serial stack space, with P processors the stack space is at most P*S1.  If you repurpose a processor every time it hits somewhere another processor is executing, you can easily blow out memory.

You can model the computation to performed as a dag (directed acyclic graph).  Each vertex has a value to be computed which depends on the values of its successors in the dag.  Sinks (vertices with out-degree 0) can be computed with no further exploration.  A vertex whose value has been assigned can provide its value to its predecessors without further computation.

Here's a randomized nondeterministic algorithm to compute the dag which should work well using Cilk or TBB, although I haven't implemented it.  It uses at most P*S1 stack space.  Recursively walk the dag in parallel, spawning each exploration of a successor.  Label edges as white, grey, and black.  All edges are initially white.  If you traverse a white edge, color it grey.  If you return across a grey edge, color it black.  Now, in your parallel recursive tree walk, when you visit a vertex, if its value has already been computed, return the value.  If not and there exists at least one white edge, take a _random_ white edge.  Otherwise (all edges are black or grey), take a _random_ grey edge.  (You're now chasing the tail of the processor that colored the edge grey speculatively hoping to help out with that subdag.)  Use locks only to guarantee the atomicity of these operations on a single node.

I'll look at this problem with my students, both theoretically and in practice.  If you have any data sets, let me know.  Also, if you implement it, I'd be curious as to the results.</description>
		<content:encoded><![CDATA[<p>You're right that there are issues of memory use.  A properly implemented work-stealing scheduler guarantees that if S1 is the serial stack space, with P processors the stack space is at most P*S1.  If you repurpose a processor every time it hits somewhere another processor is executing, you can easily blow out memory.</p>
<p>You can model the computation to performed as a dag (directed acyclic graph).  Each vertex has a value to be computed which depends on the values of its successors in the dag.  Sinks (vertices with out-degree 0) can be computed with no further exploration.  A vertex whose value has been assigned can provide its value to its predecessors without further computation.</p>
<p>Here's a randomized nondeterministic algorithm to compute the dag which should work well using Cilk or TBB, although I haven't implemented it.  It uses at most P*S1 stack space.  Recursively walk the dag in parallel, spawning each exploration of a successor.  Label edges as white, grey, and black.  All edges are initially white.  If you traverse a white edge, color it grey.  If you return across a grey edge, color it black.  Now, in your parallel recursive tree walk, when you visit a vertex, if its value has already been computed, return the value.  If not and there exists at least one white edge, take a _random_ white edge.  Otherwise (all edges are black or grey), take a _random_ grey edge.  (You're now chasing the tail of the processor that colored the edge grey speculatively hoping to help out with that subdag.)  Use locks only to guarantee the atomicity of these operations on a single node.</p>
<p>I'll look at this problem with my students, both theoretically and in practice.  If you have any data sets, let me know.  Also, if you implement it, I'd be curious as to the results.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robert Reed (Intel)</title>
		<link>http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12385</link>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		<pubDate>Wed, 07 May 2008 21:38:33 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/05/06/under-the-hood-learning-more-about-task-scheduling/#comment-12385</guid>
		<description>Thanks for the correction, Alexey.  Somewhere in the back of my mind I recalled some suggestion of migrating code to the affinity_partitioner.  Must have been a memory fault ;-)

To Charles: yes, in general I agree with you.  Particularly driven by the philosophy of Intel Threading Building Blocks to reduce areas where caches can cool, locks are to be avoided as much as possible.  Yet locks are included in Threading Building Blocks so if they are evil, they are at least a necessary evil.  

Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies. Briefly I described a couple of these application structures in my intro, but let me try to elaborate one and see if anyone has an alternate approach to the problem.  Imagine an object processing system (could be a mechanical CAD package, or maybe a 3D renderer, or some physics simulation) wherein the objects to process have shared dependencies to secondary objects.  Those secondary objects may be sub-assemblies or caches of precomputed data that themselves may be computed in parallel.  Our example program may not know a priori what is the topology of those dependencies: a reasonable approach in this case is to describe a set of tasks where each describes processing for one of the primary objects.  Presumably these objects are localized in memory and so fit the general criteria for processing in parallel on separate processing elements (PEs).  Then one PE encounters a reference to a secondary object that requires its own computation.  What to do here?  We could dispatch this thread to compute the secondary object, but what if a second PE working on another primary object runs into a reference to the same secondary?  Do we duplicate the computation?  We could, at the cost of extra memory and duplicated processing.  Better if we could somehow suspend the second PE until the processing/preparation of the secondary object is completed (cached data of the primary cooling the whole time, admittedly), at which point both PEs could proceed on computation of their primary objects.  Sounds like a resource lock to me, although ideally we'd like to keep the second PE busy, not off in an idle loop or some other process while the thread it was running is suspended on that lock.  What would be really cool is to structure things so that both PEs share the work of processing the secondary object and then each return to their primary object processing once that work is done.  But there's questions of oversubscription and undersubscription and memory use and stack splitting and more to deal with here.  I don't claim to have answers for all these but I think they're worth a discussion.

But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies?  I'd love to find that I'm overlooking an obvious solution that doesn't require locks.  Got an idea?</description>
		<content:encoded><![CDATA[<p>Thanks for the correction, Alexey.  Somewhere in the back of my mind I recalled some suggestion of migrating code to the affinity_partitioner.  Must have been a memory fault ;-)</p>
<p>To Charles: yes, in general I agree with you.  Particularly driven by the philosophy of Intel Threading Building Blocks to reduce areas where caches can cool, locks are to be avoided as much as possible.  Yet locks are included in Threading Building Blocks so if they are evil, they are at least a necessary evil.  </p>
<p>Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies. Briefly I described a couple of these application structures in my intro, but let me try to elaborate one and see if anyone has an alternate approach to the problem.  Imagine an object processing system (could be a mechanical CAD package, or maybe a 3D renderer, or some physics simulation) wherein the objects to process have shared dependencies to secondary objects.  Those secondary objects may be sub-assemblies or caches of precomputed data that themselves may be computed in parallel.  Our example program may not know a priori what is the topology of those dependencies: a reasonable approach in this case is to describe a set of tasks where each describes processing for one of the primary objects.  Presumably these objects are localized in memory and so fit the general criteria for processing in parallel on separate processing elements (PEs).  Then one PE encounters a reference to a secondary object that requires its own computation.  What to do here?  We could dispatch this thread to compute the secondary object, but what if a second PE working on another primary object runs into a reference to the same secondary?  Do we duplicate the computation?  We could, at the cost of extra memory and duplicated processing.  Better if we could somehow suspend the second PE until the processing/preparation of the secondary object is completed (cached data of the primary cooling the whole time, admittedly), at which point both PEs could proceed on computation of their primary objects.  Sounds like a resource lock to me, although ideally we'd like to keep the second PE busy, not off in an idle loop or some other process while the thread it was running is suspended on that lock.  What would be really cool is to structure things so that both PEs share the work of processing the secondary object and then each return to their primary object processing once that work is done.  But there's questions of oversubscription and undersubscription and memory use and stack splitting and more to deal with here.  I don't claim to have answers for all these but I think they're worth a discussion.</p>
<p>But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies?  I'd love to find that I'm overlooking an obvious solution that doesn't require locks.  Got an idea?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
