<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Intel Software Network Blogs &#187; Arch Robison (Intel)</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/arch-robison/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<pubDate>Mon, 13 Oct 2008 20:26:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<item>
		<title>Implementing task_group interface in TBB</title>
		<link>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/#comments</comments>
		<pubDate>Wed, 02 Jul 2008 13:53:06 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/</guid>
		<description><![CDATA[The TBB class task was designed for high-performance implementations of the TBB templates.  It's efficiency, particularly its emphasis on continuation-passing style, comes at some price in convenience.  Rick Molloy of Microsoft has posted a description of a task_group interface that Microsoft is considering.  It's more convenient for than the TBB interface, particularly when your compiler supports C++ [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.threadingbuildingblocks.org">TBB</a> class task was designed for high-performance implementations of the TBB templates.  It's efficiency, particularly its emphasis on continuation-passing style, comes at some price in convenience.  Rick Molloy of Microsoft has <a href="http://blogs.msdn.com/nativeconcurrency/">posted a description</a> of a <code>task_group</code> interface that Microsoft is considering.  It's more convenient for than the TBB interface, particularly when your compiler supports C++ 200x lambda expessions (Section 5.1.1 of <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2606.pdf">N2606</a>).</p>
<p>I implemented a subset of <code>task_group</code> in TBB as a header tbb/task_group.h: 37 lines of C++ and 5 preprocessor lines.   It's a small subset.</p>
<ul>
<li>It does not support task_handle. </li>
<li>The <a href="http://software.intel.com/en-us/blogs/2008/06/11/exception-handling-and-cancellation-in-tbb-part-iv-using-context-objects/">exception/cancellation model</a> is still TBB's. </li>
<li><code>wait()</code> returns void, not <code>task_group_status</code>, since the blog does not detail <code>task_group_status</code>. </li>
</ul>
<p>But nonetheless, I think some TBB users will find this minimal form useful.  For example, it's enough of <code>task_group</code> to write the quicksort in Molloy's post.</p>
<p>The code for header follows my signature.  I'd be interested to hear how useful it is.</p>
<p>- Arch</p>
<pre>#ifndef __TBB_task_group_H
#define __TBB_task_group_H

#include "tbb/task.h"

namespace tbb {

class task_group;

namespace internal {

// Suppress gratuitous warnings from icc 11.0 when lambda expressions are used in instances of function_task.
#pragma warning(disable: 588)

template&lt;typename Function&gt;
class function_task: public task {
    Function my_func;
    /*override*/ task* execute() {
        my_func();
        return NULL;
    }
public:
    function_task( Function&amp; f ) : my_func(f) {}
};

} // namespace internal

class task_group: internal::no_copy {
private:
    empty_task* root;
public:
    task_group() {
        root = new(task::allocate_root()) empty_task;
        root-&gt;set_ref_count(1);
    }
    ~task_group() {
        if( root-&gt;ref_count() )
            root-&gt;wait_for_all();
        root-&gt;destroy(*root);
    }
    template&lt;typename Function&gt;
    void run( Function f ) {
        task&amp; self = task::self();
        self.spawn(*new( self.allocate_additional_child_of( *root )) internal::function_task&lt;Function&gt;(f) );
    }
    void wait() {
        root-&gt;wait_for_all();
    }
};

} // namespace tbb

#endif /* __TBB_task_group_H */</pre>
<p> </p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tasks for Doing and Threads for Waiting</title>
		<link>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/#comments</comments>
		<pubDate>Fri, 06 Jun 2008 03:35:27 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/</guid>
		<description><![CDATA[TBB started out as a task-based framework for parallel programming.  TBB 2.1 adds threads.  This note explains the new threading interface, when to use it, and when to use tasks instead.
TBB tasks rely on non-preemptive cooperative scheduling based on work stealing, similar to Cilk. Once the TBB scheduler starts a task on a software thread, [...]]]></description>
			<content:encoded><![CDATA[<p>TBB started out as a task-based framework for parallel programming.  TBB 2.1 adds threads.  This note explains the new threading interface, when to use it, and when to use tasks instead.</p>
<p>TBB tasks rely on non-preemptive cooperative scheduling based on work stealing, similar to Cilk. Once the TBB scheduler starts a task on a software thread, it does not switch to another task except at well-defined points (specifically, while waiting for its child tasks to complete). For compute-bound workloads, this style of scheduling has multiple benefits:</p>
<ul>
<li>Cooperative scheduling has low context switch overhead.</li>
<li>Work stealing has good cache locality and certain space guarantees (see Cilk papers)</li>
<li>Lastly, and most important, actual parallelism can be matched to available parallelism. In task-based programming, the programmer is expected to provide too much parallelism (“parallel slack” in Cilk parlance), and the scheduler will extract just enough parallelism to keep the machine humming, not swamped.</li>
</ul>
<p>But programs are not only about calculation.  Programs also often have to wait on external events.  For sake of timely response, the wait needs to be done by a preemptively scheduled thread that can be scheduled when the event occurs. Of course interrupts or polling sometimes work. Interrupts are a bit of a combination of cooperative and preemptive scheduler.  The cooperative part is using an existing thread, the preemptive part is using it any time. That can be the best of both (preemptive and low overhead) or the worse of both (interrupt handlers are typically constrained on what they are allowed to do).  Polling usually scales poorly – composing two polling components requires composing their polling loops or using separate threads for the two polling loops.</p>
<p>Tasks can block, but doing so has two problems:</p>
<ul>
<li>The underlying thread sits idle until the task unblocks.</li>
<li>Tasks (and their threads) waiting on it to also sit idle.  I.e., blocking propagates along a dependence chain.</li>
</ul>
<p>In an ideal world, the scheduler would fire up other threads to run tasks in the meantime.  To do this efficiently requires user-level scheduling support that is not (at least yet) available in all operating systems targeted by TBB. But even with that support, there is another issue.  K blocked threads consume K stacks. Most of these stacks might be quite small, but current calling conventions require that programmers specify a fixed stack size (or use a default one) that is typically much larger than necessary for the common case. Changing the calling convention to be more like Cilk’s would solve this problem, but calling conventions take a long time to reform.</p>
<p>So in addition to tasks, TBB 2.1 has a class tbb_thread, which is a thin wrapper around a platform’s native thread. The interface is as close to the C++ 200x std::thread as we could make it given the limitations of C++ 1998. In particular:</p>
<ul>
<li>Lack of variadic templates restricts us to a fixed limit on template arguments.</li>
<li>Lack or rvalue references implies slightly more overhead because copy-construction has to be used instead of move construction.</li>
<li>Time is measured in the existing TBB timing interface tick_count::interval_t instead of the templated time interface in C++ 200x. We had to draw the line somewhere on where to stop pulling in C++ 200x.</li>
</ul>
<p>We chose to call it tbb::tbb_thread and not tbb::thread to avoid name collisions when the ISO std::thread becomes available and a program liberally employs “using” directives.</p>
<p>Because tbb::tbb_thread is a thin wrapper around native threads, threads are heavier than tasks. They take longer to create and destroy.  They have associated stacks.  They are preemptively scheduled, so they guarantee concurrency, which is useful when you need it, but comes at the price of oversubscription if misused.  But they can block without impacting other threads or tasks. </p>
<p>So TBB 2.1 has two ways to get things done: tasks and threads. When designing a program, try to separate calculating work from waiting work. Use tasks for calculation and threads for waiting. When a thread needs to do calculations, it can do it with tasks. Avoid having a task block on an external event. Software components doing waiting should call on components doing calculation, not the other way around.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Abstracting Thread Local Storage</title>
		<link>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/#comments</comments>
		<pubDate>Thu, 31 Jan 2008 19:25:35 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</guid>
		<description><![CDATA[[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]
Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented [...]]]></description>
			<content:encoded><![CDATA[<p>[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]</p>
<p>Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented as a one-to-one map from thread ids to pieces of storage. And vice versa, the address of a variable in thread-local storage can be used as a thread id.</p>
<p>TBB by design has no thread id or thread-local storage. TBB is based on task-based parallelism, where the programmer breaks work up into tasks, and the task scheduler is free to map tasks to hardware threads. Furthermore, our OpenMP run-time group strongly recommended that we avoid explicit thread ids because of problems with nested parallelism and dealing with a dynamically growing or shrinking team of threads. For example, the nature of the OpenMP thread id interface implies that the number of threads in a thread team is fixed for the duration of the team.</p>
<p>However, thread local storage does have its uses. Don Webber <a href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30247002/30248017/ShowThread.aspx#30248017">posted</a> an excellent use case for thread local storage, which involves updating a sparse matrix. The problem involves doing many updates of the form *p += value, in parallel, where some updates might update the same location. Assuming that += is commutative and associative, one way to implement this is to have each thread sum its own updates privately, and then merge the sums. As Don notes, the alternative of locking *p on each update is prohibitively expensive. Using an atomic +=, even if available on current hardware, would likewise be prohibitively expensive, because cache lines would ping-pong severely.</p>
<p>I'd like to see TBB extended to provide the power of thread local storage without opening up a Pandora's box of raw thread ids. I think the solution needs to cleanly separate a high-level portion from a low-level portion, like we recently did for cache affinity. (Note: the type task::affinity_id might appear to have opened the box, but did not, because it is a <em>hint</em>, not a commandment.)</p>
<p>TBB's template parallel_reduce in TBB partially deals with the cited use case, because it is lazy about fork/joins. The user defines how to fork/join state information for the reduction. The template recursively decomposes the range and applies the users fork/join operations. The laziness is that fork/join pairs are only used when task stealing occurs. For example, if there are P threads and N leaf subranges, it does not do the obvious N-1 fork operations, but instead does just enough to keep the threads busy. Specifically, it does a fork/join pair only when the two children would be processed by different threads.</p>
<p>However, parallel_reduce is not lazy enough. At the high level, the problem is that parallel_reduce cannot assume that the reduction operation is commutative. For a non-commutative reduction operation, the current implementation is close to optimal (maybe a factor of 2 off in the worse case) with respect to the number of fork/join pairs. If TBB added a reduction template that could assume a commutative reduction operation (e.g.parallel_unordered_reduce), then at most P-1 fork/join pairs would be necessary.</p>
<p>The good thing about using the hypothetical parallel_unordered_reduce instead of exposing thread local storage is that it keeps the abstraction at a high level. Explicitly using thread local storage would introduce irrelevant low-level details. For example, a typical implementation based on thread local storage can be sketched as:</p>
<blockquote>
<pre>forall updates (p,value)  

    do  *p += value // *p points to thread-local partial sum  

for each thread-local partial sum do  

    update global-sum+= thread-local partial sum</pre>
</blockquote>
<p>This level exposes issues such as "where are the thread-local partial sums that were generated in the first loop?" Since threads can come and go during execution of the first loop, iterating across ids of currently running threads is not enough. Some of the partial sums might outlive their threads, or some threads might come into existence after the partial sums were generated.  We'll need a container to hold the partial sums, and a means of iterating over the sums.</p>
<p>The interface for such a container seems straightforward. Define it as a sequence of T that has iterator capability. Add TBB-style ranges too, so that reductions over the container can be done in parallel. Add a special method "mine()" that returns reference to  the element that is owned by the invoking thread. If the element is not present, "mine()" would insert one and default-construct it. </p>
<p>Method mine() would be most likely implemented by hashing a thread id, so it's not going to be cheap, but probably inexpensive enough if the user hoists calls to it. </p>
<p>There is an interesting alternative that weakens guarantees, with the intent of expressing intent at a little higher level. It's somewhat an object-oriented extension of a semaphore that combines the semaphore with the resource it is protecting. It would work as follows. The method "mine() could be replaced by two methods "acquire()" and "release()" [possibly sugared with RAII] such that: </p>
<ul>
<li>"acquire()" would grant access to an instance of T that is not being accessed by any other thread</li>
<li>"release()" would release access</li>
</ul>
<p>This interface permits an implementation to keep the limit the number of "thread local" copies of T to what is actually necessary for concurrency, not what is necessary for one-per-thread. If T is really big, this could be advantageous. There's perhaps an issue with cache affinity.  However, a sufficiently clever implementation could bias towards grabbing the instance of T that the thread most recently had before. Of course, a conforming implementation could just use thread-local storage for each copy of T.</p>
<p>Comments? </p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Linked Lists - Incompatible with Parallel Programming?</title>
		<link>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/#comments</comments>
		<pubDate>Thu, 20 Dec 2007 23:08:55 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/</guid>
		<description><![CDATA[I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access.   The answer is that we'd add one if:

We could figure out semantics that are useful for parallel programming and
We could implement it reasonably efficiently on current hardware.

I usually try to avoid linked [...]]]></description>
			<content:encoded><![CDATA[<p>I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access.   The answer is that we'd add one if:</p>
<ul>
<li>We could figure out semantics that are useful for parallel programming <em>and</em></li>
<li>We could implement it reasonably efficiently on current hardware.</li>
</ul>
<p>I usually try to avoid linked lists even for <em>sequential </em>programming if I'm programming for performance. My reasons are:</p>
<ul>
<li>Linked lists are often misused in ways that create asymptotic slowness.  Usually the culprit is searching the list.  Yes, similar abuses can occur for other data structures too.  But lists seem to attract this abuse.</li>
<li> Linked lists are unfriendly to cache.  Adjacent items in a list tend to be scattered in memory.  With cache misses costing on the order of 100x a cache hit, this can be a significant performance issue.</li>
</ul>
<p>For parallel programming, add another flaw:</p>
<ul>
<li>Traversing a linked list is inherently serial.  [Theorists will point out that traversal <em>can </em>be done in parallel if you have a processor per node in the list. Feel free to buy that many Intel processors -- I own Intel stock.]</li>
</ul>
<p>Two traditional attractions of linked lists are:</p>
<ol>
<li>Linked lists are about the easiest dynamic data structure to write from scratch.</li>
<li> Prepending and appending take O(1) time.</li>
</ol>
<p>But modern polymorphic language like C++ provide dynamic data structures like std::vector and std::deque. You don't have to write them from scratch.  Prepending or appending to a deque also takes O(1) time.  Appending to a vector takes O(1) amortized time. Amortized time is the time averaged over many append operations.</p>
<p>Here's a speed test you might want to try.  Construct a container, append n items, walk the container once, and destroy it.  Here's the code:</p>
<pre>template&lt;typename Container&gt;</pre>
<pre>int Iota( int n ) {</pre>
<pre>    Container container;

    for( int i=0; i&lt;n; ++i )<n;></n;>

        container.push_back(i);

    int sum = 0;

    for( typename Container::const_iterator j=container.begin(); j!=container.end(); ++j )

        sum += *j;

    return sum;

}</pre>
<p>I tried this fragment on a Linux box and found that std::deque was slightly faster than std::list when n&gt;=3, and std::vector was slightly faster when n&gt;=10.  When n&gt;=100, std::deque was more than <em>10x </em>faster than std::list, and even std::vector more than 3x faster than std::list.  So for very short collections, std::list might pay off.  But for big collections, its second-rate.</p>
<p>Of course linked lists do have some virtues, notably when concatentating lists, splicing lists, and inserting items in the middle.  I use lists when I need to do that.  But getting back to parallelism, which set of those operations make any sense in parallel programming?  Concurrent splicing and inserting seems awfully tricky to use correctly.  For example, if I really need to insert in the middle of the list, it must be because there is something special about the insertion context.  But if there are other threads inserting at the same time, how do I know the context will not be broken?</p>
<p>The two operations on lists that I think could be useful in parallel programming are:</p>
<ol>
<li> concatenating two lists, in constant time</li>
<li>splitting a list into two sublists, or at least view it as two sublists, in constant time</li>
</ol>
<p>For example, a parallel reduction could use "concatenate" as its reduction operation, and thus build a list of N items in O(N/P+log(P)) time.  The log(P) term arises from a tree reduction at the end.  The problem is the second operation.  To keep a list from becoming a serial bottleneck, we need a way to traverse it in parallel.  That probably means it is no longer a linked list, but some kind of (balanced?) tree structure.</p>
<p>I've had a recurring thought that we should add this kind list, one that supports concatenation and splitting in constant time.  But we really need motivating use cases before implementing it.  Suggestions for good use cases or demos appreciated.</p>
<p>- Arch Robison</p>
<p>P.S. I though about writing this blog as a politcal attack ad, but given the technical details, decided against it.  If I had done it that way, it would have started:</p>
<blockquote><p>Mr. Linked List is running for office. He's popular everywhere. But here's what Mr. Linked List doesn't want <em>you</em> to know... .</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Volatile: Almost Useless for Multi-Threaded Programming</title>
		<link>http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/#comments</comments>
		<pubDate>Fri, 30 Nov 2007 20:44:49 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/</guid>
		<description><![CDATA[There is a widespread notion that the keyword volatile is good for multi-threaded programming. I've seen interfaces with volatile qualifiers justified as "it might be used for multi-threaded programming". I thought was useful until the last few weeks, when it finally dawned on me (or if you prefer, got through my thick head) that volatile [...]]]></description>
			<content:encoded><![CDATA[<p>There is a widespread notion that the keyword <code>volatile</code> is good for multi-threaded programming. I've seen interfaces with volatile qualifiers justified as "it might be used for multi-threaded programming". I thought was useful until the last few weeks, when it finally dawned on me (or if you prefer, got through my thick head) that volatile is almost useless for multi-threaded programming. I'll explain here why you should scrub most of it from your multi-threaded code.</p>
<p>Hans Boehm points out that there are only <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html">three portable uses for volatile</a>. I'll summarize them here:</p>
<ul>
<li>marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.</li>
<li>memory that is modified by an external agent or appears to be because of a screwy memory mapping</li>
<li>signal handler mischief</li>
</ul>
<p>None of these mention multi-threading. Indeed, Boehm's paper points to a 1997 <a href="http://groups.google.com/group/comp.programming.threads/browse_frm/thread/399797d84a5c37d5/eb60e71097dd5755">comp.programming.threads discussion </a>where two experts said it bluntly:</p>
<blockquote><p>"Declaring your variables volatile will have no useful effect, and will simply cause your code to run a *lot* slower when you turn on optimisation in your compiler." - Bryan O' Sullivan</p></blockquote>
<blockquote><p>"...the use of volatile accomplishes nothing but to prevent the compiler from making useful and desirable optimizations, providing no help whatsoever in making code "thread safe". " - David Butenhof</p></blockquote>
<p>If you are multi-threading for the sake of speed, slowing down code is definitely not what you want. For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:</p>
<ol>
<li>atomicity</li>
<li>memory consistency, i.e. the order of a thread's operations as seen by another thread.</li>
</ol>
<p>Let's deal with (1) first. Volatile does <em>not</em> guarantee atomic reads or writes. For example, a volatile read or write of a 129-bit structure is not going to be atomic on most modern hardware. A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it. It would likely be atomic without the volatile. The atomicity is at the whim of the compiler. There's nothing in the C or C++ standards that says it has to be atomic.</p>
<p>Now consider issue (2). Sometimes programmers think of volatile as turning off optimization of volatile accesses. That's largely true in practice. But that's only the volatile accesses, not the non-volatile ones. Consider this fragment:</p>
<pre>
    volatile int Ready;       

    int Message[100];      

    void foo( int i ) {      

        Message[i/10] = 42;      

        Ready = 1;      

    }</pre>
<p>It's trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with "gcc -O2 -S" using gcc 4.0, or icc. Both will do the store to Ready <em>first</em>, so it can be overlapped with the computation of i/10. The reordering is <em>not</em> a compiler bug. It's an aggressive optimizer doing its job.</p>
<p>You might think the solution is to mark all your memory references volatile. That's just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it <em>might not fix the problem</em>. Even if the compiler does not reorder the references, the hardware might. In this example, x86 hardware will not reorder it. Neither will an Itanium(TM) processor, because Itanium compilers insert memory fences for volatile stores. That's a clever Itanium extension. But chips like Power(TM) will reorder. What you really need for ordering are <em>memory fences</em>, also called <em>memory barriers</em>. A memory fence prevents reordering of memory operations across the fence, or in some cases, prevents reordering in one direction. Paul McKenney's article <a href="http://www.linuxjournal.com/article/8211">Memory Ordering in Modern Microprocessors</a> explains them. Sufficient for discussion here is that volatile has nothing to do with memory fences.</p>
<p>So what's the solution for multi-threaded programming? Use a library or language extension hat implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:</p>
<ul>
<li>POSIX threads</li>
<li>Windows(TM) threads</li>
<li>OpenMP</li>
<li>TBB</li>
</ul>
<p>For example, the parallel reduction template in TBB does all the right fences so you don't have to worry about them.</p>
<p>I spent part of this week scrubbing <code>volatile</code> from the TBB task scheduler. We were using volatile for memory fences because version 1.0 targeted only x86 and Itanium. For Itanium, volatile did imply memory fences. And for x86, we were just using one compiler, and catering to it. All atomic operations were in the binary that we compiled. But now with the open source version, we have to pay heed to other compilers and other chips. So I scrubbed out volatile, replacing them with explicit load-with-acquire and store-with-release operations, or in some cases plain loads and stores. Those operations themselves are implemented using volatile, but that's largely for Itanium's sake.  Only one volatile remained, ironically on an unshared local variable! See file src/tbb/task.cpp in the latest download if your curious about the oddball survivor.<br />
- Arch</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Supercomputing '07: Computer Environment and Evolution</title>
		<link>http://software.intel.com/en-us/blogs/2007/11/17/supercomputing-07-computer-environment-and-evolution/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/11/17/supercomputing-07-computer-environment-and-evolution/#comments</comments>
		<pubDate>Sun, 18 Nov 2007 04:50:51 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/11/17/supercomputing-07-computer-environment-and-evolution/</guid>
		<description><![CDATA[ I'm back from Supercomputing '07.  Gone is the heyday of wacky hyper-dimensional topologies and strange new architectures like the Connection Machine. Clusters built from commodity parts have become the ubiquitous denizens of the High Performance Computing (HPC)  ecological niche. The Precambrian explosion seems to be over.
The environment directs evolution.  It was [...]]]></description>
			<content:encoded><![CDATA[<p> I'm back from Supercomputing '07.  Gone is the heyday of wacky hyper-dimensional topologies and strange new architectures like the Connection Machine. Clusters built from commodity parts have become the ubiquitous denizens of the High Performance Computing (HPC)  ecological niche. The Precambrian explosion seems to be over.</p>
<p>The environment directs evolution.  It was good to walk around the exhibits and be reminded of how different the HPC environment is from desktop parallelism, and how TBB differs somewhat in its focus compared to classic HPC.</p>
<ul>
<li> HPC applications run on million dollar machines.  In this environment, owners are willing to do a lot of code rewriting, and use labor-intensive techniques like message-passing to get good speedup on thousands of cores.  In a former life, I had fun doing that on a 256 node hypercube.  TBB in contrast, targets desktop machines, where programmer time is premium.   TBB perhaps gives up a little performance short of optimal so you don't have to write message-passing.</li>
<li>The high machine costs cause some strange cross-breeding.  E.g., Los Alamos' latest big machine "Roadrunner" marries AMD Opterons to IBM Cell processors. (Seems  more like a platypus than a bird to me.) Desktop software is typically expected to run on many desktops; it's authors have neither time nor money to port it to odd de jour architectures.</li>
<li>HPC codes seem to focus on crunching large regular data sets.  Exemplars are finite-element codes and fluid solvers.  Desktop software involves a lot more than marching an army in a straight line.  For example, my standard example of an application that I would like to speed up is program compilation.  Sure, you can get some speedup with parallel make, but why not parallelize the guts of the compiler?  Likewise a parallel <em>linker </em>would be really useful.</li>
<li>HPC codes run on dedicated machines, unperturbed by other stuff running on them like web browsers, spreadsheets, and video.  In such a pristine environment, the HPC programmers can orchestrate things in detail.  The desktop, in contrast, is a less controlled environment where many programs vie for processor resources.  So TBB focuses on techniques that enable automatic load balancing even if the machine is perturbed.</li>
<li>The Fortran dinosaur roams freely in HPC land.  Something like TBB is not practical with C or Fortran, because TBB depends upon compile-time polymorphism.</li>
</ul>
<p>Though the supercomputers seemed dully alike, there was a corner with "<a href="http://sc07.supercomputing.org/index.php?pg=disrupttech.html">disruptive technologies</a>".  Any of these could radically change the environment, and thus future evolution of computers. Four of them were based on using different physics than traditional computers:</p>
<ul>
<li><a href="http://www.nantero.com/" title="Nantero"> Nantero</a> is working on technology for building non-volatile memories out of nanotubes.  If it pans out, it could enable radically faster and denser non-volatile memories than what we get with flash.  Of course it's the newcomer versus well tuned established processes.  (Anyone remember how gallium arsenide was going to take over?)</li>
<li><a href="http://www.dwavesys.com/"> D-Wave</a> is working on quantum computing.  They are not trying to build a universal computer, but rather an accelerator for certain NP-complete problems like integer linear programming and 3SAT.</li>
<li> IBM Research was showing off an <a href="http://sc07.supercomp.org/schedule/event_detail.php?evid=11167">optical circuit board</a>, with one optical layer printed with a transparent polymer.  That could radically improve memory bandwidth.</li>
<li><a href="http://www.luxtera.com/">Luxtera</a> is making silicon photonics cheaper.  They were demonstrating 40 Gigabit/sec cables. The ends had electrical connections, but the cable itself was optical.</li>
</ul>
<p>It will be interesting to see which of these technologies survive and how they mutate, and which ones disrupt the ecosystem.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/11/17/supercomputing-07-computer-environment-and-evolution/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Have a Fish - How Break from a Parallel Loop in TBB</title>
		<link>http://software.intel.com/en-us/blogs/2007/11/08/have-a-fish-how-break-from-a-parallel-loop-in-tbb/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/11/08/have-a-fish-how-break-from-a-parallel-loop-in-tbb/#comments</comments>
		<pubDate>Thu, 08 Nov 2007 15:57:29 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/11/08/have-a-fish-how-break-from-a-parallel-loop-in-tbb/</guid>
		<description><![CDATA["Give people a fish, and you feed them for a day. Teach people how to fish, and you feed them forever."
This blog does both. A recurring question from TBB users is how to break from a parallel loop. This blog shows one way to do it, by writing a new kind of range type. It's not a perfect solution, but [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>"Give people a fish, and you feed them for a day. Teach people how to fish, and you feed them forever."</p></blockquote>
<p>This blog does both. A recurring question from TBB users is how to break from a parallel loop. This blog shows one way to do it, by writing a new kind of range type. It's not a perfect solution, but has two merits:</p>
<ul>
<li>"Teach": It can be understood without knowing the internals of the task scheduler. TBB developement team is investigating more general approaches for cancellation and exceptions, but these involve significant changes to the scheduler's internals. What I'll present here only requires understanding the Range concept in TBB.</li>
<li>"Give": It can be cut and pasted from this blog, and used now</li>
</ul>
<p>Also, fishermen like to show off their fish. My example shows off how the TBB loop templates are more general than they first appear. By constrast, OpenMP parallel loops cannot be terminated early like this.</p>
<p>The basic trick is to build a TBB recursive range that collapses when the program decides to break the loop. Any thread working on a subrange of the loop will suddenly see an empty subrange and quit. Better yet, the entire parallel_for invocation will quit early.</p>
<p>A recursive range in TBB is any type R that with the following signatures:</p>
<ul>
<li>A copy constructor</li>
<li>A destructor</li>
<li>A method R.empty() that tells whether the range is empty.</li>
<li>A method R.is_divisible() that tells whether the range can be subdivided.</li>
<li>A "splitting constructor" that splits the range into two subranges.</li>
</ul>
<p>I've called it "cancelable_range". It's like a blocked_range, but with the cancellation feature added. Cancellation is implemented as a reference to a shared boolean flag. When the flag becomes true, all cancelable_range objects that share the flag act as if they became empty.  Here's the complete code for a cancelable range. It's a bargain at only 18 lines of code.</p>
<blockquote>
<pre>
//! Like a blocked_range, but becomes immediately empty if "stop" flag is true.
template&lt;typename Value&gt;
class cancelable_range {
    tbb::blocked_range&lt;Value&gt; my_range;
    volatile bool&amp; <strong>my_stop</strong>;
public:
    // Constructor for client code
    /** Range becomes empty if stop==true. */
    cancelable_range( int begin, int end, int grainsize, volatile bool&amp; stop ) :
        my_range(begin,end,grainsize),
        my_stop(stop)
    {}
    //! Splitting constructor used by parallel_for
    cancelable_range( cancelable_range&amp; r, tbb::split ) :
        my_range(r.my_range,tbb::split()),
        my_stop(r.my_stop)
    {}
    //! Cancel the range.
    void cancel() const {<strong>my_stop</strong>=true;}
    //! True if range is empty.
    /** Range is empty if there is request to cancel the range. */
    bool empty() const {return <strong>my_stop</strong> || my_range.empty();}
    //! True if range is divisible
    /** Range becomes indivisible if there is request to cancel the range. */
    bool is_divisible() const {return !<strong>my_stop</strong> &amp;&amp; my_range.is_divisible();}
    //! Initial value in range.
    Value begin() const {return my_range.begin();}
    //! One past last value in range
    /** Note that end()==begin() if there is request to cancel the range.
        The value of end() may change asynchronously if another thread cancels the range. **/
    Value end() const {return <strong>my_stop</strong> ? my_range.begin() : my_range.end();}
};</pre>
</blockquote>
<p>Look at the occurrences of <strong>my_stop</strong> to see the key workings. I've only tested it with parallel_for, using the following the example:</p>
<blockquote>
<pre>//! Set to value that we are looking for.
tbb::atomic&lt;int&gt; ValueThatWeFound;
//! Some random predicate.
/** For demonstration purposes, the predicate is trivial.  In real life, this should
    be something that takes some significant time. */
bool ShouldBreakFromLoop( int i ) {
    return i==1234567;
}
//! Loop body, as a function object.
struct Body {
    void operator()( const cancelable_range&lt;int&gt;&amp; r ) const {
        // Iterate over subrange.  It is important that "&lt;" be used for comparison,
        // because the value of r.end() changes to r.begin() if r is cancelled.
        for( int i=r.begin(); <strong>i&lt;r.end()</strong>; ++i ) {
            // Do test for whether we want to break from loop early.
            if( ShouldBreakFromLoop(i) ) {
                // Cancel the range
                <strong>r.cancel();</strong>
                // Record the value found.
                // If two values are found by different threads, no harm is done,
                // because we are storing into an atomic&lt;int&gt;.
                ValueThatWeFound = i;
            }
        }
    }
};
int main() {
    tbb::task_scheduler_init init;
    ValueThatWeFound = -1;
    bool stop = false;
    tbb::parallel_for( cancelable_range&lt;int&gt;(0,10000000,1,stop), Body() );
    std::printf("ValueThatWeFound=%d\n",int(ValueThatWeFound));
    return 0;
}</pre>
</blockquote>
<p>This approach takes O(lg N) time to cancel. The reason is that parallel_for traverses a tree of subranges, in a parallel depth-first manner. With P threads, each thread will have some O(lg N) size path down the tree. When the stop flag is set by any of the threads, each thread will march up its path, ceasing any further downward traversal. The O(lg N) time is much better than if we used a blocked_range and simply made the body poll the flag, because the full depth-first traversal would still occur.</p>
<p>In this example, each loop iteration tests against r.end(). This has the benefit of stopping a cancelled loop quickly, because when the loop is cancelled, r.end() becomes the same as r.begin(). It does incur the cost of a conditional branch (the ? in method end()) on each loop iteration. If each loop iteration is quick, this might be a relatively high cost to pay. I have not measured it; it could be small because branch predictors do wonderful things these days. But if the cost is high, you can capture r.end() in a temporary variable before entering the loop, and avoid the cost. But then the subrange will be run to completion. Take your pick of cost-per-iteration versus promptness of cancellation. There's no free lunch here. </p>
<p>One performance warning: If the flag "stop" is on a cache line that is being written frequently during the loop, the cache will thrash. Put padding around the flag if you are paranoid. E.g., use an array of type bool[128], and pass in element [64] as the flag. Then the flag will surely be on a separate cache line if the cache line size is 64 (common on recent Intel processors). But in practice, the padding is unnecessary in most cases.</p>
<p>Enjoy the fish.</p>
<p>- Arch</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/11/08/have-a-fish-how-break-from-a-parallel-loop-in-tbb/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Boo! Three Parallelism Bogeymen for Halloween</title>
		<link>http://software.intel.com/en-us/blogs/2007/10/31/boo-three-parallelism-bogeymen-for-halloween/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/10/31/boo-three-parallelism-bogeymen-for-halloween/#comments</comments>
		<pubDate>Wed, 31 Oct 2007 17:00:16 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/10/31/boo-three-parallelism-bogeymen-for-halloween/</guid>
		<description><![CDATA[If you are trick-or-treating tonight, and visiting the house of a parallel programming researcher, just dress up as one of the bogeymen listed here and you'll give them a good scare!  Seriously, in discussions of new languages for parallel programming, pundits trot out various bogeymen, declare them evil, and imply that by removing them, [...]]]></description>
			<content:encoded><![CDATA[<p>If you are trick-or-treating tonight, and visiting the house of a parallel programming researcher, just dress up as one of the bogeymen listed here and you'll give them a good scare!  Seriously, in discussions of new languages for parallel programming, pundits trot out various bogeymen, declare them evil, and imply that by removing them, parallel programs will be easy to write and debug.   Here are three frequent bogeymen, and why I'm not scared of them.</p>
<p><strong>Pointers:</strong>  C++ pointers are a bogeyman, particularly in academic circles.   From a parallelism perspective, the problem is <em>not </em>pointers or pointer arithmetic.   They are a wonderful abstraction of addresses.   The real problem with pointers are abusive casting, such as <code>((void*)42)</code>, and aliasing. I.e., in <code>*p = *q</code>, do p and q refer to the same location? But aliasing is even a problem in good old Fortran; e.g. in <code>A(I) = A(J)</code>, are I and J the same? Pointers are frequently blamed to be the root of all evil.  Basic context-insensitive alias analyzers get the pointer ambiguity problems down to the same level as the Fortran. So the real problem is aliasing. Don't blame pointers. No matter what the language is, programmers will still need linked data structures, so the equivalent of pointers will be around. For parallelism, the solution is not going to be to eliminate pointers, but to provide better information about how they are being used, or constraining their use via a straightjacket.  Take a look at <a href="http://cyclone.thelanguage.org/">http://cyclone.thelanguage.org/</a> for how pointers can be reformed without hanging them.</p>
<p><strong>Destructive assigment:</strong> Another favorite, because destructive assignments introduce dependences that constrain parallelism.   But destructive assignment is the key to efficient execution.  Recycling a resource instead of cloning it often yields big gains.  Think of big arrays.   The related "histogram" problem is notorious in the functional programming community.  Take an sequence of N integers between 0 and M, where N is much larger than M.  The problem is to build a histogram that maps each integer to the number of times it has occurred.  In a language with destructive assignment, the problem has a simple O(N) solution.  In functional languages, the problem requires time O(N lg N), or ad-hoc primitives.   Destructive assignment is here to stay.  Like pointers, the solution will be to constrain destructive update, or indicate when recycling can be replaced by cloning.</p>
<p><strong>Shared mutable state:</strong>  There's a camp that believes that parallel programming becomes easy if shared mutable state is eliminated. I'll claim that even if the programming language does not allow shared mutable state at the physical level, programmers will create it anyway at the abstraction level. I've done it myself once, implementing a tuple space on top of a distributed memory machine. Shared mutable objects are a powerful abstraction; they are not going away. The real problem is non-determinism. Somehow sharing needs to be constrained to eliminate unexpected nondeterminism.</p>
<p>It's easy to blame a few low-level language features for all that ails parallel programming.  But they are fundamental to practical programming.   Instead of treating the features as bogeymen to be banished, let's civilize them for parallel programming.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/10/31/boo-three-parallelism-bogeymen-for-halloween/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Retooling Exceptions for Parallelism - Part 2</title>
		<link>http://software.intel.com/en-us/blogs/2007/10/22/retooling-exceptions-for-parallelism-part-2/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/10/22/retooling-exceptions-for-parallelism-part-2/#comments</comments>
		<pubDate>Mon, 22 Oct 2007 17:44:27 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/10/22/retooling-exceptions-for-parallelism-part-2/</guid>
		<description><![CDATA[My previous blog discussed exceptions as alternative control flow versus exceptions as alternative data values.  Here I'll take that notion further and sketch how I think the TBB task scheduler should be deal with exceptions.
A quick review of the TBB task scheduler. It's the low-level engine that drives the high-level templates like parallel_for. The high-level [...]]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism-5/%22">previous blog</a> discussed exceptions as alternative control flow versus exceptions as alternative data values.  Here I'll take that notion further and sketch how I think the TBB task scheduler should be deal with exceptions.</p>
<p>A quick review of the TBB task scheduler. It's the low-level engine that drives the high-level templates like <code>parallel_for</code>. The high-level templates are designed for convenience. The task scheduler is designed for speed, sometimes at the expense of convenience. It's <em>low-level</em>, after all, and we didn't want to penalize speed across the board for some convenience that only a few clients might use. Users define a task's work by overriding the virtual method <code>task::execute()</code>.<br />
Currently, if <code>execute()</code> throws an exception, the effect is undefined. "Undefined", as typically used in standards parlance, means anything can happen. Like crash your program or beam flying pigs into your room. So up to now, it's been the programmers job to wrap the body of <code>execute()</code> in try-catch if exception safety is an issue. We've gotten negative comments on this.  Some programmers want exception safety.  Others claim exception safety is so complex that the cure is worse than the disease, and leads to a <a href="http://www.informit.com/content/images/020163371x/supplements/Exception_Handling_Article.html">false sense of security.</a>. But nonetheless, since C++ offers the option of being exception safe, we should give users better support for making TBB tasks exception safe.<br />
Exceptions are a difficult subject, even for sequential programs. Exceptions were a late addition to C++, presented in Stroustrup's Annotated Reference Manual (ARM) as an experimental feature.  I implemented most of the exception optimizations in KAI C++, and so am familiar with the difficulties.  That said, the rest of this blog proposes how to improve exception-handling support in the TBB task scheduler.<br />
There's two patterns to consider, blocking-style and continuation-passing style.  Let's start with blocking style, because it is easier to understand. Suppose task C has spawned two tasks A and B, like this:</p>
<pre>
      C
     / \
    A   B</pre>
<p>If A throws an exception, C's pending "wait" should act as if it threw the exception, <em>after</em> all children have completed. It's tempting to say that B should be cancelled immediately. But asynchronous cancellation invites a lot of problems. Having cancellation points, as in pthreads, might be a solution. For for now, I'd like to keep things simple. Cancellation usually requires some high level coordination, something above what a low-level task is capable of. Maybe later TBB can have a notification hook to class task so that C can be notified that an exception is coming its way, and C can do the cancellation. But that's an additional complex frill to be added later [2].</p>
<p>If <em>both</em> A and B throw an exception, then one of the exceptions should be chosen. In principle, both could be glued together into some kind of list-of-exception object, but I doubt that's of much help to real code, and probably a gratuitous resource hog.  For example, remember all 1000 iterations of a loop that threw an exception seems like overkill.  I'm inclined to let the first exception detected propagate, and suppress any later exceptions that collide with it in the task tree.<br />
Continuation-passing style is trickier. Consider the earlier picture, but where C is <em>not</em> currently waiting, because it is a continuation task that will start running <em>after</em> tasks A and B complete. Suppose A and B complete, but at least one threw an exception. C.execute() is not yet running, so there is no obvious point from which the exception should be thrown. The right thing to do is to skip <code>C.execute()</code> and simply destroy C. Maybe later TBB can have notification hooks so that C can catch the exception. The hook would be a virtual method, say <code>task::handle_exception()</code>, that would be called instead of <code>execute()</code> when a child throws an exception. But to keep the first version simple, let's omit that. The <a href="http://www.artima.com/intv/modern3.html">Resource Acquisition Is Initialization</a> (RAII) idiom probably suffices for cleanup in common cases. Let the destructor of C do any necessary cleanup.<br />
Is this a plausible design? Since the task scheduler is the engine that drives the high-level templates, we have to ask if the proposal enables us to write an exception-safe <code>parallel_for</code>. I think it does.  Below is the code where the TBB 2.0 <code>parallel_for</code> does its work. I've added comments.</p>
<pre>
     template&lt;typename Range, typename Body, typename Partitioner&gt;
     task* start_for&lt;Range,Body,Partitioner&gt;::execute() {
         if( my_partitioner.should_execute_range(my_range, *this) ) {
             <em>// Base case of recursion</em>.
             my_body( my_range );
             return NULL;
         } else {
             <em>// Recursive case.  Build continuation task c.</em>
             empty_task&amp; c = *new( allocate_continuation() ) empty_task;
             <em>// Child </em>a<em> is the recycled this, and hence is not named </em>a<em> explicitly.</em>
             recycle_as_child_of(c);
             c.set_ref_count(2);
             <em>// Create and spawn child </em>b<em> </em>
             start_for&amp; b = *new( c.allocate_child() ) start_for(Range(my_range,split()),my_body,Partitioner(my_partitioner,split()));
             c.spawn(b);
             <em>// Child </em>a<em> bypasses scheduler.</em>
             return this;
         }
     }</pre>
<p>It is <strong>NOT</strong> exception-safe under the proposal.  In particular, if the line with <code>allocate_child()</code> throws an exception, child b is not created. This would leave <code>c.ref_count==2</code>, but with only one child (the recycled <code>this</code>).  Furthermore, even that child will never execute, because the code is requesting execution via scheduler bypass (<code>return this</code>). <br />
But the proposal gives us the option to make it exception safe, by rewriting the else part like this: </p>
<pre>
             empty_task&amp; c = *new( allocate_continuation() ) empty_task;
             start_for* b;
             try {
                 b = new( c.allocate_child() ) start_for(Range(my_range,split()),my_body,Partitioner(my_partitioner,split()));
             } catch( ... ) {
                 destroy(c);
                 throw;
             }
             recycle_as_child_of(c);
             c.set_ref_count(2);
             c.spawn(*b);
             return this;</pre>
<p>This fragment is exception-safe because:</p>
<ul>
<li>If <code>allocate_continuation()</code> throws an exception, then there<br />
is nothing to clean up.</li>
<li>If any code inside the try block throws an exception, then the catch<br />
clause cleans up c, and rethrows.</li>
<li>Code after the try block cannot throw an exception.</li>
</ul>
<p>This example shows that adding the exception-propagation mechanism to TBB does<br />
<em>not</em> automatically make code exception-safe, but at least enables<br />
it.  That should be no surprise, because the same is true of C++.<br />
- Arch</p>
<h2>Notes</h2>
<p>[1] C++ does not offer a way to capture an exception on one thread and rethrow it in another. Thus the mechanism described will have to be limited to rethrowing some kind of summary of the original exception, or limited to exceptions that have some kind of "movable" property. See comments in <a href="http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism-5/%22">previous blog</a>.<br />
[2] C++ exception handling typically operates in two phases:</p>
<ol>
<li>Searching the stack for an appropriate handler</li>
<li>Unwinding the stack to that handler</li>
</ol>
<p>In a parallel environment, I suspect that the first step should be done <em>before</em> all children complete, and the second step done <em>after</em> all children complete. When the appropriate handler is found, a "prehandler" should be invoked. The prehandler could do cancellation.<br />
[3] A task is automatically deleted after its method <code>execute()</code> returns. If such a task throws an exception, it makes sense to automatically delete it too. But a task is not automatically deleted if it has been subjected to a <code>recycle_</code>.. method. That raises the question of what to do if <code>execute()</code> throws an exception <em>after </em>the task has been recycled. I think the answer is to automatically delete the task nonetheless, but do not have use we should probably delete it, as if the recycling never happened. We'll need some<br />
real use cases to drive this decision.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/10/22/retooling-exceptions-for-parallelism-part-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Retooling Exceptions for Parallelism</title>
		<link>http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism/#comments</comments>
		<pubDate>Wed, 17 Oct 2007 00:44:15 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		
		<category><![CDATA[Multicore]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism/</guid>
		<description><![CDATA[Exception handling is one of the big improvements of C++ over C.  C code that checks for erroneous or unusual conditions is littered with tests for those conditions, making programs harder to read.  Worse yet, the programmer might forget to check one of those conditions.  Exceptions eliminated most of those problems, albeit [...]]]></description>
			<content:encoded><![CDATA[<p>Exception handling is one of the big improvements of C++ over C.  C code that checks for erroneous or unusual conditions is littered with tests for those conditions, making programs harder to read.  Worse yet, the programmer might forget to check one of those conditions.  Exceptions eliminated most of those problems, albeit opened up the new issue of exception safety.</p>
<p>Unfortunately, exceptions as practiced in C++ tend to force sequential execution of code.  Consider:</p>
<pre>    float a[N];
    float b[N];
    ...
    for( int i=0; i&lt;N; ++i )
        a[i] = f(b[i]);</pre>
<p>If f has no side effects, this loop can be trivially parallelized while retaining the net effect of sequential execution.  But if f throws an exception, the loop must terminate early, and is really a <em>while</em> loop as far as parallelization is concerned.  Such a loop cannot be parallelized in a scalable way without resorting to speculative parallelization, which tends to be expensive in running time and/or complexity of implementation.<br />
We’ve run into a similar issue while revising TBB’s <code>concurrent_vector&lt;T&gt;</code> while trying to make it completely exception safe.  For a <code>concurrent_vector&lt;T&gt; x</code>, the call <code>x.grow_by(n)</code> should append n consecutive elements to x, and return the index of the first element appended.  Multiple threads can invoke grow_by in parallel.  But what happens if the constructor <code>T()</code> throws an exception for one of the threads? We now have a vector where some elements have been constructed and some have not.</p>
<p>One solution is to make an invocation of grow_by wait until the previous invocation completes.  That’s what the 2006 beta version of <code>concurrent_vector&lt;T&gt;</code> did.  But that turned out to be a big mistake, because it introduces a badly serializing bottleneck.</p>
<p>The root problem is that exceptions are modeled in C++ as alternative control flow.  But another option is to model exceptions as alternative data values.  Indeed, this is the way they started out in functional programming languages, e.g. Backus’ FP.  A well known example of exceptions as alternative data values are the NaNs in IEEE arithmetic.  We should investigate extending this model to C++.  For example, we can define a class <code>value_or_error&lt;T&gt;</code> that represents a T or a special error value.  Its constructor would never throw an exception.  Here’s a sketch of what the class might look like:</p>
<pre>    template&lt;typename T&gt;
    class value_or_error {
    public:
        //! True if value() is okay to use
        bool is_valid() const;
        //! Reference to value.  Valid only if is_valid() is true.
        T&amp; value() {
            if(!is_valid()) throw some-kind-of-exception;
	    return <em>reference-to-internal-value-of-type-T</em>;
        }
        //! Const variant of above.
        const T&amp; value() const;
        //! Default constructor
        value_or_error() throw() {
            try {
                new( &amp;<em>internal-space-for-value</em> ) T;
                mark as valid
            } catch (...) {
                mark as invalid
            }
        }
        ~value_or_error() {
            if( is_valid() )
                &amp;(<em>internal-space-for-value)-&gt;</em>~T();
        }
        ... add other appropriate constructors and assignments...
    };</pre>
<p>A nice feature in this approach is that it automatically converts back to a control-flow exception if the client accesses value() without checking is_valid().  In that respect, it behaves somewhat like a signaling NaN in IEEE arithmetic.  Getting back to <code>concurrent_vector&lt;T&gt;</code>, we can advise users to do one of three things:</p>
<ol>
<li> Ensure that T() does not throw an exception</li>
<li> If not (1), ensure that <code>~T()</code> works correctly on zeroed memory.</li>
<li> If (1) and (2) are impractical, use a <code>concurrent_vector&lt;value_or_error&lt;T&gt;&gt;</code> instead of a concurrent_vector&lt;T&gt;.</li>
</ol>
<p>I’d be interested to hear other people’s comments on the value_or_error<br />
approach.</p>
<p>- Arch</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/10/16/retooling-exceptions-for-parallelism/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
