<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Abstracting Thread Local Storage</title>
	<atom:link href="http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/feed/" rel="self" type="application/rss+xml" />
	<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/</link>
	<description></description>
	<pubDate>Sun, 20 Jul 2008 16:35:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Intel® Software Network Blogs &#187; Under the hood: Building hooks to explore TBB task scheduler</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-13172</link>
		<dc:creator>Intel® Software Network Blogs &#187; Under the hood: Building hooks to explore TBB task scheduler</dc:creator>
		<pubDate>Mon, 16 Jun 2008 19:14:32 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-13172</guid>
		<description>[...] do so requires the implementation of some Thread Local Storage, which I do with caution.  Reread Arch’s blog on the subject to understand the [...]</description>
		<content:encoded><![CDATA[<p>[...] do so requires the implementation of some Thread Local Storage, which I do with caution.  Reread Arch’s blog on the subject to understand the [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Arch Robison (Intel)</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12451</link>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		<pubDate>Fri, 09 May 2008 22:40:08 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12451</guid>
		<description>task_scheduler_observer will be part of a commercial release this summer.</description>
		<content:encoded><![CDATA[<p>task_scheduler_observer will be part of a commercial release this summer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick Turner</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12444</link>
		<dc:creator>Mick Turner</dc:creator>
		<pubDate>Fri, 09 May 2008 20:04:15 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12444</guid>
		<description>Hi Arch
That looks like it would do the trick. I haven't been using the open source version (please make a version of this that compiles as a MSVC project, I don't want to have to install kinds of unix like things on my development machine).

Any idea when that feature is likely to make it into the commercial release?

Thanks

Mick</description>
		<content:encoded><![CDATA[<p>Hi Arch<br />
That looks like it would do the trick. I haven't been using the open source version (please make a version of this that compiles as a MSVC project, I don't want to have to install kinds of unix like things on my development machine).</p>
<p>Any idea when that feature is likely to make it into the commercial release?</p>
<p>Thanks</p>
<p>Mick</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Arch Robison (Intel)</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12214</link>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		<pubDate>Thu, 01 May 2008 15:36:07 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12214</guid>
		<description>We recently added a feature to the open source version of TBB that might be just what you are looking for: task_scheduler_observer.

If you download the Reference Manual for the open source version from http://threadingbuildingblocks.org/documentation.php , and go to section 8.6, you'll find a description of task_scheduler_observer.  It lets you define an object that gets notification when TBB creates or destroys a thread.  If you create the object after TBB has already created a thread, you get notification before the thread does anything on behalf of the next parallel loop.  Section 8.6.5 gives this guarantee in a somewhat oblique way.

If this solves your case, I'd like to hear about it, because we've been looking for example use cases for task_scheduler_observer.

Arch</description>
		<content:encoded><![CDATA[<p>We recently added a feature to the open source version of TBB that might be just what you are looking for: task_scheduler_observer.</p>
<p>If you download the Reference Manual for the open source version from <a href="http://threadingbuildingblocks.org/documentation.php" rel="nofollow">http://threadingbuildingblocks.org/documentation.php</a> , and go to section 8.6, you'll find a description of task_scheduler_observer.  It lets you define an object that gets notification when TBB creates or destroys a thread.  If you create the object after TBB has already created a thread, you get notification before the thread does anything on behalf of the next parallel loop.  Section 8.6.5 gives this guarantee in a somewhat oblique way.</p>
<p>If this solves your case, I'd like to hear about it, because we've been looking for example use cases for task_scheduler_observer.</p>
<p>Arch</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick Turner</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12206</link>
		<dc:creator>Mick Turner</dc:creator>
		<pubDate>Thu, 01 May 2008 10:58:39 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-12206</guid>
		<description>I have another case where I want access to a thread id.
My application is database like with user requests being distributed to worker threads that manage their own large chunks of memory so that when the user task finishes all memory associated with that task gets immediately wiped with no long term issues with fragmentation etc. This user task is now to be parallelised with TBB so I now need to make sure memory allocation by a TBB task only uses the memory dedicated to the user task but also uses thread specific allocs to reduce locking overhead. 
To reduce this to its simplest form I need a 2D thread specific memory allocator. I currently have 1D over my user task threads and the allocator provided with TBB is 1D over all threads.
The only way I can think of to do this is for the TBB threads to have an index that can be used to provide the 2nd dimension.
I can implement this with TLS (I'm Windows only), but I then have to check my TLS index at the beginning of each task just in case the current thread is new in some way.
What would be neat is if each TBB thread called a constructor/destructor call that could be replaced. Then all appropriate initialisation could be done in this call rather than constantly having to worry about the possibility of threads being constructed or terminated.

Mick</description>
		<content:encoded><![CDATA[<p>I have another case where I want access to a thread id.<br />
My application is database like with user requests being distributed to worker threads that manage their own large chunks of memory so that when the user task finishes all memory associated with that task gets immediately wiped with no long term issues with fragmentation etc. This user task is now to be parallelised with TBB so I now need to make sure memory allocation by a TBB task only uses the memory dedicated to the user task but also uses thread specific allocs to reduce locking overhead.<br />
To reduce this to its simplest form I need a 2D thread specific memory allocator. I currently have 1D over my user task threads and the allocator provided with TBB is 1D over all threads.<br />
The only way I can think of to do this is for the TBB threads to have an index that can be used to provide the 2nd dimension.<br />
I can implement this with TLS (I'm Windows only), but I then have to check my TLS index at the beginning of each task just in case the current thread is new in some way.<br />
What would be neat is if each TBB thread called a constructor/destructor call that could be replaced. Then all appropriate initialisation could be done in this call rather than constantly having to worry about the possibility of threads being constructed or terminated.</p>
<p>Mick</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mikedeskevich</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-10047</link>
		<dc:creator>mikedeskevich</dc:creator>
		<pubDate>Mon, 11 Feb 2008 23:12:34 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-10047</guid>
		<description>After reading more of these comments, maybe it's better to have a way for TBB return a CoreID, ThreadID, and any other kind of identifier and just let the programmer hash that as necessary for his or her own use.  I'm taking back some of my earlier comments about trying to abstract everything, sometimes you do need the low level access.   That's the beauty of C++, you get everything in C for free plus more.  Maybe TBB could be the same way, you get all the power of threads for free plus more.</description>
		<content:encoded><![CDATA[<p>After reading more of these comments, maybe it's better to have a way for TBB return a CoreID, ThreadID, and any other kind of identifier and just let the programmer hash that as necessary for his or her own use.  I'm taking back some of my earlier comments about trying to abstract everything, sometimes you do need the low level access.   That's the beauty of C++, you get everything in C for free plus more.  Maybe TBB could be the same way, you get all the power of threads for free plus more.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bjoern Knafla</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9981</link>
		<dc:creator>Bjoern Knafla</dc:creator>
		<pubDate>Wed, 06 Feb 2008 15:22:29 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9981</guid>
		<description>Another idea is to offer a specialized version of the parallel algorithms (parallel_for, etc.) which when called get an object (thread storage prototype) as an argument. This object is duplicated (thread storage object) by each thread used by the algorithm internal thread scheduler. When the functor/task of the programmer is called for a specific range the threads own thread storage object is also handed to the functor and can be used as a thread specific cache, for example to count. 
This would only work if tasks have a thread-affinity or if the thread storage object is wrapped by a proxy that re-routes access to the thread storage object of the actual thread (and I am not sure if this could be done efficiently).
In a certain aspect this is like parallel_reduce but because of the thread storage objects only a minimal amount of copying, merging, and duplication is needed. Another difference is that no reduction takes place. After the algorithm finished the programmer could ask for a container of thread storage objects (or such a container is returned or the thread storage objects are filled into a container handed to the algorithm when calling it) and then iterate over it sequentially or in parallel.

While such a design wouldn't be as mighty and flexible as the one drafted above it might be easier to use - though duplicating the algorithm interfaces for it might be unpleasant.


Code example:

struct task {

    void operator()(blocked_range&#38; r, TS&#38; thread_storage ) const {
        // Use thread_storage to cache data that is only needed local to the thread calling this task.
        // Thread affinity guarantees that the task isn't switched to another thread while running.
    }

}; // struct task


TS thread_storage_prototype;
TSC thread_storage_container;

parallel_for( range, task( in, out ), thread_storage_prototype, thread_storage_container );

// Look into thread_storage_container and use whatever is stored in the contained thread storage objects.


Well, the longer I think about the aquire/release container the more I like the idea - mainly because it can be made to work correctly and if thread affinity comes into play it could be made very efficient without locking. Then it would also be easy to program a thread specific cache as described above but without the need for algorithm or task interface changes. I am looking forward to it ;-)


Cheers,
Bjoern</description>
		<content:encoded><![CDATA[<p>Another idea is to offer a specialized version of the parallel algorithms (parallel_for, etc.) which when called get an object (thread storage prototype) as an argument. This object is duplicated (thread storage object) by each thread used by the algorithm internal thread scheduler. When the functor/task of the programmer is called for a specific range the threads own thread storage object is also handed to the functor and can be used as a thread specific cache, for example to count.<br />
This would only work if tasks have a thread-affinity or if the thread storage object is wrapped by a proxy that re-routes access to the thread storage object of the actual thread (and I am not sure if this could be done efficiently).<br />
In a certain aspect this is like parallel_reduce but because of the thread storage objects only a minimal amount of copying, merging, and duplication is needed. Another difference is that no reduction takes place. After the algorithm finished the programmer could ask for a container of thread storage objects (or such a container is returned or the thread storage objects are filled into a container handed to the algorithm when calling it) and then iterate over it sequentially or in parallel.</p>
<p>While such a design wouldn't be as mighty and flexible as the one drafted above it might be easier to use - though duplicating the algorithm interfaces for it might be unpleasant.</p>
<p>Code example:</p>
<p>struct task {</p>
<p>    void operator()(blocked_range&amp; r, TS&amp; thread_storage ) const {<br />
        // Use thread_storage to cache data that is only needed local to the thread calling this task.<br />
        // Thread affinity guarantees that the task isn't switched to another thread while running.<br />
    }</p>
<p>}; // struct task</p>
<p>TS thread_storage_prototype;<br />
TSC thread_storage_container;</p>
<p>parallel_for( range, task( in, out ), thread_storage_prototype, thread_storage_container );</p>
<p>// Look into thread_storage_container and use whatever is stored in the contained thread storage objects.</p>
<p>Well, the longer I think about the aquire/release container the more I like the idea - mainly because it can be made to work correctly and if thread affinity comes into play it could be made very efficient without locking. Then it would also be easy to program a thread specific cache as described above but without the need for algorithm or task interface changes. I am looking forward to it ;-)</p>
<p>Cheers,<br />
Bjoern</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Arch Robison (Intel)</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9979</link>
		<dc:creator>Arch Robison (Intel)</dc:creator>
		<pubDate>Wed, 06 Feb 2008 05:36:30 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9979</guid>
		<description>The notion of core affinity might actually work here.

With the mine() kind of interface, core affinity does not work, because TBB has no control over when the OS switches a core from one thread to another.  E.g., supposed threads A and B are both executing X=X+1 on core-local storage.  We would not want to be in a situation where:
&lt;OL&gt;
&lt;LI&gt;core 0 is running thread A
&lt;LI&gt;thread A is running X = X+1, but has only done the load so far
&lt;LI&gt;core 0 switches to thread B
&lt;LI&gt;thread B does X = X+1
&lt;LI&gt;core 0 switches to thread A
&lt;LI&gt;thread A does its store into X
&lt;/OL&gt;
The net effect would be X=X+1 instead of the intended X=X+2.

But with the acquire()/release() kind of interface, the notion of mapping to cores intriguing.  Call it an "affinitized_bag&lt;T&gt;".  It would be a collection of T, initially empty.  The acquire() operation would be guaranteed to return exclusive access to an object in the bag, and release() would release rights to it.  Call the acquire/release pair an "access".  Many conforming implementations would be possible:
&lt;UL&gt;
&lt;LI&gt; An implementation that has a single element, and serializes all accesses  to it.
&lt;LI&gt; An implementation that lazily creates an element per core, and serializes all accesses to the same element. 
&lt;LI&gt; An implementation that lazily creates an element per thread, and serializes all accesses to the same element.
&lt;LI&gt; Any of the above, but changed to resolve conflicts by creating multiple elements instead of serializing.
&lt;LI&gt; Random combinations of the above strategies, to shake loose programmer violations of the contract :-)
&lt;/UL&gt;
The big win from core affinity would be in situations where cache affinity is important.  The neat part is that it would address cache affinity without handing programmers explicit core ids.</description>
		<content:encoded><![CDATA[<p>The notion of core affinity might actually work here.</p>
<p>With the mine() kind of interface, core affinity does not work, because TBB has no control over when the OS switches a core from one thread to another.  E.g., supposed threads A and B are both executing X=X+1 on core-local storage.  We would not want to be in a situation where:</p>
<ol>
<li>core 0 is running thread A
</li>
<li>thread A is running X = X+1, but has only done the load so far
</li>
<li>core 0 switches to thread B
</li>
<li>thread B does X = X+1
</li>
<li>core 0 switches to thread A
</li>
<li>thread A does its store into X
</li>
</ol>
<p>The net effect would be X=X+1 instead of the intended X=X+2.</p>
<p>But with the acquire()/release() kind of interface, the notion of mapping to cores intriguing.  Call it an "affinitized_bag<t>".  It would be a collection of T, initially empty.  The acquire() operation would be guaranteed to return exclusive access to an object in the bag, and release() would release rights to it.  Call the acquire/release pair an "access".  Many conforming implementations would be possible:</p>
<ul>
<li> An implementation that has a single element, and serializes all accesses  to it.
</li>
<li> An implementation that lazily creates an element per core, and serializes all accesses to the same element.
</li>
<li> An implementation that lazily creates an element per thread, and serializes all accesses to the same element.
</li>
<li> Any of the above, but changed to resolve conflicts by creating multiple elements instead of serializing.
</li>
<li> Random combinations of the above strategies, to shake loose programmer violations of the contract :-)
</li>
</ul>
<p>The big win from core affinity would be in situations where cache affinity is important.  The neat part is that it would address cache affinity without handing programmers explicit core ids.</t></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mikedeskevich</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9886</link>
		<dc:creator>mikedeskevich</dc:creator>
		<pubDate>Thu, 31 Jan 2008 22:40:01 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-9886</guid>
		<description>Disclaimer: I'm fairly new to the world of parallel programming, so my comments may be worthless.

I think we should keep the idea of threads as far away from the user as possible.  The beauty of TBB (in my mind) is the idea of "Task based programming" rather than "Thread based programming".  If I'm understanding the mine() idea right, I think that would be the way to go.  When writing a task, the programmer doesn't need know know anything about the thread it's running on.  mine() just gets a pointer to the chunk of memory that belongs to the current thread that the task is running on and then afterwards you can iterate over all existing instances and pull out the data there.

Or would it make sense to go even one step lower and have a something like a ProcessorID or CoreID or something like that (like having the mine() idea work with a hash of ProcessorID rather than TreadID).  Really, you're only ever going to have N threads running for N processors (cores), so you would only want to have N chunks local storage.  So even if you have M&#62;N threads, only N are going to be active at any time, so you just have the thread update the local storage that belongs on the processor it's running on.  Wouldn't this eliminate the cache issues? May have to watch out for deadlocks.</description>
		<content:encoded><![CDATA[<p>Disclaimer: I'm fairly new to the world of parallel programming, so my comments may be worthless.</p>
<p>I think we should keep the idea of threads as far away from the user as possible.  The beauty of TBB (in my mind) is the idea of "Task based programming" rather than "Thread based programming".  If I'm understanding the mine() idea right, I think that would be the way to go.  When writing a task, the programmer doesn't need know know anything about the thread it's running on.  mine() just gets a pointer to the chunk of memory that belongs to the current thread that the task is running on and then afterwards you can iterate over all existing instances and pull out the data there.</p>
<p>Or would it make sense to go even one step lower and have a something like a ProcessorID or CoreID or something like that (like having the mine() idea work with a hash of ProcessorID rather than TreadID).  Really, you're only ever going to have N threads running for N processors (cores), so you would only want to have N chunks local storage.  So even if you have M&gt;N threads, only N are going to be active at any time, so you just have the thread update the local storage that belongs on the processor it's running on.  Wouldn't this eliminate the cache issues? May have to watch out for deadlocks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By:  Ray Zed Blog</title>
		<link>http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-10520</link>
		<dc:creator> Ray Zed Blog</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/#comment-10520</guid>
		<description>&lt;!--%kramer-pre%--&gt;Robert Johnson:   [Disclaimer: I’m sketching possibilities here. There is no commitment from the TBB group to implement any of this.] Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense&lt;!--%kramer-post%--&gt;</description>
		<content:encoded><![CDATA[<p><a class="technorati-balloon" href="http://www.technorati.com/cosmos/search.html?url=http://softwareblogs.intel.com/2008/01/31/abstracting-thread-local-storage/feed/"><img src="http://static.technorati.com/images/bubble_h11.gif" class="technorati-balloon" alt="links from Technorati" style="border:0;" /></a>Robert Johnson:   [Disclaimer: I’m sketching possibilities here. There is no commitment from the TBB group to implement any of this.] Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense</p>
]]></content:encoded>
	</item>
</channel>
</rss>
