1,091 Posts served
4,540 Conversations started
[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]
Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented as a one-to-one map from thread ids to pieces of storage. And vice versa, the address of a variable in thread-local storage can be used as a thread id.
TBB by design has no thread id or thread-local storage. TBB is based on task-based parallelism, where the programmer breaks work up into tasks, and the task scheduler is free to map tasks to hardware threads. Furthermore, our OpenMP run-time group strongly recommended that we avoid explicit thread ids because of problems with nested parallelism and dealing with a dynamically growing or shrinking team of threads. For example, the nature of the OpenMP thread id interface implies that the number of threads in a thread team is fixed for the duration of the team.
However, thread local storage does have its uses. Don Webber posted an excellent use case for thread local storage, which involves updating a sparse matrix. The problem involves doing many updates of the form *p += value, in parallel, where some updates might update the same location. Assuming that += is commutative and associative, one way to implement this is to have each thread sum its own updates privately, and then merge the sums. As Don notes, the alternative of locking *p on each update is prohibitively expensive. Using an atomic +=, even if available on current hardware, would likewise be prohibitively expensive, because cache lines would ping-pong severely.
I'd like to see TBB extended to provide the power of thread local storage without opening up a Pandora's box of raw thread ids. I think the solution needs to cleanly separate a high-level portion from a low-level portion, like we recently did for cache affinity. (Note: the type task::affinity_id might appear to have opened the box, but did not, because it is a hint, not a commandment.)
TBB's template parallel_reduce in TBB partially deals with the cited use case, because it is lazy about fork/joins. The user defines how to fork/join state information for the reduction. The template recursively decomposes the range and applies the users fork/join operations. The laziness is that fork/join pairs are only used when task stealing occurs. For example, if there are P threads and N leaf subranges, it does not do the obvious N-1 fork operations, but instead does just enough to keep the threads busy. Specifically, it does a fork/join pair only when the two children would be processed by different threads.
However, parallel_reduce is not lazy enough. At the high level, the problem is that parallel_reduce cannot assume that the reduction operation is commutative. For a non-commutative reduction operation, the current implementation is close to optimal (maybe a factor of 2 off in the worse case) with respect to the number of fork/join pairs. If TBB added a reduction template that could assume a commutative reduction operation (e.g.parallel_unordered_reduce), then at most P-1 fork/join pairs would be necessary.
The good thing about using the hypothetical parallel_unordered_reduce instead of exposing thread local storage is that it keeps the abstraction at a high level. Explicitly using thread local storage would introduce irrelevant low-level details. For example, a typical implementation based on thread local storage can be sketched as:
forall updates (p,value) do *p += value // *p points to thread-local partial sum for each thread-local partial sum do update global-sum+= thread-local partial sum
This level exposes issues such as "where are the thread-local partial sums that were generated in the first loop?" Since threads can come and go during execution of the first loop, iterating across ids of currently running threads is not enough. Some of the partial sums might outlive their threads, or some threads might come into existence after the partial sums were generated. We'll need a container to hold the partial sums, and a means of iterating over the sums.
The interface for such a container seems straightforward. Define it as a sequence of T that has iterator capability. Add TBB-style ranges too, so that reductions over the container can be done in parallel. Add a special method "mine()" that returns reference to the element that is owned by the invoking thread. If the element is not present, "mine()" would insert one and default-construct it.
Method mine() would be most likely implemented by hashing a thread id, so it's not going to be cheap, but probably inexpensive enough if the user hoists calls to it.
There is an interesting alternative that weakens guarantees, with the intent of expressing intent at a little higher level. It's somewhat an object-oriented extension of a semaphore that combines the semaphore with the resource it is protecting. It would work as follows. The method "mine() could be replaced by two methods "acquire()" and "release()" [possibly sugared with RAII] such that:
This interface permits an implementation to keep the limit the number of "thread local" copies of T to what is actually necessary for concurrency, not what is necessary for one-per-thread. If T is really big, this could be advantageous. There's perhaps an issue with cache affinity. However, a sufficiently clever implementation could bias towards grabbing the instance of T that the thread most recently had before. Of course, a conforming implementation could just use thread-local storage for each copy of T.
Comments?
By mikedeskevich on January 31st, 2008 at 2:40 pm
Disclaimer: I'm fairly new to the world of parallel programming, so my comments may be worthless.
I think we should keep the idea of threads as far away from the user as possible. The beauty of TBB (in my mind) is the idea of "Task based programming" rather than "Thread based programming". If I'm understanding the mine() idea right, I think that would be the way to go. When writing a task, the programmer doesn't need know know anything about the thread it's running on. mine() just gets a pointer to the chunk of memory that belongs to the current thread that the task is running on and then afterwards you can iterate over all existing instances and pull out the data there.
Or would it make sense to go even one step lower and have a something like a ProcessorID or CoreID or something like that (like having the mine() idea work with a hash of ProcessorID rather than TreadID). Really, you're only ever going to have N threads running for N processors (cores), so you would only want to have N chunks local storage. So even if you have M>N threads, only N are going to be active at any time, so you just have the thread update the local storage that belongs on the processor it's running on. Wouldn't this eliminate the cache issues? May have to watch out for deadlocks.
By Ray Zed Blog on January 31st, 2008 at 8:32 pm
Robert Johnson: [Disclaimer: I’m sketching possibilities here. There is no commitment from the TBB group to implement any of this.] Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense
By Arch Robison (Intel) on February 5th, 2008 at 9:36 pm
The notion of core affinity might actually work here.
With the mine() kind of interface, core affinity does not work, because TBB has no control over when the OS switches a core from one thread to another. E.g., supposed threads A and B are both executing X=X+1 on core-local storage. We would not want to be in a situation where:
The net effect would be X=X+1 instead of the intended X=X+2.
But with the acquire()/release() kind of interface, the notion of mapping to cores intriguing. Call it an "affinitized_bag
The big win from core affinity would be in situations where cache affinity is important. The neat part is that it would address cache affinity without handing programmers explicit core ids.
By Bjoern Knafla on February 6th, 2008 at 7:22 am
Another idea is to offer a specialized version of the parallel algorithms (parallel_for, etc.) which when called get an object (thread storage prototype) as an argument. This object is duplicated (thread storage object) by each thread used by the algorithm internal thread scheduler. When the functor/task of the programmer is called for a specific range the threads own thread storage object is also handed to the functor and can be used as a thread specific cache, for example to count.
This would only work if tasks have a thread-affinity or if the thread storage object is wrapped by a proxy that re-routes access to the thread storage object of the actual thread (and I am not sure if this could be done efficiently).
In a certain aspect this is like parallel_reduce but because of the thread storage objects only a minimal amount of copying, merging, and duplication is needed. Another difference is that no reduction takes place. After the algorithm finished the programmer could ask for a container of thread storage objects (or such a container is returned or the thread storage objects are filled into a container handed to the algorithm when calling it) and then iterate over it sequentially or in parallel.
While such a design wouldn't be as mighty and flexible as the one drafted above it might be easier to use - though duplicating the algorithm interfaces for it might be unpleasant.
Code example:
struct task {
void operator()(blocked_range& r, TS& thread_storage ) const {
// Use thread_storage to cache data that is only needed local to the thread calling this task.
// Thread affinity guarantees that the task isn't switched to another thread while running.
}
}; // struct task
TS thread_storage_prototype;
TSC thread_storage_container;
parallel_for( range, task( in, out ), thread_storage_prototype, thread_storage_container );
// Look into thread_storage_container and use whatever is stored in the contained thread storage objects.
Well, the longer I think about the aquire/release container the more I like the idea - mainly because it can be made to work correctly and if thread affinity comes into play it could be made very efficient without locking. Then it would also be easy to program a thread specific cache as described above but without the need for algorithm or task interface changes. I am looking forward to it
Cheers,
Bjoern
By mikedeskevich on February 11th, 2008 at 3:12 pm
After reading more of these comments, maybe it's better to have a way for TBB return a CoreID, ThreadID, and any other kind of identifier and just let the programmer hash that as necessary for his or her own use. I'm taking back some of my earlier comments about trying to abstract everything, sometimes you do need the low level access. That's the beauty of C++, you get everything in C for free plus more. Maybe TBB could be the same way, you get all the power of threads for free plus more.
By Mick Turner on May 1st, 2008 at 2:58 am
I have another case where I want access to a thread id.
My application is database like with user requests being distributed to worker threads that manage their own large chunks of memory so that when the user task finishes all memory associated with that task gets immediately wiped with no long term issues with fragmentation etc. This user task is now to be parallelised with TBB so I now need to make sure memory allocation by a TBB task only uses the memory dedicated to the user task but also uses thread specific allocs to reduce locking overhead.
To reduce this to its simplest form I need a 2D thread specific memory allocator. I currently have 1D over my user task threads and the allocator provided with TBB is 1D over all threads.
The only way I can think of to do this is for the TBB threads to have an index that can be used to provide the 2nd dimension.
I can implement this with TLS (I'm Windows only), but I then have to check my TLS index at the beginning of each task just in case the current thread is new in some way.
What would be neat is if each TBB thread called a constructor/destructor call that could be replaced. Then all appropriate initialisation could be done in this call rather than constantly having to worry about the possibility of threads being constructed or terminated.
Mick
By Arch Robison (Intel) on May 1st, 2008 at 7:36 am
We recently added a feature to the open source version of TBB that might be just what you are looking for: task_scheduler_observer.
If you download the Reference Manual for the open source version from http://threadingbuildingblocks.org/documentation.php , and go to section 8.6, you'll find a description of task_scheduler_observer. It lets you define an object that gets notification when TBB creates or destroys a thread. If you create the object after TBB has already created a thread, you get notification before the thread does anything on behalf of the next parallel loop. Section 8.6.5 gives this guarantee in a somewhat oblique way.
If this solves your case, I'd like to hear about it, because we've been looking for example use cases for task_scheduler_observer.
Arch
By Mick Turner on May 9th, 2008 at 12:04 pm
Hi Arch
That looks like it would do the trick. I haven't been using the open source version (please make a version of this that compiles as a MSVC project, I don't want to have to install kinds of unix like things on my development machine).
Any idea when that feature is likely to make it into the commercial release?
Thanks
Mick
By Arch Robison (Intel) on May 9th, 2008 at 2:40 pm
task_scheduler_observer will be part of a commercial release this summer.