1,364 Posts served
5,674 Conversations started
Little did I suspect as I was introducing the topic of blocking in parallel computation in my last post that it would generate such interest, even though it seemed a common problem I’d been working on privately with several Intel customers. Charles Leiserson amplified the pitfalls of employing blocking in a multi-threaded architecture and offered an alternate algorithm using an edge coloring scheme on a graph representing the computation. Mick Turner offered another example of the problem in Online Analytical Processing (OLAP) databases and speculated on a solution using task continuation and task stealing. Dmitriy V’jukov, who has been quite vocal on memory fencing and related issues in TBB on the forum, suggested a method to postpone and reschedule the potentially blocking tasks at some later time.
As I stated up front in my previous post, there’s lots of questions and I am far from able to claim I have any answers, but it appears we have a wealth of ideas and as I have the time to explore them, I’d like to share them with the community and garner more feedback. And the problems are formidable: just consider for example the situation where you have more top-level objects with secondary object (potentially blocking) dependencies than you have processing elements (HW threads or PEs): the usual Cilk/TBB approach of stealing from the top, even if you had a scheme where PEs are freed from waiting on blocked threads to schedule work elsewhere, would misdirect those PEs towards the big chunks of work at the top rather than sharing the load of the dependent objects—any scheme providing more threads than PEs would end in the same situation of being blocked save the threads computing the dependent object(s). Similar problems may exist if the object graph contains deep dependency chains, and either topology requires more memory to hold these contexts (stacks, dynamic memory allocations, OS resource tables, etc.) and run the risk of running out.
Given all these concerns, I want to gain a more practical understanding of how the Cilk task scheduler approach as implemented in Intel Threading Building Blocks works. My last post showed my first attempt to use Intel® Thread Profiler to explore such behavior; it showed the parallelism but little of the detail. Maybe there’s a scheme using Intel® Threading Tools User Events with a tailored labeling scheme that might provide more insightful details, but for now I want to talk about a new TBB feature, task scheduler observer. I’ve used it to implement a thread ID which I can use within the parallel control structures to report how tasks are divided among the threads. To do so requires the implementation of some Thread Local Storage, which I do with caution. Reread Arch’s blog on the subject to understand the issues.
This solution addresses the “problem” that within a parallel control structure there is no connection to the task structure or structures that ultimately compute it.
I can insert print statements in this functor that record the begin and end indices each time it is invoked but there are no hooks back to the tasks that actually run it. This would require more run-time overhead to maintain so it’s not something we want to have all the time, but would be useful to have available during development.
We start by subclassing task_scheduler_observer to create an observer object:
The parent class offers two virtual functions, on_scheduler_entry and on_scheduler_exit which when the observer is enabled will be called by each pool thread when they are created as a master or steal a task from another thread (entry) and when they are shut down (like in a task_scheduler_init::terminate() call).
For a thread ID, I want my on_scheduler_entry call to record a unique number for each of the threads that call it, but the first question that comes up: where to stick it? I have no handle on any per-thread data structures within TBB, so the obvious approach is to use some sort of thread local storage. As it happens, there’s a reasonable chunk of code already in the test routes, src/test/test_task_scheduler_observer.cpp. So I’ll just steal the pertinent bits.
Here I have practically everything I need, for Windows or Linux. I start with a structure that contains space for my thread ID, workerID, and code to get a pointer to it for the current thread, allocating it if necessary. For Windows, practically everything is handled by the __declspec(thread). For pthreads we need one additional bit of glue:
This creates the actual key used to access this thread local storage area for the current thread. All that’s left is to implement the on_scheduler_entry function:
There’s a master counter, declared as an atomic to ensure each thread gets a clean increment and I just assign the value to the field created earlier. I don’t need to distinguish the master thread from the worker threads, so my function ignores the value of is_worker. With that in place, I can complete my modified main function:
With a thread ID in place, next time I’ll start making use of it to explore how things get scheduled.
By Intel® Software Network Blogs » Under the hood: Learning more about task scheduling on May 19th, 2008 at 2:45 pm
[...] I’ve implemented an observer that lets me identify which thread executes which range. In my next post I describe it and start exploring the behavior of the [...]
By Alexey Kukanov (Intel) on May 20th, 2008 at 1:18 am
> This solution addresses the “problem” that within a parallel control structure there is no connection to the task structure or structures that ultimately compute it. I can insert print statements in this functor that record the begin and end indices each time it is invoked but there are no hooks back to the tasks that actually run it.
That's not quite true. tbb::task::self() gives you a reference to the tbb::task being executed. Don't know though how many useful info you would get out of this.
By Robert Reed (Intel) on May 23rd, 2008 at 11:00 am
Don't you know, Alexey, that I include at least one inaccuracy in my posts just to let me know that you're out there to keep me honest! ;-)
I'm still learning the internal details of tasks, but it looks from a quick peek that while the task doesn't have much of interest in itself, all kinds of stuff might be gleaned from the task_prefix. More fodder for a future blog. If there's anything here that would allow a program to generate some unique thread ID, I'm sure you'll let me know, but for the meantime, I'll continue exploring what I can discover with my task_scheduler_observer generated thread IDs.
Thanks for watching my back, Alexey.
By Intel® Software Network Blogs » Under the hood: Employing hooks to explore TBB task scheduler on June 16th, 2008 at 11:10 am
[...] block access to an object until it can get built), I’ve been building up tools to take a peek. Last time I showed a technique to use thread local storage to create a Thread ID that gets initialized (and [...]