1,090 Posts served
4,543 Conversations started
I’m back with another challenge, encountered during my support work for Intel® Threading Building Blocks. I’ve been working with several TBB users who appreciate the general philosophy of Cilk task scheduling embodied in TBB but have run into some practical challenges applying it to their applications. Often the issue revolves around the need to block some computations until other computations complete. It may be that they need to handle either inter-object or intra-object threading—their application may at different times encounter a bunch of objects to run in parallel or just one big object that could use several threads to chew on simultaneously. Or they may have objects they need to compute upon which other objects are dependent—here it would be great to suspend dependent object processing while the weight of the associated threads is thrown to computing the shared object.
These are tough problems and I don’t have answers for them yet. But I have a few ideas and I hope in the next few posts to share them and get community feedback that might inspire some solutions. So bear with me as I stumble about and maybe we can all learn something.
Anyone who has had much exposure to presentations about Intel TBB has probably heard in one form or another the motto, “process locally, steal globally.” The Cilk philosophy embodied in the TBB task scheduler is to constrain active threads to their own local region in memory to exploit any memory that may already reside in the caches of the processing element running that thread; meanwhile, idle threads should try to steal work from memory regions as yet untouched by the active thread(s) to avoid interrupting those running threads.
When one of my customers presented an example using a nested pair of parallel_for statements, I realized that all my knowledge on subject of scheduling was theoretical: I hadn’t dug down into the guts of this code. Now might be the time.
First, here’s a sketch of the test code. You’ll note an outer and inner loop with a lock on the outer and some spinning work on the inner. This is intended to represent the structure of a program operating on a set of objects that block because of some postulated resource contention, and the inner loop represents the work done to compute that shared resource. You’ll see that the lock is currently commented out. When running this code on an 8-core machine, it will sometimes lock up. Some ideas have been bounced around about why, but I’m curious whether I can demonstrate the workings of the problem rather than just speculating about it. There’s also explicit references to the auto_partitioner, but with the advent of the affinity_partitioner, use of the auto_partitioner has been deprecated. Still, we’ll start with this one and then see if we can tell the difference when looking under the hood.
So what’s going on here? As is my wont, I turned to Intel® Thread Profiler for a first look at the processing:
Hrumph! I can see eight threads cranking at the work, which is good because I’m running on an 8-core processor. 32% of the run time is spent at concurrency level 8. Most of the time is spent in that serial tail that must represent startup processing, but the whole run is under 0.2 seconds so I suspect that’s a fixed overhead that will be negligible compared to the overall time of real work. But that’s about all. Zooming in on the high concurrency zone:
Where the work is done, the test is keeping 8 threads busy 94% of the time. But still there’s no clue about how the task scheduler is dividing the work, though it looks like it gets interesting towards the end of the run. Guess I’ll have to revert to an old standard, inserting print statements to expose what is happening in the code:
Unfortunately, while I can print out the loop bounds that tell me where I am in the nested executions of the array, I can’t print which thread is handling which range. The TBB task object that might contain that information lies outside the scope of the outer_loop class. Stuck.
But not for long. In the evolving library that is Intel Threading Building Blocks, there is a new feature, available in at least the latest open source release (upgraded to Stable as tbb20_20080408oss), called task_scheduler_observer. I’ve implemented an observer that lets me identify which thread executes which range. In my next post I will describe it and start exploring the behavior of the scheduler. Stay tuned!
By Alexey Kukanov (Intel) on May 7th, 2008 at 5:51 am
I don't think we consider auto_partitioner as deprecated feature replaced by affinity_partitioner; I expect both to be used as appropriate. Also auto_partitioner could evolve using some ideas implemented in affinity_partitioner.
By Charles E. Leiserson on May 7th, 2008 at 12:24 pm
Work-stealing: good. Locks: bad. Locks held for a long time: really bad. What's the real app, and why is this style of programming a good way to solve the problem?
By Robert Reed (Intel) on May 7th, 2008 at 1:38 pm
Thanks for the correction, Alexey. Somewhere in the back of my mind I recalled some suggestion of migrating code to the affinity_partitioner. Must have been a memory fault
To Charles: yes, in general I agree with you. Particularly driven by the philosophy of Intel Threading Building Blocks to reduce areas where caches can cool, locks are to be avoided as much as possible. Yet locks are included in Threading Building Blocks so if they are evil, they are at least a necessary evil.
Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies. Briefly I described a couple of these application structures in my intro, but let me try to elaborate one and see if anyone has an alternate approach to the problem. Imagine an object processing system (could be a mechanical CAD package, or maybe a 3D renderer, or some physics simulation) wherein the objects to process have shared dependencies to secondary objects. Those secondary objects may be sub-assemblies or caches of precomputed data that themselves may be computed in parallel. Our example program may not know a priori what is the topology of those dependencies: a reasonable approach in this case is to describe a set of tasks where each describes processing for one of the primary objects. Presumably these objects are localized in memory and so fit the general criteria for processing in parallel on separate processing elements (PEs). Then one PE encounters a reference to a secondary object that requires its own computation. What to do here? We could dispatch this thread to compute the secondary object, but what if a second PE working on another primary object runs into a reference to the same secondary? Do we duplicate the computation? We could, at the cost of extra memory and duplicated processing. Better if we could somehow suspend the second PE until the processing/preparation of the secondary object is completed (cached data of the primary cooling the whole time, admittedly), at which point both PEs could proceed on computation of their primary objects. Sounds like a resource lock to me, although ideally we'd like to keep the second PE busy, not off in an idle loop or some other process while the thread it was running is suspended on that lock. What would be really cool is to structure things so that both PEs share the work of processing the secondary object and then each return to their primary object processing once that work is done. But there's questions of oversubscription and undersubscription and memory use and stack splitting and more to deal with here. I don't claim to have answers for all these but I think they're worth a discussion.
But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies? I'd love to find that I'm overlooking an obvious solution that doesn't require locks. Got an idea?
By Charles E. Leiserson on May 8th, 2008 at 7:58 am
You're right that there are issues of memory use. A properly implemented work-stealing scheduler guarantees that if S1 is the serial stack space, with P processors the stack space is at most P*S1. If you repurpose a processor every time it hits somewhere another processor is executing, you can easily blow out memory.
You can model the computation to performed as a dag (directed acyclic graph). Each vertex has a value to be computed which depends on the values of its successors in the dag. Sinks (vertices with out-degree 0) can be computed with no further exploration. A vertex whose value has been assigned can provide its value to its predecessors without further computation.
Here's a randomized nondeterministic algorithm to compute the dag which should work well using Cilk or TBB, although I haven't implemented it. It uses at most P*S1 stack space. Recursively walk the dag in parallel, spawning each exploration of a successor. Label edges as white, grey, and black. All edges are initially white. If you traverse a white edge, color it grey. If you return across a grey edge, color it black. Now, in your parallel recursive tree walk, when you visit a vertex, if its value has already been computed, return the value. If not and there exists at least one white edge, take a _random_ white edge. Otherwise (all edges are black or grey), take a _random_ grey edge. (You're now chasing the tail of the processor that colored the edge grey speculatively hoping to help out with that subdag.) Use locks only to guarantee the atomicity of these operations on a single node.
I'll look at this problem with my students, both theoretically and in practice. If you have any data sets, let me know. Also, if you implement it, I'd be curious as to the results.
By Dmitriy V'jukov on May 8th, 2008 at 10:31 am
Re [Charles E. Leiserson]: But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies? I'd love to find that I'm overlooking an obvious solution that doesn't require locks. Got an idea?
What do about following approach?
// pseudo-code
void task::operator()
{
secondary_t* sec = get_associated_secondary_element();
if (sec)
{
if (false == sec->is_computed())
{
if (sec->try_lock())
{
sec->compute();
sec->unlock();
}
else
{
// in fifo order, not in lifo!
postpone_task(this);
// maybe next time
return;
}
}
}
// sec is computed, can use it
fork_child1(sec);
fork_child2(sec);
}
By Dmitriy V'jukov on May 8th, 2008 at 10:41 am
Code formatting in previous post is lost. Damn!
Re [Charles E. Leiserson]: Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies.
Btw, work-stealing scheduler supports only parallelism structured as *balanced* tree. Am I right here?
So user MUST structure parallelism as *balanced* tree. Provided extreme case of unbalanced tree, user can get extremely bad performance.
For example following structure *allows* parallelism, but will be executed extremely badly by work-stealing scheduler:
work11 forks work21, work22
work21 forks nothing
work22 forks work31, work32
work31 forks nothing
work32 forks work41, work42
etc
Is my understanding correct?
By Charles E. Leiserson on May 8th, 2008 at 11:24 am
I think you can actually do the algorithm I proposed easily using only synchronization through memory, because the state bits change monotonically: white => grey => black and no-value => value. You might get some redundant computation, but it would likely be minimal.
Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span >> P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see http://supertech.csail.mit.edu/cilk/lecture-1.ppt. Other lectures and materials can be found here: http://supertech.csail.mit.edu/cilk/.
By Robert Reed (Intel) on May 8th, 2008 at 4:18 pm
Thanks very much, Charles! I think I understand the basics of your algorithm and I'd like to dig deeper as I have the time. I'll certainly report back here anything I find interesting. Translating from the theoretical DAG to a practical object computing code, it seems to me that the grey-following threads would reexecute the serial regions in the object computation functions that without the following would only get executed once. So once-per-object operations like reference counting, memory allocations and such would have to be guarded from the follower(s). Also in that list, there'd need to be some mechanism for the code to figure out that the initial spawn for a successor had already occurred and a way to find the outgoing spans for that parallel region in, say the TBB task tree. Another complication would be serial regions between parallel regions within a particular function that may require a synchronizer to ensure tasks associated with the first parallel block were complete before spawning tasks for the second parallel block.
Unfortunately, I do not have any data sets to share. The customers suggesting these models have their own concerns about privacy and have either not shared more than the sample code I show above or have constrained me under a nondisclosure agreement. But I can share anything I come up with personally, and I can point them to this exchange in the hope that they may participate to the level they are able.
Thanks for your comments. I hope we'll continue to keep your interest.
By Mick Turner on May 9th, 2008 at 12:41 pm
I have a similar issue. My app is effectively a OLAP type database. The task I am parallelising is resolution of user data selections to a data set. So I have a tree of operators on picked items from dimensions. A node in the tree (and possibly occuring more than once in a tree) could be a previously saved data set. This saved data set may or may not exist in memory and may require construction from its initial definition and thereby generating a new resolution of data set task. As soon as such a saved data set appears more than once I could have a second tbb thread looking to use it whilst it is being constructed. Without getting more sophisticated it would have to block and wait.
Wouldn't the following provide a solution?
If a tbb thread could recycle what is left of its task as a new task and then force a task steal rather than immediately checking to see what other work it had available then there is a chance it would do some useful work - possibly even helping the first tbb thread as it is busy resolving the saved data set (assuming that was broken into many tasks, which mine will be). Once it had completed the stolen task it would go back to normal processing of its task set.
I wouldn't claim my TBB knowledge to be anywhere near 100% yet... but I believe a TBB thread will always execute all task in its task set before looking to steal, so I don't think there is a way of simulating this process by playing around with levels in the stack of task lists.
It may be better if this thread moved onto some other task in its task set if such tasks were available and only looked to steal if there were no other tasks. Then what I am proposing is that the task that would otherwise block would temporarily become invisible to the thread, probably only for one 'get next task' call.
I.e. Have a task.spawntemporarilyinvisible() call
Mick
By randomizer on May 11th, 2008 at 2:06 am
Re [Charles E. Leiserson]: Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span >> P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see
http://supertech.csail.mit.edu/cilk/lecture-1.ppt.
Other lectures and materials can be found here:
http://supertech.csail.mit.edu/cilk/.
I've read Cilk papers some time ago. Can you, please, explain why
(A) work must be structured as balanced tree (I mean not ideally balanced, just sufficiently balanced)
not followed from
(B) work/span >> P
?
I can't imagine unbalanced tree with work/span >> P. Also take into account that stealing is expensive operation.
Dmitriy V'jukov