1,437 Posts served
I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access. The answer is that we'd add one if:
I usually try to avoid linked lists even for sequential programming if I'm programming for performance. My reasons are:
For parallel programming, add another flaw:
Two traditional attractions of linked lists are:
But modern polymorphic language like C++ provide dynamic data structures like std::vector and std::deque. You don't have to write them from scratch. Prepending or appending to a deque also takes O(1) time. Appending to a vector takes O(1) amortized time. Amortized time is the time averaged over many append operations.
Here's a speed test you might want to try. Construct a container, append n items, walk the container once, and destroy it. Here's the code:
template<typename Container>
int Iota( int n ) {
Container container;
for( int i=0; i<n; ++i )
container.push_back(i);
int sum = 0;
for( typename Container::const_iterator j=container.begin(); j!=container.end(); ++j )
sum += *j;
return sum;
}
I tried this fragment on a Linux box and found that std::deque was slightly faster than std::list when n>=3, and std::vector was slightly faster when n>=10. When n>=100, std::deque was more than 10x faster than std::list, and even std::vector more than 3x faster than std::list. So for very short collections, std::list might pay off. But for big collections, its second-rate.
Of course linked lists do have some virtues, notably when concatentating lists, splicing lists, and inserting items in the middle. I use lists when I need to do that. But getting back to parallelism, which set of those operations make any sense in parallel programming? Concurrent splicing and inserting seems awfully tricky to use correctly. For example, if I really need to insert in the middle of the list, it must be because there is something special about the insertion context. But if there are other threads inserting at the same time, how do I know the context will not be broken?
The two operations on lists that I think could be useful in parallel programming are:
For example, a parallel reduction could use "concatenate" as its reduction operation, and thus build a list of N items in O(N/P+log(P)) time. The log(P) term arises from a tree reduction at the end. The problem is the second operation. To keep a list from becoming a serial bottleneck, we need a way to traverse it in parallel. That probably means it is no longer a linked list, but some kind of (balanced?) tree structure.
I've had a recurring thought that we should add this kind list, one that supports concatenation and splitting in constant time. But we really need motivating use cases before implementing it. Suggestions for good use cases or demos appreciated.
- Arch Robison
P.S. I though about writing this blog as a politcal attack ad, but given the technical details, decided against it. If I had done it that way, it would have started:
Mr. Linked List is running for office. He's popular everywhere. But here's what Mr. Linked List doesn't want you to know... .
| December 21, 2007 11:25 AM PST
Arch Robison (Intel) | Thanks for the pointer to ropes. Ropes sound like what I'm looking for in the was of a data structure for concatenate and split. Yes, I'm the author of IFP. It was my master's degree project. It's been a while since I've thought about it. My big regret is not implementing proper tail recursion in it. There's been a lot of compiler/interpreter innovations since I wrote it. If I were to do it over again, I'd write a JIT for it, not an interpreter. Backus's FP, from which IFP was derived, was definitely a different way to think about programming. |
| December 21, 2007 12:03 PM PST
Arch Robison (Intel) | Yes, it is true that std::vector is not concurrency friendly. But TBB's tbb::concurrent_vector is, and indeed allows multiple readers and writers to work on it. It's not as contiguous as a std::vector, but certainly more contiguous than a std::list. A tbb::concurrent_vector of length N has O(lg(N)) discontinuities in its layout. I should have defined what I meant by "parallel programming", and distinguished it from "concurrent programming". By parallel programming, I meant programming for parallel speedup, not just the concurrent programming taught in OS classes. OS programming != parallel programming. OS programming is about virtualizing resources, in way that threads do not interfere with other threads. It has be that way, because most of those threads are running independent tasks authored by independent authors. Parallel programming is about orchestrating threads to achieve a common purpose. Indeed, if a parallel program spends most of its time in the kernel, it's probably not getting much speedup. (Parallel I/O is probably the exception to this rule.) Yes, compare-and-swap protocols can be used to concurrently prepend to a list. Indeed, we use that technique in the implementation of TBB. (Sometimes it takes some care to avoid ABA problems.) But I see compare-and-swap as useful for parallel programming only when the operations can be spread out across many cache lines. If they all contend for a single cache line, it's just another bottleneck. And no matter how clever the construction of a llinked ist, its traversal is inherently serial. I agree that linked lists are useful for OS and concurrent programming, but I see them as largely (but not completely) useless for parallel programming. That's not to rule out linked structures. Something like ropes sound like a good deal. |
| December 21, 2007 12:57 PM PST
Greg Buchholz | Data Parallel Algorithms. |
| December 21, 2007 1:44 PM PST
Arch Robison (Intel) | Thanks for the citation. It's an excellent discussion of parallel prefix. TBB in has the parallel prefix operation (tbb::parallel_scan), though I have found it slow in practice because (1) it is a two-pass algorithm and hence uses twice the memory bandwidth of a single pass and (2) TBB was lacking thread affinity support. Problem (2) is being addressed, and so I hope to see some significant improvement once I add the affinity logic to tbb::parallel_scan. When a machine has a processor per node in a linked list, amazing things can be done. Hillis and Steele's article assume that kind of machine, namely the CM-1 or CM-2, which had up to 2^16 nodes. Cynics will note, however, that Thinking Machines switched to more a conventional MIMD architecture with the CM-5. Coarse-grain MIMD machines seem to dominate the market for now. The processor per list node would seem to be an unreasonable assumption on those kinds of machines. A sublist per node would work. More recently, Uzi Vishkin has been promoting a very highly parallel "XMT" machine that features fast parallel prefix. |
| December 21, 2007 3:15 PM PST
Arch Robison (Intel) | Yes, it comes down to concurrent access vs. data parallel programming. I should have said "data parallel" in the title. One problem we're running into with TBB (and I think other data parallel systems have it too) is how to marry data-parallel programming with OS-style ("many clients") concurrent programming. E.g., someone in the TBB forum has a nice example where agents wait on an external message/event (definitely OS-style), but want process the message using data-parallel style processing. A single CAS beats a single lock/unlock pair. But a lot of the non-blocking algorithms seem to use more than one CAS. Three summers ago, an intern and I looked into non-blocking algorithms and timed some, specifically dual-queues and skip-lists. The results were disappointing. Unless the machine was grossly oversubscribed with many more software threads running than hardware threads available, the locked algorithms were substantially faster. The problems seemed to arise from the Pentium 4 processor's relatively slow CAS. We also suspected that CAS-based algorithms were causing a lot of cache line thrashing. I had a similar experience when experimenting with lock-free (or at least lock-reduced) algorithms in the core TBB scheduling algorithm. Cache-line ping-ponging seemed to be a problem. In contrast, locks not only force logical mutual exclusion, they also (if used carefully) tell waiting threads "keep your paws off my cache lines!". When the cache line has to be tranferred across sockets, that could conceivably be a big cost. We came to the conclusion that about 3 CAS was as bad as one lock/unlock pair, at least for short critical sections. (Sounds like geeky poker). The situation is probably somewhat better with Core-2, but I have not done the timings lately. Of course, for grossly oversubscribed situations, non-blocking algorithms are a win because they avoid problems where a lock holder is preempted. I recall that the CAS-based skip list had similar issues compared to a skip list with a big fat lock around it, at least for the 2-4 core systems we were looking at. Which was depressing. Having written both a red-black tree implementation (for KAI C++ <map> and a skip list (for the guts of the KAI C++ compiler), I'm definitely a fan of the simplicity of skip lists. Thanks for all the comments. My previous blog on how 'volatile' is almost useless for multi-threaded programming got no comments, so I was wondering if anyone was reading it. |
| December 24, 2007 12:30 PM PST
AJ Guillon
| Just as a note, I did read your blog on volatile. I read all of the blogs here, and learn a lot from them. |
| January 17, 2008 8:52 AM PST
Arch Robison (Intel) | Thanks for pointing out intrusive lists. I'm a big fan of them - most of the lists inside the KAI C++ compiler were intrusive. The template we had used a pointer-to-member to describe the link. For a list of type T the programmer declared a FrugalSListLink in their structure, say foo_link, and declared the list as FrugalSList<T,&T::foo_link> . For our new programmers, it was often their first introduction to the pointer-to-member feature in C++. I agree that my original posting narrowly addressing speeding up wallclock time for classic supercomputing-type CPU-intensive applications, and not addressing other important areas of concurrency. |
| May 13, 2008 1:34 AM PDT
Flying Ninja Monkey | Intel guy is talking about the relative merits of a linked list |
| June 2, 2008 12:13 AM PDT
Andrey Marochko (Intel) | Another tribute to the intrusive lists. Recently we've added two more of them into the TBB scheduler as part of the new cancellation propagation machanism. One is the list of master thread scheduler objects, and the other is the list of task group contexts. Using them allowed us to remove locks from the normal (that is hot) execution path. |
Kevin Farnham
The memory scattering problem with linked lists can also be an enormously significant issue. If the list is constructed from a large number of inserted links, with many other links deleted, the memory locations of "adjacent" list items will often be far apart -- meaning that efficient utilization of cache (which is readily possible with standard arrays or vectors) will be impossible with this kind of linked list.
Still, a lot of people like using linked lists!