|
<HTML> |
|
<HEAD> |
|
<TITLE>Garbage collector scalability</TITLE> |
|
</HEAD> |
|
<BODY> |
|
<H1>Garbage collector scalability</h1> |
|
In its default configuration, the Boehm-Demers-Weiser garbage collector |
|
is not thread-safe. It can be made thread-safe for a number of environments |
|
by building the collector with the appropriate |
|
<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation |
|
flag. This has primarily two effects: |
|
<OL> |
|
<LI> It causes the garbage collector to stop all other threads when |
|
it needs to see a consistent memory state. |
|
<LI> It causes the collector to acquire a lock around essentially all |
|
allocation and garbage collection activity. |
|
</ol> |
|
Since a single lock is used for all allocation-related activity, only one |
|
thread can be allocating or collecting at one point. This inherently |
|
limits performance of multi-threaded applications on multiprocessors. |
|
<P> |
|
On most platforms, the allocator/collector lock is implemented as a |
|
spin lock with exponential back-off. Longer wait times are implemented |
|
by yielding and/or sleeping. If a collection is in progress, the pure |
|
spinning stage is skipped. This has the advantage that uncontested and |
|
thus most uniprocessor lock acquisitions are very cheap. It has the |
|
disadvantage that the application may sleep for small periods of time |
|
even when there is work to be done. And threads may be unnecessarily |
|
woken up for short periods. Nonetheless, this scheme empirically |
|
outperforms native queue-based mutual exclusion implementations in most |
|
cases, sometimes drastically so. |
|
<H2>Options for enhanced scalability</h2> |
|
Version 6.0 of the collector adds two facilities to enhance collector |
|
scalability on multiprocessors. As of 6.0alpha1, these are supported |
|
only under Linux on X86 and IA64 processors, though ports to other |
|
otherwise supported Pthreads platforms should be straightforward. |
|
They are intended to be used together. |
|
<UL> |
|
<LI> |
|
Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to |
|
run the mark phase in parallel in multiple threads, and thus on multiple |
|
processors. The mark phase typically consumes the large majority of the |
|
collection time. Thus this largely parallelizes the garbage collector |
|
itself, though not the allocation process. Currently the marking is |
|
performed by the thread that triggered the collection, together with |
|
<I>N</i>-1 dedicated |
|
threads, where <I>N</i> is the number of processors detected by the collector. |
|
The dedicated threads are created once at initialization time. |
|
<P> |
|
A second effect of this flag is to switch to a more concurrent |
|
implementation of <TT>GC_malloc_many</tt>, so that free lists can be |
|
built, and memory can be cleared, by more than one thread concurrently. |
|
<LI> |
|
Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread |
|
local allocation. It does not, by itself, cause thread local allocation |
|
to be used. It simply allows the use of the interface in |
|
<TT>gc_local_alloc.h</tt>. |
|
<P> |
|
Memory returned from thread-local allocators is completely interchangeable |
|
with that returned by the standard allocators. It may be used by other |
|
threads. The only difference is that, if the thread allocates enough |
|
memory of a certain kind, it will build a thread-local free list for |
|
objects of that kind, and allocate from that. This greatly reduces |
|
locking. The thread-local free lists are refilled using |
|
<TT>GC_malloc_many</tt>. |
|
<P> |
|
An important side effect of this flag is to replace the default |
|
spin-then-sleep lock to be replace by a spin-then-queue based implementation. |
|
This <I>reduces performance</i> for the standard allocation functions, |
|
though it usually improves performance when thread-local allocation is |
|
used heavily, and thus the number of short-duration lock acquisitions |
|
is greatly reduced. |
|
</ul> |
|
<P> |
|
The easiest way to switch an application to thread-local allocation is to |
|
<OL> |
|
<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, |
|
and then include the <TT>gc.h</tt> |
|
header in each client source file. |
|
<LI> Invoke <TT>GC_thr_init()</tt> before any allocation. |
|
<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, |
|
and/or <TT>GC_GCJ_MALLOC</tt>. |
|
</ol> |
|
<H2>The Parallel Marking Algorithm</h2> |
|
We use an algorithm similar to |
|
<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by |
|
Endo, Taura, and Yonezawa</a> at the University of Tokyo. |
|
However, the data structures and implementation are different, |
|
and represent a smaller change to the original collector source, |
|
probably at the expense of extreme scalability. Some of |
|
the refinements they suggest, <I>e.g.</i> splitting large |
|
objects, were also incorporated into out approach. |
|
<P> |
|
The global mark stack is transformed into a global work queue. |
|
Unlike the usual case, it never shrinks during a mark phase. |
|
The mark threads remove objects from the queue by copying them to a |
|
local mark stack and changing the global descriptor to zero, indicating |
|
that there is no more work to be done for this entry. |
|
This removal |
|
is done with no synchronization. Thus it is possible for more than |
|
one worker to remove the same entry, resulting in some work duplication. |
|
<P> |
|
The global work queue grows only if a marker thread decides to |
|
return some of its local mark stack to the global one. This |
|
is done if the global queue appears to be running low, or if |
|
the local stack is in danger of overflowing. It does require |
|
synchronization, but should be relatively rare. |
|
<P> |
|
The sequential marking code is reused to process local mark stacks. |
|
Hence the amount of additional code required for parallel marking |
|
is minimal. |
|
<P> |
|
It should be possible to use generational collection in the presence of the |
|
parallel collector, by calling <TT>GC_enable_incremental()</tt>. |
|
This does not result in fully incremental collection, since parallel mark |
|
phases cannot currently be interrupted, and doing so may be too |
|
expensive. |
|
<P> |
|
Gcj-style mark descriptors do not currently mix with the combination |
|
of local allocation and incremental collection. They should work correctly |
|
with one or the other, but not both. |
|
<P> |
|
The number of marker threads is set on startup to the number of |
|
available processors (or to the value of the <TT>GC_NPROCS</tt> |
|
environment variable). If only a single processor is detected, |
|
parallel marking is disabled. |
|
<P> |
|
Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside |
|
the collector to immediately yield the processor instead of busy waiting |
|
first. In the case of a multiprocessor and a client with multiple |
|
simultaneously runnable threads, this may have disastrous performance |
|
consequences (e.g. a factor of 10 slowdown). |
|
<H2>Performance</h2> |
|
We conducted some simple experiments with a version of |
|
<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to |
|
run multiple concurrent client threads in the same address space. |
|
Each client thread does the same work as the original benchmark, but they share |
|
a heap. |
|
This benchmark involves very little work outside of memory allocation. |
|
This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine |
|
under Linux 2.2.12. |
|
<P> |
|
Running with a thread-unsafe collector, the benchmark ran in 9 |
|
seconds. With the simple thread-safe collector, |
|
built with <TT>-DLINUX_THREADS</tt>, the execution time |
|
increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. |
|
(The times for the <TT>malloc</tt>/i<TT>free</tt> version |
|
with glibc <TT>malloc</tt> |
|
are 10.51 (standard library, pthreads not linked), |
|
20.90 (one thread, pthreads linked), |
|
and 24.55 seconds respectively. The benchmark favors a |
|
garbage collector, since most objects are small.) |
|
<P> |
|
The following table gives execution times for the collector built |
|
with parallel marking and thread-local allocation support |
|
(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested |
|
the client using either one or two marker threads, and running |
|
one or two client threads. Note that the client uses thread local |
|
allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector |
|
switches to a locking strategy that is better tuned to less frequent |
|
lock acquisition. The standard allocation primitives thus peform |
|
slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be |
|
avoided in time-critical code. |
|
<P> |
|
(The results using <TT>pthread_mutex_lock</tt> |
|
directly for allocation locking would have been worse still, at |
|
least for older versions of linuxthreads. |
|
With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the |
|
lock with pthread_mutex_try_lock(), busy_waiting between attempts. |
|
After a fixed number of attempts, we use pthread_mutex_lock().) |
|
<P> |
|
These measurements do not use incremental collection, nor was prefetching |
|
enabled in the marker. We used the C version of the benchmark. |
|
All measurements are in elapsed seconds on an unloaded machine. |
|
<P> |
|
<TABLE BORDER ALIGN="CENTER"> |
|
<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> |
|
<TH>2 marker threads (secs.)</th></tr> |
|
<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> |
|
<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> |
|
</table> |
|
<PP> |
|
The execution time for the single threaded case is slightly worse than with |
|
simple locking. However, even the single-threaded benchmark runs faster than |
|
even the thread-unsafe version if a second processor is available. |
|
The execution time for two clients with thread local allocation time is |
|
only 1.4 times the sequential execution time for a single thread in a |
|
thread-unsafe environment, even though it involves twice the client work. |
|
That represents close to a |
|
factor of 2 improvement over the 2 client case with the old collector. |
|
The old collector clearly |
|
still suffered from some contention overhead, in spite of the fact that the |
|
locking scheme had been fairly well tuned. |
|
<P> |
|
Full linear speedup (i.e. the same execution time for 1 client on one |
|
processor as 2 clients on 2 processors) |
|
is probably not achievable on this kind of |
|
hardware even with such a small number of processors, |
|
since the memory system is |
|
a major constraint for the garbage collector, |
|
the processors usually share a single memory bus, and thus |
|
the aggregate memory bandwidth does not increase in |
|
proportion to the number of processors. |
|
<P> |
|
These results are likely to be very sensitive to both hardware and OS |
|
issues. Preliminary experiments with an older Pentium Pro machine running |
|
an older kernel were far less encouraging. |
|
|
|
</body> |
|
</html> |
|
|