Please test the parallel mark-region collector for SBCL

or, working with fire and steel banks

SBCL 2.3.8 has a new parallel mark-region collector which you might like to test. This is a brief guide on what to expect, and how to check said expectations.

If you just want to play with the new collector, bear in mind it currently only works on x86-64/Linux, x86-64/macOS and x86-64/OpenBSD (the last thanks to Sebastien Marie).

Theory of operation

A tracing garbage collector knows where the garbage is sometimes. It knows this because it knows where the garbage isn't. By tracing where the live objects are from where they aren't¹, it obtains a live set. The old and new collectors use different algorithms to implement this idea.

A copying collector, such as that used by the current SBCL collector (gencgc) does so by copying objects reachable from variables to a newspace, then copying all objects reachable from objects in newspace, and discarding the oldspace which now only contains dead objects. (This is slightly complicated by generational garbage collection, which I'll explain later.)

A mark-region collector, such as the one I'd appreciate if you tested, identifies live objects by associating a mark bit with every object and every region of memory. It recursively visits and sets both the mark bits of every object referenced from every marked object and the mark bit for the regions containing the object, and then lets every unmarked region be reused. The collector needs to compact in order to deal with regions which are only partly occupied by live objects, but it only needs to compact part of the heap, and only needs to compact infrequently.

The collectors in SBCL are also generational: they focus on collecting newer objects more often, because newer objects tend to be shorter lived, and so reclaiming them is usually more effective. Lisp code must record where references from old objects to new objects (might) reside, in order for the collector to correctly collect just the new objects. The implementations also must deal with conservative references: the SBCL compiler does not inform the collector of what registers and what parts of the stack have references rather than 64-bit integers and double-floats, which must not be mistaken for references. The ELS paper goes further in depth in implementation, but was written before I implemented compaction.

What to expect

If you keep lots of objects live, you hopefully will find that tracing those objects is faster, and the mark-region GC will run faster. If you don't keep lots of objects live, the mark-region GC might not run faster. It might run slower, as only the mark-region GC has to sweep regions, and scavenging where old objects reference new objects can be slower when old objects are more fragmented.

The collectors seem to produce different locality of reference; it seems to vary substantially between different programs and hardware. I'm not sure exactly when the locality will be better or worse, but it'll likely be different.

Allocation might be slower with multiple Lisp threads; threads take a global lock when they need to claim new pages to allocate into. A page which is partly used fits fewer objects, so threads need to claim pages more frequently.

A parallel GC only scales with an appropriately "parallel" heap with many objects that it can trace in parallel. The worst case would be to have a very large linked list with shallow elements: the collector only finds one more cons cell to trace by walking the prior cons cell, and so no parallelisation is possible. But hopefully you don't have such a list laying around. Right?

The mark-region collector may reserve more memory from the operating system, due to objects being more dispersed in memory, although the additional memory can be reused by SBCL. But more data should fit in the same heap size when using the mark-region collector, as the mark-region collector will not copy a lot of memory at once.

There is no support for the immobile space. The immobile space is used to give 32-bit pointers to various objects, such as layouts of instances; with immobile space SBCL crams a pointer to the layout into the header of an instance, and without immobile space SBCL must use another word for a full 64-bit pointer. Somehow lacking immobile space greatly slows down one application I have (and gencgc sans immobile space also has the same effect), but I can't work out exactly why.

Diagnostics

The time function now prints both the CPU time spent in collection (run time) and the real time spent in collection.

The C function mr_print_meters prints various meters of time taken in each phase of collection and other interesting figures. It may be called from Lisp by sb-alien::(alien-funcall (extern-alien "mr_print_meters" (function void))). The meters are written to stderr; the output looks like

collection 1129 (10% compacting):
  7us consider
  353 scavenge (996 prefixes) 993 trace (3820 alive 3161 running)
  197 sweep (106 lines 65 pages) 257 compact (156 copy 92 fix)
  40 raise; 954000B fresh 278pg pinned

1,129 garbage collections were performed while running this program, with 10% including a compacting pass. It took 7 microseconds on average to consider compaction, 353 microseconds to scavenge old-to-new references, 993 microseconds to trace (consisting of GC threads being awake for 3820 microseconds and working for 3161 microseconds), 197 microseconds to sweep, 257 microseconds to compact, and 40 microseconds to raise objects from younger to older generations. The collector had to compute where objects begin for 954kB of the heap on average, and could not move 278 pages on average.

Ideally most time should be spent in tracing, as opposed to in scavenging and sweeping; else there is little work to parallelise, and the mark-region collector is inherently disadvantaged by having to sweep.

How to test

Download the SBCL source code, and use ./make.sh --without-gencgc --with-mark-region-gc to build with the mark-region GC.

The environment variable GC_THREADS controls the number of helper threads; the Lisp thread which commenced GC and the helper threads all collect in parallel.² By default there are three helper threads, totalling four threads performing GC. This may well be too few or too many depending on your computer and workload - I'd appreciate suggestions for what a good default would be.

Acknowledgements

Garbage collectors don't grow on trees, and I couldn't have incubated one alone. Hopefully I got everyone:

Stas Boukarev and Douglas Katzman helped me get up to speed with the interactions between the garbage collector and the rest of SBCL. Doug heavily refactored the SBCL runtime to get the new GC fitting in nicely.
Steve Blackburn and Kunal Sareen had discussed Immix with me, leading to the novel approaches to implementing generations and conservative roots that are used in this collector. Elijah Stone also consistently had interesting ideas about GC.
Yukari Hafner helped me get auto-vectorisation to work, and kindly provided me with pre-release access to the Kandria source code for testing.
Paul Khuong and Larry Masinter helped me design the internal arena-based memory manager used by the garbage collector.
The Coffee Compiler Club helped me interpret the performance of parallel GC and programs.
Jan Moringen, Kunal Sareen, Elijah Stone, and Robert Strandh provided feedback on my ELS paper.
Jason Chandler, Vitaly Drogan, Gnuxie Lulamoon, Larry Masinter, Suzanne Mitchell, Roger Sen, and Robert Smith most kindly sponsored my work.

Footnotes:

A reference counting collector does the opposite, tracing the garbage from where it isn't.

Having to explain that it's off by one suggests that the environment variable really should be the total number of threads, i.e. GC_THREADS=4 means four threads run. Oops.