My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.
This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.
I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.
Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.
If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.
However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.
Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).
It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.
The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.
You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.
You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.
You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.
Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.
You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.
There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.
But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.
Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.
Sorry, I refuse to accept a trivial loop of generating up to 2 objects at a time as ever being a relevant benchmark. I didn’t say throughput wasn’t relevant. I said this test isn’t convincing.
> coming to the conclusion that GC was somewhat faster than malloc/free
Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.
For malloc/free specific workload is much more important than for a bump allocator; malloc potentially has branching and different cache access/utilization patterns depending on requested sizes or distance/pattern between mallocs & frees, whereas a bump allocator is just bumping a single pointer, with basically zero mispredicts, and very simple linear cache utilization that's easily prefetched.
Yes bump allocators are very fast. (It would have been helpful if the blog post explained what type of allocator it was benchmarking).
But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.
> But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle.
Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!
This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.
It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.
It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.
> The allocation fast path of the RPython GC is a simple bump pointer,
and the assembly for it.
Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)
And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.
I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)
The bump allocator declaration occurs after the 2.1 cycles per allocation line. Perhaps all the information is the article. But it is not presently clearly imho. YMMV.
FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.
The presentation clearly presumes a lot of implicit knowledge about generational garbage collectors that you not only don't have but firmly believed to be false.
Yeah, it is a good bit down; that said, bump allocators are a very common (most common?) option around GC anyway; I personally scrolled straight to the disassembly, and it's pretty clear from that, and the explicit mention of bump allocator is just around there too.
In defense of the OP: bump allocators are part of the CPython runtime. It's used to avoid calling malloc a bazillion times a second. Of course, it still needs to defer to libc for certain objects and to grab more memory for its pools/arena/nursery.
Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.
> I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.
That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.
Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).
I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.
This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.
I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.
______
† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c
https://godbolt.org/z/1oTrv1Y58
Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.
If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.
However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.
And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?
You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.
You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.
I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...
Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.
You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.
Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.
You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.
There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.
But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.
Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.
> coming to the conclusion that GC was somewhat faster than malloc/free
Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.
Modern GC certainly better than old ones.
But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.
Workload always matters. Always.
Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!
This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.
It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.
It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.
> The allocation fast path of the RPython GC is a simple bump pointer,
and the assembly for it.
Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)
And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.
I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)
FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.
> I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.
That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.
Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).
I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.