How fast can the RPython GC allocate?

(pypy.org)

47 points | by todsacerdoti 23 hours ago

3 comments

kragen 21 hours ago
My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.
This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.
I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.
______
† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c
[-]
- charleslmunger 17 hours ago
  I simulated yours vs the UPB arena fast path:
  https://godbolt.org/z/1oTrv1Y58
  Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.
  If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.
  However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.
  [-]
  - kragen 17 hours ago
    Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).
- runlaszlorun 20 hours ago
  I don't know much about language internals or allocation but am learning. why this could be significantly faster than a bump/arena allocator?
  And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?
  [-]
  - kragen 19 hours ago
    It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.
pizlonator 21 hours ago
The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.
You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.
You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.
forrestthewoods 18 hours ago
> Every allocation takes 110116790943 / 10000000000 ≈ 11 instructions and 21074240395 / 10000000000 ≈ 2.1 cycles
I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...
Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.
[-]
- kragen 15 hours ago
  You should believe it, because it's true.
  You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.
  Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.
  You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.
  There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.
  But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.
  Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.
  [-]
  - forrestthewoods 13 hours ago
    Sorry, I refuse to accept a trivial loop of generating up to 2 objects at a time as ever being a relevant benchmark. I didn’t say throughput wasn’t relevant. I said this test isn’t convincing.
    > coming to the conclusion that GC was somewhat faster than malloc/free
    Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.
    Modern GC certainly better than old ones.
    [-]
    - dzaima 4 hours ago
      For malloc/free specific workload is much more important than for a bump allocator; malloc potentially has branching and different cache access/utilization patterns depending on requested sizes or distance/pattern between mallocs & frees, whereas a bump allocator is just bumping a single pointer, with basically zero mispredicts, and very simple linear cache utilization that's easily prefetched.
      [-]
      - forrestthewoods 2 hours ago
        Yes bump allocators are very fast. (It would have been helpful if the blog post explained what type of allocator it was benchmarking).
        But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.
        Workload always matters. Always.
        [-]
        kragen 33 minutes ago
        > But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle.
        Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!
        This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.
        It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.
        It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.
        dzaima 1 hour ago
        The article does include:
        > The allocation fast path of the RPython GC is a simple bump pointer,
        and the assembly for it.
        Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)
        And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.
        I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)
        [-]
        forrestthewoods 1 hour ago
        The bump allocator declaration occurs after the 2.1 cycles per allocation line. Perhaps all the information is the article. But it is not presently clearly imho. YMMV.
        FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.
        [-]
        kragen 32 minutes ago
        The presentation clearly presumes a lot of implicit knowledge about generational garbage collectors that you not only don't have but firmly believed to be false.
        dzaima 53 minutes ago
        Yeah, it is a good bit down; that said, bump allocators are a very common (most common?) option around GC anyway; I personally scrolled straight to the disassembly, and it's pretty clear from that, and the explicit mention of bump allocator is just around there too.
        yvdriess 2 hours ago
        In defense of the OP: bump allocators are part of the CPython runtime. It's used to avoid calling malloc a bazillion times a second. Of course, it still needs to defer to libc for certain objects and to grab more memory for its pools/arena/nursery.
        [-]
        kragen 31 minutes ago
        This is about the PyPy runtime, not the CPython runtime.
- yvdriess 2 hours ago
  Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.
  > I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
  I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.
  That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.
  Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).
  I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.