pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via mmap(MAP_PRIVATE|MAP_ANONYMOUS) and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.
Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via mmap with the MAP_HUGETLB flag (available since kernel 2.6.32) from a pre-reserved pool configured through /proc/sys/vm/nr_hugepages. On Windows, the equivalent is VirtualAlloc with MEM_LARGE_PAGES.
I'd like to propose adding a ./configure --with-pymalloc-hugepages option that increases ARENA_BITS from 20 to 21 (1 MiB -> 2 MiB) and makes _PyMem_ArenaAlloc() try mmap(MAP_HUGETLB) first, falling back to regular mmap if the huge page pool is exhausted. On Windows the equivalent would be VirtualAlloc(MEM_LARGE_PAGES) with fallback. _PyMem_ArenaFree() needs no changes since munmap handles huge pages identically. All derived constants (ARENA_SIZE, MAX_POOLS_IN_ARENA, radix tree bit widths, nfp2lasta sizing) adjust automatically from ARENA_BITS.
The flag is opt-in and off by default. MAP_HUGETLB requires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through /proc/sys/vm/nr_hugepages. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):
# Allocate (requires root)
echo 128 | sudo tee /proc/sys/vm/nr_hugepages
# Verify
grep HugePages /proc/meminfo
# HugePages_Total: 128
# HugePages_Free: 128
# Make persistent across reboots by adding to /etc/sysctl.conf:
# vm.nr_hugepages = 128
Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.
I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with nr_hugepages=128. Measured with perf stat -r 100 using cpu_core counters. GC disabled during benchmarks.
Wall-clock results:
| Benchmark |
Default |
Hugepages |
Change |
| list_of_tuples (1M 3-tuples) |
0.172s |
0.121s |
-29.5% |
| fragmentation (500K alloc/free/realloc) |
0.162s |
0.119s |
-26.5% |
| mixed_sizes (500K, 12 size classes) |
0.141s |
0.106s |
-25.1% |
| bulk_small_alloc (1M bytearrays) |
0.205s |
0.160s |
-22.1% |
| class_instances (500K __slots__) |
0.120s |
0.096s |
-20.0% |
| arena_pressure (10x200K objects) |
0.509s |
0.448s |
-12.1% |
| random_walk (1M, shuffled access) |
0.822s |
0.759s |
-7.6% |
dTLB miss reductions:
| Benchmark |
dTLB Load Miss |
dTLB Store Miss |
Page Faults |
| fragmentation |
-95.9% |
-94.7% |
-94.5% |
| random_walk |
-93.1% |
-98.9% |
-91.6% |
| bulk_small_alloc |
-91.4% |
-94.5% |
-93.5% |
| list_of_tuples |
-88.0% |
-93.7% |
-94.1% |
| class_instances |
-84.3% |
-91.8% |
-92.1% |
| mixed_sizes |
-80.8% |
-76.5% |
-78.2% |
The perf command used per benchmark:
EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
perf stat -r 10 -e "$EVENTS" ./python bench_obmalloc.py fragmentation
bench_obmalloc.py
import sys, gc
def bench_small_object_churn():
objs = []
for _ in range(200_000): objs.append(bytearray(64))
for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)
def bench_bulk_small_alloc():
objs = [bytearray(48) for _ in range(1_000_000)]
for o in objs: o[0] = 1
def bench_dict_churn():
for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d
def bench_mixed_sizes():
sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
objs = [bytearray(sizes[i % 12]) for i in range(500_000)]
def bench_fragmentation():
objs = [bytearray(128) for _ in range(500_000)]
for i in range(0, len(objs), 2): objs[i] = None
for i in range(0, len(objs), 2): objs[i] = bytearray(128)
def bench_list_of_tuples():
objs = [(i, i+1, i+2) for i in range(1_000_000)]
def bench_class_instances():
class Pt:
__slots__ = ('x', 'y', 'z')
def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
objs = [Pt(i, i+1, i+2) for i in range(500_000)]
def bench_arena_pressure():
layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]
def bench_random_walk():
import random; random.seed(42)
objs = [bytearray(64) for _ in range(1_000_000)]
idx = list(range(len(objs))); random.shuffle(idx)
for i in idx: objs[i][0] = i & 0xff
BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)
if __name__ == "__main__":
gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()
Full reproduction:
./configure && make -j$(nproc) && cp python python_default
./configure --with-pymalloc-hugepages && make -j$(nproc) && cp python python_hugepages
echo 128 | sudo tee /proc/sys/vm/nr_hugepages
EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
for b in bulk_small_alloc mixed_sizes fragmentation list_of_tuples class_instances arena_pressure random_walk; do
echo "=== $b ==="
perf stat -r 10 -e "$EVENTS" ./python_default bench_obmalloc.py "$b"
perf stat -r 10 -e "$EVENTS" ./python_hugepages bench_obmalloc.py "$b"
done
Linked PRs
pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via
mmap(MAP_PRIVATE|MAP_ANONYMOUS)and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via
mmapwith theMAP_HUGETLBflag (available since kernel 2.6.32) from a pre-reserved pool configured through/proc/sys/vm/nr_hugepages. On Windows, the equivalent isVirtualAllocwithMEM_LARGE_PAGES.I'd like to propose adding a
./configure --with-pymalloc-hugepagesoption that increasesARENA_BITSfrom 20 to 21 (1 MiB -> 2 MiB) and makes_PyMem_ArenaAlloc()trymmap(MAP_HUGETLB)first, falling back to regularmmapif the huge page pool is exhausted. On Windows the equivalent would beVirtualAlloc(MEM_LARGE_PAGES)with fallback._PyMem_ArenaFree()needs no changes sincemunmaphandles huge pages identically. All derived constants (ARENA_SIZE,MAX_POOLS_IN_ARENA, radix tree bit widths,nfp2lastasizing) adjust automatically fromARENA_BITS.The flag is opt-in and off by default.
MAP_HUGETLBrequires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through/proc/sys/vm/nr_hugepages. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.
I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with
nr_hugepages=128. Measured withperf stat -r 100usingcpu_corecounters. GC disabled during benchmarks.Wall-clock results:
dTLB miss reductions:
The perf command used per benchmark:
bench_obmalloc.py
Full reproduction:
Linked PRs
madvise(MADV_HUGEPAGE)#144353