gh-95004: specialize access to enums and fix scaling on free-threading#148184
gh-95004: specialize access to enums and fix scaling on free-threading#148184kumaraditya303 merged 7 commits intopython:mainfrom
Conversation
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Thanks for doing this, I have just 3 comments.
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Pretty close, just one question and one minor nit.
Misc/NEWS.d/next/Core_and_Builtins/2026-04-06-18-25-53.gh-issue-95004.CQeT_H.rst
Outdated
Show resolved
Hide resolved
colesbury
left a comment
There was a problem hiding this comment.
I just tried this on a 96 core AWS machine and the scaling is not great: ~2.9x faster.
- Given that, I'm pretty ambivalent about the chnage
- I don't like to add benchmarks to
Tools/ftscalingbench/ftscalingbench.pythat don't scale well
On my mac I see 4.2x faster: ❯ ./python.exe Tools/ftscalingbench/ftscalingbench.py
Running benchmarks with 10 threads
object_cfunction 3.2x faster
cmodule_function 2.8x faster
object_lookup_special 4.3x faster
context_manager 4.9x faster
mult_constant 2.3x faster
generator 2.3x faster
pymethod 3.0x faster
pyfunction 2.7x faster
module_function 2.7x faster
load_string_const 4.1x faster
load_tuple_const 3.6x faster
create_pyobject 4.4x faster
create_closure 4.4x faster
create_dict 3.8x faster
create_frozendict 4.0x faster
thread_local_read 3.5x faster
method_caller 3.3x faster
instantiate_dataclass 4.9x faster
instantiate_namedtuple 4.9x faster
instantiate_typing_namedtuple 4.9x faster
super_call 4.6x faster
classmethod_call 3.8x faster
staticmethod_call 3.5x faster
deepcopy 1.8x slower
setattr_non_interned 4.5x faster
enum_attr 4.2x faster
I can remove the benchmark if you prefer. |
|
Could it be that the Enum itself is not using deferred refcounting? It's a LOAD_GLOBAL_MODULE which increfs and then decrefs at each LOAD_ATTR. |
|
I found the problem: It seems that this perpetually deopts at the first _GUARD_TYPE_VERSION. Then, that causes a re-specialization, which is obviously bottlenecked on a lot of things. So it seems the current specialization/deopt needs to be fixed. Investigating now. |
|
Before: After: Seems the pre-existing specialization for METACLASS_CHECK was bugged. Diff to fix this: diff --git a/Python/specialize.c b/Python/specialize.c
index 355b6eabdb7..bfa7b8148e4 100644
--- a/Python/specialize.c
+++ b/Python/specialize.c
@@ -1220,13 +1220,14 @@ specialize_class_load_attr(PyObject *owner, _Py_CODEUNIT *instr,
#ifdef Py_GIL_DISABLED
maybe_enable_deferred_ref_count(descr);
#endif
- write_u32(cache->type_version, tp_version);
write_ptr(cache->descr, descr);
if (metaclass_check) {
- write_u32(cache->keys_version, meta_version);
+ write_u32(cache->keys_version, tp_version);
+ write_u32(cache->type_version, meta_version);
specialize(instr, LOAD_ATTR_CLASS_WITH_METACLASS_CHECK);
}
else {
+ write_u32(cache->type_version, tp_version);
specialize(instr, LOAD_ATTR_CLASS);
}
Py_XDECREF(descr);This is bugged in 3.14 as well it seems 5d3201f |
|
@colesbury can you please try this again? Thank you! |
|
I see much better scaling now on AMD Ryzen Threadripper 3970X 32-Core Processor: taskset -c 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./python Tools/ftscalingbench/ftscalingbench.py enum_attr -t 16 --scale 10000
Running benchmarks with 16 threads
enum_attr 15.5x fasterThanks @Fidget-Spinner for fixing this |
Uh oh!
There was an error while loading. Please reload this page.