Feature or enhancement
Proposal:
On the free-threaded build, threading's concurrency primitives have a bunch of extra overhead across multiple threads due to reference count contention. For example:
import threading
import time
lock = threading.Lock()
def scale():
a = time.perf_counter()
for _ in range(10000000):
lock.locked()
b = time.perf_counter()
print(b - a, "s")
threads = [threading.Thread(target=scale) for _ in range(8)]
for thread in threads:
thread.start()
vs
import threading
import time
def scale():
lock = threading.Lock()
a = time.perf_counter()
for _ in range(10000000):
lock.locked()
b = time.perf_counter()
print(b - a, "s")
threads = [threading.Thread(target=scale) for _ in range(8)]
for thread in threads:
thread.start()
Comparing the two on a 3.15t release build:
0.38904138099997 s
0.39082639699995525 s
0.4013638610001635 s
0.40917961700006344 s
0.526825904000134 s
0.5402126970000154 s
0.540466712999887 s
0.5586060919999909 s
3.425866439999936 s
3.5953266010001244 s
3.6094701500001065 s
3.667731437000157 s
4.458146230000011 s
4.466017671000145 s
4.499206339000011 s
4.50090869099995 s
That's a ~90% slowdown solely due to reference count contention.
We can significantly reduce this overhead by enabling deferred reference counting on these objects. This is already done for threading.local, but we can also do this for Lock and RLock. It would also be nice to do this for primitives like Event, so that will require a (private) API to expose DRC into Python.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
N/A
Linked PRs
Feature or enhancement
Proposal:
On the free-threaded build,
threading's concurrency primitives have a bunch of extra overhead across multiple threads due to reference count contention. For example:vs
Comparing the two on a 3.15t release build:
That's a ~90% slowdown solely due to reference count contention.
We can significantly reduce this overhead by enabling deferred reference counting on these objects. This is already done for
threading.local, but we can also do this forLockandRLock. It would also be nice to do this for primitives likeEvent, so that will require a (private) API to expose DRC into Python.Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
N/A
Linked PRs
threadingconcurrency primitives #134762