A LazyImports test is generating TSAN warnings that look something like:
WARNING: ThreadSanitizer: data race (pid=453778)
Read of size 8 at 0x567fc2060568 by thread T3:
#0 _Py_TYPE_impl /home/sgross/cpython/./Include/object.h:313:16 (python+0x3265a1) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
#1 _Py_IS_TYPE_impl /home/sgross/cpython/./Include/object.h:328:12 (python+0x3265a1)
#2 compare_unicode_unicode_threadsafe /home/sgross/cpython/Objects/dictobject.c:1424:13 (python+0x3265a1)
#3 do_lookup /home/sgross/cpython/Objects/dictobject.c:1009:23 (python+0x325b48) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
#4 unicodekeys_lookup_unicode_threadsafe /home/sgross/cpython/Objects/dictobject.c:1445:12 (python+0x30d85f) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
...
Previous write of size 8 at 0x567fc2060568 by thread T2:
#0 __tsan_memset <null> (python+0xf5c91) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
#1 fill_mem_debug /home/sgross/cpython/Objects/obmalloc.c (python+0x37b9f9) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
#2 _PyMem_DebugRawAlloc /home/sgross/cpython/Objects/obmalloc.c:2904:9 (python+0x37b9f9)
#3 _PyMem_DebugRawMalloc /home/sgross/cpython/Objects/obmalloc.c:2920:12 (python+0x37b9f9)
#4 _PyMem_DebugMalloc /home/sgross/cpython/Objects/obmalloc.c:3085:12 (python+0x37b9f9)
#5 PyObject_Malloc /home/sgross/cpython/Objects/obmalloc.c:1493:12 (python+0x37a41d) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
#6 PyUnicode_New /home/sgross/cpython/Objects/unicodeobject.c:1320:24 (python+0x3fd681) (BuildId: 149c3950b350c299f7b543875dac3ce12f85640f)
...
The problem is that the reader is using "relaxed" memory ordering. (The writer is using "release", which is good).
|
PyDictUnicodeEntry *ep = &((PyDictUnicodeEntry *)ep0)[ix]; |
|
PyObject *startkey = _Py_atomic_load_ptr_relaxed(&ep->me_key); |
This is a case where the C11 memory model isn't a good fit for actual hardware. The C11 memory model requires at least "consume", but compilers treat "consume" like the stronger "acquire", which emits stronger than necessary fences on aarch64 1. "consume" should be "free" on aarch64, but "acquire" requires a load with a fence like LDAR/LDAPR.
We have a few options:
- Keep using "relaxed" and sometimes trigger TSAN warnings -- not great
- Use "acquire" and suffer a potential performance hit on aarch64
- Add fake "consume" bindings in pyatomic.h that are "acquire" under TSAN and "relaxed" otherwise
I think (3) is the best option, but I'll benchmark (2).
FAQ
Would #3 mean that we're not doing the right thing on aarch64 though?
No, we'd be doing the right thing. A plain load is sufficient on aarch64 (and x86-64, POWER, armv7, SPARC, etc.) and pretty much every mainstream CPU in existence, except (famously) the DEC Alpha, which hasn't been manufactured in two decades. CPUs (except the DEC Alpha) respect address dependencies -- they won't reorder dependent loads.
References
Linked PRs
A LazyImports test is generating TSAN warnings that look something like:
The problem is that the reader is using "relaxed" memory ordering. (The writer is using "release", which is good).
cpython/Objects/dictobject.c
Lines 1416 to 1417 in 4629567
This is a case where the C11 memory model isn't a good fit for actual hardware. The C11 memory model requires at least "consume", but compilers treat "consume" like the stronger "acquire", which emits stronger than necessary fences on aarch64 1. "consume" should be "free" on aarch64, but "acquire" requires a load with a fence like
LDAR/LDAPR.We have a few options:
I think (3) is the best option, but I'll benchmark (2).
FAQ
No, we'd be doing the right thing. A plain load is sufficient on aarch64 (and x86-64, POWER, armv7, SPARC, etc.) and pretty much every mainstream CPU in existence, except (famously) the DEC Alpha, which hasn't been manufactured in two decades. CPUs (except the DEC Alpha) respect address dependencies -- they won't reorder dependent loads.
References
READ_ONCE, which is the Linux kernel counterpart tomemory_order_relaxedLinked PRs
Footnotes
There's no performance issue when using "acquire" on x86-64 because plain loads have "acquire" semantics so "acquire" is free. ↩