Consider
class C:
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
C(1,2,3)
This produces a trace looking something like this:
...
_CHECK_AND_ALLOCATE_OBJECT
_CREATE_INIT_FRAME
_PUSH_FRAME
# Some guards
_LOAD_FAST_BORROW_1
_LOAD_FAST_BORROW_0
# Some more guards
_STORE_ATTR_INSTANCE_VALUE
# Some more guards
_LOAD_FAST_BORROW_2
_LOAD_FAST_BORROW_0
# Some more guards
_STORE_ATTR_INSTANCE_VALUE
# Some more guards
_LOAD_FAST_BORROW_3
_LOAD_FAST_BORROW_0
# Some more guards
_STORE_ATTR_INSTANCE_VALUE
...
Each of those _STORE_ATTR_INSTANCE_VALUE reads the old value out of memory and then conditionally decrefs it.
But in this case we know that the old value was NULL so we can just overwrite it.
So we can replace this:
PyObject **value_ptr = (PyObject**)(((char *)owner_o) + offset);
PyObject *old_value = *value_ptr;
FT_ATOMIC_STORE_PTR_RELEASE(*value_ptr, PyStackRef_AsPyObjectSteal(value));
if (old_value == NULL) {
PyDictValues *values = _PyObject_InlineValues(owner_o);
Py_ssize_t index = value_ptr - values->values;
_PyDictValues_AddToInsertionOrder(values, index);
}
Py_XDECREF(old_value);
with this:
PyObject **value_ptr = (PyObject**)(((char *)owner_o) + offset);
FT_ATOMIC_STORE_PTR_RELEASE(*value_ptr, PyStackRef_AsPyObjectSteal(value));
PyDictValues *values = _PyObject_InlineValues(owner_o);
Py_ssize_t index = value_ptr - values->values;
_PyDictValues_AddToInsertionOrder(values, index);
On Aarch64, this reduces the number of machine instructions from 48 to 26.
The same reasoning also applies to _STORE_ATTR_SLOT where it reduces the number of machine instructions from 32 to 14.
See also #134584
We can probably remove some of those guards as well, but that's a separate issue.
Consider
This produces a trace looking something like this:
Each of those
_STORE_ATTR_INSTANCE_VALUEreads the old value out of memory and then conditionally decrefs it.But in this case we know that the old value was
NULLso we can just overwrite it.So we can replace this:
with this:
On Aarch64, this reduces the number of machine instructions from 48 to 26.
The same reasoning also applies to
_STORE_ATTR_SLOTwhere it reduces the number of machine instructions from 32 to 14.See also #134584
We can probably remove some of those guards as well, but that's a separate issue.