Skip to content

Conversation

@Fidget-Spinner
Copy link
Member

@Fidget-SpinnerFidget-Spinner commented Aug 18, 2024

This implements PEP 703 deferred refcounting for LOAD_GLOBAL in an agnostic way using the cases generator. So it won't block full deferring of references in the future.

For now, we need to write directly to a stackref entry to keep the global alive. Changing the dictionary lookup to stackref variants is also a prerequisite for LOAD_ATTR and friends.

Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only nitpick on things I understand :') But I will, at some point, try to understand the magic of bytecodes.c.



def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None:
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool = False) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defemit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool=False) ->None:
defemit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool=False) ->None:

goto read_failed;
}

if (ix >= 0){
Copy link
Member

@picnixzpicnixzAug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put the case if (ix < 0) first to reflect the if (rc < 0) constructions that we usually do?

Comment on lines 1584 to 1585
ix = _Py_dict_lookup(mp, key, hash, &value);
*value_addr = value == NULL ? PyStackRef_NULL : PyStackRef_FromPyObjectNew(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use the _Py_dict_lookup_threadsafe_stackref helper here since you acquired the lock? This would save the PyObject *value declaration (and you could consider inlining the helper actually)

goto read_failed;
}
if (ix >= 0){
*value_addr = PyStackRef_FromPyObjectNew(DK_ENTRIES(dk)[ix].me_value);
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand the TSAN failures on dictionaries here. @colesbury do you have a clue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe it the value is not deferred or immortal.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but doesn't PyStackRef_FromPyObjectNew incref the value if it's not deferred or immortal, which make it thread safe? Or maybe I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Py_INCREF() on a value that might be concurrently freed is not thread-safe.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay that makes sense now. So one possible solution would be to introduce another variant that tries to incref and is allowed to fail? Like in https://github.com/python/cpython/blob/main/Objects/dictobject.c#L1435

Or we can check if it's deferred/immortal, and only if it's not we do the TryXGetRef thing, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, do the check that it's deferred/immortal and if not do the TryXGetRef thing.

I think it's best to write it out explicitly first. If there are other uses of it, we can refactor it into a function later.

dk = _Py_atomic_load_ptr(&mp->ma_keys);
kind = dk->dk_kind;

if (kind != DICT_KEYS_GENERAL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is complicated and duplicates logic elsewhere. I think it might be worth only supporting a a single case or two. For example, unicode-only and non-split dict. Fallback to _Py_dict_lookup_threadsafe + PyStackRef_FromPyObjectSteal for other cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth changing anything. LOAD_GLOBAL is specialized incredibly well, so this only applies to a rare slow path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not specializing at all in the free-threaded build so LOAD_GLOBAL is a scaling bottleneck now that fewer types are immortalized.

I don't think we want to wait for specialization to be made thread-safe, but we can treat this as a temporary measure and structure the code to reflect that.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this case to fall back to _Py_dict_lookup_threadsafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks backwards to me. The common case should be DICT_KEYS_UNICODE for globals dictionaries, but this only handles DICT_KEYS_GENERAL efficiently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.

I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef() that's like _Py_TryXGetRef but returns a _PyStackRef.

PyDictKeysObject*dk=_Py_atomic_load_ptr(&mp->ma_keys); if (dk->dk_kind==DICT_KEYS_UNICODE&&PyUnicode_CheckExact(key)){Py_ssize_tix=unicodekeys_lookup_unicode_threadsafe(dk, key, hash); if (ix==DKIX_EMPTY){*value_addr=PyStackRef_NULL; returnix} elseif (ix >= 0){PyObject**addr_of_value=&DK_UNICODE_ENTRIES(dk)[ix].me_value; PyObject*value=_Py_atomic_load_ptr(addr_of_value); if (value==NULL){*value_addr=PyStackRef_NULL; returnDKIX_EMPTY} if (_Py_IsImmortal(value) ||_PyObject_HasDeferredRefcount(value)){*value_addr= (_PyStackRef){.bits= (uintptr_t)value | Py_TAG_DEFERRED }; returnix} if (_Py_TryIncrefCompare(addr_of_value, value)){*value_addr=PyStackRef_FromPyObjectSteal(value); returnix} } } PyObject*obj; Py_ssize_tix=_Py_dict_lookup_threadsafe(mp, key, hash, &obj); if (ix >= 0&&obj!=NULL){*value_addr=PyStackRef_FromPyObjectSteal(obj)} else{*value_addr=PyStackRef_NULL} returnix;

Copy link
Member

@markshannonmarkshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point of these changes. LOAD_GLOBAL is something like 1 in 5000 of executed instructions. Its performance is irrelevant.

What is STACK_ENTRY for? All outputs are already declared in the stack comment for the op.

extern Py_ssize_t _PyDict_LookupIndex(PyDictObject *, PyObject *);
extern Py_ssize_t _PyDictKeys_StringLookup(PyDictKeysObject* dictkeys, PyObject *key);
PyAPI_FUNC(PyObject *)_PyDict_LoadGlobal(PyDictObject *, PyDictObject *, PyObject *);
PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *, _PyStackRef *);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject*, PyDictObject*, PyObject*, _PyStackRef*);
PyAPI_FUNC(_PyStackRef)_PyDict_LoadGlobalStackRef(PyDictObject*, PyDictObject*, PyObject*);

Likewise

(PyDictObject *)BUILTINS(),
name);
if (v_o == NULL){
_PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with PyMapping_GetOptionalItem above.
I think it would be better to change the whole instruction to use stack refs, or leave it as it is.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed stackrefs from the instruction to leave it as is.

_PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(),
(PyDictObject *)BUILTINS(),
name,
STACK_ENTRY(v));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does STACK_ENTRY do? Does it have any side effects, or is it just equivalent to &v?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically equivalent to &stack_pointer[whatever_v_index_is]. It has no side effects.



def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None:
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want balanced parentheses? What is the construct that doesn't balance parentheses?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically it's when we replace something in the middle of the line but still want the rest of the line (and all it's parantheses)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. we replace the before foo(STACK_ENTRY(x));, with foo(stack_pointer[-1]);, we want the replacement function to write until the semicolon ;, ignoring all balanced brackets from the replacement point till that semicolon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding allow_unbalanced_parens can we push the logic into the caller (stack_entry)? Like replace up to the matching ) and then consume/emit tokens up to the end of the statement?

dk = _Py_atomic_load_ptr(&mp->ma_keys);
kind = dk->dk_kind;

if (kind != DICT_KEYS_GENERAL){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth changing anything. LOAD_GLOBAL is specialized incredibly well, so this only applies to a rare slow path.

@Fidget-Spinner
Copy link
MemberAuthor

@colesbury I think the PR is in better shape now. Can you take a look please?

Copy link
Contributor

@colesburycolesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets ensure that this actually improves the scalability of LOAD_GLOBAL calls.

PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1);
PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name);
ERROR_IF(res_o == NULL, error);
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, STACK_ENTRY(res));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

  1. Make _PyEval_LoadGlobalStackRef and _PyDict_LoadGlobalStackRef return int error codes
  2. Use &STACK_ENTRY(res) because it makes it more clear that the value is a pointer and because it means the STACK_ENTRY() transformation preserves the type.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will apply 2. but not 1. The reasoning is to keep consistency with _PyEval_LoadGlobal (non-stackref version).



def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None:
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding allow_unbalanced_parens can we push the logic into the caller (stack_entry)? Like replace up to the matching ) and then consume/emit tokens up to the end of the statement?

dk = _Py_atomic_load_ptr(&mp->ma_keys);
kind = dk->dk_kind;

if (kind != DICT_KEYS_GENERAL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks backwards to me. The common case should be DICT_KEYS_UNICODE for globals dictionaries, but this only handles DICT_KEYS_GENERAL efficiently.

@Fidget-Spinner
Copy link
MemberAuthor

Main:

object_cfunction MEGA FAILED: 1.1x slower cmodule_function MEGA FAILED: 1.5x slower generator FAILED: 2.7x faster pymethod MEGA FAILED: 1.6x slower pyfunction MEGA FAILED: 2.3x slower module_function MEGA FAILED: 1.6x slower load_string_const MEGA FAILED: 1.6x slower load_tuple_const PASSED: 3.9x faster create_closure MEGA FAILED: 2.8x slower create_pyobject MEGA FAILED: 2.1x slower 

This branch:

object_cfunction MEGA FAILED: 1.0x slower cmodule_function MEGA FAILED: 1.2x slower generator PASSED: 3.1x faster pymethod MEGA FAILED: 1.4x slower pyfunction MEGA FAILED: 2.4x slower module_function MEGA FAILED: 1.2x slower load_string_const MEGA FAILED: 1.6x slower load_tuple_const PASSED: 3.9x faster create_closure MEGA FAILED: 2.6x slower create_pyobject PASSED: 4.2x faster 

Create_pyobject became a lot faster!

@Fidget-Spinner
Copy link
MemberAuthor

Benchmarking results courtesy of Mike: no slowdown on JIT build https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20240828-3.14.0a0-bfd4400-JIT

Copy link
Contributor

@colesburycolesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this. I think we should simplify this further to only focus on the case we care about.

dk = _Py_atomic_load_ptr(&mp->ma_keys);
kind = dk->dk_kind;

if (kind != DICT_KEYS_GENERAL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.

I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef() that's like _Py_TryXGetRef but returns a _PyStackRef.

PyDictKeysObject*dk=_Py_atomic_load_ptr(&mp->ma_keys); if (dk->dk_kind==DICT_KEYS_UNICODE&&PyUnicode_CheckExact(key)){Py_ssize_tix=unicodekeys_lookup_unicode_threadsafe(dk, key, hash); if (ix==DKIX_EMPTY){*value_addr=PyStackRef_NULL; returnix} elseif (ix >= 0){PyObject**addr_of_value=&DK_UNICODE_ENTRIES(dk)[ix].me_value; PyObject*value=_Py_atomic_load_ptr(addr_of_value); if (value==NULL){*value_addr=PyStackRef_NULL; returnDKIX_EMPTY} if (_Py_IsImmortal(value) ||_PyObject_HasDeferredRefcount(value)){*value_addr= (_PyStackRef){.bits= (uintptr_t)value | Py_TAG_DEFERRED }; returnix} if (_Py_TryIncrefCompare(addr_of_value, value)){*value_addr=PyStackRef_FromPyObjectSteal(value); returnix} } } PyObject*obj; Py_ssize_tix=_Py_dict_lookup_threadsafe(mp, key, hash, &obj); if (ix >= 0&&obj!=NULL){*value_addr=PyStackRef_FromPyObjectSteal(obj)} else{*value_addr=PyStackRef_NULL} returnix;

Co-Authored-By: Sam Gross <[email protected]>
Copy link
Contributor

@colesburycolesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@markshannonmarkshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go back to defining res as an array of one.

PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1);
PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name);
ERROR_IF(res_o == NULL, error);
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res));
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &res[0]));

Can we go back to the earlier design of making res an array?
Please also add a comment explaining why we res needs to be trated specially (we need pass a pointer to _PyEval_LoadGlobalStackRef).

The code generator already understands arrays, and it avoids confusion as to where it should insert stack spills for fully deferred reference counting.

You'll able to remove def stack_entry below.

Sorry for the back and forth on this.

# unused portions of the stack to NULL.
stack.flush_single_var(self.out, target, uop.stack.outputs)

def stack_entry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the stack calculations are correct here, and I don't think it will be able to handle code that writes to multiple results, like UNPACK....

If we go back to using arrays, we can remove this.

@bedevere-app
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@Fidget-SpinnerFidget-Spinner merged commit 8810e28 into python:mainSep 13, 2024
@Fidget-SpinnerFidget-Spinner deleted the deferred_globals_final branch September 13, 2024 16:23
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Windows10 3.x has failed when building commit 8810e28.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/146/builds/9494) and take a look at the build logs.
  4. Check if the failure is related to this commit (8810e28) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/146/builds/9494

Summary of the results of the build (if available):

Click to see traceback logs
remote: Enumerating objects: 14, done. remote: Counting objects: 7% (1/14) remote: Counting objects: 14% (2/14) remote: Counting objects: 21% (3/14) remote: Counting objects: 28% (4/14) remote: Counting objects: 35% (5/14) remote: Counting objects: 42% (6/14) remote: Counting objects: 50% (7/14) remote: Counting objects: 57% (8/14) remote: Counting objects: 64% (9/14) remote: Counting objects: 71% (10/14) remote: Counting objects: 78% (11/14) remote: Counting objects: 85% (12/14) remote: Counting objects: 92% (13/14) remote: Counting objects: 100% (14/14) remote: Counting objects: 100% (14/14), done. remote: Compressing objects: 7% (1/13) remote: Compressing objects: 15% (2/13) remote: Compressing objects: 23% (3/13) remote: Compressing objects: 30% (4/13) remote: Compressing objects: 38% (5/13) remote: Compressing objects: 46% (6/13) remote: Compressing objects: 53% (7/13) remote: Compressing objects: 61% (8/13) remote: Compressing objects: 69% (9/13) remote: Compressing objects: 76% (10/13) remote: Compressing objects: 84% (11/13) remote: Compressing objects: 92% (12/13) remote: Compressing objects: 100% (13/13) remote: Compressing objects: 100% (13/13), done. remote: Total 14 (delta 4), reused 2 (delta 1), pack-reused 0 (from 0)  From https://github.com/python/cpython * branch main -> FETCH_HEAD Note: checking out '8810e286fa48876422d1b230208911decbead294'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b <new-branch-name> HEAD is now at 8810e28... gh-121459: Deferred LOAD_GLOBAL (GH-123128) Switched to and reset branch 'main' Could Not Find D:\buildarea\3.x.bolen-windows10\build\Lib\*.pyc The system cannot find the file specified. Could Not Find D:\buildarea\3.x.bolen-windows10\build\PCbuild\python*.zip Cannot open file 'D:\buildarea\3.x.bolen-windows10\build\test-results.xml' for upload Could Not Find D:\buildarea\3.x.bolen-windows10\build\Lib\*.pyc The system cannot find the file specified. Could Not Find D:\buildarea\3.x.bolen-windows10\build\PCbuild\python*.zip

zware added a commit to zware/cpython that referenced this pull request Sep 13, 2024
colesbury added a commit to colesbury/cpython that referenced this pull request Sep 14, 2024
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

@Fidget-Spinner@bedevere-bot@colesbury@markshannon@picnixz@Eclips4