gh-117578: Fix inlining regression in PyType_GetModuleByDef()#123100

neonene · 2024-08-17T11:40:48Z

On main and 3.13, there are cases where the get_module_by_def function in typeobject.c is not inlined in its wrapper functions:

Wrappers	Windows	callee: `get_module_by_def()`
`PyType_GetModuleByDef()`	Release	`/Ob2`:called `/Ob3`:inlined
	PGO	inlined
`_PyType_GetModuleByDef2()`	Release	`/Ob2`:called `/Ob3`:inlined
	PGO	called

Non-builtin modules can have extra function-call overheads, where the wrappers cannot be inlined.

This PR specifies Py_ALWAYS_INLINE to the callee.

cc @encukou

Issue: PyType_GetModuleByDef family for binary functions performance #117578

encukou · 2024-08-19T11:37:32Z

Hm, it doesn't sound right to override profile-guided optimization, especially since test_decimal (the only current caller of _PyType_GetModuleByDef2) is in the PGO test set.
Does this PR have a significant performance impact?

neonene · 2024-08-20T10:53:48Z

It is mentioned on the faster-cpython repo that the telco test has slowed down a lot.

According to MSVC, the module state access counts were:

Function / breakdown entry-cnt alternative access ----------------------- --------- --------------------------- PyType_GetModuleByDef() 6852643 convert_op 2971848 via context object PyDecType_New 1651188 via context object (partial) dec_addstatus 1651188 via context object (partial) current_context 1486221 via context object (partial) dec_mpd_qquantize 247731 METH_METHOD ctx_mpd_qquantize 165000 METH_METHOD ... _PyType_GetModuleByDef2() 1073193 nm_mpd_qadd 660462 nm_mpd_qmul 412731

Tested with the /Ob3 option, switching the inlining specifier: f740a5d. My Release/PGO builds on Windows get slower using TLS version of PyThreadState_Get(), which is also observed at #103324 (comment). If *nix OSes are in good health with TLS, I guess I also need to run the telco with a good condition (without TLS):

	sub
`GetModuleByDef`	call	inline	inline
`GetModuleByDef2`	call	call	inline

normal 3.14	perf			(the higher, the faster)
	1.00x	(base)	1.01x
	1.03x	1.05x	1.08x	respect alternative

less TLS overhead	perf			(experiment)
	1.05x	1.05x	1.06x
	1.08x	1.08x	1.08x	respect alternative

module state access	normal	TLS-less
current	(base)	1.05x
This PR	1.01x	1.06x
PyThreadState_Get()	1.05x	1.04x	example patches
PR with alternatives	1.08x	1.08x
global state	1.08x	1.11x
static type (GC unused)	1.14x	1.14x	taken from 3.12

This patch would need to be applied if we wanted as much speed as the global state access on Windows, which has little effect alone (1%) for some reason.

neonene · 2024-08-20T22:49:35Z

Windows PGO:
- telco: 9.21 ms +- 0.12 ms -> 9.02 ms +- 0.09 ms: 1.02x faster

neonene · 2024-08-21T09:55:18Z

Is it acceptable that test_decimal.py has a test case like below, instead of touching the C code?

@requires_cdecimalclassCArithmeticOperatorsTest(ArithmeticOperatorsTest, unittest.TestCase): ... @unittest.skipIf(nottest.support.PGO, 'PGO training only')deftest_excecise_binop(self): Decimal=self.decimal.Decimald=Decimal('11.1') foriinrange(500000): 1+d# at least 300000 times

neonene · 2024-08-21T23:59:58Z

I'll try PyType_GetBaseByToken() version.

neonene · 2024-08-26T02:54:33Z

Closing in favor of proposing the PyType_GetBaseByToken() version, which can supersede _PyType_GetModuleByDef2() on PGO and Relase(/Ob3) builds.

add Py_ALWAYS_INLINE
01c3322

neonene requested a review from markshannon as a code owner August 17, 2024 11:40

bedevere-appbot mentioned this pull request Aug 17, 2024
PyType_GetModuleByDef family for binary functions performance #117578
Closed

bedevere-appbot added the awaiting review label Aug 17, 2024

neonene changed the title ~~gh-117578: Fix inlining regression in PyType_GetModuleByDef() family~~gh-117578: Fix inlining regression in PyType_GetModuleByDef()Aug 17, 2024

encukou added the skip news label Aug 19, 2024

neonene added 3 commits August 20, 2024 19:18

benchmark setup: /Ob3, __declspec(noinline)
f740a5d

unsafe experiment: do not respect TLS access
6e62c38

revert benchmark stuff
89754d7

neonene closed this Aug 26, 2024

neonene deleted the bydef-inline branch September 20, 2024 03:42

neonene mentioned this pull request Sep 27, 2024
gh-124688: _decimal: Get a module state from ctx objects for performance #124691
Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-117578: Fix inlining regression in PyType_GetModuleByDef()#123100

gh-117578: Fix inlining regression in PyType_GetModuleByDef() #123100

Uh oh!

neonene commented Aug 17, 2024•
edited
Loading

Uh oh!

encukou commented Aug 19, 2024

Uh oh!

neonene commented Aug 20, 2024•
edited
Loading

Uh oh!

neonene commented Aug 20, 2024

Uh oh!

neonene commented Aug 21, 2024•
edited
Loading

Uh oh!

neonene commented Aug 21, 2024

Uh oh!

neonene commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

gh-117578: Fix inlining regression in PyType_GetModuleByDef()#123100

gh-117578: Fix inlining regression in PyType_GetModuleByDef() #123100

Uh oh!

Conversation

neonene commented Aug 17, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

encukou commented Aug 19, 2024

Uh oh!

neonene commented Aug 20, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neonene commented Aug 20, 2024

Uh oh!

neonene commented Aug 21, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neonene commented Aug 21, 2024

Uh oh!

neonene commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neonene commented Aug 17, 2024•
edited
Loading

neonene commented Aug 20, 2024•
edited
Loading

neonene commented Aug 21, 2024•
edited
Loading