Skip to content

Conversation

@neonene
Copy link
Contributor

@neoneneneonene commented Aug 17, 2024

On main and 3.13, there are cases where the get_module_by_def function in typeobject.c is not inlined in its wrapper functions:

WrappersWindowscallee: get_module_by_def()
PyType_GetModuleByDef()Release/Ob2:called /Ob3:inlined
PGOinlined
_PyType_GetModuleByDef2()Release/Ob2:called /Ob3:inlined
PGOcalled

Non-builtin modules can have extra function-call overheads, where the wrappers cannot be inlined.

This PR specifies Py_ALWAYS_INLINE to the callee.

cc @encukou

@neoneneneonene changed the title gh-117578: Fix inlining regression in PyType_GetModuleByDef() familygh-117578: Fix inlining regression in PyType_GetModuleByDef()Aug 17, 2024
@encukou
Copy link
Member

Hm, it doesn't sound right to override profile-guided optimization, especially since test_decimal (the only current caller of _PyType_GetModuleByDef2) is in the PGO test set.
Does this PR have a significant performance impact?

@neonene
Copy link
ContributorAuthor

neonene commented Aug 20, 2024

It is mentioned on the faster-cpython repo that the telco test has slowed down a lot.

According to MSVC, the module state access counts were:

Function / breakdown entry-cnt alternative access ----------------------- --------- --------------------------- PyType_GetModuleByDef() 6852643 convert_op 2971848 via context object PyDecType_New 1651188 via context object (partial) dec_addstatus 1651188 via context object (partial) current_context 1486221 via context object (partial) dec_mpd_qquantize 247731 METH_METHOD ctx_mpd_qquantize 165000 METH_METHOD ... _PyType_GetModuleByDef2() 1073193 nm_mpd_qadd 660462 nm_mpd_qmul 412731 

Tested with the /Ob3 option, switching the inlining specifier: f740a5d. My Release/PGO builds on Windows get slower using TLS version of PyThreadState_Get(), which is also observed at #103324 (comment). If *nix OSes are in good health with TLS, I guess I also need to run the telco with a good condition (without TLS):

sub
GetModuleByDefcallinlineinline
GetModuleByDef2callcallinline
normal 3.14perf(the higher, the faster)
1.00x(base)1.01x
1.03x1.05x1.08xrespect alternative
less TLS overheadperf(experiment)
1.05x1.05x1.06x
1.08x1.08x1.08xrespect alternative
module state accessnormalTLS-less
current(base)1.05x
This PR1.01x1.06x
PyThreadState_Get()1.05x1.04xexample patches
PR with alternatives1.08x1.08x
global state1.08x1.11x
static type (GC unused)1.14x1.14xtaken from 3.12

This patch would need to be applied if we wanted as much speed as the global state access on Windows, which has little effect alone (1%) for some reason.

@neonene
Copy link
ContributorAuthor

Windows PGO:
- telco: 9.21 ms +- 0.12 ms -> 9.02 ms +- 0.09 ms: 1.02x faster

@neonene
Copy link
ContributorAuthor

neonene commented Aug 21, 2024

Is it acceptable that test_decimal.py has a test case like below, instead of touching the C code?

@requires_cdecimalclassCArithmeticOperatorsTest(ArithmeticOperatorsTest, unittest.TestCase): ... @unittest.skipIf(nottest.support.PGO, 'PGO training only')deftest_excecise_binop(self): Decimal=self.decimal.Decimald=Decimal('11.1') foriinrange(500000): 1+d# at least 300000 times

@neonene
Copy link
ContributorAuthor

I'll try PyType_GetBaseByToken() version.

@neonene
Copy link
ContributorAuthor

Closing in favor of proposing the PyType_GetBaseByToken() version, which can supersede _PyType_GetModuleByDef2() on PGO and Relase(/Ob3) builds.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@neonene@encukou