Skip to content

Conversation

@brandtbucher
Copy link
Member

@brandtbucherbrandtbucher commented May 2, 2024

This breaks up the JIT into smaller functions, reduces a lot of branching in hot inner loops, and generally makes the C code cleaner (and probably faster).

Currently, we generate declarative structures at build time that we then loop over in order to emit the desired machine code at runtime. For example, the _STORE_FAST stencil looks like this:

Details
staticconstunsigned char_STORE_FAST_code_body[61] ={0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5, 0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00, 0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44, 0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48, 0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff, 0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff, 0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x58, }; staticconstHole_STORE_FAST_code_holes[4] ={{0xc, HoleKind_R_X86_64_GOTPCREL, HoleValue_DATA, NULL, -0x4},{0x31, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0x4},{0x37, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0xc}, }; staticconstunsigned char_STORE_FAST_data_body[25] ={0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; staticconstHole_STORE_FAST_data_holes[4] ={{0x0, HoleKind_R_X86_64_64, HoleValue_OPARG, NULL, 0x0},{0x8, HoleKind_R_X86_64_64, HoleValue_CONTINUE, NULL, 0x0},{0x10, HoleKind_R_X86_64_64, HoleValue_ZERO, &_Py_Dealloc, 0x0}, };

This very general approach means that we have a lot of complex logic in our hot inner loop to decode instructions and set up values for patching that may not even be needed. It's also very branchy, since we're essentially "interpreting" the array of holes for each instruction.

With this PR, jit_stencils.h instead contains the following function:

Details
voidemit__STORE_FAST( unsigned char*code, unsigned char*data, _PyExecutorObject*executor, const_PyUOpInstruction*instruction, uintptr_tinstruction_starts[]){constunsigned charcode_body[60] ={0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5, 0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00, 0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44, 0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48, 0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff, 0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff, 0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x58, }; constunsigned chardata_body[24] ={0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; memcpy(data, data_body, sizeof(data_body)); patch_64(data+0x0, instruction->oparg); patch_64(data+0x8, (uintptr_t)code+sizeof(code_body)); patch_64(data+0x10, (uintptr_t)&_Py_Dealloc); memcpy(code, code_body, sizeof(code_body)); patch_32r(code+0xc, (uintptr_t)data+-0x4); patch_x86_64_32rx(code+0x31, (uintptr_t)data+0x4); patch_x86_64_32rx(code+0x37, (uintptr_t)data+0xc)}

This function is called directly to emit the machine code for every _STORE_FAST instruction, and hardcodes the logic for all of the necessary copies and patches. The result is one indirect call, no unnecessary branching, and (in my opinion) cleaner code, since a lot of the tricky logic is now hidden away in generated files.


I know this is right before feature freeze, but I'd really like to get this in 3.13 since it will make backporting any fixes much easier. It doesn't change the actual jitted code in any way.

Note to reviewers: the diff is a bit messy, so it may make more sense to compare the before-vs-after files side-by-side instead.

@brandtbucherbrandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build labels May 2, 2024
@brandtbucherbrandtbucher self-assigned this May 2, 2024
@bedevere-appbedevere-appbot mentioned this pull request May 2, 2024
@brandtbucher
Copy link
MemberAuthor

@savannahostrowski, I'd love to get your review of this if you have a few cycles.

Copy link
Member

@savannahostrowskisavannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments and questions but after sitting and reading through this code a bunch over the last week or two, I'm excited about how much more readable this will get with this change! 💆‍♀️

"""Yield a JIT compiler line-by-line as a C header file."""
yieldfrom_dump_header()
foropname, groupingroups.items():
foropname, groupinsorted(groups.items()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this needs to be sorted?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I just like it that way (if you couldn't tell by now). ;)

Copy link
Member

@savannahostrowskisavannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding in the comment about the naming conventions - I think that helps! Otherwise, this looks pretty solid to me (barring some Windows CI failures). Lots of moving things into function but it's a whole lot more readable! 🎉

@brandtbucher
Copy link
MemberAuthor

Windows JIT CI fixed in GH-118564.

@brandtbucherbrandtbucher merged commit 1b7e5e6 into python:mainMay 3, 2024
SonicField pushed a commit to SonicField/cpython that referenced this pull request May 8, 2024
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

buildThe build process and cross-buildinterpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usageskip newstopic-JIT

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@brandtbucher@savannahostrowski