Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
GH-131521: fix clangcl build on Windows for zlib-ng#131526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
chris-eibl commented Mar 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
| <ClCompileInclude="$(zlibNgDir)\arch\x86\chunkset_ssse3.c"> | ||
| <AdditionalOptionsCondition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions> | ||
| </ClCompile> | ||
| <ClCompileInclude="$(zlibNgDir)\arch\x86\adler32_sse42.c"> |
chris-eiblMar 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting: the filename suggests sse43 sse42, but -mssse3 deemed sufficient.
Ups: edited typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth matching the filename, so that if an updated zlib-ng uses SSE4.3 instructions we don't need to modify anything here.
The file relies on intrinsics, so it's not going to automatically generate newer instructions.
PCbuild/zlib-ng.vcxproj Outdated
| <PreprocessorDefinitionsCondition="$(Platform) == 'Win32' or $(Platform) == 'x64'">%(PreprocessorDefinitions);X86_FEATURES;X86_HAVE_XSAVE_INTRIN;X86_SSE2;X86_SSSE3;X86_SSE42;X86_PCLMULQDQ_CRC;X86_AVX2;X86_AVX512;X86_AVX512VNNI;X86_VPCLMULQDQ_CRC</PreprocessorDefinitions> | ||
| <PreprocessorDefinitionsCondition="$(Configuration) == 'Debug'">%(PreprocessorDefinitions);ZLIB_DEBUG</PreprocessorDefinitions> | ||
| <PreprocessorDefinitionsCondition="$(PlatformToolset) == 'ClangCL'">%(PreprocessorDefinitions);HAVE_BUILTIN_CTZ</PreprocessorDefinitions> | ||
| <EnableEnhancedInstructionSetCondition="$(Platform) == 'Win32' or $(Platform) == 'x64'">AdvancedVectorExtensions2</EnableEnhancedInstructionSet> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd feel better we'd not enable that for all *.c files (at lest not on Win32). MSVC would be totally happy without it (and AFAIR defaults to SSE2 if not set). Maybe I'd then have to throw in more -m for clang-cl, but IMHO we're safer without that even in the MSVC case, so that the resulting binary can run on ancient CPUs, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__ preprocessor checks that will behave differently without this setting).
Putting it just on the .c files from arch\x86 should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are
__AVX*__preprocessor checks that will behave differently without this setting).
Well spotted! Although MSVC allows you to use intrinsics even if the respective /arch part isn't set, it won't set the macros __AVX__, etc. Please note, that it never sets macros like __SSE2__:
- Fix #4265: Add appropriate compiler defines and flags for SIMD with MSVC AcademySoftwareFoundation/OpenImageIO#4266
- https://godbolt.org/z/3McojGzvs
Doing a regex __[A-Z]+[0-9]+__ only reveals
# if (defined(X86_SSE2) && defined(__SSE2__)) || defined(__x86_64__) || defined(_M_X64) || defined(X86_NOCHECK_SSE2) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 76 37 # if defined(X86_SSSE3) && defined(__SSSE3__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 95 37 #if defined(X86_PCLMULQDQ_CRC) && defined(__PCLMUL__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 110 43 # if defined(X86_AVX2) && defined(__AVX2__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 123 36 # if defined(X86_AVX512) && defined(__AVX512F__) && defined(__AVX512DQ__) && defined(__AVX512BW__) && defined(__AVX512VL__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 147 38 # if defined(X86_AVX512VNNI) && defined(__AVX512VNNI__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 159 44 # if defined(__PCLMUL__) && defined(__AVX512F__) && defined(__VPCLMULQDQ__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 166 17 which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.
The other hits are in x86_intrins.h:
#ifdef __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 7 8 #if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 10) \ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 10 63 #ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 18 8 #endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 24 11 #endif // __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 27 11 #ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 31 8 #if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 9) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 32 63 #endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 67 11 #ifdef __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 74 8 #endif // __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 78 11 #ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 80 8 #endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 84 11 #if defined(_MSC_VER) && defined(__AVX512F__) && !defined(_MM_K0_REG8) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 88 34 where intrinsic quirks of some compiler versions are taken care of. There are some for MSVC that even include __AVX512F__, which wouldn't be covered with the current AdvancedVectorExtensions2. My latest commits should fix this.
Putting it just on the .c files from
arch\x86should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).
Yeah, I've only set it to those files that really need them, rendering some clang specific stuff obsolete :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which are within an
#ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.
This makes me quite confident, that you're spot on #131438 (comment)
I believe it's detected and called dynamically based on availability, but it's all compiled at compile time.
I.e., we are in the "dynamic dispatch" code path
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.h#L12-L26
#ifdefDISABLE_RUNTIME_CPU_DETECTION# include"arch_functions.h"/* When compiling with native instructions it is not necessary to use functable. * Instead we use native_ macro indicating the best available variant of arch-specific * functions for the current platform. */# defineFUNCTABLE_INIT ((void)0) # defineFUNCTABLE_CALL(name) native_ ## name # defineFUNCTABLE_FPTR(name) &native_ ## name #elsestructfunctable_s{where functable_s will be populated at runtime according to the capabilities of the CPU in:
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.c#L45-L49
staticvoidinit_functable(void){structfunctable_sft; structcpu_featurescf; cpu_check_features(&cf);Using the /arch option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the
/archoption only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.
This is similar to #130213 / #130447. Thus, I've checked using clang-cl 18.1.8: it is picky about just "seeing" intrinsics without the appropriate -m architecture - and it happily compiled :)
chris-eibl commented Mar 20, 2025
Maybe someone can trigger Windows tailcall CI - it's the only one that really is of interest here, but is only triggered for ceval and the like cpython/.github/workflows/tail-call.yml Lines 4 to 9 in ce79274
|
zooba commented Mar 20, 2025
Just touch one of those files - a comment will do. And undo it before it gets merged (make the comment something like |
so that the resulting binary can be executed on older CPUs, too. Also use AdvancedVectorExtensions512 where necessary.
some of the files tail-call.yml is watching were touched on main, and thus I think it will fire :)
chris-eibl commented Mar 22, 2025
I think this is a skip news? |
chris-eibl commented Mar 22, 2025
Windows tail-call CI is now green: https://github.com/python/cpython/actions/runs/14006628024/job/39221297660 Having a detailed look, ISTM that the tests are not run, though. This is not related to this PR, the last green build some days ago did not run them either: https://github.com/python/cpython/actions/runs/13954576580/job/39062383429#step:4:326 After building, no more output can be seen. This is interesting, because cpython/.github/workflows/tail-call.yml Lines 82 to 91 in ce79274
clearly ./PCbuild/rt.bat is invoked. Maybe the reason is, because it is not used with backslashes like in
The question then is, why ./PCbuild/build.bat is working ...@Fidget-Spinner: any idea? |
Fidget-Spinner commented Mar 22, 2025
CI uses cmd instead if powershell, could that make a difference? |
chris-eibl commented Mar 22, 2025
Maybe? |
| <ItemDefinitionGroup> | ||
| <ClCompile> | ||
| <AdditionalOptions>%(AdditionalOptions) /utf-8 /w34242</AdditionalOptions> | ||
| <AdditionalOptionsCondition="$(SupportPGO) and $(Configuration) == 'PGUpdate' and $(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -fno-profile-instr-use</AdditionalOptions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since zlib-ng is compiled into a static lib, it won't produce any profile data. It is the first target that is built (after the _freeze_module)
Lines 49 to 51 in 9962469
| <ItemGroup> | |
| <!-- Static libraries for use later in the build --> | |
| <ProjectsInclude="zlib-ng.vcxproj"Condition="$(zlibNgDir) != '' and Exists('$(zlibNgDir)\zlib-ng.h.in')" /> |
But since there is no
\$(TargetName)_*.profraw for itcpython/PCbuild/pyproject-clangcl.props
Lines 16 to 18 in 9962469
| <ItemGroup> | |
| <_profrawFilesInclude="$(OutDir)instrumented\$(TargetName)_*.profraw" /> | |
| </ItemGroup> |
MergeClangProfileData won't create profdata.profdata:cpython/PCbuild/pyproject-clangcl.props
Lines 26 to 28 in 9962469
| <TargetName="MergeClangProfileData"BeforeTargets="PrepareForBuild" | |
| Condition="'$(SupportPGO)' and $(Configuration) == 'PGUpdate'" | |
| Inputs="@(_profrawFiles)" |
IMHO, this is the best fix (instead of just setting SupportPGO=false for it):
- it still will get compiled with
-fprofile-instr-generate - it will be linked with
python314.dlland (hopefully?) happily contribute its part topython314.profraw - MSVC doesn't create
zlib-ng!1.pgcandzlib-ng.pgdeither. No idea whether it gets PGOed there, but the same idea applies? The difference is just, that MSVC does not produce a hard error if the profile data isn't there, whereas clang does (TRACKEDEXEC : error : Error in reading profile profdata.profdata: no such file or directory), and I still have found no way to make that a warning :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since zlib-ng is compiled into a static lib, it won't produce any profile data.
Thinking about it again, instead of doing the -fno-profile-instr-use trick in the PGUpdate phase, it would be better, to just create the lib once instead of twice (like I've recently done for the _freeze_module).
There is no need to build the lib twice (same holds true for the MSVC case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a quick try:
- introducing a
StaticLibProjectslikeFreezeProjectsinpcbuild.projis simple and straight forward. - Then I'd have to make the
ProjectReference Include="zlib-ng.vcxprojconditional (otherwise it would still be built in thePGUpdatephase) - ok, doable. - but then
zlib.his no longer found, because it is generated into$(IntDir)zlib.h- likewise$(IntDir)zlib-ng.h. So stopping here :)
So turns out to be not so easy and I'd like to keep that for a follow-up PR (if at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the call counts for PGO are going straight into each project that references it, since a static lib is essentially just a bundle of .obj files and so the link-time optimisation happens in the referencing project, not the static lib project.
Which means yes, it's participating in PGO, and it doesn't require rebuilding. Though it's not going to hurt that badly, so I'd tend towards decoupled build configuration (it would be easy to fix by moving the header file regeneration into pythoncore.vcxproj, but that's bad coupling).
Possibly we could make SupportsPGO choose the definition of IntDir and then it would usually be a quick rebuild?
chris-eiblMar 24, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but let's do that in a follow-up PR, since it is not really related to clang-cl.
Technically, the /arch changes so that the resulting binary can run on legacy CPUs aren't either, but IMHO they are a nice fit here.
It would just put some rebase burden onto this PR, if we'd do it in a separate one.
zooba commented Mar 24, 2025
It probably requires |
Python/ceval_macros.h Outdated
| @@ -1,3 +1,4 @@ | |||
| // DO NOT MERGE - just triggering Windows tail-call CI) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to remove
chris-eibl commented Mar 24, 2025
I've created #131678 to continue the discussion there. |
d16f455 into python:mainUh oh!
There was an error while loading. Please reload this page.
…honGH-131526) Do not enable AdvancedVectorExtensions2 for all *.c files, so that the resulting binary can be executed on older CPUs, too. Also enable AdvancedVectorExtensions512 where necessary, and add the ClangCL flags required to enable vector extensions.
I gave it a try. For me it compiles and
test_zlibis green.I've only thrown in as many
-mas needed.But only for those files that really need them!
I started to set that for all files (the diff would have been much smaller), but
*.cfiles given, and the resulting binary would then not run on all hoststest_zlibfailed withinvalid instructionif I compiled all*.cfiles with the "all the-m. This convinced me even more to be "selective" here: my ancient Haswell i5 4570 only supports up to AVX2 and it makes sense, that it would not happily accept all that instructions (up to AVX512).