Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-90536: Add support for the BOLT post-link binary optimizer#95908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kmod commented Aug 11, 2022 • edited by bedevere-bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-bot
Uh oh!
There was an error while loading. Please reload this page.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
bedevere-bot commented Aug 11, 2022
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
gvanrossum commented Aug 11, 2022
Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11. |
pablogsal commented Aug 11, 2022
Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12. |
corona10 commented Aug 11, 2022
Nice work! I will take a look at this PR by this weekend |
bedevere-bot commented Aug 12, 2022
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
This comment was marked as resolved.
This comment was marked as resolved.
corona10 left a comment • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things need to be checked.
I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used.-> solved
BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805). #0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0 #2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520) #3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106) #4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658) #5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0 #6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a) #7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8) #8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6) #9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8) #10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20 #12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5 #13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75) PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through-> solvedpython -m test
./python.bolt_inst -m test --pgo --timeout=1200 || true 0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min) 0:00:00 load avg: 2.17 [ 1/44] test_array 0:00:01 load avg: 2.17 [ 2/44] test_base64 0:00:02 load avg: 2.07 [ 3/44] test_binascii 0:00:02 load avg: 2.07 [ 4/44] test_binop 0:00:02 load avg: 2.07 [ 5/44] test_bisect 0:00:02 load avg: 2.07 [ 6/44] test_bytes 0:00:06 load avg: 2.07 [ 7/44] test_bz2 0:00:06 load avg: 2.07 [ 8/44] test_cmath 0:00:07 load avg: 2.07 [ 9/44] test_codecs 0:00:08 load avg: 1.99 [10/44] test_collections 0:00:09 load avg: 1.99 [11/44] test_complex 0:00:10 load avg: 1.99 [12/44] test_dataclasses 0:00:10 load avg: 1.99 [13/44] test_datetime 0:00:14 load avg: 1.83 [14/44] test_decimal 0:00:18 load avg: 1.76 [15/44] test_difflib 0:00:19 load avg: 1.76 [16/44] test_embed 0:00:21 load avg: 1.76 [17/44] test_float 0:00:22 load avg: 1.76 [18/44] test_fstring 0:00:23 load avg: 1.70 [19/44] test_functools 0:00:23 load avg: 1.70 [20/44] test_generators 0:00:24 load avg: 1.70 [21/44] test_hashlib 0:00:25 load avg: 1.70 [22/44] test_heapq 0:00:26 load avg: 1.70 [23/44] test_int 0:00:26 load avg: 1.70 [24/44] test_itertools 0:00:32 load avg: 1.64 [25/44] test_json 0:00:36 load avg: 1.59 [26/44] test_long 0:00:39 load avg: 1.54 [27/44] test_lzma 0:00:39 load avg: 1.54 [28/44] test_math 0:00:42 load avg: 1.50 [29/44] test_memoryview 0:00:43 load avg: 1.50 [30/44] test_operator 0:00:44 load avg: 1.50 [31/44] test_ordered_dict 0:00:46 load avg: 1.50 [32/44] test_patma 0:00:46 load avg: 1.50 [33/44] test_pickle 0:00:52 load avg: 1.46 [34/44] test_pprint 0:00:52 load avg: 1.42 [35/44] test_re 0:00:53 load avg: 1.42 [36/44] test_set 0:01:00 load avg: 1.39 [37/44] test_sqlite3 0:01:05 load avg: 1.36 [38/44] test_statistics 0:01:10 load avg: 1.33 [39/44] test_struct 0:01:11 load avg: 1.33 [40/44] test_tabnanny 0:01:12 load avg: 1.30 [41/44] test_time 0:01:15 load avg: 1.30 [42/44] test_unicode test test_unicode failed 0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure) 0:01:19 load avg: 1.28 [44/44] test_xml_etree_c Total duration: 1 min 21 sec Tests result: FAILURE I will share further investigation into this patch.
FYI, this is my environment.
- OS: Ubuntu 22.04 LTS - BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0 - gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 corona10 commented Aug 13, 2022
Hmm, I will try to build BOLT from LLVM 14.0.6 |
corona10 commented Aug 13, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I found why the BOLT was failed, I will downgrade the gcc version into 10. |
corona10 left a comment • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for work! All pipeline works correctly.
Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.
I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.
Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.
plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
bedevere-bot commented Aug 13, 2022
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
corona10 commented Aug 13, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@gvanrossum@kmod cc @markshannon Interesting result!
Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence |
corona10 commented Aug 14, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): But let's measure the benchmark from the Faster CPython machine after the PR is merged. |
Makefile.pre.in Outdated
| bolt-opt: @PREBOLT_RULE@ | ||
| rm -f *.fdata | ||
| @LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | |
| @LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
Makefile.pre.in Outdated
| @LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | ||
| ./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true | ||
| @MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata | ||
| @LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot | |
| @LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
corona10 commented Aug 15, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. Environment
Binary Size
ICache miss
Benchmark (1.01x faster)https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5 |
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
osevan commented Dec 18, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Another question and important view of performance tuning. Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate. So, would be nice to make new flags with --enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang
And one compilechain completely in clang Thank you very much |
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
| # These flags are required to get good performance from bolt: | ||
| CFLAGS_NODIST="$CFLAGS_NODIST -fno-pie" | ||
| # We want to add these no-pie flags to linking executables but not shared libraries: | ||
| LINKCC="$LINKCC -fno-pie -no-pie" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmod are those flags required for Bolt to have proper impact? Asking as in Fedora we link all the packages with -pie and there is a conflict with this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's been quite a while and I don't really recall. I'd guess that I added them because the answer was yes, but I think the bolt team was actively working on improving pie support at the time so there's a good chance that the answer could have changed since the time of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the support for PIE binaries with computed gotos is not merged yet: llvm/llvm-project#120267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be resolved via https://github.com/python/cpython/pull/128511/files though? Hence the -no-pie and -fno-pie flags would be redundant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what you've linked is a workaround. With that, you can enable PIE. Once that PR I've linked is merged, you can drop skip-funcs and hopefully enjoy better performance.
Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.
It is gated behind an
--enable-boltconfigure arg because not alltoolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).
Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.
The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).
The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.
This change uses the existing pgo profiling task (
python -m test --pgo),though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.