Skip to content

Conversation

@kmod
Copy link
Contributor

@kmodkmod commented Aug 11, 2022

Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an --enable-bolt configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (python -m test --pgo),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@gvanrossum
Copy link
Member

Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.

@pablogsal
Copy link
Member

and maybe @pablogsal will be willing to backport it to 3.11.

Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.

@corona10corona10 self-requested a review August 11, 2022 23:28
@corona10
Copy link
Member

Nice work! I will take a look at this PR by this weekend

@corona10corona10 changed the title Add support for the BOLT post-link binary optimizergh-90536: Add support for the BOLT post-link binary optimizerAug 11, 2022
@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@corona10

This comment was marked as resolved.

Copy link
Member

@corona10corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things need to be checked.

  • I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used. -> solved
BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805). #0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0 #2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520) #3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106) #4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658) #5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0 #6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a) #7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8) #8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6) #9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8) #10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20 #12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5 #13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75) PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped 
  • While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through python -m test -> solved
./python.bolt_inst -m test --pgo --timeout=1200 || true 0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min) 0:00:00 load avg: 2.17 [ 1/44] test_array 0:00:01 load avg: 2.17 [ 2/44] test_base64 0:00:02 load avg: 2.07 [ 3/44] test_binascii 0:00:02 load avg: 2.07 [ 4/44] test_binop 0:00:02 load avg: 2.07 [ 5/44] test_bisect 0:00:02 load avg: 2.07 [ 6/44] test_bytes 0:00:06 load avg: 2.07 [ 7/44] test_bz2 0:00:06 load avg: 2.07 [ 8/44] test_cmath 0:00:07 load avg: 2.07 [ 9/44] test_codecs 0:00:08 load avg: 1.99 [10/44] test_collections 0:00:09 load avg: 1.99 [11/44] test_complex 0:00:10 load avg: 1.99 [12/44] test_dataclasses 0:00:10 load avg: 1.99 [13/44] test_datetime 0:00:14 load avg: 1.83 [14/44] test_decimal 0:00:18 load avg: 1.76 [15/44] test_difflib 0:00:19 load avg: 1.76 [16/44] test_embed 0:00:21 load avg: 1.76 [17/44] test_float 0:00:22 load avg: 1.76 [18/44] test_fstring 0:00:23 load avg: 1.70 [19/44] test_functools 0:00:23 load avg: 1.70 [20/44] test_generators 0:00:24 load avg: 1.70 [21/44] test_hashlib 0:00:25 load avg: 1.70 [22/44] test_heapq 0:00:26 load avg: 1.70 [23/44] test_int 0:00:26 load avg: 1.70 [24/44] test_itertools 0:00:32 load avg: 1.64 [25/44] test_json 0:00:36 load avg: 1.59 [26/44] test_long 0:00:39 load avg: 1.54 [27/44] test_lzma 0:00:39 load avg: 1.54 [28/44] test_math 0:00:42 load avg: 1.50 [29/44] test_memoryview 0:00:43 load avg: 1.50 [30/44] test_operator 0:00:44 load avg: 1.50 [31/44] test_ordered_dict 0:00:46 load avg: 1.50 [32/44] test_patma 0:00:46 load avg: 1.50 [33/44] test_pickle 0:00:52 load avg: 1.46 [34/44] test_pprint 0:00:52 load avg: 1.42 [35/44] test_re 0:00:53 load avg: 1.42 [36/44] test_set 0:01:00 load avg: 1.39 [37/44] test_sqlite3 0:01:05 load avg: 1.36 [38/44] test_statistics 0:01:10 load avg: 1.33 [39/44] test_struct 0:01:11 load avg: 1.33 [40/44] test_tabnanny 0:01:12 load avg: 1.30 [41/44] test_time 0:01:15 load avg: 1.30 [42/44] test_unicode test test_unicode failed 0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure) 0:01:19 load avg: 1.28 [44/44] test_xml_etree_c Total duration: 1 min 21 sec Tests result: FAILURE 

I will share further investigation into this patch.
FYI, this is my environment.

- OS: Ubuntu 22.04 LTS - BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0 - gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 

@corona10
Copy link
Member

Hmm, I will try to build BOLT from LLVM 14.0.6

@corona10
Copy link
Member

corona10 commented Aug 13, 2022

I found why the BOLT was failed, I will downgrade the gcc version into 10.

 DWARF 5 has become the default in GCC 11 

Copy link
Member

@corona10corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for work! All pipeline works correctly.

Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.

I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.

Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.

plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@corona10
Copy link
Member

corona10 commented Aug 13, 2022

@gvanrossum@kmod cc @markshannon

Interesting result!
The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178)
I wish to re-measure the benchmark from the FasterCPython project machine also.
I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.

BenchmarkCPython 3.12 ./configure --enable-optimizations --with-ltoCPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt
2to3269 ms255 ms: 1.05x faster
chameleon7.39 ms7.02 ms: 1.05x faster
chaos74.1 ms68.8 ms: 1.08x faster
crypto_pyaes82.3 ms77.2 ms: 1.07x faster
deltablue3.65 ms3.41 ms: 1.07x faster
django_template38.6 ms35.3 ms: 1.09x faster
dulwich_log67.6 ms58.7 ms: 1.15x faster
fannkuch385 ms380 ms: 1.02x faster
float73.2 ms72.4 ms: 1.01x faster
genshi_text24.3 ms23.3 ms: 1.04x faster
genshi_xml56.4 ms52.8 ms: 1.07x faster
go140 ms136 ms: 1.03x faster
hexiom6.40 ms6.25 ms: 1.02x faster
html5lib65.0 ms60.7 ms: 1.07x faster
json_dumps11.1 ms10.4 ms: 1.07x faster
json_loads28.7 us26.3 us: 1.09x faster
logging_format7.29 us6.69 us: 1.09x faster
logging_silent101 ns97.6 ns: 1.03x faster
logging_simple6.48 us6.01 us: 1.08x faster
mako10.6 ms9.91 ms: 1.07x faster
meteor_contest106 ms102 ms: 1.04x faster
nbody86.4 ms87.7 ms: 1.02x slower
nqueens91.3 ms88.1 ms: 1.04x faster
pathlib19.0 ms16.8 ms: 1.13x faster
pickle_dict32.2 us32.6 us: 1.01x slower
pickle_list4.69 us4.62 us: 1.02x faster
pickle_pure_python297 us282 us: 1.05x faster
pidigits177 ms176 ms: 1.01x faster
pyflate423 ms416 ms: 1.02x faster
python_startup8.72 ms8.15 ms: 1.07x faster
python_startup_no_site6.35 ms5.97 ms: 1.06x faster
raytrace312 ms293 ms: 1.06x faster
regex_compile139 ms131 ms: 1.06x faster
regex_dna180 ms185 ms: 1.03x slower
regex_effbot2.99 ms2.82 ms: 1.06x faster
regex_v821.4 ms20.4 ms: 1.05x faster
richards48.6 ms46.3 ms: 1.05x faster
scimark_fft348 ms338 ms: 1.03x faster
scimark_lu120 ms117 ms: 1.02x faster
scimark_monte_carlo67.0 ms65.4 ms: 1.02x faster
scimark_sor116 ms113 ms: 1.02x faster
spectral_norm101 ms102 ms: 1.01x slower
sqlalchemy_declarative143 ms135 ms: 1.06x faster
sqlalchemy_imperative19.0 ms17.0 ms: 1.12x faster
sqlite_synth2.50 us2.29 us: 1.09x faster
sympy_expand507 ms465 ms: 1.09x faster
sympy_integrate21.7 ms20.5 ms: 1.06x faster
sympy_sum176 ms164 ms: 1.08x faster
sympy_str311 ms286 ms: 1.09x faster
telco7.02 ms6.36 ms: 1.10x faster
tornado_http125 ms113 ms: 1.10x faster
unpickle15.7 us15.1 us: 1.04x faster
unpickle_list4.74 us4.56 us: 1.04x faster
unpickle_pure_python229 us219 us: 1.05x faster
xml_etree_parse158 ms155 ms: 1.02x faster
xml_etree_iterparse103 ms101 ms: 1.02x faster
xml_etree_generate91.0 ms84.3 ms: 1.08x faster
xml_etree_process61.9 ms58.4 ms: 1.06x faster
Geometric mean(ref)1.05x faster

Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence

@corona10corona10 self-assigned this Aug 13, 2022
@corona10
Copy link
Member

corona10 commented Aug 14, 2022

Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0):
https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d

But let's measure the benchmark from the Faster CPython machine after the PR is merged.

Makefile.pre.in Outdated

bolt-opt: @PREBOLT_RULE@
rm -f *.fdata
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

Makefile.pre.in Outdated
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

@corona10
Copy link
Member

corona10 commented Aug 15, 2022

I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report.
I didn't analyze whether the GCC version or OS version could affect the performance result.
But I can conclude that BOLT definitely makes CPython faster.

Environment

  • Hardware: AWS c5n.metal
  • Red Hat Enterprise Linux release 8.6 (Ootpa)
  • gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
  • LLVM version 14.0.6

Binary Size

  • Without BOLT: 79M
  • With BOLT: 36M

ICache miss

ExperimentinstructionsL1-icache-missesratio
PGO + LTO8,330,863,079,93277,047,357,1630.92%
PGO + LTO + BOLT8,312,698,165,97565,319,225,0640.79%

Benchmark (1.01x faster)

https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5

kmodand others added 6 commits August 16, 2022 17:31
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
@osevan
Copy link

osevan commented Dec 18, 2022

Another question and important view of performance tuning.

Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate.

So, would be nice to make new flags with

--enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang

  • bolting

And one compilechain completely in clang
--enable-lto-llvm --enable-pgo-llvm plus bolt

Thank you very much

vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
# These flags are required to get good performance from bolt:
CFLAGS_NODIST="$CFLAGS_NODIST -fno-pie"
# We want to add these no-pie flags to linking executables but not shared libraries:
LINKCC="$LINKCC -fno-pie -no-pie"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmod are those flags required for Bolt to have proper impact? Asking as in Fedora we link all the packages with -pie and there is a conflict with this flag.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's been quite a while and I don't really recall. I'd guess that I added them because the answer was yes, but I think the bolt team was actively working on improving pie support at the time so there's a good chance that the answer could have changed since the time of this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the support for PIE binaries with computed gotos is not merged yet: llvm/llvm-project#120267

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be resolved via https://github.com/python/cpython/pull/128511/files though? Hence the -no-pie and -fno-pie flags would be redundant here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what you've linked is a workaround. With that, you can enable PIE. Once that PR I've linked is merged, you can drop skip-funcs and hopefully enjoy better performance.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

@kmod@bedevere-bot@gvanrossum@pablogsal@corona10@aaupov@osevan@stratakis