gh-119118: Fix performance regression in tokenize module#119615

lysnikolaou · 2024-05-27T16:11:00Z

Cache line object to avoid creating a Unicode object for all of the tokens in the same line.
Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

Issue: tokenize.generate_tokens() performance regression in 3.12 #119118

- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

Python/Python-tokenize.c

pablogsal · 2024-05-28T13:13:39Z

Hummm, seems also that this solution fails with test_tokenize with -uall:

====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_difflib.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 467 dateb = '3 fév' ^ IndentationError: unindent does not match any outer indentation level ====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_html.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 85 check('&notin;', '∉') ^ IndentationError: unindent does not match any outer indentation level ====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_str.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 1008 try: ^ IndentationError: unindent does not match any outer indentation level ----------------------------------------------------------------------

pablogsal · 2024-05-28T13:37:46Z

Another idea: we do have the token already as unicode (str) so if the line has not changed we can keep adding the size of the token itself to our state and take into account the number of whitespace characters between the tokens (but this is always ASCII so we can basically just calculate as the diff of two pointers).

pablogsal · 2024-05-28T14:08:12Z

I'm discussing another fix with @lysnikolaou offline

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal · 2024-05-28T15:55:44Z

The docs failure seems unrelated. @hugovk do we know what may be happening here?

lysnikolaou · 2024-05-28T17:41:35Z

The latest results are:

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 ❯ python tmp/t.py cpython darwin 3.11.3 (main, May 8 2023, 13:16:43) [Clang 14.0.3 (clang-1403.0.22.14.1)] Time taken: 0.5428769588470459 cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 ❯ ./python.exe tmp/t.py cpython darwin 3.14.0a0 (heads/performance-tokenize-dirty:ab3437096a7, May 28 2024, 19:26:19) [Clang 15.0.0 (clang-1500.3.9.4)] Time taken: 0.4570140838623047

The test failures appear to be unrelated. We can probably merge this.

hugovk · 2024-05-28T18:16:08Z

The docs failure seems unrelated. @hugovk do we know what may be happening here?

Yep, need to merge in latest main, the new option has been added there (re: #119221).

miss-islington-app · 2024-05-28T19:17:53Z

Thanks @lysnikolaou for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app · 2024-05-28T19:18:05Z

GH-119682 is a backport of this pull request to the 3.13 branch.

…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app · 2024-05-28T19:18:12Z

GH-119683 is a backport of this pull request to the 3.12 branch.

…19615) (#119682) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…19615) (#119683) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

pythongh-119118: Fix performance regression in tokenize module
66f9385
- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

bedevere-appbot added the awaiting core review label May 27, 2024

bedevere-appbot mentioned this pull request May 27, 2024
tokenize.generate_tokens() performance regression in 3.12 #119118
Closed

devdanzin mentioned this pull request May 28, 2024
tokenize in 3.12 makes copies of each line, 3.11 does not #119654
Closed

lysnikolaou requested a review from pablogsal May 28, 2024 12:12

📜🤖 Added by blurb_it.
143a420

pablogsal reviewed May 28, 2024
View reviewed changes

Python/Python-tokenize.c Outdated Show resolvedHide resolved

Update Python/Python-tokenize.c
8318ab8

pablogsal reviewed May 28, 2024
View reviewed changes

Python/Python-tokenize.c Outdated Show resolvedHide resolved

pablogsal reviewed May 28, 2024
View reviewed changes

Python/Python-tokenize.c Outdated Show resolvedHide resolved

WIP: strcmp lines (need to fix) and use line buffer for end col offset
566adc3

lysnikolaouand others added 2 commits May 28, 2024 16:19

Get rid of strcmp
858eff1
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

Make NEWS linter happy
ab34370

pablogsal approved these changes May 28, 2024
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting core review labels May 28, 2024

pablogsal added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels May 28, 2024

pablogsal approved these changes May 28, 2024
View reviewed changes

Merge branch 'main' into performance-tokenize
b97c51e

pablogsal enabled auto-merge (squash) May 28, 2024 18:19

pablogsal merged commit d87b015 into python:mainMay 28, 2024

bedevere-appbot removed the awaiting merge label May 28, 2024

bedevere-appbot removed the needs backport to 3.13 bugs and security fixes label May 28, 2024

bedevere-appbot removed the needs backport to 3.12 only security fixes label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-119118: Fix performance regression in tokenize module#119615

gh-119118: Fix performance regression in tokenize module #119615

Uh oh!

lysnikolaou commented May 27, 2024•
edited by bedevere-app bot
Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024•
edited
Loading

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

lysnikolaou commented May 28, 2024

Uh oh!

hugovk commented May 28, 2024

Uh oh!

miss-islington-appbot commented May 28, 2024

Uh oh!

bedevere-appbot commented May 28, 2024

Uh oh!

bedevere-appbot commented May 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

gh-119118: Fix performance regression in tokenize module#119615

gh-119118: Fix performance regression in tokenize module #119615

Uh oh!

Conversation

lysnikolaou commented May 27, 2024• edited by bedevere-app botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

lysnikolaou commented May 28, 2024

Uh oh!

hugovk commented May 28, 2024

Uh oh!

miss-islington-appbot commented May 28, 2024

Uh oh!

bedevere-appbot commented May 28, 2024

Uh oh!

bedevere-appbot commented May 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lysnikolaou commented May 27, 2024•
edited by bedevere-app bot
Loading

pablogsal commented May 28, 2024•
edited
Loading