Skip to content

Conversation

@lysnikolaou
Copy link
Member

@lysnikolaoulysnikolaou commented May 27, 2024

  • Cache line object to avoid creating a Unicode object for all of the tokens in the same line.
  • Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.
@pablogsal
Copy link
Member

Hummm, seems also that this solution fails with test_tokenize with -uall:

====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_difflib.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 467 dateb = '3 fév' ^ IndentationError: unindent does not match any outer indentation level ====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_html.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 85 check('&notin;', '∉') ^ IndentationError: unindent does not match any outer indentation level ====================================================================== ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_str.py') ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files self.check_roundtrip(f) ~~~~~~~~~~~~~~~~~~~~^^^ File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True) File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer raise e from None File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer for info in it: yield TokenInfo._make(info) File "<string>", line 1008 try: ^ IndentationError: unindent does not match any outer indentation level ---------------------------------------------------------------------- 

@pablogsal
Copy link
Member

pablogsal commented May 28, 2024

Another idea: we do have the token already as unicode (str) so if the line has not changed we can keep adding the size of the token itself to our state and take into account the number of whitespace characters between the tokens (but this is always ASCII so we can basically just calculate as the diff of two pointers).

@pablogsal
Copy link
Member

I'm discussing another fix with @lysnikolaou offline

lysnikolaouand others added 2 commits May 28, 2024 16:19
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@pablogsal
Copy link
Member

The docs failure seems unrelated. @hugovk do we know what may be happening here?

@lysnikolaou
Copy link
MemberAuthor

The latest results are:

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 ❯ python tmp/t.py cpython darwin 3.11.3 (main, May 8 2023, 13:16:43) [Clang 14.0.3 (clang-1403.0.22.14.1)] Time taken: 0.5428769588470459 cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 ❯ ./python.exe tmp/t.py cpython darwin 3.14.0a0 (heads/performance-tokenize-dirty:ab3437096a7, May 28 2024, 19:26:19) [Clang 15.0.0 (clang-1500.3.9.4)] Time taken: 0.4570140838623047 

The test failures appear to be unrelated. We can probably merge this.

@hugovk
Copy link
Member

The docs failure seems unrelated. @hugovk do we know what may be happening here?

Yep, need to merge in latest main, the new option has been added there (re: #119221).

@pablogsalpablogsal enabled auto-merge (squash) May 28, 2024 18:19
@pablogsalpablogsal merged commit d87b015 into python:mainMay 28, 2024
@miss-islington-app
Copy link

Thanks @lysnikolaou for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 28, 2024
…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

GH-119682 is a backport of this pull request to the 3.13 branch.

@bedevere-appbedevere-appbot removed the needs backport to 3.13 bugs and security fixes label May 28, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 28, 2024
…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

GH-119683 is a backport of this pull request to the 3.12 branch.

@bedevere-appbedevere-appbot removed the needs backport to 3.12 only security fixes label May 28, 2024
lysnikolaou added a commit that referenced this pull request May 28, 2024
…19615) (#119682) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
lysnikolaou added a commit that referenced this pull request May 28, 2024
…19615) (#119683) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024
…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

@lysnikolaou@pablogsal@hugovk@devdanzin