gh-130167: Improve speed of `difflib.IS_LINE_JUNK` by replacing `re`#130170

donbarbos · 2025-02-16T02:48:25Z

Now we don't import of re globally and there is no longer an extra (undocumented) function parameter (see documentation)

How did i benchmark a new version?

I wrote next python script difflib_bench.py:

importreimporttimeitdefis_line_junk_regex(line, pat=re.compile(r"\s*(?:#\s*)?$").match): returnpat(line) isnotNonedefis_line_junk_no_regex(line): returnline.strip() ==""orline.lstrip().rstrip() =="#"test_cases= [ " ", " #", " # comment", "code line", " 123 ", " 123 #", " 123 # hi", " 123 #comment", "\n", " # \n", "hello\n", "", " ", "\t", "#", " #", "# ", " # ", "text", "text # comment", "#text", "##", "#\t", " ", " #\t ", ] defbenchmark(func, cases, n=100000): returntimeit.timeit(lambda: [func(line) forlineincases], number=n) regex_time=benchmark(is_line_junk_regex, test_cases) no_regex_time=benchmark(is_line_junk_no_regex, test_cases) print(f"Regex time: {regex_time:.6f} sec") print(f"No regex time: {no_regex_time:.6f} sec")

`timeit` result: 1.33s -> 0.64s = x2.08 as fast

$ ./python -B difflib_bench.py Regex time: 1.330593 sec No regex time: 0.638559 sec

Issue: Improve speed of stdlib functions by replacing re uses #130167

Lib/difflib.py

Co-authored-by: Adam Turner <[email protected]>

terryjreedy

One oddity of difflib is that re is imported at module level for IS_LINE_JUNK and again at class level in class _mdiff line 1378. It is used in line 1381 to define class name change_re = re.compile(r'(\++|\-+|\^+)'). To avoid importing re when importing difflib, the second import would have to be moved within a function. This is possible since change_re is only used within the _make_line method the immediately follows (1386).

I have not looked at how commonly this method is used.

Lib/difflib.py

bedevere-app · 2025-02-16T08:14:52Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

terryjreedy · 2025-02-16T08:37:27Z

I checked correctness of the return expression replacement, using test_cases above, with

for s in test_cases: (pat(s) is not None) is (s.strip() in ('', '#'))

and all are True.

donbarbos · 2025-02-16T11:52:36Z

I checked correctness of the return expression replacement, using test_cases above, with

Ok, thank you, but this test_cases is too small a sample and therefore I have supplemented my script with the following checks:

importstringimportrandomfromitertoolsimportproduct# First Checks:forcaseintest_cases: assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"defrandom_string(length=10): chars=string.printable+" "*5+"#"*5return"".join(random.choice(chars) for_inrange(length)) # Second Checks:for_inrange(10000): case=random_string(random.randint(0, 20)) assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"# Third Checks:chars=" #\tabc"forlengthinrange(6): forcaseinmap("".join, product(chars, repeat=length)): assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"

Then I checked the tests in Lib/test/test_difflib.py and everything finished well 👍

donbarbos · 2025-02-16T11:52:52Z

I have made the requested changes; please review again

bedevere-app · 2025-02-16T11:52:57Z

Thanks for making the requested changes!

@terryjreedy, @AA-Turner: please review the changes made to this pull request.

Lib/difflib.py

picnixz

Since we're improving performance, we don't want to recompute line.strip()

Lib/difflib.py

Co-authored-by: Bénédikt Tran <[email protected]>

donbarbos · 2025-02-16T14:58:43Z

Since we're improving performance, we don't want to recompute line.strip()

Sorry, I suspected this but hoped that there was a cache mechanism here :)

serhiy-storchaka

I am not sure that the performance of this function is worth changing the code. In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g. 'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

Hmm, it seems, that that implementation has more than quadratic complexity. Of course, most data for which this function is used is short strings, but such behavior can be considered a security threat.

Lib/difflib.py

terryjreedy · 2025-02-16T23:09:32Z

I checked correctness of the return expression replacement, using test_cases above, with

forsintest_cases: (pat(s) isnotNone) is (s.strip() in ('', '#'))

and all are True.

I am going to be neutral an whether the basic change should be merged. If it is, I personally think the 'pat' parameter can be considered to be private and eliminated. But I would defer the back-compatibility to others, include 'encukou'. The only documented use of the function, marked as a constant, is to be passed as an argument. The argument is always passed on to SequenceMatcher, which calls it once with 1 argument, which would be a line.

donbarbos · 2025-02-16T23:18:37Z

@serhiy-storchaka

In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g. 'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

I tested your cases but didn't get the result you expected (maybe I did something wrong?)
I ran next script (accidentally mixed up the dividend and divisor, but it doesn't matter):

importreimporttimeitdefis_line_junk_regex(line, pat=re.compile(r"\s*(?:#\s*)?$").match): returnpat(line) isnotNonedefis_line_junk_no_regex(line): returnline.strip() ==""orline.lstrip().rstrip() =="#"defbenchmark(func, cases, n=1_000): returntimeit.timeit(lambda: [func(line) forlineincases], number=n) lengths= [10_000, 100_000, 1_000_000] forlengthinlengths: test_string="x"*length+" "regex_time=benchmark(is_line_junk_regex, test_string) no_regex_time=benchmark(is_line_junk_no_regex, test_string) print(f"String length: {length}") print(f"Regex version: {regex_time:.6f} seconds") print(f"Non-regex version: {no_regex_time:.6f} seconds") print(f"Speedup: {no_regex_time/regex_time:.2f}x\n")

and got next results:

$ ./python -B bench.pyString length: 10000Regex version: 4.220341 secondsNon-regex version: 1.983180 secondsSpeedup: 0.47xString length: 100000Regex version: 42.698992 secondsNon-regex version: 20.897327 secondsSpeedup: 0.49xString length: 1000000Regex version: 516.844790 secondsNon-regex version: 262.632697 secondsSpeedup: 0.51x

picnixz · 2025-02-16T23:30:45Z

Aren't you iterating over a single character instead of a full line:

[func(line) forlineincases]

You should pass [test_string] instead of just test_string (sorry if I'm wrong, I'm tired)

AA-Turner · 2025-02-16T23:34:42Z

I was the one who originally reqested restoring pat, but I do find Terry's argument compelling, and I wouldn't oppose removing pat.

I think we should restore this PR to the simpler previous version (the differences micro-benchmarks between string comparisons are so small they will be swallowed by noise). The class-level re import should be resolved before merge, though.

A

Co-authored-by: Hugo van Kemenade <[email protected]>

serhiy-storchaka

I was wrong, the complexity is not quadratic, but linear.

Anyway, the relative performance depends on input. For some input the regexp version is faster, for other it is slower. Claiming that this improves speed for any input is bold.

Lib/difflib.py

Misc/NEWS.d/next/Library/2025-02-16-06-25-01.gh-issue-130167.kUg7Rc.rst

rhettinger · 2025-02-18T18:16:45Z

@tim-one Do you have a reason to want to keep the regex?

tim-one · 2025-02-18T20:49:46Z

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of

stripped=line.strip() returnstripped==''orstripped=='#'

try

returnline.strip() in'#'# empty or single '#'

?

donbarbos · 2025-02-18T20:51:49Z

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of
stripped=line.strip() returnstripped==''orstripped=='#'
try
returnline.strip() in'#'# empty or single '#'
?

that's how it was originally :)

I'll probably return it so as not to overcomplicate things

Lib/difflib.py

Co-authored-by: Tim Peters <[email protected]>

donbarbos · 2025-03-08T09:01:42Z

@rhettinger

@tim-one Do you have a reason to want to keep the regex?

Tim doesn't seem to mind

donbarbos · 2025-04-27T21:32:01Z

I don't know who I can ping but it seems like there are more suggestions

Lib/difflib.py

Improve speed of difflib.IS_LINE_JUNK by replacing re
8f2670d

bedevere-appbot added the awaiting review label Feb 16, 2025

bedevere-appbot mentioned this pull request Feb 16, 2025
Improve speed of stdlib functions by replacing re uses #130167
Open

AA-Turner reviewed Feb 16, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved
Lib/difflib.py Outdated Show resolvedHide resolved

donbarbosand others added 2 commits February 16, 2025 07:20

Update difflib.py
0fae9c5
Co-authored-by: Adam Turner <[email protected]>

Update difflib.py
352b4bc
Co-authored-by: Adam Turner <[email protected]>

AA-Turner approved these changes Feb 16, 2025
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting review labels Feb 16, 2025

terryjreedy requested changes Feb 16, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved

bedevere-appbot added awaiting changes and removed awaiting merge labels Feb 16, 2025

Add backward compatibility
0ab8da8

bedevere-appbot added awaiting change review and removed awaiting changes labels Feb 16, 2025

bedevere-appbot requested review from AA-Turner and terryjreedy February 16, 2025 11:52

sobolevn reviewed Feb 16, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved
Lib/difflib.py Outdated Show resolvedHide resolved

picnixz reviewed Feb 16, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved

Update difflib.py
64dff4a

donbarbos requested a review from picnixz February 16, 2025 14:48

picnixz reviewed Feb 16, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved

Update Lib/difflib.py
32337c3
Co-authored-by: Bénédikt Tran <[email protected]>

serhiy-storchaka reviewed Feb 16, 2025
View reviewed changes

Lib/difflib.pyShow resolvedHide resolved

Update difflib.py
390b6e7
Co-authored-by: Hugo van Kemenade <[email protected]>

serhiy-storchaka reviewed Feb 17, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved
Misc/NEWS.d/next/Library/2025-02-16-06-25-01.gh-issue-130167.kUg7Rc.rst Outdated Show resolvedHide resolved

Back iff word
30d4535

rhettinger requested a review from tim-one February 18, 2025 18:16

Update difflib
0e7293d

tim-one reviewed Feb 18, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved

Update difflib.py
bcf45a3
Co-authored-by: Tim Peters <[email protected]>

Fix News message
46a4126

hugovk approved these changes Apr 28, 2025
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting change review labels Apr 28, 2025

AA-Turner reviewed May 1, 2025
View reviewed changes

Lib/difflib.pyShow resolvedHide resolved

Improve clarity of expression
60951fb

AA-Turner reviewed May 1, 2025
View reviewed changes

Lib/difflib.py Outdated Show resolvedHide resolved

Revert, but add explanatory comments
9d987b4

AA-Turner approved these changes May 1, 2025
View reviewed changes

AA-Turner enabled auto-merge (squash) May 1, 2025 03:48

AA-Turner merged commit bce45bc into python:mainMay 1, 2025
39 checks passed

bedevere-appbot removed the awaiting merge label May 1, 2025

donbarbos mentioned this pull request Jul 26, 2025
[difflib] Update pat param for IS_LINE_JUNKpython/typeshed#14449
Merged

Uh oh!

gh-130167: Improve speed of difflib.IS_LINE_JUNK by replacing re#130170

gh-130167: Improve speed of difflib.IS_LINE_JUNK by replacing re#130170

Uh oh!

Conversation

donbarbos commented Feb 16, 2025• edited by bedevere-app botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How did i benchmark a new version?

timeit result: 1.33s -> 0.64s = x2.08 as fast

Uh oh!

Uh oh!

Uh oh!

terryjreedy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bedevere-appbot commented Feb 16, 2025

Uh oh!

terryjreedy commented Feb 16, 2025

Uh oh!

donbarbos commented Feb 16, 2025

Uh oh!

donbarbos commented Feb 16, 2025

Uh oh!

bedevere-appbot commented Feb 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

donbarbos commented Feb 16, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

terryjreedy commented Feb 16, 2025• edited by hugovk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donbarbos commented Feb 16, 2025• edited by hugovk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Feb 16, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AA-Turner commented Feb 16, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rhettinger commented Feb 18, 2025

Uh oh!

tim-one commented Feb 18, 2025

Uh oh!

donbarbos commented Feb 18, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

donbarbos commented Mar 8, 2025

Uh oh!

donbarbos commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

gh-130167: Improve speed of `difflib.IS_LINE_JUNK` by replacing `re`#130170

gh-130167: Improve speed of `difflib.IS_LINE_JUNK` by replacing `re`#130170

donbarbos commented Feb 16, 2025•
edited by bedevere-app bot
Loading

`timeit` result: 1.33s -> 0.64s = x2.08 as fast

terryjreedy commented Feb 16, 2025•
edited by hugovk
Loading

donbarbos commented Feb 16, 2025•
edited by hugovk
Loading

picnixz commented Feb 16, 2025•
edited
Loading

donbarbos commented Feb 18, 2025•
edited
Loading