Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-130167: Improve speed of difflib.IS_LINE_JUNK by replacing re#130170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
donbarbos commented Feb 16, 2025 • edited by bedevere-app bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-app bot
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Adam Turner <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
terryjreedy left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One oddity of difflib is that re is imported at module level for IS_LINE_JUNK and again at class level in class _mdiff line 1378. It is used in line 1381 to define class name change_re = re.compile(r'(\++|\-+|\^+)'). To avoid importing re when importing difflib, the second import would have to be moved within a function. This is possible since change_re is only used within the _make_line method the immediately follows (1386).
I have not looked at how commonly this method is used.
Uh oh!
There was an error while loading. Please reload this page.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
terryjreedy commented Feb 16, 2025
I checked correctness of the return expression replacement, using test_cases above, with and all are True. |
donbarbos commented Feb 16, 2025
Ok, thank you, but this importstringimportrandomfromitertoolsimportproduct# First Checks:forcaseintest_cases: assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"defrandom_string(length=10): chars=string.printable+" "*5+"#"*5return"".join(random.choice(chars) for_inrange(length)) # Second Checks:for_inrange(10000): case=random_string(random.randint(0, 20)) assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"# Third Checks:chars=" #\tabc"forlengthinrange(6): forcaseinmap("".join, product(chars, repeat=length)): assertis_line_junk_regex(case) ==is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"Then I checked the tests in |
donbarbos commented Feb 16, 2025
I have made the requested changes; please review again |
Thanks for making the requested changes! @terryjreedy, @AA-Turner: please review the changes made to this pull request. |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
picnixz left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're improving performance, we don't want to recompute line.strip()
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Bénédikt Tran <[email protected]>
donbarbos commented Feb 16, 2025
Sorry, I suspected this but hoped that there was a cache mechanism here :) |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that the performance of this function is worth changing the code. In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g. 'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.
Hmm, it seems, that that implementation has more than quadratic complexity. Of course, most data for which this function is used is short strings, but such behavior can be considered a security threat.
Uh oh!
There was an error while loading. Please reload this page.
terryjreedy commented Feb 16, 2025 • edited by hugovk
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by hugovk
Uh oh!
There was an error while loading. Please reload this page.
I checked correctness of the return expression replacement, using test_cases above, with forsintest_cases: (pat(s) isnotNone) is (s.strip() in ('', '#'))and all are True. I am going to be neutral an whether the basic change should be merged. If it is, I personally think the 'pat' parameter can be considered to be private and eliminated. But I would defer the back-compatibility to others, include 'encukou'. The only documented use of the function, marked as a constant, is to be passed as an argument. The argument is always passed on to SequenceMatcher, which calls it once with 1 argument, which would be a line. |
donbarbos commented Feb 16, 2025 • edited by hugovk
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by hugovk
Uh oh!
There was an error while loading. Please reload this page.
I tested your cases but didn't get the result you expected (maybe I did something wrong?) importreimporttimeitdefis_line_junk_regex(line, pat=re.compile(r"\s*(?:#\s*)?$").match): returnpat(line) isnotNonedefis_line_junk_no_regex(line): returnline.strip() ==""orline.lstrip().rstrip() =="#"defbenchmark(func, cases, n=1_000): returntimeit.timeit(lambda: [func(line) forlineincases], number=n) lengths= [10_000, 100_000, 1_000_000] forlengthinlengths: test_string="x"*length+" "regex_time=benchmark(is_line_junk_regex, test_string) no_regex_time=benchmark(is_line_junk_no_regex, test_string) print(f"String length: {length}") print(f"Regex version: {regex_time:.6f} seconds") print(f"Non-regex version: {no_regex_time:.6f} seconds") print(f"Speedup: {no_regex_time/regex_time:.2f}x\n")and got next results: $ ./python -B bench.pyString length: 10000Regex version: 4.220341 secondsNon-regex version: 1.983180 secondsSpeedup: 0.47xString length: 100000Regex version: 42.698992 secondsNon-regex version: 20.897327 secondsSpeedup: 0.49xString length: 1000000Regex version: 516.844790 secondsNon-regex version: 262.632697 secondsSpeedup: 0.51x |
picnixz commented Feb 16, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Aren't you iterating over a single character instead of a full line: [func(line) forlineincases]You should pass |
AA-Turner commented Feb 16, 2025
I was the one who originally reqested restoring I think we should restore this PR to the simpler previous version (the differences micro-benchmarks between string comparisons are so small they will be swallowed by noise). The class-level A |
Co-authored-by: Hugo van Kemenade <[email protected]>
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, the complexity is not quadratic, but linear.
Anyway, the relative performance depends on input. For some input the regexp version is faster, for other it is slower. Claiming that this improves speed for any input is bold.
Uh oh!
There was an error while loading. Please reload this page.
Misc/NEWS.d/next/Library/2025-02-16-06-25-01.gh-issue-130167.kUg7Rc.rst Outdated Show resolvedHide resolved
Uh oh!
There was an error while loading. Please reload this page.
rhettinger commented Feb 18, 2025
@tim-one Do you have a reason to want to keep the regex? |
tim-one commented Feb 18, 2025
Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of stripped=line.strip() returnstripped==''orstripped=='#'try returnline.strip() in'#'# empty or single '#'? |
donbarbos commented Feb 18, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
that's how it was originally :) I'll probably return it so as not to overcomplicate things |
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Tim Peters <[email protected]>
donbarbos commented Mar 8, 2025
Tim doesn't seem to mind |
donbarbos commented Apr 27, 2025
I don't know who I can ping but it seems like there are more suggestions |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
bce45bc into python:mainUh oh!
There was an error while loading. Please reload this page.
Now we don't import of
reglobally and there is no longer an extra (undocumented) function parameter (see documentation)How did i benchmark a new version?
I wrote next python script
difflib_bench.py:timeitresult: 1.33s -> 0.64s = x2.08 as fastreuses #130167