Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 33.9k
bpo-41972: Tweak fastsearch.h string search algorithms#27091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
sweeneyde commented Jul 12, 2021 • edited by bedevere-bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-bot
Uh oh!
There was an error while loading. Please reload this page.
ambv left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit going through _two_way with the double loops and gotos caused a little head-spinning. I'll run some Hypothesis tests on the new version.
Note that this can only go to 3.11 now, as it doesn't qualify as a bug fix and we're past the beta stage for 3.10.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Objects/stringlib/fastsearch.h Outdated
| if (mode!=FAST_RSEARCH){ | ||
| if (m >= 100&&w >= 2000&&w / m >= 5){ | ||
| if (n<3000|| (m<100&&n<30000) ||m<6){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come about those numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll consolidate the benchmarks I used and post them here. But this is a chart comparing the default implementation to the existing two-way implementation. The equivalent chart for this implementation is similar, and I'm trying to essentially to cut out the green bits of that image.
For now though, the differences this PR makes have some benchmarks here
sweeneyde commented Jul 15, 2021
There's an overview of the two-way algorithm at https://github.com/python/cpython/blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt If you want to be able to test more rigorously, you could make the table and its contents smaller with something like this: #defineMAX_SHIFT 3 #defineTABLE_SIZE_BITS 2u |
sweeneyde commented Jul 16, 2021
Here's the chart of the ratios of the runtimes of the two algorithms, with cyan/black text to distinguish where these thresholds lie. This was run using zipf_benchmarks.py in the gist I posted before. |
ambv commented Jul 16, 2021
I was running Hypothesis smoke tests on this PR for most of today and all's well. Do the benchmarks you provided include the changes in your last two commits? |
sweeneyde commented Jul 16, 2021
I re-ran some benchmarks comparing the main branch to commit 4d7d102 and they're here: https://gist.github.com/sweeneyde/c370dcee453a2bcb34157261fb79650e These use the same benchmarks as I posted before: one randomly generated character-by-character with a zipf distribution, and then a set of benchmarks comparing binary data, C code, Python code, and RestructuredText. Summary of random string results: Summary of "real life" benchmarks: (My most recent commits only tweaked the cutoffs, and the table in my last comment ran with no run-time cutoffs.) |
ambv commented Jul 17, 2021
The recent results still look pretty good but are visibly worse from the ones you presented on July 12 on the issue. What do you think the reason for that is? |
sweeneyde commented Jul 17, 2021
There was a bug in my original implementation where it could access past the end of the haystack buffer. The new and old assembly is below (Microsoft Visual C++ 2019, Release x64 Build). Assembly for commit 4d7d102 (it looks like it unrolled x2): 440: for (;){441: LOG_LINEUP();442: Py_ssize_t shift = table[(*window_last) & TABLE_MASK];00007FF937A5BFA0 movsxrax,byte ptr [r9]00007FF937A5BFA4 andeax,3Fh00007FF937A5BFA7 movzxecx,byte ptr [rax+r15]443: window_last += shift;00007FF937A5BFAC addr9,rcx444: if (shift == 0){00007FF937A5BFAF testrcx,rcx00007FF937A5BFB2 je$windowloop+2Dh (07FF937A5BFCDh) 445: break;446: }447: if (window_last >= haystack_end){00007FF937A5BFB4 cmpr9,rbp00007FF937A5BFB7 jae stringlib__two_way+13Ch (07FF937A5BF4Ch) 440: for (;){441: LOG_LINEUP();442: Py_ssize_t shift = table[(*window_last) & TABLE_MASK];00007FF937A5BFB9 movsxrax,byte ptr [r9]00007FF937A5BFBD andeax,3Fh00007FF937A5BFC0 movzxecx,byte ptr [rax+r15]443: window_last += shift;00007FF937A5BFC5 addr9,rcx444: if (shift == 0){00007FF937A5BFC8 testrcx,rcx00007FF937A5BFCB jne$windowloop+14h (07FF937A5BFB4h) 448: return -1;449: }450: LOG("Horspool skip");451: }Old Assembly from commit 432: while (shift > 0 && window_last < haystack_end){00007FF93489BF6C testrdx,rdx00007FF93489BF6F je$windowloop+33h (07FF93489BF93h) 00007FF93489BF71 cmprax,rsi00007FF93489BF74 jae stringlib__two_way+13Dh (07FF93489BF0Dh) 433: LOG("Horspool skip.\n");434: window_last += shift;00007FF93489BF76 addrax,rdx435: shift = table[(*window_last) & TABLE_MASK];00007FF93489BF79 movsxrcx,byte ptr [rax]00007FF93489BF7D andecx,3Fh00007FF93489BF80 movzxedx,byte ptr [rcx+r14]00007FF93489BF85 testrdx,rdx00007FF93489BF88 jne$windowloop+11h (07FF93489BF71h) 436: LOG_LINEUP();437: }438: if (window_last >= haystack_end){00007FF93489BF8A cmprax,rsi00007FF93489BF8D jae stringlib__two_way+13Dh (07FF93489BF0Dh) 439: break; // return -1440: }It appears that the 10 Zipf results that slowed down the most between these versions all had needles that ended in 'D', which I had originally randomly set to be the most-frequently occurring character (about 25% of all characters). For needles that end in common characters, less time is spent in the Horspool skip table loop, so unrolling made things worse rather than better. Similarly, the benchmarks on binary data that slowed down the most all ended in null bytes, which were very common in the file I looked at. I can look at reverting commit b8df3e1 and see how that changes things. |
sweeneyde commented Jul 18, 2021
Reverting commit b8df3e1 was a wash: "1.02x slower" on the zipf benchmarks and "1.00x faster" on the real-life benchmarks. |
bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091)
sweeneyde commented Jul 19, 2021
Thanks @ambv , and congrats on the new role! |
ambv commented Jul 19, 2021
Thanks for the kind words, and thanks for your change! |
* origin/main: (1146 commits) bpo-42064: Finalise establishing sqlite3 global state (pythonGH-27155) bpo-44678: Separate error message for discontinuous padding in binascii.a2b_base64 strict mode (pythonGH-27249) correct spelling (pythonGH-27076) bpo-44524: Add missed __name__ and __qualname__ to typing module objects (python#27237) bpo-27513: email.utils.getaddresses() now handles Header objects (python#13797) Clean up comma usage in Doc/library/functions.rst (python#27083) bpo-42238: Fix small rst issue in NEWS.d/. (python#27238) bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091) bpo-44340: Add support for building with clang full/thin lto (pythonGH-27231) bpo-44661: Update property_descr_set to use vectorcall if possible. (pythonGH-27206) bpo-44645: Check for interrupts on any potentially backwards edge (pythonGH-27216) bpo-41546: make pprint (like print) not write to stdout when it is None (pythonGH-26810) bpo-44554: refactor pdb targets (and internal tweaks) (pythonGH-26992) bpo-43086: Add handling for out-of-spec data in a2b_base64 (pythonGH-24402) bpo-44561: Update hyperlinks in Doc/distributing/index.rst (python#27032) bpo-42355: symtable.get_namespace() now checks whether there are multiple or any namespaces found (pythonGH-23278) bpo-44654: Do not export the union type related symbols (pythonGH-27223) bpo-44633: Fix parameter substitution of the union type with wrong types. (pythonGH-27218) bpo-44654: Refactor and clean up the union type implementation (pythonGH-27196) bpo-20291: Fix MSVC warnings in getargs.c (pythonGH-27211) ...

https://bugs.python.org/issue41972