Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-133968: Add PyUnicodeWriter_WriteASCII() function#133973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented May 13, 2025 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
vstinner commented May 13, 2025
JSON benchmark: #133832 (comment)
Benchmark hidden because not significant (11): encode 100 floats, encode ascii string len=100, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode escaped string len=896, encode 10000 integers, encode 10000 floats, encode 10000 "ascii" strings Up to 1.20x faster to encode booleans is interesting knowing that these strings are very short: "true" (4 characters) and "false" (5 characters). |
vstinner commented May 13, 2025
@serhiy-storchaka: What do you think of this function? |
vstinner commented May 13, 2025
ZeroIntensity left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
serhiy-storchaka commented May 13, 2025
Well, we had But We can add private |
Co-authored-by: Peter Bierma <[email protected]>
vstinner commented May 13, 2025
I don't think that it can become as fast or faster than a function which takes ASCII string as argument. If we know that the input string is ASCII, there is no need to scan the string for non-ASCII characters, and we can take the fast path. You're right that the UTF-8 decoder is already highly optimized. |
vstinner commented May 14, 2025
In short:
It's hard to beat |
serhiy-storchaka commented May 14, 2025
Yes, although it was close, at least for moderately large strings. Could it be optimized even more? I don't know. But decision about |
vstinner commented May 14, 2025
I created capi-workgroup/decisions#65 issue. |
vstinner commented May 14, 2025
Benchmark: On long strings (10,000 bytes), PyUnicodeWriter_WriteASCII() is up to 2x faster (1.36 us => 690 ns) than PyUnicodeWriter_WriteUTF8(). Detailsfrom_testcapiimportPyUnicodeWriterimportpyperfrange_100=range(100) defbench_write_utf8(text, size): writer=PyUnicodeWriter(0) for_inrange_100: writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) writer.write_utf8(text, size) defbench_write_ascii(text, size): writer=PyUnicodeWriter(0) for_inrange_100: writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) writer.write_ascii(text, size) runner=pyperf.Runner() sizes= (10, 100, 1_000, 10_000) forsizeinsizes: text=b'x'*sizerunner.bench_func(f'write_utf8 size={size:,}', bench_write_utf8, text, size, inner_loops=1_000) forsizeinsizes: text=b'x'*sizerunner.bench_func(f'write_ascii size={size:,}', bench_write_ascii, text, size, inner_loops=1_000) |
encukou commented May 15, 2025
Do we know where the bottleneck is for long strings? |
vstinner commented May 15, 2025
WriteUTF8() has to check for non-ASCII characters: this check has a cost. That's the bottleneck.
Maybe, I don't know if it would be faster. |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented May 15, 2025
I tried but failed to modify the code to copy while reading (checking if the string is encoded to ASCII). The code is quite complicated. |
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
picnixz left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to have this function public. I always preferred using the faster versions of the writer API when I hardcoded strings, but they were private.
Uh oh!
There was an error while loading. Please reload this page.
ZeroIntensity left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, LGTM as well.
vstinner commented May 29, 2025
The C API Working Group voted in favor of adding the function. |
f49a07b into python:mainUh oh!
There was an error while loading. Please reload this page.
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]> (cherry picked from commit f49a07b)
GH-134974 is a backport of this pull request to the 3.14 branch. |
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]> (cherry picked from commit f49a07b)
…#134974) gh-133968: Add PyUnicodeWriter_WriteASCII() function (#133973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). (cherry picked from commit f49a07b) Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
📚 Documentation preview 📚: https://cpython-previews--133973.org.readthedocs.build/