Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()#120639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented Jun 17, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Jun 17, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
PR to discuss extensions to the PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar( PyUnicodeWriter*writer, wchar_t*str, Py_ssize_tsize); PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful( PyUnicodeWriter*writer, constchar*string, /* UTF-8 encoded string */Py_ssize_tlength, /* size of string */constchar*errors, /* error handling */Py_ssize_t*consumed); /* bytes consumed */ |
Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.
7c4cc95 to 8aa73b7Comparevstinner commented Jun 17, 2024
Objects/unicodeobject.c Outdated
| if (size < 0){ | ||
| size = wcslen(str); | ||
| } | ||
| PyObject *obj = PyUnicode_FromWideChar(str, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this API will be used a lot to build Python Unicode objects from wchar_t input, I think it's better to try to optimize it and avoid creating a temporary object.
The PyUnicode_FromWideChar() could be refactored using a private helper shared by both PyUnicode_FromWideChar () and this PyUnicodeWriter_WriteWideChar() to make this possible: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1794
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I optimized PyUnicodeWriter_WriteWideChar(). I ran a benchmark on _testcapi.test_unicodewriter_widechar():
$ env/bin/python -m pyperf timeit -s 'from _testcapi import test_unicodewriter_widechar' 'test_unicodewriter_widechar()' -o ref.json -v (...) $ python3 -m pyperf compare_to ref.json optim.json Mean +- std dev: [ref] 203 ns +- 9 ns -> [optim] 150 ns +- 3 ns: 1.35x faster It's a 1.4x faster, so it's worth it. It saves around 53 ns for 3 calls to PyUnicodeWriter_WriteWideChar().
Avoid a temporary Unicode object, write directly into the writer.
vstinner commented Jun 19, 2024
@malemburg: Is |
malemburg commented Jun 19, 2024
Yes, thanks for adding that. |
Uh oh!
There was an error while loading. Please reload this page.
Objects/unicodeobject.c Outdated
| PyObject * | ||
| PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size) | ||
| static inline int | ||
| unicode_fromwidechar(const wchar_t *u, Py_ssize_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that more than a half of this function is now specific to the caller. This is a mess. I wonder, would not it be simpler if write it as two different functions specialized for their case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar(): I added unicode_write_widechar() and removed unicode_convert_wchar_to_ucs4(). Does it look better?
Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?
Objects/unicodeobject.c Outdated
| // This code assumes that unicode can hold one more code point than | ||
| // wstr characters for a terminating null character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer true, after adding the (iter+1) < end check.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| // consumed is 0 if write fails | ||
| consumed = 12345; | ||
| assert(PyUnicodeWriter_DecodeUTF8Stateful(writer, "invalid\xFF", -1, NULL, &consumed) < 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This do nothing in non-debug build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertions are always built in _testcapi.c: the NDEBUG macro is undefined early in parts.h.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| if (PyUnicodeWriter_WriteWideChar(writer, L"-", 1) < 0){ | ||
| goto error; | ||
| } | ||
| if (PyUnicodeWriter_WriteWideChar(writer, L"euro=\u20AC", -1) < 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test surrogate pairs and non-BMP characters.
Since the code depends on the kind of the buffer string, you need to test different combinations: write different strings after writing a UCS2 or UCS4 string.
I suggest to implement in C a function which creates a PyUnicodeWriter, write the first argument as a Python string, then covert the second argument to the wchar_t* string and write it with size specified as optional third argument, and return the result. This helper function can be called in Python code with different arguments. The result will be checked even in non-debug build. You can test much more cases.
Co-authored-by: Serhiy Storchaka <[email protected]>
vstinner commented Jun 20, 2024
@serhiy-storchaka: I tried to address most of your reviews. Would you mind to review the updated PR? For tests, it's really complicated to write tests in C. I think that I will try to expose the C API PyUnicodeWriter in Python to write tests in Python in a following PR. I wanted to do that at the beginning, but it was quicker to start with C. Now the C test suite of PyUnicodeWriter is already quite big!
Right, I prefer to leave it as it is for now and write a following PR. |
vstinner commented Jun 21, 2024
Ok, I merged this PR as a starting point. I will rework tests in a follow-up PR. Thanks @serhiy-storchaka and @malemburg for your reviews. |
vstinner commented Jun 21, 2024
Rewrite tests in Python: #120845 |
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.
📚 Documentation preview 📚: https://cpython-previews--120639.org.readthedocs.build/