gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()#120639

vstinner · 2024-06-17T13:36:48Z

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120639.org.readthedocs.build/

vstinner · 2024-06-17T13:37:39Z

PR to discuss extensions to the PyUnicodeWriter API:

PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar( PyUnicodeWriter*writer, wchar_t*str, Py_ssize_tsize); PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful( PyUnicodeWriter*writer, constchar*string, /* UTF-8 encoded string */Py_ssize_tlength, /* size of string */constchar*errors, /* error handling */Py_ssize_t*consumed); /* bytes consumed */

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

vstinner · 2024-06-17T15:57:04Z

cc @serhiy-storchaka @malemburg @zooba

malemburg · 2024-06-19T08:22:10Z

Objects/unicodeobject.c

+ if (size < 0){
+ size = wcslen(str);
+ }
+ PyObject *obj = PyUnicode_FromWideChar(str, size);


Since this API will be used a lot to build Python Unicode objects from wchar_t input, I think it's better to try to optimize it and avoid creating a temporary object.
The PyUnicode_FromWideChar() could be refactored using a private helper shared by both PyUnicode_FromWideChar () and this PyUnicodeWriter_WriteWideChar() to make this possible: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1794

Ok. I optimized PyUnicodeWriter_WriteWideChar(). I ran a benchmark on _testcapi.test_unicodewriter_widechar():
$ env/bin/python -m pyperf timeit -s 'from _testcapi import test_unicodewriter_widechar' 'test_unicodewriter_widechar()' -o ref.json -v (...) $ python3 -m pyperf compare_to ref.json optim.json Mean +- std dev: [ref] 203 ns +- 9 ns -> [optim] 150 ns +- 3 ns: 1.35x faster
It's a 1.4x faster, so it's worth it. It saves around 53 ns for 3 calls to PyUnicodeWriter_WriteWideChar().

Avoid a temporary Unicode object, write directly into the writer.

vstinner · 2024-06-19T10:22:24Z

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

malemburg · 2024-06-19T13:04:45Z

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

Yes, thanks for adding that.

Objects/unicodeobject.c

serhiy-storchaka · 2024-06-20T08:07:45Z

Objects/unicodeobject.c

-PyObject *
-PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size)
+static inline int
+unicode_fromwidechar(const wchar_t *u, Py_ssize_t size,


It seems that more than a half of this function is now specific to the caller. This is a mess. I wonder, would not it be simpler if write it as two different functions specialized for their case?

I refactored PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar(): I added unicode_write_widechar() and removed unicode_convert_wchar_to_ucs4(). Does it look better?

Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().

serhiy-storchaka

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

serhiy-storchaka · 2024-06-20T12:30:41Z

Objects/unicodeobject.c

+ // This code assumes that unicode can hold one more code point than
+ // wstr characters for a terminating null character.


I think this is no longer true, after adding the (iter+1) < end check.

Objects/unicodeobject.c

Modules/_testcapi/unicode.c

serhiy-storchaka · 2024-06-20T13:09:54Z

Modules/_testcapi/unicode.c

+
+ // consumed is 0 if write fails
+ consumed = 12345;
+ assert(PyUnicodeWriter_DecodeUTF8Stateful(writer, "invalid\xFF", -1, NULL, &consumed) < 0);


This do nothing in non-debug build.

Assertions are always built in _testcapi.c: the NDEBUG macro is undefined early in parts.h.

Modules/_testcapi/unicode.c

serhiy-storchaka · 2024-06-20T13:21:57Z

Modules/_testcapi/unicode.c

+ if (PyUnicodeWriter_WriteWideChar(writer, L"-", 1) < 0){
+ goto error;
+ }
+ if (PyUnicodeWriter_WriteWideChar(writer, L"euro=\u20AC", -1) < 0){


Also test surrogate pairs and non-BMP characters.
Since the code depends on the kind of the buffer string, you need to test different combinations: write different strings after writing a UCS2 or UCS4 string.
I suggest to implement in C a function which creates a PyUnicodeWriter, write the first argument as a Python string, then covert the second argument to the wchar_t* string and write it with size specified as optional third argument, and return the result. This helper function can be called in Python code with different arguments. The result will be checked even in non-debug build. You can test much more cases.

Co-authored-by: Serhiy Storchaka <[email protected]>

vstinner · 2024-06-20T14:05:00Z

@serhiy-storchaka: I tried to address most of your reviews. Would you mind to review the updated PR?

For tests, it's really complicated to write tests in C. I think that I will try to expose the C API PyUnicodeWriter in Python to write tests in Python in a following PR. I wanted to do that at the beginning, but it was quicker to start with C. Now the C test suite of PyUnicodeWriter is already quite big!

@serhiy-storchaka:

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

Right, I prefer to leave it as it is for now and write a following PR.

vstinner · 2024-06-21T17:33:47Z

Ok, I merged this PR as a starting point. I will rework tests in a follow-up PR.

Thanks @serhiy-storchaka and @malemburg for your reviews.

vstinner · 2024-06-21T17:50:13Z

I will rework tests in a follow-up PR.

Rewrite tests in Python: #120845

) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-appbot mentioned this pull request Jun 17, 2024
[C API] Add an efficient public PyUnicodeWriter API #119182
Closed

vstinner mentioned this pull request Jun 17, 2024
Add PyUnicodeWriter API capi-workgroup/decisions#27
Closed

pythongh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()
8aa73b7
Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

vstinner force-pushed the WIP_unicode_writer_more branch from 7c4cc95 to 8aa73b7Compare June 17, 2024 15:56

vstinner changed the title ~~[WIP] gh-119182: Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful()~~gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()Jun 17, 2024

vstinner marked this pull request as ready for review June 17, 2024 15:56

bedevere-appbot added the awaiting core review label Jun 17, 2024

doc: fix typo
788a85f

vstinner added the skip news label Jun 17, 2024

malemburg reviewed Jun 19, 2024
View reviewed changes

Optimize PyUnicodeWriter_WriteWideChar()
e67a8b4
Avoid a temporary Unicode object, write directly into the writer.

serhiy-storchaka reviewed Jun 19, 2024
View reviewed changes

Objects/unicodeobject.c Outdated Show resolvedHide resolved

Update Objects/unicodeobject.c
de56475

serhiy-storchaka self-requested a review June 19, 2024 14:55

Fix compiler warning
e48eec7

serhiy-storchaka reviewed Jun 20, 2024
View reviewed changes

Add unicode_write_widechar()
75fa8ba
Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().

serhiy-storchaka reviewed Jun 20, 2024
View reviewed changes

vstinnerand others added 3 commits June 20, 2024 15:40

Update Doc/c-api/unicode.rst
3f284f8
Co-authored-by: Serhiy Storchaka <[email protected]>

Address Serhiy's review
1e018d2

Add more tests
6f29c53

vstinner merged commit 4123226 into python:mainJun 21, 2024

vstinner deleted the WIP_unicode_writer_more branch June 21, 2024 17:33

bedevere-appbot removed the awaiting core review label Jun 21, 2024

		// This code assumes that unicode can hold one more code point than
		// wstr characters for a terminating null character.

Uh oh!

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()#120639

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

Uh oh!

Conversation

vstinner commented Jun 17, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Jun 17, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Jun 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Jun 19, 2024

Uh oh!

malemburg commented Jun 19, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Jun 20, 2024

Uh oh!

vstinner commented Jun 21, 2024

Uh oh!

vstinner commented Jun 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vstinner commented Jun 17, 2024•
edited
Loading

vstinner commented Jun 17, 2024•
edited
Loading