gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8#120248

vstinner · 2024-06-07T20:40:05Z

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.

Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

vstinner · 2024-06-07T20:42:34Z

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

@serhiy-storchaka @methane: Would you mind to review this change?

vstinner · 2024-06-07T20:47:08Z

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

PyUnicode_FromFormat() is strict for anything else:

width is too big
%c argument is out of the Unicode range
etc.

vstinner · 2024-06-07T20:55:12Z

PyUnicode_FromFormat() is used by PyErr_Format(), PyErr_FormatUnraisable(), and will be used by the incoming PyUnicodeWriter_Format().

serhiy-storchaka · 2024-06-10T05:40:30Z

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

PyUnicode_Format("%s", "\xe2\x82\xac")

methane · 2024-06-10T07:07:15Z

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?
Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

I prefer "strict" because "hard to notice" is also hard to debug.

vstinner · 2024-06-10T08:25:47Z

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with strchr('%') to get the string length) instead of parsing manually the string for check for non-ASCII characters.

vstinner · 2024-06-10T09:08:52Z

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

Ok, I created a dedicated PR for that: #120307.

vstinner · 2024-06-11T10:52:36Z

@serhiy-storchaka @methane: Would you mind to review the updated PR?

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

I modified the change to use the strict error handler.

I also modified the implementation to still raise ValueError if the format string is not a valid UTF-8 string, but chain the exception to the internal UnicodeDecodeError which contains details. Example:

UnicodeDecodeError: 'utf-8'codeccan'tdecodebyte0xffinposition21: invalidstartbyteDuringhandlingoftheaboveexception, anotherexceptionoccurred: Traceback (mostrecentcalllast): File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line391, intest_from_formatPyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc') ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line377, inPyUnicode_FromFormatreturn_PyUnicode_FromFormat(format, *cargs) ValueError: PyUnicode_FromFormatV() expectsavalidUTF-8-encodedformatstring, gotaninvalidUTF-8string

Replace PyErr_Format() with PyErr_SetString()

serhiy-storchaka

I do not think that this change is necessary, but I do not strongly oppose it.

Doc/c-api/unicode.rst

serhiy-storchaka · 2024-06-11T11:28:22Z

Objects/unicodeobject.c

+ if (unicode_decode_utf8_writer(&writer, f, len,
+ _Py_ERROR_STRICT, "strict") < 0){
+ PyObject *exc = PyErr_GetRaisedException();
+ PyErr_SetString(PyExc_ValueError,


Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.

The error message helps debugging such issue: it points directly to the format string.

Lib/test/test_capi/test_unicode.py

vstinner · 2024-06-11T12:04:40Z

@serhiy-storchaka:

I do not think that this change is necessary, but I do not strongly oppose it.

Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and strchr(). It makes PyUnicode_FromFormat() between 1.08x (format string of 30 characters) and 1.21x faster (format string of 100 characters). The speedup should be even better with longer format string.

Benchmark:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c index b139b46c826..4efef31ef4c 100644 --- a/Modules/_testcapimodule.c+++ b/Modules/_testcapimodule.c@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args)) Py_RETURN_NONE} +static PyObject *+bench(PyObject *Py_UNUSED(module), PyObject *args)+{+ const char *format;+ if (!PyArg_ParseTuple(args, "y", &format)){+ return NULL;+ }+++ PyObject *str = PyUnicode_FromFormat(format, 123);+ assert(str != NULL);+ Py_DECREF(str);++ Py_RETURN_NONE;+}+ static PyMethodDef TestMethods[] ={{"set_errno", set_errno, METH_VARARGS},{"test_config", test_config, METH_NOARGS}, @@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] ={{"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},{"test_weakref_capi", test_weakref_capi, METH_NOARGS},{"function_set_warning", function_set_warning, METH_NOARGS}, +{"bench", bench, METH_VARARGS},{NULL, NULL} /* sentinel */ };

Script:

importpyperfimport_testcapirunner=pyperf.Runner() runner.bench_func('bench 3', _testcapi.bench, b'x'*3+b'%i') runner.bench_func('bench 30', _testcapi.bench, b'x'*30+b'%i') runner.bench_func('bench 100', _testcapi.bench, b'x'*100+b'%i')

Result:

+----------------+--------+----------------------+ | Benchmark | ref | change | +================+========+======================+ | bench 30 | 215 ns | 200 ns: 1.08x faster | +----------------+--------+----------------------+ | bench 100 | 252 ns | 208 ns: 1.21x faster | +----------------+--------+----------------------+ | Geometric mean | (ref) | 1.09x faster | +----------------+--------+----------------------+ Benchmark hidden because not significant (1): bench 3

vstinner · 2024-06-11T12:10:57Z

@erlend-aasland @corona10: Do you have an opinion on this change?

Objects/unicodeobject.c

vstinner · 2024-06-20T12:51:33Z

Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR.

serhiy-storchaka · 2024-06-20T13:28:58Z

I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored.

vstinner · 2024-06-20T19:04:03Z

Well, I'm not convinced myself anymore, so I prefer to abandon this PR.

pythongh-119182: Decode PyUnicode_FromFormat() format from UTF-8
3d5bca4
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

bedevere-appbot mentioned this pull request Jun 7, 2024
[C API] Add an efficient public PyUnicodeWriter API #119182
Closed

bedevere-appbot added the awaiting core review label Jun 7, 2024

Update test_exceptions
6a87915

vstinner mentioned this pull request Jun 10, 2024
gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307
Closed

Use strict error handler
e830944

Fix error handling
242e6cb
Replace PyErr_Format() with PyErr_SetString()

serhiy-storchaka reviewed Jun 11, 2024
View reviewed changes

vstinner added 2 commits June 11, 2024 14:08

Add tests on truncated UTF-8 format strings
d04269f

Don't mention the strict error handler
94da5e7

vstinner changed the title ~~gh-119182: Decode PyUnicode_FromFormat() format from UTF-8~~gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8Jun 11, 2024

serhiy-storchaka reviewed Jun 11, 2024
View reviewed changes

Objects/unicodeobject.cShow resolvedHide resolved

Revert consumed parameter
89fd69a

vstinner closed this Jun 20, 2024

vstinner deleted the format_utf8 branch June 20, 2024 12:51

vstinner mentioned this pull request Jun 20, 2024
gh-119182: Optimize PyUnicode_FromFormat() #120796
Merged

Uh oh!

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8#120248

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

Uh oh!

Conversation

vstinner commented Jun 7, 2024• edited by github-actions botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Jun 7, 2024

Uh oh!

vstinner commented Jun 7, 2024

Uh oh!

vstinner commented Jun 7, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

methane commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 11, 2024

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchakaJun 11, 2024

Choose a reason for hiding this comment

Uh oh!

vstinnerJun 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Jun 11, 2024

Uh oh!

vstinner commented Jun 11, 2024

Uh oh!

Uh oh!

vstinner commented Jun 20, 2024

Uh oh!

serhiy-storchaka commented Jun 20, 2024

Uh oh!

vstinner commented Jun 20, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vstinner commented Jun 7, 2024•
edited by github-actions bot
Loading

vstinner commented Jun 20, 2024•
edited
Loading