Skip to content

Conversation

@vstinner
Copy link
Member

@vstinnervstinner commented Jun 7, 2024

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.

Remove unused 'consumed' parameter of unicode_decode_utf8_writer().


📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
@vstinner
Copy link
MemberAuthor

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

@serhiy-storchaka@methane: Would you mind to review this change?

@vstinner
Copy link
MemberAuthor

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

PyUnicode_FromFormat() is strict for anything else:

  • width is too big
  • %c argument is out of the Unicode range
  • etc.

@vstinner
Copy link
MemberAuthor

PyUnicode_FromFormat() is used by PyErr_Format(), PyErr_FormatUnraisable(), and will be used by the incoming PyUnicodeWriter_Format().

@serhiy-storchaka
Copy link
Member

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

PyUnicode_Format("%s", "\xe2\x82\xac")

@methane
Copy link
Member

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

I prefer "strict" because "hard to notice" is also hard to debug.

@vstinner
Copy link
MemberAuthor

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with strchr('%') to get the string length) instead of parsing manually the string for check for non-ASCII characters.

@vstinner
Copy link
MemberAuthor

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

Ok, I created a dedicated PR for that: #120307.

@vstinner
Copy link
MemberAuthor

@serhiy-storchaka@methane: Would you mind to review the updated PR?

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

I modified the change to use the strict error handler.

I also modified the implementation to still raise ValueError if the format string is not a valid UTF-8 string, but chain the exception to the internal UnicodeDecodeError which contains details. Example:

UnicodeDecodeError: 'utf-8'codeccan'tdecodebyte0xffinposition21: invalidstartbyteDuringhandlingoftheaboveexception, anotherexceptionoccurred: Traceback (mostrecentcalllast): File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line391, intest_from_formatPyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc') ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line377, inPyUnicode_FromFormatreturn_PyUnicode_FromFormat(format, *cargs) ValueError: PyUnicode_FromFormatV() expectsavalidUTF-8-encodedformatstring, gotaninvalidUTF-8string

Replace PyErr_Format() with PyErr_SetString()
Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this change is necessary, but I do not strongly oppose it.

if (unicode_decode_utf8_writer(&writer, f, len,
_Py_ERROR_STRICT, "strict") < 0){
PyObject *exc = PyErr_GetRaisedException();
PyErr_SetString(PyExc_ValueError,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message helps debugging such issue: it points directly to the format string.

@vstinner
Copy link
MemberAuthor

@serhiy-storchaka:

I do not think that this change is necessary, but I do not strongly oppose it.

Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and strchr(). It makes PyUnicode_FromFormat() between 1.08x (format string of 30 characters) and 1.21x faster (format string of 100 characters). The speedup should be even better with longer format string.

Benchmark:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c index b139b46c826..4efef31ef4c 100644 --- a/Modules/_testcapimodule.c+++ b/Modules/_testcapimodule.c@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args)) Py_RETURN_NONE} +static PyObject *+bench(PyObject *Py_UNUSED(module), PyObject *args)+{+ const char *format;+ if (!PyArg_ParseTuple(args, "y", &format)){+ return NULL;+ }+++ PyObject *str = PyUnicode_FromFormat(format, 123);+ assert(str != NULL);+ Py_DECREF(str);++ Py_RETURN_NONE;+}+ static PyMethodDef TestMethods[] ={{"set_errno", set_errno, METH_VARARGS},{"test_config", test_config, METH_NOARGS}, @@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] ={{"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},{"test_weakref_capi", test_weakref_capi, METH_NOARGS},{"function_set_warning", function_set_warning, METH_NOARGS}, +{"bench", bench, METH_VARARGS},{NULL, NULL} /* sentinel */ }; 

Script:

importpyperfimport_testcapirunner=pyperf.Runner() runner.bench_func('bench 3', _testcapi.bench, b'x'*3+b'%i') runner.bench_func('bench 30', _testcapi.bench, b'x'*30+b'%i') runner.bench_func('bench 100', _testcapi.bench, b'x'*100+b'%i')

Result:

+----------------+--------+----------------------+ | Benchmark | ref | change | +================+========+======================+ | bench 30 | 215 ns | 200 ns: 1.08x faster | +----------------+--------+----------------------+ | bench 100 | 252 ns | 208 ns: 1.21x faster | +----------------+--------+----------------------+ | Geometric mean | (ref) | 1.09x faster | +----------------+--------+----------------------+ Benchmark hidden because not significant (1): bench 3 

@vstinner
Copy link
MemberAuthor

@erlend-aasland@corona10: Do you have an opinion on this change?

@vstinnervstinner changed the title gh-119182: Decode PyUnicode_FromFormat() format from UTF-8gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8Jun 11, 2024
@vstinner
Copy link
MemberAuthor

Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR.

@vstinnervstinner deleted the format_utf8 branch June 20, 2024 12:51
@serhiy-storchaka
Copy link
Member

I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored.

@vstinner
Copy link
MemberAuthor

vstinner commented Jun 20, 2024

Well, I'm not convinced myself anymore, so I prefer to abandon this PR.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

@vstinner@serhiy-storchaka@methane