Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8#120248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented Jun 7, 2024 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
vstinner commented Jun 7, 2024
I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one? Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for @serhiy-storchaka@methane: Would you mind to review this change? |
vstinner commented Jun 7, 2024
|
vstinner commented Jun 7, 2024
|
serhiy-storchaka commented Jun 10, 2024
But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the PyUnicode_Format("%s", "\xe2\x82\xac") |
methane commented Jun 10, 2024
I prefer "strict" because "hard to notice" is also hard to debug. |
vstinner commented Jun 10, 2024
I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with |
vstinner commented Jun 10, 2024
vstinner commented Jun 11, 2024
@serhiy-storchaka@methane: Would you mind to review the updated PR?
I modified the change to use the strict error handler. I also modified the implementation to still raise UnicodeDecodeError: 'utf-8'codeccan'tdecodebyte0xffinposition21: invalidstartbyteDuringhandlingoftheaboveexception, anotherexceptionoccurred: Traceback (mostrecentcalllast): File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line391, intest_from_formatPyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc') ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File"/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line377, inPyUnicode_FromFormatreturn_PyUnicode_FromFormat(format, *cargs) ValueError: PyUnicode_FromFormatV() expectsavalidUTF-8-encodedformatstring, gotaninvalidUTF-8string |
Replace PyErr_Format() with PyErr_SetString()
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this change is necessary, but I do not strongly oppose it.
Uh oh!
There was an error while loading. Please reload this page.
| if (unicode_decode_utf8_writer(&writer, f, len, | ||
| _Py_ERROR_STRICT, "strict") < 0){ | ||
| PyObject *exc = PyErr_GetRaisedException(); | ||
| PyErr_SetString(PyExc_ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message helps debugging such issue: it points directly to the format string.
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Jun 11, 2024
Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and Benchmark: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c index b139b46c826..4efef31ef4c 100644 --- a/Modules/_testcapimodule.c+++ b/Modules/_testcapimodule.c@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args)) Py_RETURN_NONE} +static PyObject *+bench(PyObject *Py_UNUSED(module), PyObject *args)+{+ const char *format;+ if (!PyArg_ParseTuple(args, "y", &format)){+ return NULL;+ }+++ PyObject *str = PyUnicode_FromFormat(format, 123);+ assert(str != NULL);+ Py_DECREF(str);++ Py_RETURN_NONE;+}+ static PyMethodDef TestMethods[] ={{"set_errno", set_errno, METH_VARARGS},{"test_config", test_config, METH_NOARGS}, @@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] ={{"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},{"test_weakref_capi", test_weakref_capi, METH_NOARGS},{"function_set_warning", function_set_warning, METH_NOARGS}, +{"bench", bench, METH_VARARGS},{NULL, NULL} /* sentinel */ }; Script: importpyperfimport_testcapirunner=pyperf.Runner() runner.bench_func('bench 3', _testcapi.bench, b'x'*3+b'%i') runner.bench_func('bench 30', _testcapi.bench, b'x'*30+b'%i') runner.bench_func('bench 100', _testcapi.bench, b'x'*100+b'%i')Result: |
vstinner commented Jun 11, 2024
@erlend-aasland@corona10: Do you have an opinion on this change? |
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Jun 20, 2024
Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR. |
serhiy-storchaka commented Jun 20, 2024
I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored. |
vstinner commented Jun 20, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Well, I'm not convinced myself anymore, so I prefer to abandon this PR. |
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.
Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/