Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL#111587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented Nov 1, 2023 • edited by bedevere-app bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-app bot
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Nov 1, 2023
Since 2008,
Embedded null characters can lead to bugs, or worse, security vulnerabilities. I just modified PyUnicode_AsUTF8() in Python 3.13 to raise a ValueError if the string contains a null character: PR #111091. This change made the function safer and so I added it to the limited C API: PR #111121. Problem: @encukouhas concerns about PyUnicode_AsUTF8() performance since strlen() has a O(n) complexity. I propose adding a 2-bit cache to PyASCIIObject to still keep PyUnicode_AsUTF8() safe and avoid whenever possible to call strlen() to check if a string contains null characters. With proposed change, PyUnicode_AsUTF8() doesn't have to call strlen() on strings created by PyUnicode_FromString(). Calling PyUnicode_FromString() is the most common way to create a Python str object in the Python C API. Moreover, in general, at maximum, PyUnicode_AsUTF8() only has to call strlen() once per Python str object, since the result is then stored in the cache. The change doesn't change IMO converting a Python str object to a C string as The Python C API has many other APIs to converting a Python str object to a C string, but usually, with a less convenient API, like return a bytes object which requires to get the pointer to the string, and get the size separately. In C, C strings terminated by a NUL byte is the most commonly used format for strings. The change initializes the cache for global static strings single as string singletons and the internal |
vstinner commented Nov 1, 2023
vstinner commented Nov 1, 2023
If the UTF-8 encoded string is under 100 bytes, |
serhiy-storchaka commented Nov 1, 2023
I considered adding similar 2-bit field for embedded lone surrogates. UTF-16 and UTF-32 encoder can be faster if we know that the string do not contain them. It can extend the ASCII bit.
But I was not sure that it would justify the cost of additional complexity. |
vstinner commented Nov 1, 2023
I would prefer to not reuse the ASCII member to avoid any backward compatibility issue. If tomorrow, we are short in number of free bits, we can revisit the implementation, but it's not needed yet. embed_null=3 is invalid: _PyUnicode_CheckConsistency() fails in this case. It's a wasted "bit" but IMO it makes the API easier to use. I think that it's a good idea to add a cache for lone surrogates, it will be helpful on Windows which converts often to UTF-16. As I discovered while implementing this PR, sometimes you can fill the cache without having to compute anything. For example, PyUnicode_FromString() cannot produce lone surrogate: you can init the cache for free. |
vstinner commented Nov 1, 2023
@serhiy-storchaka: So, do you think that avoid strlen() at each PyUnicode_AsUTF8() is worth it? |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
serhiy-storchaka commented Nov 1, 2023
I am not sure. Every use of the But it increases the implementation complexity and makes the code more fragile. What happen when you modify the string in-place? For example by calling This flag could also be used in |
vstinner commented Nov 1, 2023
Only the first call. Then it's a O(1) operation when strlen() is omitted, no? |
serhiy-storchaka commented Nov 1, 2023
What do you do with the |
vstinner commented Nov 1, 2023
Currently, cached PyUnicode_AsUTF8() has a complexity of O(n). With this change, cached PyUnicode_AsUTF8() has a complexity of O(1). On a 1 GiB string, it takes 3 ns instead of 64 ms (23,787,089x faster ;-)). Currently, (cached) PyUnicode_AsUTF8() performance has a complexity of O(n). I wrote a benchmark and in short, it's a benchmark on strlen(str) since the UTF-8 encoder is cached. Benchmark on cached PyUnicode_AsUTF8() comparing this PR to the main branch: The benchmark warmup fills the PyUnicode_AsUTF8() cache, so the benchmark is on cached PyUnicode_AsUTF8(). Before, on a string of 1 GiB, strlen() took 64 ms. With this change, cached PyUnicode_AsUTF8() always take 3 ns, it has a complexity of O(1): it doesn't depending on the string length. bench.py: importpyperffrom_testinternalcapiimportbench_asutf8importfunctoolsrunner=pyperf.Runner() fortext, sizein ( ('0', 0), ('1', 1), ('10', 10), ('10^3', 10**3), ('10^6', 10**6), ('1 MiB', 1024**2), ('1 GiB', 1024**3), ): runner.bench_time_func(f'asutf8 {text}', functools.partial(bench_asutf8, 'x'*size))Patch to add the benchmark function: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c index a71e7e1dcc..6ccc89f28a 100644 --- a/Modules/_testinternalcapi.c+++ b/Modules/_testinternalcapi.c@@ -1639,6 +1639,28 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args) } +static PyObject *+bench_asutf8(PyObject *self, PyObject *args)+{+ PyObject *str;+ Py_ssize_t loops;+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)){+ return NULL;+ }++ _PyTime_t t = _PyTime_GetPerfCounter();+ for (Py_ssize_t i=0; i < loops; i++){+ if (PyUnicode_AsUTF8(str) == NULL){+ return NULL;+ }+ }++ _PyTime_t dt = _PyTime_GetPerfCounter() - t;++ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));+}++ static PyMethodDef module_functions[] ={{"get_configs", get_configs, METH_NOARGS},{"get_recursion_depth", get_recursion_depth, METH_NOARGS}, @@ -1701,6 +1723,7 @@ static PyMethodDef module_functions[] ={{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS}, _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF +{"bench_asutf8", bench_asutf8, METH_VARARGS},{NULL, NULL} /* sentinel */ }; |
vstinner commented Nov 1, 2023
Would you mind to elaborate how/when the second (cached) call to PyUnicode_AsUTF8() on the same Python str object has a complexity of O(n) with my change? I'm not sure that we are talking about the same thing. |
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.Fix unicode_subtype_new
830ffd5 to e224751Comparevstinner commented Nov 1, 2023
I updated my PR to clear the cache when unicode_resize() is called. So indirectly, PyUnicode_Append() now clears the cache as well. I added an assertion to make sure that it's the case. In general, Python has low or no guarantee when a Python str object is modified. Functions to modify a string just check unicode_modifiable() which only has basic checks. For example, functions like PyUnicode_Fill() and PyUnicode_CopyCharacters() don't check if the UTF-8 encoded string cache is already filled. Only PyUnicode_Resize() clears the UTF-8 cache. In general, The PyUnicode API makes the assumption that these functions are only used to create a string, but then a string is no longer modified. (I would prefer to remove all C API which allows to modify a string, but that would be a backward incompatible change and require more work.)
I modified PyUnicode_AsWideCharString() in the first version of my change, but I reverted my change before creating this PR because of the Solaris code path (HAVE_NON_UNICODE_WCHAR_T_REPRESENTATION). Maybe I can propose a follow-up PR for PyUnicode_AsWideCharString() once this PR is merged. |
Suggestion by Serhiy
vstinner commented Nov 1, 2023
@serhiy-storchaka: I addressed your suggestions. Would you mind to review the updated PR? |
vstinner commented Nov 1, 2023
The benefits of the cache is minor (1 ns) for strings about 10 characters. It starts to become interesting for strings of at least 1,000 characters. |
serhiy-storchaka commented Nov 2, 2023
Do you have any microbenchmarks that do not call BTW, I just removed 2 calls of |
vstinner commented Nov 2, 2023
This is nice, but how is it related to this change? There are usecases where you cannot avoid PyUnicode_AsUTF8(). Moreover, even when it's possible, it's not always trivial to avoid encode+decode roundtrip, especially if you are calling 3rd party code. |
vstinner commented Nov 2, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
UPDATE: Oops, I made a mistake the first time that I ran my benchmark. I forgot to rebuild Python, and so I measured twice the same binary. I reran the benchmark, and now I see a more significant difference with and without the change! Different benchmark on In short,
Result: the embed_null cache becomes very interesting starting at 10^6 characters (1.04x faster: 561 us = >542 us: saves 19 us). But the effect is visible starting at 0 characters (397 ns => 389 ns: save 8 ns) :-) IMO this benchmark is closer to a "real-world" benchmark that my first micro-benchmark on PyUnicode_AsUTF8() which was focused on the worst case. Benchmark ran with CPU isolation on Linux. Patch: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c index a71e7e1dcc..692d008cdf 100644 --- a/Modules/_testinternalcapi.c+++ b/Modules/_testinternalcapi.c@@ -1639,6 +1639,44 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args) } +static PyObject *+bench_asutf8(PyObject *self, PyObject *args)+{+ PyObject *str;+ Py_ssize_t loops;+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)){+ return NULL;+ }++ PyObject *codecs = PyImport_ImportModule("_codecs");+ if (codecs == NULL){+ return NULL;+ }+ PyObject *lookup_error = PyObject_GetAttrString(codecs, "lookup_error");+ Py_DECREF(codecs);+ if (lookup_error == NULL){+ return NULL;+ }++ _PyTime_t t = _PyTime_GetPerfCounter();+ for (Py_ssize_t i=0; i < loops; i++){+ PyObject *error = PyObject_CallOneArg(lookup_error, str);+ if (error != NULL){+ Py_DECREF(error);+ }+ else{+ PyErr_Clear();+ }+ }++ _PyTime_t dt = _PyTime_GetPerfCounter() - t;++ Py_DECREF(lookup_error);++ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));+}++ static PyMethodDef module_functions[] ={{"get_configs", get_configs, METH_NOARGS},{"get_recursion_depth", get_recursion_depth, METH_NOARGS}, @@ -1701,6 +1739,7 @@ static PyMethodDef module_functions[] ={{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS}, _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF +{"bench_asutf8", bench_asutf8, METH_VARARGS},{NULL, NULL} /* sentinel */ }; Script: importpyperffrom_testinternalcapiimportbench_asutf8importfunctoolsrunner=pyperf.Runner() fortext, sizein ( ('0', 0), ('1', 1), ('10', 10), ('10^3', 10**3), ('10^6', 10**6), ('1 MiB', 1024**2), ('1 GiB', 1024**3), ): runner.bench_time_func(f'_codecs.lookup_error size={text}', functools.partial(bench_asutf8, 'x'*size)) |
vstinner commented Nov 2, 2023
I reran PyUnicode_AsUTF8() micro-benchmark on the same machine but with CPU isolation: Aha, I still see the major difference without/with the cache for 1 GiB string: 76.2 ms becomes 5.36 ns (14,220,978x faster). |
vstinner commented Nov 2, 2023
When PyUnicode_AsUTF8() doesn't need to encode the string to UTF-8, since data is already available (all ASCII strings!), and so "is O(1)", it's silly that PyUnicode_AsUTF8() wastes CPU cycles in calling strlen() and so becomes O(n) again in practice, no? |
vstinner commented Nov 2, 2023
I'm sorry, I made a mistake when running my benchmark on Reminder: I made PyUnicode_AsUTF8() slower in the main branch by adding strlen(). In Python 3.12, strlen() is not called: the function doesn't reject null characters. This change just reduces the slowdown compared to Python 3.12. |
methane commented Nov 2, 2023
When user get UTF-8 C string from Is there any valid O(1) use case of |
encukou commented Nov 2, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
The elephant in the room: why isn't this deprecated with a deprecation warning? edit:
I don't know. Is |
serhiy-storchaka commented Nov 2, 2023
Inada-san understand me.
It was made for correctness. If the
Even for ASCII strings you need O(n) to process them.
It shows that in the best possible example the difference is only few percents. Unlikely you can get better results with other real-world example. And megabyte-long error handler name is not very realistic.
Well, it saves 8 ns on every call of |
vstinner commented Nov 2, 2023
Sure, I understand that sometimes, the code processing PyUnicode_AsUTF8() result is way slower than strlen() and so the cache is not useful. There are various way to "use a string". Example from ctypes: switch (PyUnicode_AsUTF8(stgd->proto)[0]){case'z': /* c_char_p */case'Z': /* c_wchar_p */and: fmt=PyUnicode_AsUTF8(dict->proto); fd=_ctypes_get_fielddesc(fmt); structfielddesc*_ctypes_get_fielddesc(constchar*fmt){... for (; table->code; ++table){if (table->code==fmt[0]) returntable} ... }A common pattern is to compare PyUnicode_AsUTF8() result using strcmp() to multiple strings. Example of _elementtree: constchar*event_name=NULL; if (PyUnicode_Check(event_name_obj)){event_name=PyUnicode_AsUTF8(event_name_obj)} elseif (PyBytes_Check(event_name_obj)){event_name=PyBytes_AS_STRING(event_name_obj)} ... if (strcmp(event_name, "start") ==0){... } elseif (strcmp(event_name, "end") ==0){... } elseif (strcmp(event_name, "start-ns") ==0){... } elseif (strcmp(event_name, "end-ns") ==0){... ...Another benchmark on strcmp() using _Py_GetErrorHandler(). Result: Avoid strlen() in PyUnicode_AsUTF8() with cache of this PR saves 1.2 ns: 1.11x faster. _Py_GetErrorHandler(): _Py_error_handler_Py_GetErrorHandler(constchar*errors){if (errors==NULL||strcmp(errors, "strict") ==0){return_Py_ERROR_STRICT} if (strcmp(errors, "surrogateescape") ==0){return_Py_ERROR_SURROGATEESCAPE} if (strcmp(errors, "replace") ==0){return_Py_ERROR_REPLACE} ... }Patch: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c index a71e7e1dcc..27a9004a5c 100644 --- a/Modules/_testinternalcapi.c+++ b/Modules/_testinternalcapi.c@@ -1639,6 +1639,40 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args) } +static PyObject *+bench_asutf8(PyObject *self, PyObject *args)+{+ PyObject *str;+ Py_ssize_t loops;+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)){+ return NULL;+ }+ Py_ssize_t found = 0;++ _PyTime_t t = _PyTime_GetPerfCounter();+ for (Py_ssize_t i=0; i < loops; i++){+ const char *utf8 = PyUnicode_AsUTF8(str);+ if (utf8 == NULL){+ return NULL;+ }+ _Py_error_handler error_handler = _Py_GetErrorHandler(utf8);+ if (error_handler == _Py_ERROR_STRICT){+ found++;+ }+ }++ _PyTime_t dt = _PyTime_GetPerfCounter() - t;++ // Cannot happen, just to make sure that the compiler don't remove+ //_Py_GetErrorHandler() call.+ if (found > loops){+ PyErr_NoMemory();+ }++ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));+}++ static PyMethodDef module_functions[] ={{"get_configs", get_configs, METH_NOARGS},{"get_recursion_depth", get_recursion_depth, METH_NOARGS}, @@ -1701,6 +1735,7 @@ static PyMethodDef module_functions[] ={{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS}, _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF +{"bench_asutf8", bench_asutf8, METH_VARARGS},{NULL, NULL} /* sentinel */ };Script: importpyperffrom_testinternalcapiimportbench_asutf8importfunctoolsrunner=pyperf.Runner() runner.bench_time_func(f'asutf8', functools.partial(bench_asutf8, 'strict')) |
vstinner commented Nov 2, 2023
I created this PR to reply to @encukou's concern about performance overhead of adding strlen() to PyUnicode_AsUTF8(): #111089 If we consider that the overhead is not significant and this cache has too many drawbacks compared to advantages, I'm fine to not add it. |
vstinner commented Nov 2, 2023
You should maybe ask to the author who added the comment. Maybe the issue was not important enough to "implement" a deprecation. But well, PyUnicode_AsUTF8() now checks for embedded null characters, and so it's no longer needed to deprecate it. It seems like the majority of code dev who reviewed my recent changes about PyUnicode_AsUTF8() are fine with the new behavior. It also allowed to use PyUnicode_AsUTF8() in places where PyUnicode_AsUTF8AndSize() was used before to check for embedded null characters. |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve because it looks technically correct to me.
I don't know if these changes should be made. The benefit is small in most real examples.
Yhg1s commented Nov 2, 2023
I don't know why I have to point this out, but the cache only amortises the cost when repeatedly calling |
vstinner commented Nov 2, 2023
Strings created by PyUnicode_FromString() and "static strings" are created with the cache initialized and so PyUnicode_AsUUTF8() never call strlen() on these strings. As I wrote before, PyUnicode_FromString() is one of the most common way to create a Python str object in the C API. |
* unicode_char() * PyUnicode_FromWideChar(str, -1) * _PyUnicode_Copy()
vstinner commented Nov 2, 2023
I tried to write the smallest PR just to add the member, but there is room for enhancements: set embed_null cache on newly created strings in more cases, with no performance overhead (without having to call strlen()). In short, all operations modifying an existing string can set embed_null if embed_null of the modified string is known (is 0 or 1). Tell me if you prefer that I make this PR more complete and attempt to set embed_null in every single case. |
| memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode), | ||
| length*PyUnicode_KIND(unicode)); | ||
| _PyUnicode_STATE(copy).embed_null=_PyUnicode_STATE(unicode).embed_null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyUnicode_Copy() makes a modifiable unicode object. It is legal to embed a null character in it after creation or replace an embeded null character with non-null character. In particular, it creates a new copy even from Latin1 character singletons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, technically, any string can be modified anytime by the C API. People "should not do that", but since it's possible, I'm not sure if it's safe to make the assumption that people will not mutate a string long after its creation: after the cache is initialized.
vstinner commented Nov 3, 2023
The trend of #111089 is more about reverting While the design of this cache is appealing to me, it seems like there are multiple issues:
If tomorrow, the public C API no longer allows to mutate a string, we can reconsider the cache idea. But for now, it seems like it's safer to not take the risk of adding a cache which can introduce bugs. Correctness matters more than performance here. I close my PR. Thanks everybody for looking into my cache ;-) It was a constructive discussion. |
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes embed_null since the string cannot contain a null character.
Global static strings now also initialize the embed_null member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.