gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL#111091

vstinner · 2023-10-20T00:02:31Z

PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
PyUnicode_AsUTF8AndSize() now sets *size to 0 on error to avoid undefined variable value.
Update related C API tests (test_capi.test_unicode).
type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.

Issue: [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

📚 Documentation preview 📚: https://cpython-previews--111091.org.readthedocs.build/

vstinner · 2023-10-20T00:03:54Z

Objects/unicodeobject.c

serhiy-storchaka

I considered this idea when added such checks in other functions, but at that time PyUnicode_AsUTF8() was rather a simple accessor to internal data.

Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize() + strlen(). Now some of them can be changed back to use PyUnicode_AsUTF8().

Do you consider to include PyUnicode_AsUTF8() in the limited C API?

serhiy-storchaka · 2023-10-20T06:24:19Z

Include/cpython/unicodeobject.h

- extracted from the returned data.
-*/
-
+// Returns a pointer to the default encoding (UTF-8) of the


Was it necessary to change the type of the comment? Most comments here are /* */.
It makes reviewing the changes more difficult.

It's not necessary, but I take this PR as an opportunity to change the comment style. IMO multi-line comments written with // comment syntax are way easier to read.

Objects/unicodeobject.c

vstinner · 2023-10-20T07:01:15Z

Do you consider to include PyUnicode_AsUTF8() in the limited C API?

It's my main motivation for this change. It's the most natural way to get a char*. But I prefer to discuss it separetely.

* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.

vstinner · 2023-10-20T10:07:49Z

Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize() + strlen(). Now some of them can be changed back to use PyUnicode_AsUTF8().

Once this change will be merged, we can review PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize() usage. But I chose to make this PR as small to possible to ease the review.

Objects/unicodeobject.c

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

vstinner · 2023-10-20T15:12:16Z

Oh. It seems like I removed the cast fix in my latest change. Thanks @serhiy-storchaka, I applied your suggestion.

serhiy-storchaka · 2023-10-20T15:27:31Z

But why set size to 0, and not to -1? There is larger chance to miss error. For example PyUnicode_FromStringAndSize(NULL, 0) and PyBytes_FromStringAndSize(NULL, 0) return empty string and empty bytes object, but if size is negative, they raise exception.

vstinner · 2023-10-20T15:30:44Z

But why set size to 0, and not to -1?

Is this comment related to my PyUnicode_AsUTF8AndSize change, PR #111106?

serhiy-storchaka · 2023-10-20T15:40:05Z

Yes.

vstinner · 2023-10-20T15:59:39Z

Merged, thanks for the review @serhiy-storchaka.

Yhg1s · 2023-11-02T15:19:30Z

Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code?

Why is recommending a replacement of PyUnicode_AsUTF8AndSize(s, NULL) any better than leaving PyUnicode_AsUTF8(s) as-is? Considering the two have been documented and recommended as alternatives for so long, is it a good idea to break the equivalence? How should users know that one is "safe" and the other is not?

I use PyUnicode_AsUTF8() in performance-critical situations, and the extra check is wholly unnecessary for my uses. I'm also concerned about the change of semantics. Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().

vstinner · 2023-11-02T15:22:51Z

I use PyUnicode_AsUTF8() in performance-critical situations, and the extra check is wholly unnecessary for my uses.

I wrote PR #111587 to avoid calling strlen() in most cases.

Yhg1s · 2023-11-02T15:52:49Z

I wrote PR #111587 to avoid calling strlen() in most cases.

Caching the result does not affect my use-cases.

vstinner · 2023-11-02T16:36:43Z

Caching the result does not affect my use-cases.

Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache?

vstinner · 2023-11-02T17:11:58Z

Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code?

Sorry, I made the assumption that the background of this change was a common knowledge. I wrote issue #111656 to describe the issue.

vstinner · 2023-11-02T17:14:49Z

Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().

If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use PyUnicode_AsUTF8AndSize(), PyUnicode_AsUTF8String() or any other existing API which returns the size of the UTF-8 encoded string (and don't raise an exception in case of embedded null character).

Yhg1s · 2023-11-02T20:58:11Z

Caching the result does not affect my use-cases.

Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache?

I've commented there, but just for clarity: it's because it only ever calls PyUnicode_AsUTF8() once per string.

Yhg1s · 2023-11-02T21:01:22Z

Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().
If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use PyUnicode_AsUTF8AndSize(), PyUnicode_AsUTF8String() or any other existing API which returns the size of the UTF-8 encoded string (and don't raise an exception in case of embedded null character).

This is exactly the problem I'm pointing out. You are breaking existing code because you're making bad assumptions. The assumption is that the user didn't check for NULs before, or that handling NULs this way would be problematic. It does not have to be. You're forcing users who are correctly using the API to use a different API for no reason other than to signal to you that they're using it correctly. If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here.

vstinner · 2023-11-02T23:21:25Z

If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here.

Using PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() doesn't mean that you don't care about the length. You can use PyUnicode_AsUTF8() when you use any C API which expects a null terminated string. strlen() gives you the string length, but you don't need to explicitly pass the length, the null terminator is used for that.

I'm not sure that PyUnicode_AsUTF8() implies that you don't care about embedded null characters.

Would you mind to elaborate how you create a string with null characters and then truncate the string at the first null character on purpose? What does the string contain? Why not truncating at null when you create the string? Do you have examples of code doing that?

python#111091)" This reverts commit d731579.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.

…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.

…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.

vstinner requested a review from markshannon as a code owner October 20, 2023 00:02

bedevere-appbot mentioned this pull request Oct 20, 2023
[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089
Closed

bedevere-appbot added the awaiting core review label Oct 20, 2023

vstinner commented Oct 20, 2023
View reviewed changes

Objects/unicodeobject.c Outdated Show resolvedHide resolved

serhiy-storchaka reviewed Oct 20, 2023
View reviewed changes

vstinner force-pushed the asutf8 branch from 91e159c to 4e0b3d3Compare October 20, 2023 10:04

serhiy-storchaka reviewed Oct 20, 2023
View reviewed changes

Objects/unicodeobject.c Outdated Show resolvedHide resolved

Update Objects/unicodeobject.c
957c433
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

serhiy-storchaka approved these changes Oct 20, 2023
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting core review labels Oct 20, 2023

vstinner merged commit d731579 into python:mainOct 20, 2023

vstinner deleted the asutf8 branch October 20, 2023 15:59

bedevere-appbot removed the awaiting merge label Oct 20, 2023

vstinner mentioned this pull request Oct 20, 2023
gh-111089: Add PyUnicode_AsUTF8() to the limited C API #111121
Merged

vstinner mentioned this pull request Nov 1, 2023
gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587
Closed

vstinner mentioned this pull request Nov 3, 2023
gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672
Closed

vstinner added a commit to vstinner/cpython that referenced this pull request Nov 7, 2023
Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (
be7e341
python#111091)" This reverts commit d731579.

vstinner mentioned this pull request Nov 7, 2023
gh-111089: Revert PyUnicode_AsUTF8() changes #111833
Merged

Uh oh!

gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL#111091

gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL #111091

Uh oh!

Conversation

vstinner commented Oct 20, 2023• edited by github-actions botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchakaOct 20, 2023

Choose a reason for hiding this comment

Uh oh!

vstinnerOct 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

serhiy-storchaka commented Oct 20, 2023

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

serhiy-storchaka commented Oct 20, 2023

Uh oh!

vstinner commented Oct 20, 2023

Uh oh!

Yhg1s commented Nov 2, 2023

Uh oh!

vstinner commented Nov 2, 2023

Uh oh!

Yhg1s commented Nov 2, 2023

Uh oh!

vstinner commented Nov 2, 2023

Uh oh!

vstinner commented Nov 2, 2023

Uh oh!

vstinner commented Nov 2, 2023

Uh oh!

Yhg1s commented Nov 2, 2023

Uh oh!

Yhg1s commented Nov 2, 2023

Uh oh!

vstinner commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vstinner commented Oct 20, 2023•
edited by github-actions bot
Loading