Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL#111091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented Oct 20, 2023 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Oct 20, 2023
Uh oh!
There was an error while loading. Please reload this page.
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this idea when added such checks in other functions, but at that time PyUnicode_AsUTF8() was rather a simple accessor to internal data.
Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize() + strlen(). Now some of them can be changed back to use PyUnicode_AsUTF8().
Do you consider to include PyUnicode_AsUTF8() in the limited C API?
| extracted from the returned data. | ||
| */ | ||
| // Returns a pointer to the default encoding (UTF-8) of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it necessary to change the type of the comment? Most comments here are /* */.
It makes reviewing the changes more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, but I take this PR as an opportunity to change the comment style. IMO multi-line comments written with // comment syntax are way easier to read.
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Oct 20, 2023
It's my main motivation for this change. It's the most natural way to get a |
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
vstinner commented Oct 20, 2023
Once this change will be merged, we can review PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize() usage. But I chose to make this PR as small to possible to ease the review. |
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
vstinner commented Oct 20, 2023
Oh. It seems like I removed the cast fix in my latest change. Thanks @serhiy-storchaka, I applied your suggestion. |
serhiy-storchaka commented Oct 20, 2023
But why set size to 0, and not to -1? There is larger chance to miss error. For example |
vstinner commented Oct 20, 2023
Is this comment related to my PyUnicode_AsUTF8AndSize change, PR #111106? |
serhiy-storchaka commented Oct 20, 2023
Yes. |
vstinner commented Oct 20, 2023
Merged, thanks for the review @serhiy-storchaka. |
Yhg1s commented Nov 2, 2023
Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code? Why is recommending a replacement of I use |
vstinner commented Nov 2, 2023
I wrote PR #111587 to avoid calling strlen() in most cases. |
Yhg1s commented Nov 2, 2023
Caching the result does not affect my use-cases. |
vstinner commented Nov 2, 2023
Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache? |
vstinner commented Nov 2, 2023
Sorry, I made the assumption that the background of this change was a common knowledge. I wrote issue #111656 to describe the issue. |
vstinner commented Nov 2, 2023
If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use |
Yhg1s commented Nov 2, 2023
I've commented there, but just for clarity: it's because it only ever calls |
Yhg1s commented Nov 2, 2023
This is exactly the problem I'm pointing out. You are breaking existing code because you're making bad assumptions. The assumption is that the user didn't check for NULs before, or that handling NULs this way would be problematic. It does not have to be. You're forcing users who are correctly using the API to use a different API for no reason other than to signal to you that they're using it correctly. If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here. |
vstinner commented Nov 2, 2023
Using PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() doesn't mean that you don't care about the length. You can use PyUnicode_AsUTF8() when you use any C API which expects a null terminated string. strlen() gives you the string length, but you don't need to explicitly pass the length, the null terminator is used for that. I'm not sure that PyUnicode_AsUTF8() implies that you don't care about embedded null characters. Would you mind to elaborate how you create a string with null characters and then truncate the string at the first null character on purpose? What does the string contain? Why not truncating at null when you create the string? Do you have examples of code doing that? |
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
*sizeto 0 on error to avoid undefined variable value.📚 Documentation preview 📚: https://cpython-previews--111091.org.readthedocs.build/