Skip to content

Conversation

@vstinner
Copy link
Member

@vstinnervstinner commented Jun 10, 2024

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.

Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().


📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().
@vstinner
Copy link
MemberAuthor

@serhiy-storchaka
Copy link
Member

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

@vstinner
Copy link
MemberAuthor

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

There are two tests on that: UnicodeDecodeError is raised in this case.

@vstinner
Copy link
MemberAuthor

Example of test: test_capi.test_unicode

# test "%s" format with precisioncheck_format('abc', b'%.3s', b'abcdef') withself.assertRaises(UnicodeDecodeError): PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8')) check_format('abc[\u20ac', b'%.7s', 'abc[\u20ac]'.encode('utf8'))

@serhiy-storchaka
Copy link
Member

This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence.

But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError.

@vstinner
Copy link
MemberAuthor

On my PR gh-120248, @methanewrote:

I prefer "strict" because "hard to notice" is also hard to debug.

So I created this PR. @methane: What do you think?

I can modify the %.100s format ("%s" with precision) to truncate to 100 characters instead of 100 bytes, to avoid the risk of creating invalid UTF-8 strings.

@serhiy-storchaka
Copy link
Member

I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments.

@methane
Copy link
Member

I think 100 codepoints is the best option.

About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently."
But both of "replace" and "backslashreplace" are acceptable.

@serhiy-storchaka
Copy link
Member

Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings.

charbuffer[100]; PyUnicode_FromFormat("%.100s", buffer);

If you start to count codepoints, you can read past the end of the array.

@vstinner
Copy link
MemberAuthor

I abandon this PR. It seems like using "replace" error handler is more appropriate here.

@vstinnervstinner deleted the format_strict branch June 17, 2024 20:02
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

@vstinner@serhiy-storchaka@methane