Skip to content

ElementTree should use UTF-8 for xml declaration.#91810

@methane

Description

@methane

Feature or enhancement

Currently, ElementTree.tostring(root, encoding="unicode", xml_declaration=True) uses locale encoding.

I think ElementTree should use UTF-8, instead of locale encoding.

Example:

$ LANG=ja_JP.eucJP ./python.exe Python 3.11.0a7+ (heads/bytes-alloc-dirty:7fbc7f6128, Apr 19 2022, 16:53:54) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import xml.etree.ElementTree asET >>> et =ET.fromstring("<t>hello</t>") >>> ET.tostring(et, encoding="unicode", xml_declaration=True) "<?xml version='1.0' encoding='eucJP'?>\n<t>hello</t>"

Code:

with_get_writer(file_or_filename, enc_lower) aswrite:
ifmethod=="xml"and (xml_declarationor
(xml_declarationisNoneand
enc_lowernotin ("utf-8", "us-ascii", "unicode"))):
declared_encoding=encoding
ifenc_lower=="unicode":
# Retrieve the default encoding for the xml declaration
importlocale
declared_encoding=locale.getpreferredencoding()
write("<?xml version='1.0' encoding='%s'?>\n"% (
declared_encoding,))

Pitch

  • UTF-8 is the most common encoding for XML.
  • Locale encoding name (e.g. cp932 or eucJP) would be different from XML encoding name recommended by w3c (e.g. Shift_JIS or EUC-JP).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions