Skip to content

unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties.#129117

@mrolle45

Description

@mrolle45

Bug report

Bug description:

With the unicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases.
The method is to look at unicodedata.category(c).
A start character has category in "Lu Ll Lt Lm Lo Nl Pc".split().
A continue character has category in "Lu Ll Lt Lm Lo Mn Mc Nd Nl Pc".split().

However, there are several codepoints which don't match these criteria, either because they are not that type of character or because their category is different.
Here is a complete list of the exceptions, on Python 3.13 and Unicode version 16.0:
Should be XID_START but are not:

005f Pc True LOW LINE 037a Lm True GREEK YPOGEGRAMMENI 0e33 Lo True THAI CHARACTER SARA AM 0eb3 Lo True LAO VOWEL SIGN AM 203f Pc True UNDERTIE 2040 Pc True CHARACTER TIE 2054 Pc True INVERTED UNDERTIE 2e2f Lm True VERTICAL TILDE fc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM fc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM fc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM fc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM fc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM fc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM fdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM fdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU fe33 Pc True PRESENTATION FORM FOR VERTICAL LOW LINE fe34 Pc True PRESENTATION FORM FOR VERTICAL WAVY LOW LINE fe4d Pc True DASHED LOW LINE fe4e Pc True CENTRELINE LOW LINE fe4f Pc True WAVY LOW LINE fe70 Lo True ARABIC FATHATAN ISOLATED FORM fe72 Lo True ARABIC DAMMATAN ISOLATED FORM fe74 Lo True ARABIC KASRATAN ISOLATED FORM fe76 Lo True ARABIC FATHA ISOLATED FORM fe78 Lo True ARABIC DAMMA ISOLATED FORM fe7a Lo True ARABIC KASRA ISOLATED FORM fe7c Lo True ARABIC SHADDA ISOLATED FORM fe7e Lo True ARABIC SUKUN ISOLATED FORM ff3f Pc True FULLWIDTH LOW LINE ff9e Lm True HALFWIDTH KATAKANA VOICED SOUND MARK ff9f Lm True HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK 

Should not be XID_START but are:

1885 Mn False MONGOLIAN LETTER ALI GALI BALUDA 1886 Mn False MONGOLIAN LETTER ALI GALI THREE BALUDA 2118 Sm False SCRIPT CAPITAL P 212e So False ESTIMATED SYMBOL 

Should be XID_CONTINUE but are not:

037a Lm True GREEK YPOGEGRAMMENI 2e2f Lm True VERTICAL TILDE fc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM fc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM fc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM fc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM fc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM fc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM fdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM fdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU fe70 Lo True ARABIC FATHATAN ISOLATED FORM fe72 Lo True ARABIC DAMMATAN ISOLATED FORM fe74 Lo True ARABIC KASRATAN ISOLATED FORM fe76 Lo True ARABIC FATHA ISOLATED FORM fe78 Lo True ARABIC DAMMA ISOLATED FORM fe7a Lo True ARABIC KASRA ISOLATED FORM fe7c Lo True ARABIC SHADDA ISOLATED FORM fe7e Lo True ARABIC SUKUN ISOLATED FORM 

Should not be XID_CONTINUE but are:

00b7 Po False MIDDLE DOT 0387 Po False GREEK ANO TELEIA 1369 No False ETHIOPIC DIGIT ONE 136a No False ETHIOPIC DIGIT TWO 136b No False ETHIOPIC DIGIT THREE 136c No False ETHIOPIC DIGIT FOUR 136d No False ETHIOPIC DIGIT FIVE 136e No False ETHIOPIC DIGIT SIX 136f No False ETHIOPIC DIGIT SEVEN 1370 No False ETHIOPIC DIGIT EIGHT 1371 No False ETHIOPIC DIGIT NINE 19da No False NEW TAI LUE THAM DIGIT ONE 200c Cf False ZERO WIDTH NON-JOINER 200d Cf False ZERO WIDTH JOINER 2118 Sm False SCRIPT CAPITAL P 212e So False ESTIMATED SYMBOL 30fb Po False KATAKANA MIDDLE DOT ff65 Po False HALFWIDTH KATAKANA MIDDLE DOT 

Many of these exceptions are specified in the UAX#31 Section 5.1, NFKC Modifications.

Proposal

I suggest adding two functions to the module, unicodedata.isidstart(chr) and unicodedata.isidcontinue(chr). These return True if chr appears in the DerivedCoreProperties.txt file as XID_Start or XID_Continue, resp.

CPython versions tested on:

3.13

Operating systems tested on:

Windows

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions