Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-74902: Add Unicode Grapheme Cluster Break algorithm#143076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076
Uh oh!
There was an error while loading. Please reload this page.
Conversation
serhiy-storchaka commented Dec 22, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex python#29. Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm. Co-authored-by: Guillaume "Vermeille" Sanchez <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| ``False`` otherwise. | ||
| .. versionadded:: next | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can split it on sections by type and order alphabetically inside a section.
| .. data:: ucd_3_2_0 | ||
| This is an object that has the same methods as the entire module, but uses the | ||
| This is an object that has most of the methods of the entire module, but uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as».
merwok commented Dec 23, 2025
These functions help compute width? |
serhiy-storchaka commented Dec 23, 2025
At least two implementations (in Perl's Unicode::GCString and builtin in C++) use graphemes. Naive implementation in C's |
Co-authored-by: Stan Ulbrych <[email protected]>
merwok commented Dec 23, 2025
Sorry if my question was not clear. |
serhiy-storchaka commented Dec 23, 2025
Yes, I think that |
Uh oh!
There was an error while loading. Please reload this page.
| self.assertEqual(chunks.pop(), '', line) | ||
| input=''.join(chunks) | ||
| withself.subTest(line): | ||
| result=list(unicodedata.iter_graphemes(input)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to use the passed ucd argument?
| result=list(unicodedata.iter_graphemes(input)) | |
| result=list(ucd.iter_graphemes(input)) |
| self.assertEqual([x.startforxinresult], breaks[:-1], comment) | ||
| self.assertEqual([x.endforxinresult], breaks[1:], comment) | ||
| foriinrange(1, len(breaks) -1): | ||
| result=list(unicodedata.iter_graphemes(input, breaks[i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| result=list(unicodedata.iter_graphemes(input, breaks[i])) | |
| result=list(ucd.iter_graphemes(input, breaks[i])) |
Continues above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is module-only function.
Modules/unicodedata.c Outdated
| } | ||
| /* XXX Add doc strings. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above functions already have docstrings?
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| hdr=testfile.readline() | ||
| returnunicodedata.unidata_versioninhdr | ||
| @requires_resource('network') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it not be urlfetch resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. The other test (for normalization) uses the network resource).
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Stan Ulbrych <[email protected]>
bab1d7a into python:mainUh oh!
There was an error while loading. Please reload this page.
bedevere-bot commented Jan 14, 2026
|
bedevere-bot commented Jan 14, 2026
|
bedevere-bot commented Jan 14, 2026
|
Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex
#29.Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm.