gh-74902: Add Unicode Grapheme Cluster Break algorithm#143076

serhiy-storchaka · 2025-12-22T16:34:03Z

Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex #29.

Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm.

Issue: Add unicode grapheme cluster break algorithm #74902

Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex python#29. Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm. Co-authored-by: Guillaume "Vermeille" Sanchez <[email protected]>

Doc/whatsnew/3.15.rst

Doc/library/unicodedata.rst

merwok · 2025-12-23T01:55:01Z

Doc/library/unicodedata.rst

+ ``False`` otherwise.
+
+ .. versionadded:: next
+


The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.

Or we can split it on sections by type and order alphabetically inside a section.

merwok · 2025-12-23T01:56:02Z

Doc/library/unicodedata.rst

 .. data:: ucd_3_2_0

- This is an object that has the same methods as the entire module, but uses the
+ This is an object that has most of the methods of the entire module, but uses the


This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as».

merwok · 2025-12-23T01:58:48Z

These functions help compute width?

serhiy-storchaka · 2025-12-23T08:01:13Z

At least two implementations (in Perl's Unicode::GCString and builtin in C++) use graphemes. Naive implementation in C's wcwidth() does not work well with complex characters and Emoji.

Co-authored-by: Stan Ulbrych <[email protected]>

merwok · 2025-12-23T14:59:33Z

Sorry if my question was not clear.
Are these functions building blocks that will be used to compute character widths, as needed by the other tickets about pyrepl, str justify, etc?
(maybe iter_graphemes?)

serhiy-storchaka · 2025-12-23T15:12:10Z

Yes, I think that _PyGraphemeBreak, _Py_InitGraphemeBreak() and _Py_NextGraphemeBreak() will be used in the implementation of unicodedata.width().

Doc/library/unicodedata.rst

StanFromIreland · 2025-12-26T13:43:26Z

Lib/test/test_unicodedata.py

+self.assertEqual(chunks.pop(), '', line)
+input=''.join(chunks)
+withself.subTest(line):
+result=list(unicodedata.iter_graphemes(input))


Did you mean to use the passed ucd argument?
Suggested change
result=list(unicodedata.iter_graphemes(input))
result=list(ucd.iter_graphemes(input))

StanFromIreland · 2025-12-26T13:43:43Z

Lib/test/test_unicodedata.py

+self.assertEqual([x.startforxinresult], breaks[:-1], comment)
+self.assertEqual([x.endforxinresult], breaks[1:], comment)
+foriinrange(1, len(breaks) -1):
+result=list(unicodedata.iter_graphemes(input, breaks[i]))


Suggested change
result=list(unicodedata.iter_graphemes(input, breaks[i]))
result=list(ucd.iter_graphemes(input, breaks[i]))
Continues above.

No, it is module-only function.

StanFromIreland · 2025-12-26T13:46:46Z

Modules/unicodedata.c

+}
+
+
+/* XXX Add doc strings. */


The above functions already have docstrings?

Modules/unicodedata.c

Doc/library/unicodedata.rst

Lib/test/test_unicodedata.py

StanFromIreland · 2025-12-26T14:05:02Z

Lib/test/test_unicodedata.py

+hdr=testfile.readline()
+returnunicodedata.unidata_versioninhdr
+
+@requires_resource('network')


Should it not be urlfetch resource?

Maybe. The other test (for normalization) uses the network resource).

Modules/unicodedata.c

Co-authored-by: Stan Ulbrych <[email protected]>

…me_break_tests().

bedevere-bot · 2026-01-14T15:28:16Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Stable LTO 3.x (tier-3) has failed when building commit bab1d7a.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1654/builds/1940) and take a look at the build logs.
Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1654/builds/1940

Summary of the results of the build (if available):

==

Click to see traceback logs

Traceback (most recent call last): File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-s390x.lto/build/Lib/tempfile.py", line 484, in __del__ _warnings.warn(self.warn_message, ResourceWarning) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ResourceWarning: Implicitly cleaning up <HTTPError 403: 'Forbidden'>

bedevere-bot · 2026-01-14T15:58:09Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 CentOS9 NoGIL Refleaks 3.x (tier-1) has failed when building commit bab1d7a.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1610/builds/2750) and take a look at the build logs.
Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1610/builds/2750

Failed tests:

test_unicodedata

Test leaking resources:

test_unicodedata: memory blocks
test_unicodedata: references

Summary of the results of the build (if available):

==

Click to see traceback logs

remote: Enumerating objects: 24, done. remote: Counting objects: 6% (1/16) remote: Counting objects: 12% (2/16) remote: Counting objects: 18% (3/16) remote: Counting objects: 25% (4/16) remote: Counting objects: 31% (5/16) remote: Counting objects: 37% (6/16) remote: Counting objects: 43% (7/16) remote: Counting objects: 50% (8/16) remote: Counting objects: 56% (9/16) remote: Counting objects: 62% (10/16) remote: Counting objects: 68% (11/16) remote: Counting objects: 75% (12/16) remote: Counting objects: 81% (13/16) remote: Counting objects: 87% (14/16) remote: Counting objects: 93% (15/16) remote: Counting objects: 100% (16/16) remote: Counting objects: 100% (16/16), done. remote: Compressing objects: 9% (1/11) remote: Compressing objects: 18% (2/11) remote: Compressing objects: 27% (3/11) remote: Compressing objects: 36% (4/11) remote: Compressing objects: 45% (5/11) remote: Compressing objects: 54% (6/11) remote: Compressing objects: 63% (7/11) remote: Compressing objects: 72% (8/11) remote: Compressing objects: 81% (9/11) remote: Compressing objects: 90% (10/11) remote: Compressing objects: 100% (11/11) remote: Compressing objects: 100% (11/11), done. remote: Total 24 (delta 5), reused 5 (delta 5), pack-reused 8 (from 2)  From https://github.com/python/cpython * branch main -> FETCH_HEAD Note: switching to 'bab1d7a561ab015dd6bb97e255fd12a8ce367edf'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at bab1d7a561a gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076) Switched to and reset branch 'main' configure: WARNING: no system libmpdec found; falling back to pure-Python version for the decimal modulemake: *** [Makefile:2503: buildbottest] Error 2

bedevere-bot · 2026-01-14T16:11:44Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 FreeBSD Refleaks 3.x (tier-3) has failed when building commit bab1d7a.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1613/builds/2681) and take a look at the build logs.
Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1613/builds/2681

Failed tests:

test_unicodedata

Test leaking resources:

test_unicodedata: memory blocks
test_unicodedata: references

Summary of the results of the build (if available):

==

Click to see traceback logs

remote: Enumerating objects: 24, done. remote: Counting objects: 6% (1/16) remote: Counting objects: 12% (2/16) remote: Counting objects: 18% (3/16) remote: Counting objects: 25% (4/16) remote: Counting objects: 31% (5/16) remote: Counting objects: 37% (6/16) remote: Counting objects: 43% (7/16) remote: Counting objects: 50% (8/16) remote: Counting objects: 56% (9/16) remote: Counting objects: 62% (10/16) remote: Counting objects: 68% (11/16) remote: Counting objects: 75% (12/16) remote: Counting objects: 81% (13/16) remote: Counting objects: 87% (14/16) remote: Counting objects: 93% (15/16) remote: Counting objects: 100% (16/16) remote: Counting objects: 100% (16/16), done. remote: Compressing objects: 9% (1/11) remote: Compressing objects: 18% (2/11) remote: Compressing objects: 27% (3/11) remote: Compressing objects: 36% (4/11) remote: Compressing objects: 45% (5/11) remote: Compressing objects: 54% (6/11) remote: Compressing objects: 63% (7/11) remote: Compressing objects: 72% (8/11) remote: Compressing objects: 81% (9/11) remote: Compressing objects: 90% (10/11) remote: Compressing objects: 100% (11/11) remote: Compressing objects: 100% (11/11), done. remote: Total 24 (delta 5), reused 5 (delta 5), pack-reused 8 (from 2)  From https://github.com/python/cpython * branch main -> FETCH_HEAD Note: switching to 'bab1d7a561ab015dd6bb97e255fd12a8ce367edf'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at bab1d7a561a gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076) Switched to and reset branch 'main'

serhiy-storchaka requested a review from AA-Turner as a code owner December 22, 2025 16:34

bedevere-appbot mentioned this pull request Dec 22, 2025
Add unicode grapheme cluster break algorithm #74902
Closed

bedevere-appbot added the awaiting core review label Dec 22, 2025

Try to fix rst.
b0585f9

AA-Turner mentioned this pull request Dec 22, 2025
gh-74902: add unicode grapheme cluster break algorithm #2673
Closed

AA-Turner reviewed Dec 22, 2025
View reviewed changes

Doc/whatsnew/3.15.rst Outdated Show resolvedHide resolved

grayjk mentioned this pull request Dec 22, 2025
What would wcwidth look like if it were built-in to Python? jquast/wcwidth#94
Open

StanFromIreland reviewed Dec 22, 2025
View reviewed changes

Doc/library/unicodedata.rst Outdated Show resolvedHide resolved

merwok reviewed Dec 23, 2025
View reviewed changes

serhiy-storchakaand others added 2 commits December 23, 2025 11:37

iConvert the types to heap types.
37fa38f

Update Doc/library/unicodedata.rst
c46e9bd
Co-authored-by: Stan Ulbrych <[email protected]>

StanFromIreland reviewed Dec 26, 2025
View reviewed changes

serhiy-storchakaand others added 6 commits December 26, 2025 22:06

Apply suggestions from code review
22cacf6
Co-authored-by: Stan Ulbrych <[email protected]>

Add tests for GB12 and GB13. Remove redundant parameter in run_graphe…
a95c3cb
…me_break_tests().

Rename Grapheme to Segment. Add __repr__(), remove __iter__().
ad50831

Merge branch 'main' into grapheme_cluster_break2
365d0c1

Merge branch 'main' into grapheme_cluster_break2
f96c8d0

Polishing.
3b3a50a

serhiy-storchaka enabled auto-merge (squash) January 14, 2026 14:16

serhiy-storchaka merged commit bab1d7a into python:mainJan 14, 2026
47 checks passed

bedevere-appbot removed the awaiting core review label Jan 14, 2026

	result=list(unicodedata.iter_graphemes(input))
	result=list(ucd.iter_graphemes(input))

	result=list(unicodedata.iter_graphemes(input, breaks[i]))
	result=list(ucd.iter_graphemes(input, breaks[i]))

Uh oh!

gh-74902: Add Unicode Grapheme Cluster Break algorithm#143076

gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076

Uh oh!

Conversation

serhiy-storchaka commented Dec 22, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merwok commented Dec 23, 2025

Uh oh!

serhiy-storchaka commented Dec 23, 2025

Uh oh!

merwok commented Dec 23, 2025

Uh oh!

serhiy-storchaka commented Dec 23, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Jan 14, 2026

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

bedevere-bot commented Jan 14, 2026

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

bedevere-bot commented Jan 14, 2026

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

serhiy-storchaka commented Dec 22, 2025•
edited
Loading