Skip to content

Conversation

@Vermeille
Copy link
Contributor

@VermeilleVermeille commented Jul 11, 2017

I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.

https://bugs.python.org/issue30717

@the-knights-who-say-ni

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

@VermeilleVermeille changed the title WIP: add grapheme cluster break algorithmbpo-30717: WIP: add grapheme cluster break algorithmJul 11, 2017
@VermeilleVermeilleforce-pushed the grapheme_cluster_break branch from 0f82f82 to 62fd6e0CompareJuly 13, 2017 23:29
@VermeilleVermeille changed the title bpo-30717: WIP: add grapheme cluster break algorithmbpo-30717: add grapheme cluster break algorithmJul 14, 2017
@VermeilleVermeille changed the title bpo-30717: add grapheme cluster break algorithmbpo-30717: add unicode grapheme cluster break algorithmJul 14, 2017
@VermeilleVermeilleforce-pushed the grapheme_cluster_break branch from 62fd6e0 to a47de54CompareJuly 14, 2017 03:21
@Vermeille
Copy link
ContributorAuthor

Hello? Someone here?

@serhiy-storchakaserhiy-storchaka added the type-feature A feature request or enhancement label Aug 3, 2017
0, /*tp_setattro*/
0, /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT,
"Internal grapheme cluster iterator object.", /* tp_doc */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the words "internal" and "object" are redundant.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?

@Vermeille
Copy link
ContributorAuthor

Sorry for the long wait.

Are we good concerning the changes? Anything to add?

@brettcannon
Copy link
Member

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

@csabella
Copy link
Contributor

@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks!

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actionsgithub-actionsbot added the stale Stale PR or inactive for long period of time. label Feb 20, 2022
@github-actionsgithub-actionsbot removed the stale Stale PR or inactive for long period of time. label Jul 28, 2022
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actionsgithub-actionsbot added the stale Stale PR or inactive for long period of time. label Aug 29, 2022
@arhadthedevarhadthedev changed the title bpo-30717: add unicode grapheme cluster break algorithmgh-74902: add unicode grapheme cluster break algorithmFeb 14, 2023
@github-actionsgithub-actionsbot removed the stale Stale PR or inactive for long period of time. label Feb 15, 2023
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actionsgithub-actionsbot added the stale Stale PR or inactive for long period of time. label Mar 18, 2023
@bedevere-app
Copy link

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize that it took so long to start reviewing this PR seriously.

Now we need this algorithm to calculate the width of text in columns, which is needed to support wide characters in many parts of the stdlib (REPL, tracebacks, etc). So we will add its implementation anyway. If you are busy or have lost interest, I will finish this work myself (keeping your credit), but if you are still interested, I would be happy to work together.

I wonder, what is the source of the state machine table? Did you created it from the original rules or from the table in GraphemeBreakTest.html? Or copied it from other source? I afraid that it is outdated and only supports legacy grapheme clusters. I can fix this, but maybe you already have a ready solution?

self: self
unistr: unicode
start: int = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be Py_ssize_t. Some other variables should be Py_ssize_t, not int.

self: self
unistr: unicode
start: int = 0
end: Py_ssize_t(c_default="PY_SSIZE_T_MAX - 1") = sys.maxsize

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be PY_SSIZE_T_MAX.

Although I am not sure that the end parameter is needed. The user can simply stop iteration at any time.

@staticmethod
defcheck_version(testfile):
hdr=testfile.readline()
returnunicodedata.unidata_versioninhdr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the file header look like?

With string contains tests, I worry about things like "8.0" in "18.0" matching wrongly. Could the full line be compared?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# GraphemeBreakTest-17.0.0.txt 

We have the same check for normalization tests.

@bedevere-app
Copy link

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@serhiy-storchakaserhiy-storchaka marked this pull request as draft December 17, 2025 22:16
@serhiy-storchakaserhiy-storchaka self-assigned this Dec 17, 2025
@github-actionsgithub-actionsbot removed the stale Stale PR or inactive for long period of time. label Dec 18, 2025
@AA-Turner
Copy link
Member

Closing in favour of #143076.

A

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type-featureA feature request or enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants

@Vermeille@the-knights-who-say-ni@brettcannon@csabella@AA-Turner@methane@merwok@serhiy-storchaka@ezio-melotti