Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-74902: add unicode grapheme cluster break algorithm#2673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Vermeille commented Jul 11, 2017 • edited by bedevere-bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-bot
Uh oh!
There was an error while loading. Please reload this page.
the-knights-who-say-ni commented Jul 11, 2017
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
0f82f82 to 62fd6e0Compare62fd6e0 to a47de54CompareVermeille commented Aug 2, 2017
Hello? Someone here? |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Modules/unicodedata.c Outdated
| 0, /*tp_setattro*/ | ||
| 0, /*tp_as_buffer*/ | ||
| Py_TPFLAGS_DEFAULT, | ||
| "Internal grapheme cluster iterator object.", /* tp_doc */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the words "internal" and "object" are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?
Uh oh!
There was an error while loading. Please reload this page.
Vermeille commented Jan 11, 2018
Sorry for the long wait. Are we good concerning the changes? Anything to add? |
brettcannon commented Feb 2, 2018
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
csabella commented May 23, 2020
@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks! |
This PR is stale because it has been open for 30 days with no activity. |
This PR is stale because it has been open for 30 days with no activity. |
This PR is stale because it has been open for 30 days with no activity. |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize that it took so long to start reviewing this PR seriously.
Now we need this algorithm to calculate the width of text in columns, which is needed to support wide characters in many parts of the stdlib (REPL, tracebacks, etc). So we will add its implementation anyway. If you are busy or have lost interest, I will finish this work myself (keeping your credit), but if you are still interested, I would be happy to work together.
I wonder, what is the source of the state machine table? Did you created it from the original rules or from the table in GraphemeBreakTest.html? Or copied it from other source? I afraid that it is outdated and only supports legacy grapheme clusters. I can fix this, but maybe you already have a ready solution?
| self: self | ||
| unistr: unicode | ||
| start: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be Py_ssize_t. Some other variables should be Py_ssize_t, not int.
| self: self | ||
| unistr: unicode | ||
| start: int = 0 | ||
| end: Py_ssize_t(c_default="PY_SSIZE_T_MAX - 1") = sys.maxsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be PY_SSIZE_T_MAX.
Although I am not sure that the end parameter is needed. The user can simply stop iteration at any time.
| @staticmethod | ||
| defcheck_version(testfile): | ||
| hdr=testfile.readline() | ||
| returnunicodedata.unidata_versioninhdr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the file header look like?
With string contains tests, I worry about things like "8.0" in "18.0" matching wrongly. Could the full line be compared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# GraphemeBreakTest-17.0.0.txt We have the same check for normalization tests.
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
AA-Turner commented Dec 22, 2025
Closing in favour of #143076. A |
I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.
https://bugs.python.org/issue30717