Skip to content

Conversation

@encukou
Copy link
Member

@encukouencukou commented Oct 22, 2025

This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section.

It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but:

  1. parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators
  2. normalizes the name
  3. validates the name, using the id_start/id_continue sets (referred to in previous sections as “letter-like” and “number-like” characters, with a link to the details)

This also means we don't need xid_start/xid_continue to define the behaviour :)


📚 Documentation preview 📚: https://cpython-previews--140464.org.readthedocs.build/

encukouand others added 4 commits October 8, 2025 17:58
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> Co-authored-by: Blaise Pabon <blaise@gmail.com> Co-authored-by: Micha Albert <info@micha.zone> Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
Copy link
Contributor

@willingcwillingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding document @encukou. I had one small suggestion to be a bit more explicit on the normalization example with number.

This means that, for example, some typographic variants of characters are
converted to their "basic" form, for example::

>>> nᵘₘᵇₑʳ = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add an explicit comment that the normalized form of nᵘₘᵇₑʳis number.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look good?

@encukou
Copy link
MemberAuthor

There was an insightful conversation in #140269. I'll update this PR to make things even clearer.

Copy link
Contributor

@willingcwillingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @encukou

@encukouencukou marked this pull request as ready for review November 19, 2025 16:08
Copy link
Contributor

@willingcwillingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @encukou!

@encukou
Copy link
MemberAuthor

Thank you for the review!

@malemburg, do you also want to take a look?

@encukouencukou merged commit 2ff8608 into python:mainNov 26, 2025
36 checks passed
@encukouencukou deleted the lex-analysis-names-simpler branch November 26, 2025 15:10
@github-project-automationgithub-project-automationbot moved this from Todo to Done in Docs PRsNov 26, 2025
@encukouencukou added the needs backport to 3.14 bugs and security fixes label Nov 26, 2025
@miss-islington-app
Copy link

Thanks @encukou for the PR 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.14 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 2ff8608b4da33f667960e5099a1a442197acaea4 3.14 

@bedevere-app
Copy link

GH-142015 is a backport of this pull request to the 3.14 branch.

@bedevere-appbedevere-appbot removed the needs backport to 3.14 bugs and security fixes label Nov 27, 2025
StanFromIreland added a commit to StanFromIreland/cpython that referenced this pull request Nov 27, 2025
This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section. It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but: - parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators - normalizes the name - validates the name, using the xid_start/xid_continue sets (cherry picked from commit 2ff8608) Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> Co-authored-by: Blaise Pabon <blaise@gmail.com> Co-authored-by: Micha Albert <info@micha.zone> Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
encukou added a commit that referenced this pull request Dec 3, 2025
This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section. It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but: - parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators - normalizes the name - validates the name, using the xid_start/xid_continue sets (cherry picked from commit 2ff8608) Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Blaise Pabon <blaise@gmail.com> Co-authored-by: Micha Albert <info@micha.zone> Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
StanFromIreland added a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025
This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section. It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but: - parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators - normalizes the name - validates the name, using the xid_start/xid_continue sets Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> Co-authored-by: Blaise Pabon <blaise@gmail.com> Co-authored-by: Micha Albert <info@micha.zone> Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docsDocumentation in the Doc dirskip news

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Docs: note requirement to normalise unicode identifiers passed to globals() and locals()

2 participants

@encukou@willingc