Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser#137837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser #137837
Conversation
serhiy-storchaka commented Aug 15, 2025 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
…arser * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
bb7b873 to 2153a4cCompareUh oh!
There was an error while loading. Please reload this page.
Doc/library/html.parser.rst Outdated
| Create a parser instance able to parse invalid markup. | ||
| If *convert_charrefs* is ``True`` (the default), all character | ||
| references (except the ones in ``script``/``style`` elements) are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated now that the list has been expanded.
It might be easier to have a short section about parsing modes, listing each mode, which elements trigger it, whether charrefs are converted or not, and when the state is terminated.
Here we could then say
| references (except the ones in ``script``/``style`` elements) are | |
| references (except the ones in RAWTEXT tags) are |
with RAWTEXT linking to that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to document this here? This is a part of the HTML5 specification. What will the user get from this information?
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Lib/html/parser.py Outdated
| self.set_cdata_mode(tag) | ||
| eliftag=="plaintext": | ||
| self.set_cdata_mode(tag) | ||
| self.interesting=re.compile(r'\z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to move this in set_cdata_mode by adding a third branch to the if/else that sets self.interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this option. But should we repeat condition tag == "plaintext" in two places or add "plaintext" to CDATA_CONTENT_ELEMENTS or RCDATA_CONTENT_ELEMENTS? In any case we will need to repeat "plaintext" twice. This can also create asymmetry with "noscript" if special cases will be handled in different places. So I came to the current code.
Other option is to use special value escapable=None to switch to the PLAINTEXT mode.
ezio-melotti commented Oct 24, 2025
This PR seems to address 3 issues:
The difference between states is the following:
|
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Lib/html/parser.py Outdated
| iftaginself.CDATA_CONTENT_ELEMENTS: | ||
| self.set_cdata_mode(tag) | ||
| eliftaginself.RCDATA_CONTENT_ELEMENTS: | ||
| self.set_cdata_mode(tag, escapable=True) | ||
| elifself.scriptingandtag=="noscript": | ||
| self.set_cdata_mode(tag) | ||
| eliftag=="plaintext": | ||
| self.set_cdata_mode(tag, escapable=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like too much (ab)using escapable=None for PLAINTEXT mode.
Currently the set_cdata_mode function does two things:
- determines where the closing tag/end is, which depends on the value
tagpassed; - determines whether charrefs are converted, which depends on the value passed to
escapable;
Even though there is some duplication, I would prefer something like this:
if (taginself.CDATA_CONTENT_ELEMENTSor (self.scriptingandtag=="noscript") ortag=="plaintext"): self.set_cdata_mode(tag, escapable=False) eliftaginself.RCDATA_CONTENT_ELEMENTS: self.set_cdata_mode(tag, escapable=True)This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.
Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
…hon into htmlparser-rawtext
serhiy-storchaka commented Oct 31, 2025
Thank you for your review @ezio-melotti. |
a17c57e into python:mainUh oh!
There was an error while loading. Please reload this page.
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
GH-140841 is a backport of this pull request to the 3.14 branch. |
GH-140842 is a backport of this pull request to the 3.13 branch. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka commented Oct 31, 2025
Backporting to older Python versions should be from 3.13. |
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
… HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
📚 Documentation preview 📚: https://cpython-previews--137837.org.readthedocs.build/