Skip to content

Conversation

@vstinner
Copy link
Member

@vstinnervstinner commented Feb 8, 2022

  • Convert unicodeobject.h macros to static inline functions.

  • Reorder functions to declare functions before their first usage.

  • Add "kind" variable to PyUnicode_READ_CHAR() and
    PyUnicode_MAX_CHAR_VALUE() functions to only call PyUnicode_KIND()
    once.

  • PyUnicode_KIND() now returns an "enum PyUnicode_Kind".

  • Simplify PyUnicode_GET_SIZE().

  • Add assertions to PyUnicode_WRITE() on the max value.

  • Add cast macros:

    • _PyASCIIObject_CAST()
    • _PyCompactUnicodeObject_CAST()
    • _PyUnicodeObject_CAST()
  • The following functions are now declared as deprecated using
    Py_DEPRECATED(3.3):

    • PyUnicode_GET_SIZE()
    • PyUnicode_GET_DATA_SIZE()
    • PyUnicode_AS_UNICODE()
    • PyUnicode_AS_DATA()
    • The implementation of these functions disable deprecation
      warnings in their body.
  • PyUnicode_READ_CHAR() now uses PyUnicode_1BYTE_DATA(),
    PyUnicode_2BYTE_DATA() and PyUnicode_4BYTE_DATA().

  • Replace "const PyObject*" with "PyObject*" in _decimal.c
    and pystrhex.c: PyUnicode_READY() can modify the object.

  • Replace "const void *data" with "void *data" in some unicodedata.c
    and unicodeobject.c functions which use PyUnicode_WRITE(): data is
    used to modify the string.

https://bugs.python.org/issue45490

* Convert unicodeobject.h macros to static inline functions. * Reorder functions to declare functions before their first usage. * Add "kind" variable to PyUnicode_READ_CHAR() and PyUnicode_MAX_CHAR_VALUE() functions to only call PyUnicode_KIND() once. * PyUnicode_KIND() now returns an "enum PyUnicode_Kind". * Simplify PyUnicode_GET_SIZE(). * Add assertions to PyUnicode_WRITE() on the max value. * Add cast macros: * _PyASCIIObject_CAST() * _PyCompactUnicodeObject_CAST() * _PyUnicodeObject_CAST() * The following functions are now declared as deprecated using Py_DEPRECATED(3.3): * PyUnicode_GET_SIZE() * PyUnicode_GET_DATA_SIZE() * PyUnicode_AS_UNICODE() * PyUnicode_AS_DATA() * The implementation of these functions disable deprecation warnings in their body. * PyUnicode_READ_CHAR() now uses PyUnicode_1BYTE_DATA(), PyUnicode_2BYTE_DATA() and PyUnicode_4BYTE_DATA(). * Replace "const PyObject*" with "PyObject*" in _decimal.c and pystrhex.c: PyUnicode_READY() can modify the object. * Replace "const void *data" with "void *data" in some unicodedata.c and unicodeobject.c functions which use PyUnicode_WRITE(): data is used to modify the string.
@vstinner
Copy link
MemberAuthor

@erlend-aasland: All in one PR to convert (almost) all macros of Include/cpython/unicodeobject.h.

I created a single PR to show what can be done with PEP 670, but IMO it will be better to split this large PR into smaller PRs to ease review, and apply (minor) API changes / cleanup in following PRs (not do everything at once).

@erlend-aasland
Copy link
Contributor

I created a single PR to show what can be done with PEP 670, but IMO it will be better to split this large PR into smaller PRs to ease review, and apply (minor) API changes / cleanup in following PRs (not do everything at once).

Sounds good.

@erlend-aasland
Copy link
Contributor

IMO, this is a great improvement when it comes to readability/maintainability.

Return NULL if malloc fails and an empty string if invalid characters
are found. */
staticchar*
numeric_as_ascii(constPyObject*u, intstrip_ws, intignore_underscores)
Copy link
Contributor

@erlend-aaslanderlend-aaslandFeb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to remove const? Ditto for the rest of the PR.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "u" string is not ready, PyUnicode_READY() will modify it. It's not a read-only operation.

In Python 3.12, PyUnicode_WCHAR_KIND will be removed: https://www.python.org/dev/peps/pep-0623/

In the meanwhile, I prefer to not stop lying: we do modify the object :-)

Copy link
Member

@gpsheadgpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a whole review, just dropping some notes.

return_PyASCIIObject_CAST(op)->length;
}

/* In the access macros below, "kind" may be evaluated more than once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably update comments like these to just mention that they used to be macros with these caveats in previous python versions.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, I didn't look at comments at all. I laser focused on macros code and make sure that I don't change the code :-)

if (kind == PyUnicode_1BYTE_KIND){
returnPyUnicode_1BYTE_DATA(unicode)[index];
}
elseif (PyUnicode_KIND(unicode) == PyUnicode_2BYTE_KIND){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a function this no longer needs to be called twice. (and the comment above becomes less true)

/* Use only if you know it's a string */
#definePyUnicode_CHECK_INTERNED(op) \
(((PyASCIIObject *)(op))->state.interned)
staticinlineintPyUnicode_CHECK_INTERNED(PyObject *op){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid the cast going away, consider doing what Py_INCREF did & add indirection through a macro for the cast:

#define PyUnicode_CHECK_INTERNED(op) \ _PyUnicode_CHECK_INTERNED(_PyASCIIObject_CAST(op)) 

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SC asked to not add such macro :-) You're right that without such macro, there is a risk of introducing new compiler warnings.

If possible I would prefer to keep PyObject* for functions in unicodeobject.c, since it's the type used for arguments in existing functions and the type returned by functions creating strings like PyUnicode_New(), PyUnicode_FromString(), etc.

Maybe for this specific header file, we can avoid casts.

For me, PyASCIIObject is an implementation detail which should be hidden. If possible, it should even be moved to the internal C API, but that's a way larger topic which may require a PEP ;-)

@vstinner
Copy link
MemberAuthor

The PyUnicode_KIND() now returns an "enum PyUnicode_Kind" change adds new warnings:

comparison of integer expressions of different signedness: ‘enum PyUnicode_Kind’ and ‘int’ [-Wsign-compare]

Changing PyUnicode_KIND() should be done in separated PR. I'm not sure what's the best return type for that. I would prefer to not add new compiler warnings!

@vstinner
Copy link
MemberAuthor

Macros not casting their arguments:

  • PyUnicode_1BYTE_DATA()
  • PyUnicode_2BYTE_DATA()
  • PyUnicode_4BYTE_DATA()
  • PyUnicode_AS_DATA()
  • PyUnicode_DATA()
  • PyUnicode_GET_DATA_SIZE()
  • PyUnicode_MAX_CHAR_VALUE()
  • PyUnicode_READ()
  • PyUnicode_READ_CHAR()
  • PyUnicode_WRITE()
  • _PyUnicodeWriter_Prepare()
  • _PyUnicodeWriter_PrepareKind()

Macro casting its argument to PyObject*:

  • PyUnicode_READY()

Macros using a cast in their implementation:

  • Cast to PyASCIIObject* (and sometimes to other types):

    • PyUnicode_AS_UNICODE()
    • PyUnicode_CHECK_INTERNED()
    • PyUnicode_GET_LENGTH()
    • PyUnicode_GET_SIZE()
    • PyUnicode_IS_ASCII()
    • PyUnicode_IS_COMPACT()
    • PyUnicode_IS_COMPACT_ASCII()
    • PyUnicode_IS_READY()
    • PyUnicode_KIND()
    • _PyUnicode_COMPACT_DATA()
  • Cast to PyUnicodeObject*:

    • _PyUnicode_NONCOMPACT_DATA()

The majority of macros use PyObject* for its Python str object parameter.

PyUnicode_READ() and PyUnicode_WRITE() expect (kind, data, index) and (kind, data, index, value) arguments: no Python object.

@vstinnervstinner deleted the unicode_static_inline branch February 23, 2022 23:30
@vstinner
Copy link
MemberAuthor

This PR was an example. I updated PEP 670 from the discussion on this PR. If PEP 670 is accepted, I will rewrite this PR with smaller changes to ease the review.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

@vstinner@erlend-aasland@gpshead@the-knights-who-say-ni@bedevere-bot