Skip to content

[C API] Add an efficient public PyUnicodeWriter API#119182

@vstinner

Description

@vstinner

Feature or enhancement

Creating a Python string object in an efficient way is complicated. Python has private_PyUnicodeWriter API. It's being used by these projects:

Affected projects (5):

  • Cython (3.0.9)
  • asyncpg (0.29.0)
  • catboost (1.2.3)
  • frozendict (2.4.0)
  • immutables (0.20)

I propose making the API public to promote it and help C extensions maintainers to write more efficient code to create Python string objects.

API:

typedefstructPyUnicodeWriterPyUnicodeWriter; PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void); PyAPI_FUNC(void) PyUnicodeWriter_Discard(PyUnicodeWriter*writer); PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter*writer); PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate( PyUnicodeWriter*writer, intoverallocate); PyAPI_FUNC(int) PyUnicodeWriter_WriteChar( PyUnicodeWriter*writer, Py_UCS4ch); PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8( PyUnicodeWriter*writer, constchar*str, // decoded from UTF-8Py_ssize_tlen); // use strlen() if len < 0PyAPI_FUNC(int) PyUnicodeWriter_Format( PyUnicodeWriter*writer, constchar*format, ...); // Write str(obj)PyAPI_FUNC(int) PyUnicodeWriter_WriteStr( PyUnicodeWriter*writer, PyObject*obj); // Write repr(obj)PyAPI_FUNC(int) PyUnicodeWriter_WriteRepr( PyUnicodeWriter*writer, PyObject*obj); // Write str[start:end]PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring( PyUnicodeWriter*writer, PyObject*str, Py_ssize_tstart, Py_ssize_tend);

The internal writer buffer is overallocated by default. PyUnicodeWriter_Finish() truncates the buffer to the exact size if the buffer was overallocated.

Overallocation reduces the cost of exponential complexity when adding short strings in a loop. Use PyUnicodeWriter_SetOverallocate(writer, 0) to disable overallocation just before the last write.

The writer takes care of the internal buffer kind: Py_UCS1 (latin1), Py_UCS2 (BMP) or Py_UCS4 (full Unicode Character Set). It also implements an optimization if a single write is made using PyUnicodeWriter_WriteStr(): it returns the string unchanged without any copy.


Example of usage (simplified code from Python/unionobject.c):

staticPyObject*union_repr(PyObject*self){unionobject*alias= (unionobject*)self; Py_ssize_tlen=PyTuple_GET_SIZE(alias->args); PyUnicodeWriter*writer=PyUnicodeWriter_Create(); if (writer==NULL){returnNULL} for (Py_ssize_ti=0; i<len; i++){if (i>0&&PyUnicodeWriter_WriteUTF8(writer, " | ", 3) <0){goto error} PyObject*p=PyTuple_GET_ITEM(alias->args, i); if (PyUnicodeWriter_WriteRepr(writer, p) <0){goto error} } returnPyUnicodeWriter_Finish(writer); error: PyUnicodeWriter_Discard(writer); returnNULL}

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions