gh-125196: Use PyUnicodeWriter in error handlers#125262

vstinner · 2024-10-10T14:00:24Z

PyCodec_ReplaceErrors() and PyCodec_BackslashReplaceErrors() now use the public PyUnicodeWriter API.

Issue: Use the public PyUnicodeWriter API #125196

PyCodec_ReplaceErrors() and PyCodec_BackslashReplaceErrors() now use the public PyUnicodeWriter API.

rruuaanng · 2024-10-10T14:10:47Z

Python/codecs.c

 }


+staticinlineint


Wouldn't it be better to move this inline function into the header file? That way it can omit the declaration in the header file and integrate well. It looks like a public API.

In fact, I implemented #125201 which should replace this static inline function.

vstinner · 2024-10-10T14:49:50Z

Benchmark:

importcodecsimportpyperfrunner=pyperf.Runner() replace=codecs.lookup_error('replace') backslashreplace=codecs.lookup_error('backslashreplace') forLENin (1, 1000): text="\xff"*LENexc=UnicodeEncodeError("ascii", text, 0, len(text), "reason") runner.bench_func(f"replace encode len={LEN}", replace, exc) text="\xff"*LENexc=UnicodeTranslateError(text, 0, len(text), "reason") runner.bench_func(f"replace translate len={LEN}", replace, exc) data=b"\x20\x80\xff"*LENexc=UnicodeDecodeError("ascii", data, 0, len(data), "reason") runner.bench_func(f"backslashreplace decode len={LEN}", backslashreplace, exc) text="\x20\xff\u20ac\U0010ffff"*LENexc=UnicodeEncodeError("ascii", text, 0, len(text), "reason") runner.bench_func(f"backslashreplace encode len={LEN}", backslashreplace, exc) text="\x20\xff\u20ac\U0010ffff"*LENexc=UnicodeTranslateError(text, 0, len(text), "reason") runner.bench_func(f"backslashreplace translate len={LEN}", backslashreplace, exc)

Results, Python built with gcc -O3, CPU isolation:

+-------------------------------------+---------+------------------------+ | Benchmark | ref | change | +=====================================+=========+========================+ | replace encode len=1 | 92.9 ns | 110 ns: 1.19x slower | +-------------------------------------+---------+------------------------+ | replace translate len=1 | 103 ns | 138 ns: 1.34x slower | +-------------------------------------+---------+------------------------+ | backslashreplace decode len=1 | 95.6 ns | 140 ns: 1.46x slower | +-------------------------------------+---------+------------------------+ | backslashreplace encode len=1 | 120 ns | 166 ns: 1.37x slower | +-------------------------------------+---------+------------------------+ | backslashreplace translate len=1 | 126 ns | 169 ns: 1.35x slower | +-------------------------------------+---------+------------------------+ | replace encode len=1000 | 157 ns | 2.09 us: 13.31x slower | +-------------------------------------+---------+------------------------+ | replace translate len=1000 | 178 ns | 2.23 us: 12.51x slower | +-------------------------------------+---------+------------------------+ | backslashreplace decode len=1000 | 3.03 us | 27.2 us: 8.98x slower | +-------------------------------------+---------+------------------------+ | backslashreplace encode len=1000 | 14.4 us | 46.3 us: 3.22x slower | +-------------------------------------+---------+------------------------+ | backslashreplace translate len=1000 | 14.4 us | 45.9 us: 3.18x slower | +-------------------------------------+---------+------------------------+ | Geometric mean | (ref) | 3.03x slower | +-------------------------------------+---------+------------------------+

It's way slower :-(

This PR has a naive PyUnicodeWriter_Fill() implementation calling PyUnicodeWriter_WriteChar() in a loop. I wrote PR gh-125201 which is way more efficient:

+-------------------------------------+---------+------------------------+-----------------------+ | Benchmark | ref | change | fill | +=====================================+=========+========================+=======================+ | replace encode len=1000 | 157 ns | 2.09 us: 13.31x slower | 176 ns: 1.12x slower | +-------------------------------------+---------+------------------------+-----------------------+ | replace translate len=1000 | 178 ns | 2.23 us: 12.51x slower | 276 ns: 1.55x slower | +-------------------------------------+---------+------------------------+-----------------------+

vstinner · 2024-10-10T14:51:42Z

For the benchmark, I had to call directly the error handler functions. Using ASCII doesn't work (it doesn't measure my change), since Objects/unicodeobject.c has a fast-path for most common error handlers such as replace and backslashreplace.

vstinner · 2024-10-15T14:37:28Z

Let's keep the private API for now, since it's faster.

pythongh-125196: Use PyUnicodeWriter in error handlers
e07ceda
PyCodec_ReplaceErrors() and PyCodec_BackslashReplaceErrors() now use the public PyUnicodeWriter API.

vstinner added the skip news label Oct 10, 2024

bedevere-appbot mentioned this pull request Oct 10, 2024
Use the public PyUnicodeWriter API #125196
Closed

bedevere-appbot added the awaiting core review label Oct 10, 2024

rruuaanng reviewed Oct 10, 2024
View reviewed changes

vstinner marked this pull request as draft October 10, 2024 14:50

bedevere-appbot removed the awaiting core review label Oct 10, 2024

vstinner mentioned this pull request Oct 10, 2024
gh-125196: Add PyUnicodeWriter_Fill() function #125201
Closed

vstinner closed this Oct 15, 2024

vstinner deleted the writer_codecs branch October 15, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-125196: Use PyUnicodeWriter in error handlers#125262

gh-125196: Use PyUnicodeWriter in error handlers #125262

Uh oh!

vstinner commented Oct 10, 2024•
edited by bedevere-app bot
Loading

Uh oh!

rruuaanngOct 10, 2024

Uh oh!

vstinnerOct 10, 2024

Uh oh!

vstinner commented Oct 10, 2024

Uh oh!

vstinner commented Oct 10, 2024

Uh oh!

vstinner commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		}


		staticinlineint

Uh oh!

gh-125196: Use PyUnicodeWriter in error handlers#125262

gh-125196: Use PyUnicodeWriter in error handlers #125262

Uh oh!

Conversation

vstinner commented Oct 10, 2024• edited by bedevere-app botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rruuaanngOct 10, 2024

Choose a reason for hiding this comment

Uh oh!

vstinnerOct 10, 2024

Choose a reason for hiding this comment

Uh oh!

vstinner commented Oct 10, 2024

Uh oh!

vstinner commented Oct 10, 2024

Uh oh!

vstinner commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vstinner commented Oct 10, 2024•
edited by bedevere-app bot
Loading