gh-107369: optimize textwrap.indent()#107374

methane · 2023-07-28T05:58:20Z

indent()-ing Object/unicodeobject.c (15332 lines) about 25% faster.

Issue: Optimize textwrap.indent() #107369

eendebakpt

Looks good! Using str.split for the predicate instead of line.strip might change something for input that is not str, but I think this is ok.

serhiy-storchaka

lstrip is faster for non-indented lines.

I wonder whether the following variants can be faster for some input and for how wide category of input.

defpredicate(line): returnlineand (notline[0].isspace() orline.lstrip())

or

predicate=re.compile(r'\S').search

methane · 2023-07-28T16:51:29Z

_has_nonspace = re.compile(r'\S').search in global and predicate = _has_nonspace -- 3.5ms
str.rstrip = 1.95ms
str.lstrip = 2.03ms
lambda x: not x.isspace() = 2.07ms

Since we use splitlines(keepends=True), we can use just not x.isspace(). (no empty line is guaranteed. "".splitlines(keepends=True) == [] and "foo\n".splitlines(True) == ['foo\n']).
But it is a bit tricky and has relatively high cognitive load.

In case of unicodeobject.c, rstrip is bit faster. But it may be because most lines are indented already.

So I chose str.lstrip here, as Serhiy suggested.

serhiy-storchaka · 2023-07-28T18:52:30Z

Now that you mention it, I can see that using isspace() is the most obvious way to do this. Why I did not see it earlier?

We want to test whether the line has any non-space character. bool(line.strip()) is actually a tricky way -- we strips the line from spaces and if the rest is not empty string, then the original line has non-space characters too. not line.isspace() is a straightforward way -- it asks the opposite question (is the line only contains space characters?) and negates the result.

Algorithmically, isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For example Misc/NEWS.d/3.8.0a1.rst may show a very different result.

Lib/textwrap.py

methane · 2023-07-29T02:18:13Z

Now that you mention it, I can see that using isspace() is the most obvious way to do this. Why I did not see it earlier?

Because "".isspace() is False. We need to guarantee that "" is not used here.
x and not x.isspace() would be bit obvious, but little slower.

Algorithmically, isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For example Misc/NEWS.d/3.8.0a1.rst may show a very different result.

lstrip() is slow when every line has long indent. But Misc/NEWS.d/3.8.0a1.rst has almost no indents.

With 4c6a46a and https://gist.github.com/methane/5c6153c564d9508199a81c48d33161eb

> ./python.exe bench_indent.py Misc/NEWS.d/3.8.0a1.rst filename='Misc/NEWS.d/3.8.0a1.rst' 8978 lines. lstrip: 0.736msec not x.isspace(): 0.877msec x and not x.isspace(): 0.929msec > ./python.exe bench_indent.py Objects/unicodeobject.c filename='Objects/unicodeobject.c' 15332 lines. lstrip: 1.812msec not x.isspace(): 1.877msec x and not x.isspace(): 1.970msec

If I add text = textwrap.indent(text, " "*32) before bench:

> ./python.exe bench_indent.py Objects/unicodeobject.c filename='Objects/unicodeobject.c' 15332 lines. lstrip: 2.259msec not x.isspace(): 2.356msec x and not x.isspace(): 2.437msec

methane · 2023-07-29T02:46:45Z

To maximize performance, we can stop using lambda by...:

 if predicate is None: for line in text.splitlines(True): if not line.isspace(): prefixed_lines.append(prefix) prefixed_lines.append(line) else: for line in text.splitlines(True): if predicate(line): prefixed_lines.append(prefix) prefixed_lines.append(line)

filename='Objects/unicodeobject.c' 15332 lines. None: 1.604msec lstrip: 1.826msec not x.isspace(): 1.883msec

serhiy-storchaka

Thank you for your research Inada-san. Which to use here, lstrip or isspace, I leave up to you. It does not really matter in most cases.

picnixz · 2023-07-29T12:01:19Z

For very long texts, I think changing

prefixed_lines= [] forlineintext.splitlines(True): ifnotline.isspace(): prefixed_lines.append(prefix) prefixed_lines.append(line)

into the following may improve the overall performances

prefixed_lines= [] append_line=prefixed_lines.appendforlineintext.splitlines(True): ifnotline.isspace(): append_line(prefix) append_line(line)

EDIT: After a more careful benchmarking, this does not seem to bring more improvements. However, not using a lambda function seems to be better.

methane added 2 commits July 28, 2023 13:36

optimize textwrap.indent()
94ab051

Add NEWS
8c5896c

bedevere-bot mentioned this pull request Jul 28, 2023
Optimize textwrap.indent() #107369
Closed

bedevere-bot added the awaiting core review label Jul 28, 2023

methane added performance Performance or resource usage stdlib Standard Library Python modules in the Lib/ directory labels Jul 28, 2023

Add what's new entry
6ee731c

eendebakpt approved these changes Jul 28, 2023
View reviewed changes

serhiy-storchaka reviewed Jul 28, 2023
View reviewed changes

Use lstrip instead of strip
fad98a2

eendebakpt reviewed Jul 28, 2023
View reviewed changes

Lib/textwrap.py Outdated Show resolvedHide resolved

avoid temporary tuple.
4c6a46a

methane added 2 commits July 29, 2023 12:34

use str.isspace instead of lstrip
5e60878

add comment about splitlines(True)
16e3dbd

serhiy-storchaka approved these changes Jul 29, 2023
View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Jul 29, 2023

25% -> 30%
734fd01

methane enabled auto-merge (squash) July 29, 2023 06:03

methane merged commit 37551c9 into python:mainJul 29, 2023

methane deleted the opt-textwrap-indent branch July 29, 2023 06:37

bedevere-bot removed the awaiting merge label Jul 29, 2023

This was referenced Jul 29, 2023
Optimize textwrap.indent a bit more. #107424
Closed
gh-107424: avoid using lambda functions in textwrap.indent()#107426
Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-107369: optimize textwrap.indent()#107374

gh-107369: optimize textwrap.indent() #107374

Uh oh!

methane commented Jul 28, 2023•
edited by bedevere-bot
Loading

Uh oh!

eendebakpt left a comment

Uh oh!

serhiy-storchaka left a comment

Uh oh!

methane commented Jul 28, 2023•
edited
Loading

Uh oh!

serhiy-storchaka commented Jul 28, 2023

Uh oh!

Uh oh!

methane commented Jul 29, 2023

Uh oh!

methane commented Jul 29, 2023

Uh oh!

serhiy-storchaka left a comment

Uh oh!

picnixz commented Jul 29, 2023•
edited
Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

gh-107369: optimize textwrap.indent()#107374

gh-107369: optimize textwrap.indent() #107374

Uh oh!

Conversation

methane commented Jul 28, 2023• edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

methane commented Jul 28, 2023• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented Jul 28, 2023

Uh oh!

Uh oh!

methane commented Jul 29, 2023

Uh oh!

methane commented Jul 29, 2023

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

picnixz commented Jul 29, 2023• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

methane commented Jul 28, 2023•
edited by bedevere-bot
Loading

methane commented Jul 28, 2023•
edited
Loading

picnixz commented Jul 29, 2023•
edited
Loading