gh-129005: Align FileIO.readall between _pyio and _io#129705

cmaloney · 2025-02-05T21:33:24Z

Utilize bytearray.resize() and os.readinto() to reduce copies and match behavior of _io.FileIO.readall().

There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from bytearray -> bytes currently.

On my system reading a 2GB file
./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v

Goes from ~2.7 seconds -> ~2.2 seconds. The C _io implementation is ~1.2 seconds, so still some performance gap, but less.

Issue: Reduce copies when reading files in pyio, match behavior of _io #129005

Utilize `bytearray.resize()` and `os.readinto()` to reduce copies and match behavior of `_io.FileIO.readall()`. There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from `bytearray` -> `bytes` currently. On my system reading a 2GB file `./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v` Goes from ~2.7 seconds -> ~2.2 seconds

Lib/_pyio.py

vstinner · 2025-02-05T22:17:02Z

Lib/_pyio.py

-result+=chunk
-
+bytes_read+=n
+result.resize(bytes_read)


result = memoryview(result)[bytes_read:] would avoid a truncation which can imply a memory copy in the worst case, no?

the resize "shrink" in bytearray doesn't actually resize unless the buffer's "capacity" is 2x the requested size (https://github.com/python/cpython/blob/main/Objects/bytearrayobject.c#L201-L214). Just updates its internal "this is how long the bytes is" counter (which for things like full-file readall with known size, this should already be just one byte over the right size).
My plan currently is to make it so bytes(bytearray(10)) and bytearray(b'\0' * 10) both don't copy (Ongoing discussion in https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164). Having a memoryview would mean there's more than one reference to the bytearray, and I couldn't do / use that optimization.

Ok, I'm fine with using result.resize() here.

Co-authored-by: Victor Stinner <vstinner@python.org>

cmaloney · 2025-02-06T00:23:25Z

Hypothesis test failure in binascii / pretty sure unrelated

vstinner · 2025-02-06T09:54:07Z

Lib/_pyio.py

-bufsize+=max(bufsize, DEFAULT_BUFFER_SIZE)
-n=bufsize-len(result)
+ifbytes_read>=bufsize:
+# Parallels _io/fileio.c new_buffersize


In the C code, new_buffersize() argument is bytes_read, not bufsize. You may keep new_buffersize() as a private module-level function.

Updated the loop to no longer use bufsize at all, this is the only line that used it, and it feels more Pythonic to me to just use len(result).
That enables rewriting to:
try: # Read until EOF (n == 0)whilen:=os.readinto(self._fd, memoryview(result)[bytes_read:]): bytes_read+=nifbytes_read>=len(result): result.resize(_new_buffersize(bytes_read)) exceptBlockingIOError: ifnotbytes_read: returnNoneassertlen(result) -bytes_read>=1, \ "os.readinto buffer size 0 will result in erroneous EOF / returns 0"result.resize(bytes_read) returnbytes(result)
which feels cleaner, but also starts changing structure relative to _io version.

decided to refactor to this. Control flow feels a lot simpler to me and a lot more readable than the branches and breaks.

…f bytes_read

cmaloney · 2025-02-06T22:17:19Z

Tests / Windows / build and test (Win32) failure is a urllib.error.HTTPError : HTTP error 504: Gateway Timeout [D:\a\cpython\cpython\PCbuild\pythoncore.vcxproj], believe unrelated

vstinner

LGTM

vstinner · 2025-02-07T11:06:18Z

Merged, thank you.

bedevere-appbot added the awaiting review label Feb 5, 2025

cmaloney changed the title ~~gh-12005: Align FileIO.readall between _pyio and _io~~gh-129005: Align FileIO.readall between _pyio and _ioFeb 5, 2025

bedevere-appbot mentioned this pull request Feb 5, 2025
Reduce copies when reading files in pyio, match behavior of _io #129005
Closed

vstinner reviewed Feb 5, 2025
View reviewed changes

Update Lib/_pyio.py
4520ecc
Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner reviewed Feb 6, 2025
View reviewed changes

Use len(result) rather than bufsize, _new_buffersize, make in terms o…
6c3ac57
…f bytes_read

cmaloneyand others added 4 commits February 6, 2025 14:41

Merge branch 'main' into fileio_readall
0e15dee

Simplify control structure
b50fb66

Fix whitespace
4da4e31

Merge branch 'main' into fileio_readall
09b52ed

vstinner approved these changes Feb 7, 2025
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting review labels Feb 7, 2025

vstinner merged commit a3d5aab into python:mainFeb 7, 2025
43 checks passed

bedevere-appbot removed the awaiting merge label Feb 7, 2025

cmaloney deleted the fileio_readall branch February 7, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-129005: Align FileIO.readall between _pyio and _io#129705

gh-129005: Align FileIO.readall between _pyio and _io #129705

Uh oh!

cmaloney commented Feb 5, 2025•
edited
Loading

Uh oh!

Uh oh!

vstinnerFeb 5, 2025

Uh oh!

cmaloneyFeb 5, 2025

Uh oh!

vstinnerFeb 6, 2025

Uh oh!

cmaloney commented Feb 6, 2025

Uh oh!

vstinnerFeb 6, 2025

Uh oh!

cmaloneyFeb 6, 2025

Uh oh!

cmaloneyFeb 7, 2025

Uh oh!

cmaloney commented Feb 6, 2025

Uh oh!

vstinner left a comment

Uh oh!

Uh oh!

vstinner commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

gh-129005: Align FileIO.readall between _pyio and _io#129705

gh-129005: Align FileIO.readall between _pyio and _io #129705

Uh oh!

Conversation

cmaloney commented Feb 5, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vstinnerFeb 5, 2025

Choose a reason for hiding this comment

Uh oh!

cmaloneyFeb 5, 2025

Choose a reason for hiding this comment

Uh oh!

vstinnerFeb 6, 2025

Choose a reason for hiding this comment

Uh oh!

cmaloney commented Feb 6, 2025

Uh oh!

vstinnerFeb 6, 2025

Choose a reason for hiding this comment

Uh oh!

cmaloneyFeb 6, 2025

Choose a reason for hiding this comment

Uh oh!

cmaloneyFeb 7, 2025

Choose a reason for hiding this comment

Uh oh!

cmaloney commented Feb 6, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmaloney commented Feb 5, 2025•
edited
Loading