Skip to content

Reduce copies when reading files in pyio, match behavior of _io#129005

@cmaloney

Description

@cmaloney

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv == CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ] == Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian == Python build: debug == cwd: <$HOME>/python/build/build/test_python_worker_32392æ == CPU count: 32 == encodings: locale=UTF-8 FS=utf-8 == resources: all Using random seed: 1740056613 0:00:00 load avg: 0.53 Run 1 test sequentially in a single process 0:00:00 load avg: 0.53 [1/1] test_largefile test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... ... expected peak memory use: 4.7G ... process data size: 2.3G ok test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... ... expected peak memory use: 4.7G ... process data size: 2.3G ... process data size: 4.3G ... process data size: 4.7G ok ---------------------------------------------------------------------- Ran 2 tests in 3.711s OK == Tests result: SUCCESS == 1 test OK. Total duration: 3.7 sec Total tests: run=2 (filtered) Total test files: run=1/1 (filtered) Result: SUCCESS

Plan:

  1. Switch to os.readv()os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
  2. Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from_pyioimportopenimporttracemallocwithopen("README.rst", 'rb') asfile: tracemalloc.start() data=file.read() snap=tracemalloc.take_snapshot() stats=snap.statistics('lineno') forstatinstats: print(stat)

Loose Ends

  • os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
    • os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions