Skip to content

Conversation

@barneygale
Copy link
Contributor

@barneygalebarneygale commented Nov 1, 2024

Use the new PathBase.scandir() method in PathBase.copy(), which greatly reduces the number of PathBase.stat() calls needed when copying. This also speeds up Path.copy(), which uses the superclass implementation.

Under the hood, we use directory entries to distinguish between files, directories and symlinks, and to retrieve a stat_result when reading metadata. This logic is extracted into a new pathlib._abc.CopierBase class, which helps reduce the number of underscore-prefixed support methods in the path interface. But it makes the patch a little large - sorry.

Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly reduces the number of `PathBase.stat()` calls needed when copying. This also speeds up `Path.copy()`, which inherits the superclass implementation. Under the hood, we use directory entries to distinguish between files, directories and symlinks, and to retrieve a `stat_result` when reading metadata. This logic is extracted into a new `pathlib._abc.CopierBase` class, which helps reduce the number of underscore-prefixed support methods in the path interface.
@barneygalebarneygale added performance Performance or resource usage topic-pathlib labels Nov 1, 2024
@barneygalebarneygale marked this pull request as ready for review November 1, 2024 05:20
@barneygale
Copy link
ContributorAuthor

Copying a directory of 100 empty files, this is about 10% faster when preserving metadata, and 5% faster without.

Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my nocturnal review just before going to bed!

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@barneygale
Copy link
ContributorAuthor

Thanks for the reviews, both! I'll wait to see if Adam has feedback before I merge.

@barneygale
Copy link
ContributorAuthor

Gentle nudge @AA-Turner, thanks in advance :)

In case you didn't spot it, I'm planning to make scandir() private again after this PR lands. Thread here. The TL;DR is that it's not sufficiently motivated at the moment, but it might be if we make the pathlib ABCs public later on.

Recursively move this file or directory tree to the given destination.
"""
self._ensure_different_file(target)
target._copier.ensure_different_files(self, target)
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong - target might be a string here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, or a PathLike - could we use self._copier?

Copy link
Member

@AA-TurnerAA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder Barney.

I've no real concerns here, but I admit I'm not entirely sure why _CopierBase needs to exist, as it seems to reimplement some functionality of the Path hierarchy. It may be that the implementation without it would've been much longer, though.

Other than that, comments throughout.

A

Comment on lines +113 to +116
try:
source_st=dir_entry.stat()
exceptAttributeError:
source_st=source.stat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there valid non-None types for dir_entry that don't have stat? Otherwise a is None check is cheaper.

Recursively move this file or directory tree to the given destination.
"""
self._ensure_different_file(target)
target._copier.ensure_different_files(self, target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, or a PathLike - could we use self._copier?

Comment on lines +69 to +74
try:
source=os.fspath(source)
exceptTypeError:
ifnotisinstance(source, PathBase):
raise
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we permit types inheriting from pathlib._abc.PathBase but not also PurePath (or: not implementing __fspath__)? This seems a little strange at first look.

raise
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry)
else:
copyfile(source, os.fspath(target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above exception handling for fspath, is target guaranteed to inherit from PurePath here?

yieldpath_str

def_join_dir_entry(self, dir_entry):
path_str=dir_entry.nameifstr(self) =='.'elsedir_entry.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we shortcut here rather than going via str(self)? The change to iterdir means that this is recalculated for each path in the directory being iterated over, rather than only once.

Comment on lines +191 to +196
exceptIsADirectoryErrorase:
ifnottarget.exists():
# Raise a less confusing exception.
raiseFileNotFoundError(
f'Directory does not exist: {target}') frome
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking -- does this directory error handling need to exist in copy_file? Can ensuring that source and target are files be handled as a precondition? No worries if 'yes', this just surprised me!

@barneygale
Copy link
ContributorAuthor

Thanks for the feedback @AA-Turner, and my much-delayed response. I ended up going a different way with making os.DirEntry info available from pathlib: the scandir() method was replaced with a caching info attribute.

The spiritual successor to this PR here is here: #130238

In that PR, the new copy_file() function consolidates a bunch of copying logic from the _CopyWriter class, and uses source_path.info wherever possible to lean on cached scandir results.

I'll go through your feedback here and re-raise it on the other PR if it still applies.

Closing this PR. Sorry for the faff!

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting mergeperformancePerformance or resource usagetopic-pathlib

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

@barneygale@pfmoore@AA-Turner@picnixz