GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching#115061

barneygale · 2024-02-06T04:10:20Z

When expanding and filtering paths for a ** wildcard segment, build an re.Pattern object from the subsequent pattern parts, rather than the entire pattern, and match against the os.DirEntry object prior to instantiating a path object.

Also skip compiling a pattern when expanding a * wildcard segment.

Issue: Speed up pathlib.Path.glob() by removing redundant regex matching #115060

… regex matching When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern. Also skip compiling a pattern when expanding a `*` wildcard segment.

barneygale · 2024-02-06T04:20:36Z

Notable improvements:

$ ./python -m timeit -s "from pathlib import Path""list(Path.cwd().glob('*', follow_symlinks=False))" 2000 loops, best of 5: 180 usec per loop # before 2000 loops, best of 5: 159 usec per loop # after# --> 1.13x faster $ ./python -m timeit -s "from pathlib import Path""list(Path.cwd().glob('**/*.py', follow_symlinks=False))" 5 loops, best of 5: 54 msec per loop # before 5 loops, best of 5: 40.9 msec per loop # after# --> 1.32x faster

Everything else is about the same.

This reverts commit b382e40.

zooba · 2024-02-08T00:22:51Z

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

Since it doesn't require changing any test cases, and I know the tests cases are pretty thorough for this area, I don't think there's any reason to not sign off. Maybe trigger a buildbot run with the tag to make sure it doesn't behave strangely on any of those setups - they can occasionally be a bit unusual and find some edge cases.

barneygale · 2024-02-08T17:13:11Z

Thanks Steve.

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

The algorithm might be worthy of a blog post at this point!

The main change is that we now filter partial paths through a regex corresponding to a partial pattern in _select_recursive, rather than complete paths through a regex corresponding to a complete pattern in PathBase.glob(). We can do this because previous parts have already been filtered by _select_children(), and so there's no need to re-filter them.

The secondary change (which includes the addition of _entry_str()) is to match against os.DirEntry.path directly, which allows us to skip construction of path objects for files that don't match.

zooba · 2024-02-08T22:07:39Z

Okay, today it made sense :) Guess I'm more awake right now. Reading the changes from the bottom up might have helped as well.

Personally, I don't think you can have too many comments in an algorithm like this, particularly when it's recursive and split between a couple of functions. I'll suggest a few comments that would've helped me, but I don't think there are any code changes needed.

zooba

Just comments that may help make it more understandable. No changes required

Lib/pathlib/_abc.py

Lib/pathlib/__init__.py

… regex matching (python#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.

barneygale added performance Performance or resource usage topic-pathlib labels Feb 6, 2024

barneygale requested a review from zooba February 6, 2024 04:10

bedevere-appbot added the awaiting core review label Feb 6, 2024

bedevere-appbot mentioned this pull request Feb 6, 2024
Speed up pathlib.Path.glob() by removing redundant regex matching #115060
Closed

barneygale added 4 commits February 6, 2024 04:58

Match against os.DirEntry.path in _select_recursive()
6abb80d

Matching against dot-prefixed path is fine (and faster!)
b382e40

Revert "Matching against dot-prefixed path is fine (and faster!)"
e1472fc
This reverts commit b382e40.

Skip computing prefix len when not matching
284c42e

Rename prefix_len --> parent_len for clarity.
169b1e7

zooba reviewed Feb 8, 2024
View reviewed changes

Lib/pathlib/_abc.pyShow resolvedHide resolved
Lib/pathlib/_abc.pyShow resolvedHide resolved
Lib/pathlib/_abc.pyShow resolvedHide resolved
Lib/pathlib/__init__.py Outdated Show resolvedHide resolved

barneygale added 4 commits February 8, 2024 22:56

Comments, naming.
1c4184f

segment --> component
2873ed8

Test post-** matching when globbing ..
90d5a12

Couple more test cases
a40924b

barneygale merged commit 6f93b4d into python:mainFeb 10, 2024

bedevere-appbot removed the awaiting core review label Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching#115061

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

Uh oh!

barneygale commented Feb 6, 2024•
edited
Loading

Uh oh!

barneygale commented Feb 6, 2024•
edited
Loading

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

barneygale commented Feb 8, 2024•
edited
Loading

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

zooba left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching#115061

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching #115061

Uh oh!

Conversation

barneygale commented Feb 6, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barneygale commented Feb 6, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

barneygale commented Feb 8, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zooba commented Feb 8, 2024

Uh oh!

zooba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching#115061

GH-115060: Speed up `pathlib.Path.glob()` by removing redundant regex matching #115061

barneygale commented Feb 6, 2024•
edited
Loading

barneygale commented Feb 6, 2024•
edited
Loading

barneygale commented Feb 8, 2024•
edited
Loading