Skip to content

Conversation

@barneygale
Copy link
Contributor

@barneygalebarneygale commented Apr 6, 2024

Move pathlib globbing implementation into a new private class: glob._Globber. This class implements fast string-based globbing. It's called by pathlib.Path.glob(), which then converts strings back to path objects.

In the private pathlib ABCs, add a pathlib._abc.Globber subclass that works with PathBase objects rather than strings, and calls user-defined path methods like PathBase.stat() rather than os.stat().

This sets the stage for two more improvements:

Timings:

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/*'))" 1000 loops, best of 5: 392 usec per loop 1000 loops, best of 5: 365 usec per loop # --> 1.07x faster $ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/*.py'))" 1000 loops, best of 5: 393 usec per loop 1000 loops, best of 5: 371 usec per loop # --> 1.06x faster $ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/**'))" 50 loops, best of 5: 9.46 msec per loop 50 loops, best of 5: 9.06 msec per loop # --> 1.04x faster $ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/**/'))" 50 loops, best of 5: 4.98 msec per loop 50 loops, best of 5: 5.15 msec per loop # --> 1.03x slower (!) $ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/**/*'))" 20 loops, best of 5: 14 msec per loop 20 loops, best of 5: 12.9 msec per loop # --> 1.09x faster $ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()""list(p.glob('Lib/**/*.py'))" 20 loops, best of 5: 12.2 msec per loop 20 loops, best of 5: 11.4 msec per loop # --> 1.07x faster

Move pathlib globbing implementation to a new module and class: `pathlib._glob.Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
@barneygale
Copy link
ContributorAuthor

This is the first PR in a series that will hopefully unify the globbing implementations in the pathlib and glob modules, and speed both up in the process.

@barneygale
Copy link
ContributorAuthor

barneygale commented Apr 7, 2024

Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the glob module for the last few years.

This PR doesn't affect glob.[i]glob(), but it does move pathlib's globbing implementation into glob.py.

Thank you.

@barneygale
Copy link
ContributorAuthor

I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1.

But I'll leave glob.glob() and glob.iglob() unchanged in 3.13; any PRs I make will target 3.14.

@barneygalebarneygale merged commit 6258844 into python:mainApr 10, 2024
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 10, 2024
Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.
barneygale added a commit that referenced this pull request Apr 11, 2024
…17726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to #117589.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.
cjwatson added a commit to cjwatson/pypandoc that referenced this pull request Dec 8, 2024
As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
JessicaTegner added a commit to JessicaTegner/pypandoc that referenced this pull request Jan 8, 2025
As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError Co-authored-by: Jessica Tegner <jessica@jessicategner.com>
srittau pushed a commit to python/typeshed that referenced this pull request Feb 28, 2025
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
mmingyu pushed a commit to mmingyu/typeshed that referenced this pull request May 16, 2025
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performancePerformance or resource usagetopic-pathlib

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

@barneygale