GH-116380: Speed up glob.[i]glob() by making fewer system calls (take 2)#137474
+214 −227
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was previously merged and then reverted because a newly-added test failed on one buildbot. I'm not sure that the test was ever valid, so I've removed it here.
Filtered recursive walk
Expanding a recursive
**segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example,glob.glob("foo/**/*.py", recursive=True)recursively walksfoo/withos.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.This fixes an issue where
glob()could return duplicate results.Tracking path existence
We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:
"","."and"..") leave the flag unchangedfoo/bar) set the flag to false*/*.py) set the flag to true (because children are found viaos.scandir())**) leave the flag unchanged for the root path, and set it to true for descendants discovered viaos.scandir().If the flag is false at the end, we call
lstat()on each path to filter out missing paths.Minor speed-ups
is_dir().bytes(a minor use-case) iniglob()rather than supportingbytesthroughout. This particularly simplifies the code needed to handle relative bytes paths withdir_fd.os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.os.path.normcase(); instead we use case-insensitive regex matching.Implementation notes
Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:
dir_fdinclude_hiddenroot_dirThis unifies the implementations of globbing in the
globandpathlibmodules.Results
Speedups via
python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)"from CPython source directory on Linux:Lib/*Lib/*/Lib/*.pyLib/**Lib/**/Lib/**/*Lib/**/**Lib/**/*/Lib/**/*.pyLib/**/__init__.pyLib/**/*/*.pyLib/**/*/__init__.pyglob.glob()by reducing number of system calls made #116380📚 Documentation preview 📚: https://cpython-previews--137474.org.readthedocs.build/