Skip to content

Conversation

@nineteendo
Copy link
Contributor

@nineteendonineteendo commented Apr 13, 2024

Benchmark

script
::speedup-ntpath.lexists.bat@echooffecho existent &&call main\python -m timeit -s "import ntpath""ntpath.lexists('main')"&&call speedup-ntpath.lexists\python -m timeit -s "import ntpath""ntpath.lexists('main')"echo non-existent &&call main\python -m timeit -s "import ntpath""ntpath.lexists('foo')"&&call speedup-ntpath.lexists\python -m timeit -s "import ntpath""ntpath.lexists('foo')"
existent 5000 loops, best of 5: 46.5 usec per loop # before 10000 loops, best of 5: 33.2 usec per loop # after # -> 1.40x faster non-existent 10000 loops, best of 5: 22.2 usec per loop # before 20000 loops, best of 5: 15.6 usec per loop # after # -> 1.42x faster 

nineteendoand others added 2 commits April 13, 2024 12:03
Co-authored-by: Eryk Sun <eryksun@gmail.com>
@nineteendonineteendo changed the title Speedup ntpath.lexistsgh-117841: Speedup ntpath.lexistsApr 13, 2024
@nineteendonineteendo marked this pull request as ready for review April 13, 2024 11:31
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 13, 2024
…lexists()` Use `os.path.lexists()` rather than `os.lstat()` to test whether paths exist. This is equivalent on POSIX, but faster on Windows.
@nineteendonineteendo changed the title gh-117841: Speedup ntpath.lexistsgh-117841: Add C implementation of ntpath.lexistsApr 13, 2024
@eryksun
Copy link
Contributor

eryksun commented Apr 14, 2024

I'm working on what I hope will be an improved version compared to the first draft. AFAIK, the GetFileInformationByName() fast path will be generally available when Windows 11 24H2 is released later this year. Until then, and for older systems, we need to focus on improving the 'slow' path. I'm looking to avoid full STAT and LSTAT calls, except as a last resort.

@nineteendonineteendo marked this pull request as draft April 14, 2024 21:48
@eryksun
Copy link
Contributor

eryksun commented Apr 14, 2024

Here's the revised implementation of nt_exists():

staticPyObject*nt_exists(PyObject*path, intfollow_symlinks){path_t_path=PATH_T_INITIALIZE("exists", "path", 0, 1); HANDLEhfile; BOOLtraverse=follow_symlinks; intresult=0; if (!path_converter(path, &_path)){path_cleanup(&_path); if (PyErr_ExceptionMatches(PyExc_ValueError)){PyErr_Clear(); Py_RETURN_FALSE} returnNULL} Py_BEGIN_ALLOW_THREADSif (_path.fd!=-1){hfile=_Py_get_osfhandle_noraise(_path.fd); if (hfile!=INVALID_HANDLE_VALUE){result=1} } elseif (_path.wide){BOOLslow_path= TRUE; FILE_STAT_BASIC_INFORMATIONstatInfo; if (_Py_GetFileInformationByName(_path.wide, FileStatBasicByNameInfo, &statInfo, sizeof(statInfo))){if (!(statInfo.FileAttributes&FILE_ATTRIBUTE_REPARSE_POINT) || !follow_symlinks&&IsReparseTagNameSurrogate(statInfo.ReparseTag)){slow_path= FALSE; result=1} else{// reparse point but not name-surrogatetraverse= TRUE} } elseif (_Py_GetFileInformationByName_ErrorIsTrustworthy( GetLastError())){slow_path= FALSE} if (slow_path){BOOLtraverse=follow_symlinks; if (!traverse){hfile=CreateFileW(_path.wide, FILE_READ_ATTRIBUTES, 0, NULL, OPEN_EXISTING, FILE_FLAG_OPEN_REPARSE_POINT | FILE_FLAG_BACKUP_SEMANTICS, NULL); if (hfile!=INVALID_HANDLE_VALUE){FILE_ATTRIBUTE_TAG_INFOinfo; if (GetFileInformationByHandleEx(hfile, FileAttributeTagInfo, &info, sizeof(info))){if (!(info.FileAttributes&FILE_ATTRIBUTE_REPARSE_POINT) ||IsReparseTagNameSurrogate(info.ReparseTag)){result=1} else{// reparse point but not name-surrogatetraverse= TRUE} } else{// device or legacy filesystemresult=1} CloseHandle(hfile)} else{STRUCT_STATst; switch (GetLastError()){caseERROR_ACCESS_DENIED: caseERROR_SHARING_VIOLATION: caseERROR_CANT_ACCESS_FILE: caseERROR_INVALID_PARAMETER: if (!LSTAT(_path.wide, &st)){result=1} } } } if (traverse){hfile=CreateFileW(_path.wide, FILE_READ_ATTRIBUTES, 0, NULL, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, NULL); if (hfile!=INVALID_HANDLE_VALUE){CloseHandle(hfile); result=1} else{STRUCT_STATst; switch (GetLastError()){caseERROR_ACCESS_DENIED: caseERROR_SHARING_VIOLATION: caseERROR_CANT_ACCESS_FILE: caseERROR_INVALID_PARAMETER: if (!STAT(_path.wide, &st)){result=1} } } } } } Py_END_ALLOW_THREADSpath_cleanup(&_path); if (result){Py_RETURN_TRUE} Py_RETURN_FALSE}

Co-authored-by: Eryk Sun <eryksun@gmail.com>
@nineteendo
Copy link
ContributorAuthor

Are you happy with the new implementation, or do you have more ideas?

@nineteendo
Copy link
ContributorAuthor

The performance for existent files has already greatly improved. Nice job!

@eryksun
Copy link
Contributor

That's all I have for now.

@nineteendonineteendo marked this pull request as ready for review April 15, 2024 06:36
@serhiy-storchakaserhiy-storchaka self-requested a review April 15, 2024 10:49
Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance looks correct, although I am not happy that we added so much complicated code for pure optimization of functions which should not be used in performance critical code (os.scandir() should be used instead).

@eryksun
Copy link
Contributor

At first glance looks correct, although I am not happy that we added so much complicated code for pure optimization of functions which should not be used in performance critical code (os.scandir() should be used instead).

The performance of os.stat() and os.lstat() will improve on Windows once GetFileInformationByName() is generally supported, starting with Windows 11 24H2 later this year. Eventually I expect the builtin _path_* tests to be removed when Python stops supporting Windows 10.

A disappointment with these builtin functions is that they can't be leveraged in pathlib.Path because they're designed to simply return True or False without raising OSError exceptions. The problem is that the pathlib.Path methods exists(), is_dir(), is_file(), and is_symlink() only ignore OSError exceptions for a small set of errno values and Windows error codes.

Regarding scandir(), in principle, the C implementations of _path_isdir() and _path_isfile() could be used by the DirEntry methods is_dir() and is_file(). These methods have to call STAT() if the entry is a reparse point, except if follow_symlinks is false and it's a name-surrogate reparse point1. That's more work than necessary. However, the DirEntry tests only handle FileNotFoundError exceptions, so there's a mismatch in error handling. Fortunately there usually aren't so many reparse points that the accumulated cost of STAT() calls would matter much.

Footnotes

  1. That's if DirEntry was updated to behave like os.stat() and os.lstat(), for which follow_symlinks is interpreted liberally to include junctions and other name-surrogate file types, not just literal symlinks, while other types of reparse points are always traversed because they're not meant to be handled like links. For example, DirEntry.is_dir() incorrectly returns true for a junction that targets a non-existent path.

@barneygale
Copy link
Contributor

A disappointment with these builtin functions is that they can't be leveraged in pathlib.Path because they're designed to simply return True or False without raising OSError exceptions. The problem is that the pathlib.Path methods exists(), is_dir(), is_file(), and is_symlink() only ignore OSError exceptions for a small set of errno values and Windows error codes.

We can at least use it from Path.glob(), which suppresses all OSErrors. I have a draft PR for that here: #117858

I wouldn't mind if we made the is_ methods suppress all OSErrors rather than a fairly arbitrary subset as we do now. We don't document what's suppressed, and it's changed before.

@nineteendo
Copy link
ContributorAuthor

cc @zooba

@zooba
Copy link
Member

zooba commented May 1, 2024

So I'm hesitant to take this for three reasons (and these do apply to previous enhancements as well, but didn't exist at that time):

  • stat is going to get fast enough on newer Windows for this optimisation to be redundant
  • it's a lot of code and not easily maintainable
  • I don't know any scenarios where you have a lot of paths and you need to check whether they exist

isdir and islink are useful to have, because you may glob or listdir and need to filter its members. But you don't need to exists in that case - everything in the directory has to be assumed to exist (and handle the rare race condition when you try to use it).

If someone can show a scenario where you would have a significant (hundreds+) list of paths, need to check whether they exist, but couldn't use one of isfile, isdir or islink, then I could be persuaded on the third point. If there's some reason why this scenario exists for people who can upgrade Python but not update Windows, then I might be convinced on the first point (or alternatively, if lexists doesn't actually get faster with the new stat APIs, which is a possibility).

I don't think we can really reduce the amount of code. If it happened to be shorter and easier to follow then I'd be less concerned about long-term maintenance, but I'm pretty sure it's as good as it gets (without adding indirection and hurting the performance again - same tradeoff we made with the earlier _path_* functions).

@barneygale
Copy link
Contributor

If someone can show a scenario where you would have a significant (hundreds+) list of paths, need to check whether they exist, but couldn't use one of isfile, isdir or islink, then I could be persuaded on the third point.

On this specifically, one example would be globbing for */__init__.py. The quickest way to implement that is os.scandir() for an initial set of paths, join __init__.py onto each of them, and then call lstat() to filter out nonexistent paths. This is the approach taken in the pathlib globbing implementation:

cpython/Lib/glob.py

Lines 508 to 520 in a7711a2

defselect_exists(self, path, exists=False):
"""Yields the given path, if it exists.
"""
ifexists:
# Optimization: this path is already known to exist, e.g. because
# it was returned from os.scandir(), so we skip calling lstat().
yieldpath
else:
try:
self.lstat(path)
yieldpath
exceptOSError:
pass

Note that glob results can include dangling symlinks, hence lstat() rather than stat().

@eryksun
Copy link
Contributor

* `stat` is going to get fast enough on newer Windows for this optimisation to be redundant 

GetFileInformationByName() won't be generally available until Windows 11 24H2 later this year, right? What about older Windows 10/11 systems?

* it's a lot of code and not easily maintainable 

The implementation was consolidated with _path_exists() to lessen duplicate code. It was also reordered to avoid the need the need for the close_file variable. But I agree that all of these _path_is* and _path_[l]exists helpers are a lot of code to maintain, taken together. It could be refactored into smaller inline helper functions that can be reused, which would also make the code more readable. nineteendo seems to be pretty good at and enthusiastic about optimizing code.

@nineteendo
Copy link
ContributorAuthor

But I agree that all of these _path_is* and _path_[l]exists helpers are a lot of code to maintain, taken together. It could be refactored into smaller inline helper functions that can be reused, which would also make the code more readable.

Doesn't seem too difficult for _path_isdir & _path_isfile:

Screenshot 2024-05-02 at 09 14 05

@nineteendo
Copy link
ContributorAuthor

nineteendo commented May 2, 2024

Could we simplify this? It already seems to know if it's a directory of file:

if (!(statInfo.FileAttributes&FILE_ATTRIBUTE_REPARSE_POINT)){
slow_path= FALSE;
result=statInfo.FileAttributes&FILE_ATTRIBUTE_DIRECTORY;
} elseif (!(statInfo.FileAttributes&FILE_ATTRIBUTE_DIRECTORY)){
slow_path= FALSE;
result=0;
}

if (!(statInfo.FileAttributes&FILE_ATTRIBUTE_REPARSE_POINT)){
slow_path= FALSE;
result= !(statInfo.FileAttributes&FILE_ATTRIBUTE_DIRECTORY);
} elseif (statInfo.FileAttributes&FILE_ATTRIBUTE_DIRECTORY){
slow_path= FALSE;
result=0;
}

@eryksun
Copy link
Contributor

eryksun commented May 2, 2024

Could we simplify this? It already seems to know if it's a directory of file:

Yes, the fast path can be simplified. If it's not a reparse point, isdir() and isfile() can always use the by-name stat information. That's implemented. isdir() can return False for a file reparse point. That's implemented. Otherwise it has to ensure that the reparsed target exists. This could be improved. Falling back to the full slow path does more work than necessary since it only needs to check existence via CreateFileW(), not the GetFileInformationByHandleEx() query. Similarly isfile() has to verify existence for a file reparse point.

@nineteendonineteendo marked this pull request as draft May 8, 2024 11:43
@nineteendo
Copy link
ContributorAuthor

Tracking further in #118755.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

@nineteendo@eryksun@barneygale@zooba@serhiy-storchaka