- Notifications
You must be signed in to change notification settings - Fork 1.9k
Python extractor: overlay support#20206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
d10c commented Aug 11, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
b18b9ce to 3015c12CompareUh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
d10c commented Aug 28, 2025
@tausbn I'm thinking this might be a good time to checkpoint this work and get it reviewed. In the last DCA run for full analysis on this PR (see above), overall analysis time is unaffected, though there are a few outstanding stage timing results that are probably noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds overlay support to the Python extractor by implementing infrastructure for incremental analysis through database overlays, without including overlay compilation functionality.
Key changes implemented:
- Database schema updates to support overlay metadata and change tracking
- Extractor modifications to handle overlay-specific file traversal and metadata management
- Path transformer support using updated environment variables
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| python/ql/lib/semmlecode.python.dbscheme | Adds databaseMetadata and overlayChangedFiles relations for overlay support |
| python/ql/lib/semmle/python/Overlay.qll | Implements discard predicates to filter out obsolete entities during overlay analysis |
| python/extractor/semmle/traverser.py | Modifies file traversal to only process changed files during overlay extraction |
| python/extractor/semmle/worker.py | Adds support for writing base metadata output required for overlay operations |
| python/extractor/semmle/path_rename.py | Updates path transformer to support new CODEQL_PATH_TRANSFORMER environment variable |
python/extractor/semmle/traverser.py Outdated
| withopen(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') asf: | ||
| data=json.load(f) | ||
| changed_paths=data.get('changes', []) | ||
| self.overlay_changes={os.path.abspath(p) forpinchanged_paths } |
CopilotAIAug 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name self.overlay_changes is inconsistent with the other instance variables which use snake_case (self.exclude_paths, self.recurse_files, etc.). Consider renaming to self.overlay_changed_paths for consistency.
tausbn left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good. 👍
Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)
Also, don't forget to update the extractor version here: https://github.com/github/codeql/blob/main/python/extractor/semmle/util.py#L13
(In this case, I think bumping it to 7.1.4 would be fine. We don't really have fixed rules for how to increase the version. The most important thing is that it changes so that we can tell from the log output what version of the extractor we're running.)
python/extractor/semmle/traverser.py Outdated
| if'CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'inos.environ: | ||
| withopen(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') asf: | ||
| data=json.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm debating whether we should have some exception handling here (substituting the empty list of changed files in case something goes wrong). Currently, if something ends up being messed up in the JSON, then I believe the whole extraction will just fail.
I don't have strong feelings about it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I also don't have strong opinions about whether file reading should fail loudly or warn and continue with a default (None, i.e. full extraction). I guess I'll go for the latter. And also insert a logger statement with the value of the environment variable, as is the convention elsewhere in the extractor.
d10c commented Sep 5, 2025
There are basic integration tests here but they depend on overlay compilation (not part of this commit), and also I'm still running into some issues on Windows (it appears that the path transformer is not working correctly there—currently debugging that). So maybe merging this should wait until I have that sorted. Otherwise, do you have an idea for an integration test for this functionality that doesn't also exercise complete overlay evaluation? |
fb23977 to f309dc6Compared10c commented Sep 10, 2025
I think I've figured out why path transformers weren't working on Windows and why built-in modules were being extracted (see latest commits). Now the integration test on the other PR passes. The only remaining thing now is solving some tuple count regressions uncovered through DCA, but that can be done independently of this PR. |
Uh oh!
There was an error while loading. Please reload this page.
jbj left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've written a few comments after reading the discard predicates. I haven't reviewed the rest of this PR.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| overlay[discard_entity] | ||
| privatepredicatediscardLocation(@location loc){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| overlay[discard_entity] | |
| privatepredicatediscardLocation(@location loc){ | |
| /** | |
| *Locationsin Python TRAPfiles use named ids, so the overlay database will | |
| *reuselocationentitiesfrom the base. Therefore we should only discard a | |
| *locationifit'snotinusebythe overlay. | |
| * | |
| *Ifthe same element (with a named TRAPid)couldhaveadifferent location | |
| *inbaseandoverlay,thisdiscardingstrategywouldnotpreventthat element | |
| *from appearing to have two locations. However, the Python extractor does not | |
| *usenamedidsforentitiesthatcanchange location. | |
| */ | |
| overlay[discard_entity] | |
| privatepredicatediscardLocation(@location loc){ |
This is not an obvious predicate. I've suggested a comment here, but I don't even know if it's correct: is it impossible for the same @py_Module entity to have a different location in the base and the overlay?
Also, maybe there should be a comment to say that if we don't discard locations, probably nothing bad will happen. There will just be some unattached locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python locations use *-ids. Which I guess means that the same @py_Module extracted in the overlay would have a different location than in the base extraction. So the base location should be discarded.
The new name is required by overlay support.
with direct or indirect location links in dbscheme.
And don't add slash to start of path patterns on Windows.
…t-in files On Windows, we're getting e.g. the following mismatches, which could be due to case differences: "Skipped built-in file C:\hostedtoolcache\windows\Python\3.13.7\x64\Lib\multiprocessing\forkserver.py" vs "Extracted file C:\hostedtoolcache\windows\Python\3.13.7\x64\lib\asyncio\streams.py"
This way, we filter both root modules and (transitive) imports against the overlay-changes json.
f309dc6 to c2f026dCompared10c commented Oct 2, 2025
Superceded by PR #20337 |
This PR adds overlay support to the Python extractor, but no overlay compilation (to be merged separately since it needs further testing, see this PR).
This PR also includes an initial pass at the discard predicates (see Overlay.qll), though these are ignored in full (non-overlay) evaluation; they probably still need to be tweaked, so I'm happy to move this commit to another PR and let this one be only about the extractor.
Roadmap:
CODEQL_EXTRACTOR_<LANG>_OVERLAY_BASE_METADATA_{IN,OUT})