Python extractor: overlay support#20206

d10c · 2025-08-11T14:37:39Z

This PR adds overlay support to the Python extractor, but no overlay compilation (to be merged separately since it needs further testing, see this PR).

This PR also includes an initial pass at the discard predicates (see Overlay.qll), though these are ignored in full (non-overlay) evaluation; they probably still need to be tweaked, so I'm happy to move this commit to another PR and let this one be only about the extractor.

Roadmap:

Update the dbscheme
Implement path transformer support
Read the overlay-changes JSON file
Read/write base metadata (CODEQL_EXTRACTOR_<LANG>_OVERLAY_BASE_METADATA_{IN,OUT})

python/ql/lib/semmle/python/Overlay.qll

d10c · 2025-08-28T10:33:43Z

@tausbn I'm thinking this might be a good time to checkpoint this work and get it reviewed. In the last DCA run for full analysis on this PR (see above), overall analysis time is unaffected, though there are a few outstanding stage timing results that are probably noise.

Copilot

Pull Request Overview

This PR adds overlay support to the Python extractor by implementing infrastructure for incremental analysis through database overlays, without including overlay compilation functionality.

Key changes implemented:

Database schema updates to support overlay metadata and change tracking
Extractor modifications to handle overlay-specific file traversal and metadata management
Path transformer support using updated environment variables

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
python/ql/lib/semmlecode.python.dbscheme	Adds `databaseMetadata` and `overlayChangedFiles` relations for overlay support
python/ql/lib/semmle/python/Overlay.qll	Implements discard predicates to filter out obsolete entities during overlay analysis
python/extractor/semmle/traverser.py	Modifies file traversal to only process changed files during overlay extraction
python/extractor/semmle/worker.py	Adds support for writing base metadata output required for overlay operations
python/extractor/semmle/path_rename.py	Updates path transformer to support new `CODEQL_PATH_TRANSFORMER` environment variable

Copilot · 2025-08-28T10:34:10Z

python/extractor/semmle/traverser.py

+withopen(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') asf:
+data=json.load(f)
+changed_paths=data.get('changes', [])
+self.overlay_changes={os.path.abspath(p) forpinchanged_paths }


The variable name self.overlay_changes is inconsistent with the other instance variables which use snake_case (self.exclude_paths, self.recurse_files, etc.). Consider renaming to self.overlay_changed_paths for consistency.

tausbn

Overall I think this looks good. 👍

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

Also, don't forget to update the extractor version here: https://github.com/github/codeql/blob/main/python/extractor/semmle/util.py#L13
(In this case, I think bumping it to 7.1.4 would be fine. We don't really have fixed rules for how to increase the version. The most important thing is that it changes so that we can tell from the log output what version of the extractor we're running.)

tausbn · 2025-09-05T12:50:02Z

python/extractor/semmle/traverser.py

+if'CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'inos.environ:
+withopen(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') asf:
+data=json.load(f)


I'm debating whether we should have some exception handling here (substituting the empty list of changed files in case something goes wrong). Currently, if something ends up being messed up in the JSON, then I believe the whole extraction will just fail.
I don't have strong feelings about it, though.

Thanks for the review! I also don't have strong opinions about whether file reading should fail loudly or warn and continue with a default (None, i.e. full extraction). I guess I'll go for the latter. And also insert a logger statement with the value of the environment variable, as is the convention elsewhere in the extractor.

d10c · 2025-09-05T16:11:47Z

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

There are basic integration tests here but they depend on overlay compilation (not part of this commit), and also I'm still running into some issues on Windows (it appears that the path transformer is not working correctly there—currently debugging that). So maybe merging this should wait until I have that sorted.

Otherwise, do you have an idea for an integration test for this functionality that doesn't also exercise complete overlay evaluation?

d10c · 2025-09-10T18:47:51Z

I think I've figured out why path transformers weren't working on Windows and why built-in modules were being extracted (see latest commits). Now the integration test on the other PR passes.

The only remaining thing now is solving some tuple count regressions uncovered through DCA, but that can be done independently of this PR.

python/extractor/semmle/extractors/builtin_extractor.py

jbj

I've written a few comments after reading the discard predicates. I haven't reviewed the rest of this PR.

python/ql/lib/semmle/python/Overlay.qll

jbj · 2025-09-12T13:52:28Z

python/ql/lib/semmle/python/Overlay.qll

+overlay[discard_entity]
+privatepredicatediscardLocation(@location loc){


Suggested change
overlay[discard_entity]
privatepredicatediscardLocation(@location loc){
/**
*Locationsin Python TRAPfiles use named ids, so the overlay database will
*reuselocationentitiesfrom the base. Therefore we should only discard a
*locationifit'snotinusebythe overlay.
*
*Ifthe same element (with a named TRAPid)couldhaveadifferent location
*inbaseandoverlay,thisdiscardingstrategywouldnotpreventthat element
*from appearing to have two locations. However, the Python extractor does not
*usenamedidsforentitiesthatcanchange location.
*/
overlay[discard_entity]
privatepredicatediscardLocation(@location loc){
This is not an obvious predicate. I've suggested a comment here, but I don't even know if it's correct: is it impossible for the same @py_Module entity to have a different location in the base and the overlay?
Also, maybe there should be a comment to say that if we don't discard locations, probably nothing bad will happen. There will just be some unattached locations.

Python locations use *-ids. Which I guess means that the same @py_Module extracted in the overlay would have a different location than in the base extraction. So the base location should be discarded.

The new name is required by overlay support.

with direct or indirect location links in dbscheme.

…hangedFiles`

…nnotations

…error

And don't add slash to start of path patterns on Windows.

…t-in files On Windows, we're getting e.g. the following mismatches, which could be due to case differences: "Skipped built-in file C:\hostedtoolcache\windows\Python\3.13.7\x64\Lib\multiprocessing\forkserver.py" vs "Extracted file C:\hostedtoolcache\windows\Python\3.13.7\x64\lib\asyncio\streams.py"

This way, we filter both root modules and (transitive) imports against the overlay-changes json.

…erclass

d10c · 2025-10-02T16:17:53Z

Superceded by PR #20337

github-actionsbot added the Python label Aug 11, 2025

d10c force-pushed the d10c/python-overlay branch 2 times, most recently from b18b9ce to 3015c12Compare August 12, 2025 10:48

github-advanced-securitybot found potential problems Aug 18, 2025
View reviewed changes

python/ql/lib/semmle/python/Overlay.qll Fixed Show fixedHide fixed
python/ql/lib/semmle/python/Overlay.qll Fixed Show fixedHide fixed

d10c force-pushed the d10c/python-overlay branch from b0c7a52 to b5c8338Compare August 19, 2025 18:20

github-advanced-securitybot found potential problems Aug 19, 2025
View reviewed changes

python/ql/lib/semmle/python/Overlay.qll Fixed Show fixedHide fixed

d10c force-pushed the d10c/python-overlay branch from f75a392 to 63106c0Compare August 20, 2025 14:32

d10c mentioned this pull request Aug 27, 2025
Python overlay compilation #20293
Closed

d10c force-pushed the d10c/python-overlay branch from 63106c0 to b3a1ba5Compare August 27, 2025 08:42

d10c mentioned this pull request Aug 27, 2025
Python: overlay compilation d10c/codeql#1
Draft

d10c force-pushed the d10c/python-overlay branch from b3a1ba5 to fb23977Compare August 27, 2025 08:59

d10c marked this pull request as ready for review August 28, 2025 10:33

d10c requested a review from a team as a code owner August 28, 2025 10:33

d10c requested review from Copilot and tausbn August 28, 2025 10:33

CopilotAI reviewed Aug 28, 2025
View reviewed changes

d10c mentioned this pull request Sep 1, 2025
Python: enable overlay compilation + extractor overlay support #20337
Merged

tausbn requested changes Sep 5, 2025
View reviewed changes

d10c force-pushed the d10c/python-overlay branch from fb23977 to f309dc6Compare September 10, 2025 18:42

jbj reviewed Sep 12, 2025
View reviewed changes

python/extractor/semmle/extractors/builtin_extractor.py Outdated Show resolvedHide resolved

jbj reviewed Sep 12, 2025
View reviewed changes

d10c added 6 commits September 12, 2025 23:13

Add overlay builtins to python dbscheme
5dbe2e2

Turn on overlay support in codeql-extractor.yml
a9f8640

Add database upgrade/downgrade scripts
1f01b11

Support CODEQL_PATH_TRANSFORMER env var in python path renamer
c0497af
The new name is required by overlay support.

Python extractor: in overlay mode, traverse only changed files
f4defdc

Write overlay metadata at end of extraction.
44219d0

d10c added 8 commits September 12, 2025 23:13

Discard predicates for dbscheme elements
8edd6c1
with direct or indirect location links in dbscheme.

Add synthetic data to dbscheme.stats for databaseMetadata/`overlayC…
693add0
…hangedFiles`

Overlay.qll: remove overlay[local?] module; in favour of explicit a…
627a7b4
…nnotations

Extractor: fall back to full extraction on overlay changes json read …
5ef3215
…error

Path transformer: handle Windows-style paths
4bea5d7
And don't add slash to start of path patterns on Windows.

Extractor: move overlay-changes check from traverser to worker
2ed8dc7
This way, we filter both root modules and (transitive) imports against the overlay-changes json.

Overlay.qll: Streamline discardable entities into one Discardable sup…
c2f026d
…erclass

d10c force-pushed the d10c/python-overlay branch from f309dc6 to c2f026dCompare September 12, 2025 21:14

d10c closed this Oct 2, 2025

		overlay[discard_entity]
		privatepredicatediscardLocation(@location loc){

-overlay[discard_entity]
-privatepredicatediscardLocation(@location loc){
+/**
+*Locationsin Python TRAPfiles use named ids, so the overlay database will
+*reuselocationentitiesfrom the base. Therefore we should only discard a
+*locationifit'snotinusebythe overlay.
+*
+*Ifthe same element (with a named TRAPid)couldhaveadifferent location
+*inbaseandoverlay,thisdiscardingstrategywouldnotpreventthat element
+*from appearing to have two locations. However, the Python extractor does not
+*usenamedidsforentitiesthatcanchange location.
+*/
+overlay[discard_entity]
+privatepredicatediscardLocation(@location loc){

Python extractor: overlay support#20206

Python extractor: overlay support #20206

Uh oh!

Conversation

d10c commented Aug 11, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d10c commented Aug 28, 2025

Uh oh!

CopilotAI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

CopilotAIAug 28, 2025

Choose a reason for hiding this comment

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

tausbnSep 5, 2025

Choose a reason for hiding this comment

Uh oh!

d10cSep 5, 2025

Choose a reason for hiding this comment

Uh oh!

d10c commented Sep 5, 2025

Uh oh!

d10c commented Sep 10, 2025

Uh oh!

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jbjSep 12, 2025

Choose a reason for hiding this comment

Uh oh!

d10cSep 12, 2025

Choose a reason for hiding this comment

Uh oh!

d10c commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d10c commented Aug 11, 2025•
edited
Loading