Reject crafted ZIPs that scan empty but unpack code (#780)#790
Open
dvlinuxx-max wants to merge 2 commits into
Open
Reject crafted ZIPs that scan empty but unpack code (#780)#790dvlinuxx-max wants to merge 2 commits into
dvlinuxx-max wants to merge 2 commits into
Conversation
Was poking at how guarddog reads wheels and hit this. If you set the EOCD size-of-central-directory to 0 (or tack a second empty EOCD onto the end), zipfile reports the archive as empty, but pip and other installers still walk the local file headers and unpack everything normally. So a package can ship real code and guarddog scans it as clean - "Found 0 potentially malicious indicators". Fix is in safe_extract: walk the local file headers directly and bail out if there are more of them than zipfile enumerated from the central directory. I read each header's compressed size and seek over the data instead of scanning for the PK signature, otherwise compressed bytes give false matches. It stops early on data descriptors and zip64 size markers so normal archives never trip it. Added tests for the cd_size=0 case and a regular wheel. Also ran it against a real deflate wheel (107 files including a dir entry) and got no false positive. lint and black pass.
1b03d0c to
f24fd87
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a ZIP parser-differential scan-evasion where zipfile can enumerate an archive as empty (via crafted EOCD/central-directory fields) while installers can still unpack code from local file headers, causing GuardDog to scan “0 files” and miss malicious content.
Changes:
- Add a local-file-header walk to detect when local headers expose more members than
zipfileenumerates, and reject such archives. - Add tests covering a well-formed wheel-like ZIP and a crafted
cd_size=0differential that should now be rejected.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
guarddog/utils/archives.py |
Adds local-header counting and an assertion to reject archives where zipfile’s central-directory view appears to hide members. |
tests/core/test_archives.py |
Adds hand-built ZIP fixtures and regression tests for the EOCD cd_size=0 differential. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+68
to
+69
| if (flags & 0x08 and compressed_size == 0) or compressed_size == 0xFFFFFFFF: | ||
| break |
Comment on lines
+112
to
+115
| import zipfile | ||
|
|
||
| assert zipfile.ZipFile(str(archive)).namelist() == [] | ||
|
|
Comment on lines
+106
to
+110
| def test_cd_size_zero_eocd_differential_rejected(tmp_path): | ||
| # zipfile reads this as empty (namelist() == []) but the payload is still | ||
| # present in the local file headers; safe_extract must refuse it (issue #780). | ||
| archive = tmp_path / "crafted-1.0-py3-none-any.whl" | ||
| archive.write_bytes(_build_zip(_WHL_MEMBERS, cd_size=0)) |
…ntries Addresses Copilot review on DataDog#790. _count_local_file_headers incremented the counter only after the seek, so a local header that uses a data descriptor (general-purpose bit 3, sizes deferred) or a zip64 size marker broke out of the loop with the member uncounted. A crafted archive whose central directory reads empty (cd_size=0) but whose first local header is such an entry therefore walked to 0 and slipped past the guard. Count the header as soon as its PK\x03\x04 signature is confirmed, then decide whether the next member can be followed; this only stops undercounting, never overcounts. Tests: open the crafted archive through a context manager so the descriptor is released promptly, and add a regression case for a cd_size=0 archive whose first local header carries a data descriptor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #780.
While looking at how guarddog reads wheels I noticed it trusts whatever
zipfileenumerates from the central directory. You can craft an archive where that view is empty but the payload is still there:size of central directoryto0cd_offsetat an empty central directoryIn all of these
zipfile.namelist()comes back empty, sosafe_extractpulls out zero files and the package scans as clean ("Found 0 potentially malicious indicators") - even though pip and other installers happily unpack the code from the local file headers.What I changed in
safe_extract: before extracting, walk the local file headers directly and refuse the archive if there are more of them than zipfile enumerated. I parse each header's compressed size and seek over the data rather than scanning for thePK��signature, because a raw signature scan false-positives on compressed bytes. The walk stops early on data descriptors without inline sizes and on zip64 size markers, so it can only ever undercount - it won't reject a legit archive.Tests in
tests/core/test_archives.py:cd_size=0archive (namelist() == []) now raisesI also checked the trailing-EOCD framing and ran it against a real deflate wheel with 107 members including a directory entry - no false positive there.
make lintandblack --check guarddogpass, mypy has nothing new onarchives.py.Background on this bug class is the USENIX Security 2025 paper "My ZIP isn't your ZIP": https://www.usenix.org/system/files/usenixsecurity25-you.pdf
One note - my commit isn't signed and v3 looks like it requires signed commits, so a squash merge would probably be easiest on your end. Happy to adjust anything.