Skip to content

Reject crafted ZIPs that scan empty but unpack code (#780)#790

Open
dvlinuxx-max wants to merge 2 commits into
DataDog:v3from
dvlinuxx-max:dvlinuxx-max/fix-zip-eocd-differential
Open

Reject crafted ZIPs that scan empty but unpack code (#780)#790
dvlinuxx-max wants to merge 2 commits into
DataDog:v3from
dvlinuxx-max:dvlinuxx-max/fix-zip-eocd-differential

Conversation

@dvlinuxx-max

@dvlinuxx-max dvlinuxx-max commented Jun 24, 2026

Copy link
Copy Markdown

Fixes #780.

While looking at how guarddog reads wheels I noticed it trusts whatever zipfile enumerates from the central directory. You can craft an archive where that view is empty but the payload is still there:

  • set the EOCD size of central directory to 0
  • or append a second EOCD record with 0 entries
  • or point cd_offset at an empty central directory

In all of these zipfile.namelist() comes back empty, so safe_extract pulls out zero files and the package scans as clean ("Found 0 potentially malicious indicators") - even though pip and other installers happily unpack the code from the local file headers.

What I changed in safe_extract: before extracting, walk the local file headers directly and refuse the archive if there are more of them than zipfile enumerated. I parse each header's compressed size and seek over the data rather than scanning for the PK�� signature, because a raw signature scan false-positives on compressed bytes. The walk stops early on data descriptors without inline sizes and on zip64 size markers, so it can only ever undercount - it won't reject a legit archive.

Tests in tests/core/test_archives.py:

  • a normal stored wheel still extracts
  • the cd_size=0 archive (namelist() == []) now raises

I also checked the trailing-EOCD framing and ran it against a real deflate wheel with 107 members including a directory entry - no false positive there. make lint and black --check guarddog pass, mypy has nothing new on archives.py.

Background on this bug class is the USENIX Security 2025 paper "My ZIP isn't your ZIP": https://www.usenix.org/system/files/usenixsecurity25-you.pdf

One note - my commit isn't signed and v3 looks like it requires signed commits, so a squash merge would probably be easiest on your end. Happy to adjust anything.

@dvlinuxx-max dvlinuxx-max requested a review from a team as a code owner June 24, 2026 12:15
Was poking at how guarddog reads wheels and hit this. If you set the EOCD
size-of-central-directory to 0 (or tack a second empty EOCD onto the end),
zipfile reports the archive as empty, but pip and other installers still walk
the local file headers and unpack everything normally. So a package can ship
real code and guarddog scans it as clean - "Found 0 potentially malicious
indicators".

Fix is in safe_extract: walk the local file headers directly and bail out if
there are more of them than zipfile enumerated from the central directory. I
read each header's compressed size and seek over the data instead of scanning
for the PK signature, otherwise compressed bytes give false matches. It stops
early on data descriptors and zip64 size markers so normal archives never trip
it.

Added tests for the cd_size=0 case and a regular wheel. Also ran it against a
real deflate wheel (107 files including a dir entry) and got no false positive.
lint and black pass.
@dvlinuxx-max dvlinuxx-max force-pushed the dvlinuxx-max/fix-zip-eocd-differential branch from 1b03d0c to f24fd87 Compare June 24, 2026 12:23
@dvlinuxx-max dvlinuxx-max changed the title Reject ZIP parser-differential archives in safe_extract (#780) Reject crafted ZIPs that scan empty but unpack code (#780) Jun 24, 2026
@sobregosodd sobregosodd requested a review from Copilot June 24, 2026 19:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a ZIP parser-differential scan-evasion where zipfile can enumerate an archive as empty (via crafted EOCD/central-directory fields) while installers can still unpack code from local file headers, causing GuardDog to scan “0 files” and miss malicious content.

Changes:

  • Add a local-file-header walk to detect when local headers expose more members than zipfile enumerates, and reject such archives.
  • Add tests covering a well-formed wheel-like ZIP and a crafted cd_size=0 differential that should now be rejected.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
guarddog/utils/archives.py Adds local-header counting and an assertion to reject archives where zipfile’s central-directory view appears to hide members.
tests/core/test_archives.py Adds hand-built ZIP fixtures and regression tests for the EOCD cd_size=0 differential.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +69
if (flags & 0x08 and compressed_size == 0) or compressed_size == 0xFFFFFFFF:
break
Comment thread tests/core/test_archives.py Outdated
Comment on lines +112 to +115
import zipfile

assert zipfile.ZipFile(str(archive)).namelist() == []

Comment on lines +106 to +110
def test_cd_size_zero_eocd_differential_rejected(tmp_path):
# zipfile reads this as empty (namelist() == []) but the payload is still
# present in the local file headers; safe_extract must refuse it (issue #780).
archive = tmp_path / "crafted-1.0-py3-none-any.whl"
archive.write_bytes(_build_zip(_WHL_MEMBERS, cd_size=0))
…ntries

Addresses Copilot review on DataDog#790.

_count_local_file_headers incremented the counter only after the seek, so a
local header that uses a data descriptor (general-purpose bit 3, sizes deferred)
or a zip64 size marker broke out of the loop with the member uncounted. A crafted
archive whose central directory reads empty (cd_size=0) but whose first local
header is such an entry therefore walked to 0 and slipped past the guard. Count
the header as soon as its PK\x03\x04 signature is confirmed, then decide whether
the next member can be followed; this only stops undercounting, never overcounts.

Tests: open the crafted archive through a context manager so the descriptor is
released promptly, and add a regression case for a cd_size=0 archive whose first
local header carries a data descriptor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GuardDog fails to detect malicious code in malformed-but-installable archives (zipfile under-extraction)

2 participants