The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
ok so this was worth reading again because it explains why i had to implement zip parsing a third fucking time last year because the stdlib suddenly broke my exes
Packaging ecosystems should be specific in their use of archive formats like tar and ZIP.
not METADATA, which is fine
For example, Python’s “zipfile” module currently doesn’t support multiple ZIP features, which are extremely uncommon today,
says the guy who just broke my exes out of spite
then they mention fuzzing and not my fuzzing for the good zip impl i wrote
Differential testing, running the same inputs through two different implementations and seeing if the results match, is also an effective method of finding bugs in data exchange formats
pretty sure this is called a compliance suite
The implementation with the fewest differentials
the most unique
Defining a “canonical” archive layout within the specifications of package archives,
rejecting valid inputs you personally dislike
along with tools or policy to push the ecosystem towards that canonical layout,
injecting code into other people's build processes
can reduce reproducibility issues
and concern trolling maintainers
and make future improvements to restrict archive features possible.
can be an effective strategy to finally make a programming language community susceptible to SHA-256 length extension attacks
Go’s packaging ecosystem has a concept called a “DirHash,”
google NIHed the idea of hashing a directory tree so they could inject their own very specific semantics
which is a pre-defined method of creating a checksum
(all checksums are "pre-defined" or they wouldn't match)
from the extracted file contents of an archive
which is not a directory
regardless of metadata, timestamps, and insertion order.
they specifically made 100% sure that length extension was possible by only recording SHA-256 hashes and not lengths, even though google's own API for bazel remote executions includes the length for this reason
they even made sure you couldn't tell which files may have changed by running the result through another SHA-256 shredder
the dirhash link wasn't there last time i looked and this is the first actual "proof" i've found of google specifically seeking to inject length-extension attacks from merkle-damgård constructions like SHA-256 into packaging ecosystems
This removes many system and implementation-specific differences of archives from affecting the final “checksum” of the package archive,
this lets us inject whatever we want into your build process
making this mechanism more reliable to use compared to byte-for-byte comparisons.
it is very important that you avoid formats with plaintext metadata
This checksum method doesn’t account for links, empty directories, or archives that contain different contents depending on the extracting implementation.
this checksum method will produce different output than you'd expect, and we won't explain why!
Packaging ecosystems like Python’s often have many tools with different organizational structures and roadmaps, and therefore cannot easily coordinate on ecosystem-wide challenges like archive format features.
pypi suddenly stopped supporting negative range requests for wheels immediately after i gave a talk about my work at packagingcon 2023 https://web.archive.org/web/20250218154403/https://cfp.packaging-con.org/2023/talk/hpuhu7/, but they refused to implement PEP 658 for years after i added it to pip, and would provide no point of contact to explain why
This is especially true when the issue arises from an ambiguity in a packaging standard, which can take months to fix and often requires public correspondence, meaning users are left to fend for themselves from exploits while standards are fixed.
yeah it's crazy how brett cannon deleted my comments identifying METADATA as a literal RCE vector and refuses to support the JSON transformation i added to pip years ago.
too bad how this keeps breaking builds https://github.com/scikit-build/scikit-build-core/issues/1006
newline parser confusion means pip will misread Requires-Dist: lines
https://github.com/python/cpython/issues/117313
However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending.
this is a textbook injection attack and no one in python packaging is interested in parser confusion from the fucked up email header parsing
the METADATA format is significantly older than zip archives, without any actual spec to compare it against
However, applying protections like rejecting archives from a large public package repository like PyPI is not without challenge.
you do it all the damn time seth
One of the biggest difficulties for developing open source package repositories is that maintainers don’t know who or what is interacting with their systems.
bullshit. google_pasta-0.2.0-py3-none-any.whl is actively exploiting the METADATA vuln and they know what they're doing
For this reason, there needs to be caution and care put into any decision that affects an entire ecosystem.
usually we do PEPs for that. but pypi doesn't do PEPs
oh my god forgot they had another blog post https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks/
especially notice how our boy seth calls out the wheel format PEP for being vague:
The most detail provided is:
and then does not propose a PEP change. he does have another accusation though:
However, most Python installers today do not do this check and extract the contents of the ZIP archive similar to unzip and then amend the installed RECORD within the virtual environment so that uninstalling the package works as expected.
our boy seth is claiming that "most python installers" will:
- recompute checksums for the files it just extracted
- open the
RECORDfile and write the new value
and then implies this is some sort of cover-up for the uninstall operation
the author of the wheel PEP seth calls out was one of the few people who reached out to help me with my http zip PR that pip still refuses to merge https://github.com/pypa/pip/pull/12208