The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
This is especially true when the issue arises from an ambiguity in a packaging standard, which can take months to fix and often requires public correspondence, meaning users are left to fend for themselves from exploits while standards are fixed.
yeah it's crazy how brett cannon deleted my comments identifying METADATA as a literal RCE vector and refuses to support the JSON transformation i added to pip years ago.
too bad how this keeps breaking builds https://github.com/scikit-build/scikit-build-core/issues/1006
newline parser confusion means pip will misread Requires-Dist: lines
https://github.com/python/cpython/issues/117313
However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending.
this is a textbook injection attack and no one in python packaging is interested in parser confusion from the fucked up email header parsing
the METADATA format is significantly older than zip archives, without any actual spec to compare it against
However, applying protections like rejecting archives from a large public package repository like PyPI is not without challenge.
you do it all the damn time seth
One of the biggest difficulties for developing open source package repositories is that maintainers don’t know who or what is interacting with their systems.
bullshit. google_pasta-0.2.0-py3-none-any.whl is actively exploiting the METADATA vuln and they know what they're doing
For this reason, there needs to be caution and care put into any decision that affects an entire ecosystem.
usually we do PEPs for that. but pypi doesn't do PEPs
oh my god forgot they had another blog post https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks/
especially notice how our boy seth calls out the wheel format PEP for being vague:
The most detail provided is:
and then does not propose a PEP change. he does have another accusation though:
However, most Python installers today do not do this check and extract the contents of the ZIP archive similar to unzip and then amend the installed RECORD within the virtual environment so that uninstalling the package works as expected.
our boy seth is claiming that "most python installers" will:
- recompute checksums for the files it just extracted
- open the
RECORDfile and write the new value
and then implies this is some sort of cover-up for the uninstall operation
the author of the wheel PEP seth calls out was one of the few people who reached out to help me with my http zip PR that pip still refuses to merge https://github.com/pypa/pip/pull/12208
it's so fucking batshit to make a whole blog post about "popular installers" failing to check the RECORD file without mentioning or contacting any of them to fix it
ok i'm more insane now but i'm also much more certain that seth larson writing active falsehoods about zips and checksums is because:
- plaintext metadata and redundant offsets make it very difficult to retroactively modify data without breaking a human-checkable invariant
- storing both compressed and uncompressed size means the decompression process cannot be used to inject extra data through length extension
- zip files having a magic number and unambiguous offsets at the end foils possibly foils length extension
- when negative range requests suddenly broke on pypi, that was likely about length extension
METADATA on the other hand is not plaintext, because its semantics depend upon the version of cpython/pip, and are not machine-checkable or human-visible
google's dirhash is so fucking funny though
problem is to make a good proof of concept for length extension you need to have a ton of parallel compute, close to a whole data center, to figure out how to go from:
(A) length N, checksum C [good]
(B) length N + M, checksum C [bad]
since SHA-256 is cryptographic, you need to essentially mine a bitcoin
it's a little harder than a bitcoin aiui, since you need to ensure the result N + M seems valid at first, but actually injects some evil data
https://en.wikipedia.org/wiki/Length_extension_attack
this page claims SHA-256/512 and SHA3 are not susceptible, which i'm pretty sure is false? it does admit:
A secret suffix MAC, which is calculated as Hash(message ‖ secret), isn't vulnerable to a length extension attack, but is vulnerable to another attack based on a hash collision.[7]
ah, so i was mistaken! length extension seems to require that the valid prefix is not known to the attacker
anyway, you know who operates pypi and has a lot of spare GPUs? obviously that's the very definition of google scale
that's why the RECORD file is weirdly redundant with the required zip metadata, EXCEPT: standard zips only provide a crc32 checksum for each entry, which is not cryptographic and much easier to collide with