The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
Post
For example, if the xz-utils project had a fully reproducible build, the attack used by the malicious maintainer could have been detected if projects were checking the reproducibility of
artifacts from the source code.
THE MALICIOUS MAINTAINER!
which reminds me that macrokernel isolation is the better way to address poettering shamelessly dlopen()ing shared libs because "uh i want to catch and respond to errors"
Software packages ideally should be byte-for-byte reproducible,
unless it's the rust compiler
but many features of archive formats make reproducing package archives difficult
"reproducing package archives" means you have to erase metadata in the code you ship to all your users because otherwise i'm going to continuously neg you until you give up maintainership
compression (stream names!)
this one is really trying
OMG THE EVIL SCARY AUTOCONF TAR COMMAND LINE
This long list of options is unwieldy, to say the least. If instead, there was a “--reproducible” option which enabled all these flags to achieve build reproducibility, that would be much more accessible and discoverable for users looking to create reproducible ZIP and tar archives.
pypi security engineer in residence who literally extracted the command line from the cpython build script where he himself commented: https://github.com/python/release-tools/pull/62
# Recommended options for creating reproducible tarballs from:
# https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html#Reproducibility
literally "the manual on the web said exactly how to do this but i would prefer a nonstandard flag with unclear semantics"
@hipsterelectron python could add a --reproducible flag that print the link to the doc 😺
Extraction filter escapes (“ZIP slip”)
ZIP SLIP ZIP ZLIP
When installing a software package, the installation program needs to extract the contents of the archive to the system before continuing with the package installation.
no, zips are usable without blocking on extraction
This extraction process needs to be filtered to avoid modifying the system beyond what is intended.
actual made up false lie, pip just calls .extractall()
Escaping this filtering is commonly done in one of two ways,
you can't escape the filtering. the zip file is not alive
either with “parent directory” path segments (e.g, “..”)
banned everywhere
or abusing links such as symbolic or hard links,
zips don't support hard links
either through extraction order,
the order is unambiguous
path collisions,
yeah it might write to a file twice
or path truncations
literally not a thing
METADATA is able to inject deps through parser confusion because it uses the worst code i've ever seen in my entire life https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py and the packaging maintainer personally ensures it never uses my json format, but pypi doesn't check uploads for that shit
The trouble is that tar and ZIP archives are designed to be able to do these actions that can be abused by malicious archives.
MALICIOUS MAINTAINER!!!
bruh omg
There is a deprecation warning for using no defined filter when extracting an archive
imagine being so desperate to scare your users you'll add a deprecation warning to the language and won't ever block actually malicious uploads to pypi
with the default changing to “data” in Python 3.14
bruh https://peps.python.org/pep-0706/#filters this just breaks any build pipelines extracting exes and clobbers perms in absurd ways that really unnerve me
All installers are recommended to use filtering when extracting Python package archives to avoid attacks like path traversal.
path traversal!!!!! scary!!!!!!!
The complexity of the ZIP and tar archive standards means that implementations can vary from each other.
i really need to figure out how to get more people asking questions about METADATA and why brett canon and donald stufft specifically are blocking an accepted PEP to translate it to JSON https://github.com/pypa/packaging/issues/570 he deleted my comments about this
i understand it's extremely subtle because there's no spec at all for the format specifying DEPENDENCIES but it's literally the only thing pypi actually needs to scan uploads for and instead this fucking guy wants to call parser confusion on MY BABY! MY CHILD WHOM I ADORE!
Using variations for which files are observed during extraction allows an attacker to smuggle files within an archive that aren’t detected, reviewed, or scanned, but then are installed on the target system upon installation.
this is literally actually something you can do with the METADATA file in every wheel and inject dependencies that escape detection. the zip format does not have variations that add new well-formed package names to the dependency list by adding a bit of invisibly malformed utf-8
i am not amazed at this guy
Extraction denial-of-service (“ZIP bombs”)
Specially crafted ZIP archives, often called “ZIP bombs”, can be created to be only 10MB when compressed, but when extracted, result in hundreds of terabytes of data.
zip archives aren't compressed. they are plaintext
This leads to resource exhaustion on the extracting system, as most systems aren’t prepared to handle this amount of data.
ok so resource exhaustion. decompression. please explain how pypi security engineer in residence
This is possible because the ZIP and tar archive formats have separate entries for their “central directory” authoritative list of files within the archive, and “local file” entries which describe the file and its contents within the archive.
tar doesn't have a central directory
This separation means that a single “file” can be referenced multiple, potentially thousands or millions of times, within a single ZIP file central directory, leading to a high and exploitable “compression ratio”.
so:
- you iterate over the central dir and not the local entries
- it tells you to extract the same file "millions" of times
- you will write the same file path "millions" of times
sorry like i think this should easily be a fireable offense
A separate category of this vulnerability is creating archive files that, due to flaws in the implementation, never terminate the extraction process.
yeah this is the other part when cpython has some completely broken code that would never stand up to scrutiny and actively ignores where the spec tells you to validate offsets
This can be seen in CVE-2025-8194 where the “tarfile” module allowed tar member entry offsets to be negative,
any conforming tar impl would immediately reject the malformed data
this is a great fucking comment though
https://github.com/python/cpython/issues/130577#issuecomment-3285698061
Am I correct when I think that versions of Python before this commit: ac3d137 are not vulnerable because the nti function did not support negative numbers, and therefore a header cannot contain a negative number leaving to a negative seek call?
bro this guy seth
Denial of service vulnerabilities for archives aren’t a massive concern for installers, as the worst-case is that the installation fails or a system is rendered unusable until backup or restore occurs.
literally just an evil man
However, for package repositories, a denial of service can interrupt service for users.
package repositories don't extract archives
For package repositories, the easiest way to mitigate this vulnerability is to not extract the whole contents of an archive.
ok but pay close attention to this next part
Instead, most of the information required from an archive, such as checking the existence or contents of a metadata file or calculating a checksum, can be done without inflating the contents of the archive.
i can't believe he would specifically advise not extracting a METADATA file, to avoid the specific file that can actually give you an RCE if you use pip, because they don't accept my PRs anymore
To prevent ZIP bombs from having any impact on installers, package repositories can impose limits on the “compression ratio” and “uncompressed size” of archives. PyPI currently sets its ratio limit at 50x to allow for data-heavy packages to still be published.
50x compression ratio is actively literally impossible this can be proven with information theory. 8x is the best you can get via prefix trees and tANS sometimes goes up to 10x for very repetitive data
This means executables can have ZIPs appended to them to provide a “filesystem” that the executable can access by reading its own file path as a ZIP archive instead of an EXE.
this is why i'm fractal zip paging in the macrokernel. but also this actually won't work, you have to specially construct the zip on top of the exe. i know seth isn't merely mistaken here and is actively lying because i have done this a million times and it's quite tricky to ensure you get a valid zip that avoids immediately erroring
As a program “opening” the appended ZIP, the file appears to either be an executable or a ZIP depending on which direction the file is read from:
he has this incredibly half-assed little graphic that tries very hard to make this confusing
These two features combined mean it’s impossible to accurately parse a ZIP64 archive using both features with “prepended data”.
actively false https://github.com/python/cpython/pull/139702 cpython intentionally chose not to support zip64 file comments before this change. in this change, the same fucking guy who closed an issue for the racist buffer overflow now actively choosed to recognize "prepended data", specifically so he could then raise a new fucking error
ok so this was worth reading again because it explains why i had to implement zip parsing a third fucking time last year because the stdlib suddenly broke my exes
Packaging ecosystems should be specific in their use of archive formats like tar and ZIP.
not METADATA, which is fine
For example, Python’s “zipfile” module currently doesn’t support multiple ZIP features, which are extremely uncommon today,
says the guy who just broke my exes out of spite
then they mention fuzzing and not my fuzzing for the good zip impl i wrote
Differential testing, running the same inputs through two different implementations and seeing if the results match, is also an effective method of finding bugs in data exchange formats
pretty sure this is called a compliance suite
The implementation with the fewest differentials
the most unique
Defining a “canonical” archive layout within the specifications of package archives,
rejecting valid inputs you personally dislike
along with tools or policy to push the ecosystem towards that canonical layout,
injecting code into other people's build processes
can reduce reproducibility issues
and concern trolling maintainers
and make future improvements to restrict archive features possible.
can be an effective strategy to finally make a programming language community susceptible to SHA-256 length extension attacks
Go’s packaging ecosystem has a concept called a “DirHash,”
google NIHed the idea of hashing a directory tree so they could inject their own very specific semantics
which is a pre-defined method of creating a checksum
(all checksums are "pre-defined" or they wouldn't match)
from the extracted file contents of an archive
which is not a directory
regardless of metadata, timestamps, and insertion order.
they specifically made 100% sure that length extension was possible by only recording SHA-256 hashes and not lengths, even though google's own API for bazel remote executions includes the length for this reason
they even made sure you couldn't tell which files may have changed by running the result through another SHA-256 shredder
the dirhash link wasn't there last time i looked and this is the first actual "proof" i've found of google specifically seeking to inject length-extension attacks from merkle-damgård constructions like SHA-256 into packaging ecosystems
This removes many system and implementation-specific differences of archives from affecting the final “checksum” of the package archive,
this lets us inject whatever we want into your build process
making this mechanism more reliable to use compared to byte-for-byte comparisons.
it is very important that you avoid formats with plaintext metadata
This checksum method doesn’t account for links, empty directories, or archives that contain different contents depending on the extracting implementation.
this checksum method will produce different output than you'd expect, and we won't explain why!
Packaging ecosystems like Python’s often have many tools with different organizational structures and roadmaps, and therefore cannot easily coordinate on ecosystem-wide challenges like archive format features.
pypi suddenly stopped supporting negative range requests for wheels immediately after i gave a talk about my work at packagingcon 2023 https://web.archive.org/web/20250218154403/https://cfp.packaging-con.org/2023/talk/hpuhu7/, but they refused to implement PEP 658 for years after i added it to pip, and would provide no point of contact to explain why