The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
i am not amazed at this guy
Extraction denial-of-service (“ZIP bombs”)
Specially crafted ZIP archives, often called “ZIP bombs”, can be created to be only 10MB when compressed, but when extracted, result in hundreds of terabytes of data.
zip archives aren't compressed. they are plaintext
This leads to resource exhaustion on the extracting system, as most systems aren’t prepared to handle this amount of data.
ok so resource exhaustion. decompression. please explain how pypi security engineer in residence
This is possible because the ZIP and tar archive formats have separate entries for their “central directory” authoritative list of files within the archive, and “local file” entries which describe the file and its contents within the archive.
tar doesn't have a central directory
This separation means that a single “file” can be referenced multiple, potentially thousands or millions of times, within a single ZIP file central directory, leading to a high and exploitable “compression ratio”.
so:
- you iterate over the central dir and not the local entries
- it tells you to extract the same file "millions" of times
- you will write the same file path "millions" of times
sorry like i think this should easily be a fireable offense
A separate category of this vulnerability is creating archive files that, due to flaws in the implementation, never terminate the extraction process.
yeah this is the other part when cpython has some completely broken code that would never stand up to scrutiny and actively ignores where the spec tells you to validate offsets
This can be seen in CVE-2025-8194 where the “tarfile” module allowed tar member entry offsets to be negative,
any conforming tar impl would immediately reject the malformed data
this is a great fucking comment though
https://github.com/python/cpython/issues/130577#issuecomment-3285698061
Am I correct when I think that versions of Python before this commit: ac3d137 are not vulnerable because the nti function did not support negative numbers, and therefore a header cannot contain a negative number leaving to a negative seek call?
bro this guy seth
Denial of service vulnerabilities for archives aren’t a massive concern for installers, as the worst-case is that the installation fails or a system is rendered unusable until backup or restore occurs.
literally just an evil man
However, for package repositories, a denial of service can interrupt service for users.
package repositories don't extract archives
For package repositories, the easiest way to mitigate this vulnerability is to not extract the whole contents of an archive.
ok but pay close attention to this next part
Instead, most of the information required from an archive, such as checking the existence or contents of a metadata file or calculating a checksum, can be done without inflating the contents of the archive.
i can't believe he would specifically advise not extracting a METADATA file, to avoid the specific file that can actually give you an RCE if you use pip, because they don't accept my PRs anymore
To prevent ZIP bombs from having any impact on installers, package repositories can impose limits on the “compression ratio” and “uncompressed size” of archives. PyPI currently sets its ratio limit at 50x to allow for data-heavy packages to still be published.
50x compression ratio is actively literally impossible this can be proven with information theory. 8x is the best you can get via prefix trees and tANS sometimes goes up to 10x for very repetitive data
This means executables can have ZIPs appended to them to provide a “filesystem” that the executable can access by reading its own file path as a ZIP archive instead of an EXE.
this is why i'm fractal zip paging in the macrokernel. but also this actually won't work, you have to specially construct the zip on top of the exe. i know seth isn't merely mistaken here and is actively lying because i have done this a million times and it's quite tricky to ensure you get a valid zip that avoids immediately erroring
As a program “opening” the appended ZIP, the file appears to either be an executable or a ZIP depending on which direction the file is read from:
he has this incredibly half-assed little graphic that tries very hard to make this confusing
These two features combined mean it’s impossible to accurately parse a ZIP64 archive using both features with “prepended data”.
actively false https://github.com/python/cpython/pull/139702 cpython intentionally chose not to support zip64 file comments before this change. in this change, the same fucking guy who closed an issue for the racist buffer overflow now actively choosed to recognize "prepended data", specifically so he could then raise a new fucking error
ok so this was worth reading again because it explains why i had to implement zip parsing a third fucking time last year because the stdlib suddenly broke my exes
Packaging ecosystems should be specific in their use of archive formats like tar and ZIP.
not METADATA, which is fine
For example, Python’s “zipfile” module currently doesn’t support multiple ZIP features, which are extremely uncommon today,
says the guy who just broke my exes out of spite
then they mention fuzzing and not my fuzzing for the good zip impl i wrote
Differential testing, running the same inputs through two different implementations and seeing if the results match, is also an effective method of finding bugs in data exchange formats
pretty sure this is called a compliance suite
The implementation with the fewest differentials
the most unique
Defining a “canonical” archive layout within the specifications of package archives,
rejecting valid inputs you personally dislike
along with tools or policy to push the ecosystem towards that canonical layout,
injecting code into other people's build processes
can reduce reproducibility issues
and concern trolling maintainers
and make future improvements to restrict archive features possible.
can be an effective strategy to finally make a programming language community susceptible to SHA-256 length extension attacks
Go’s packaging ecosystem has a concept called a “DirHash,”
google NIHed the idea of hashing a directory tree so they could inject their own very specific semantics
which is a pre-defined method of creating a checksum
(all checksums are "pre-defined" or they wouldn't match)
from the extracted file contents of an archive
which is not a directory
regardless of metadata, timestamps, and insertion order.
they specifically made 100% sure that length extension was possible by only recording SHA-256 hashes and not lengths, even though google's own API for bazel remote executions includes the length for this reason
they even made sure you couldn't tell which files may have changed by running the result through another SHA-256 shredder