The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
either through extraction order,
the order is unambiguous
path collisions,
yeah it might write to a file twice
or path truncations
literally not a thing
METADATA is able to inject deps through parser confusion because it uses the worst code i've ever seen in my entire life https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py and the packaging maintainer personally ensures it never uses my json format, but pypi doesn't check uploads for that shit
The trouble is that tar and ZIP archives are designed to be able to do these actions that can be abused by malicious archives.
MALICIOUS MAINTAINER!!!
bruh omg
There is a deprecation warning for using no defined filter when extracting an archive
imagine being so desperate to scare your users you'll add a deprecation warning to the language and won't ever block actually malicious uploads to pypi
with the default changing to “data” in Python 3.14
bruh https://peps.python.org/pep-0706/#filters this just breaks any build pipelines extracting exes and clobbers perms in absurd ways that really unnerve me
All installers are recommended to use filtering when extracting Python package archives to avoid attacks like path traversal.
path traversal!!!!! scary!!!!!!!
The complexity of the ZIP and tar archive standards means that implementations can vary from each other.
i really need to figure out how to get more people asking questions about METADATA and why brett canon and donald stufft specifically are blocking an accepted PEP to translate it to JSON https://github.com/pypa/packaging/issues/570 he deleted my comments about this
i understand it's extremely subtle because there's no spec at all for the format specifying DEPENDENCIES but it's literally the only thing pypi actually needs to scan uploads for and instead this fucking guy wants to call parser confusion on MY BABY! MY CHILD WHOM I ADORE!
Using variations for which files are observed during extraction allows an attacker to smuggle files within an archive that aren’t detected, reviewed, or scanned, but then are installed on the target system upon installation.
this is literally actually something you can do with the METADATA file in every wheel and inject dependencies that escape detection. the zip format does not have variations that add new well-formed package names to the dependency list by adding a bit of invisibly malformed utf-8
i am not amazed at this guy
Extraction denial-of-service (“ZIP bombs”)
Specially crafted ZIP archives, often called “ZIP bombs”, can be created to be only 10MB when compressed, but when extracted, result in hundreds of terabytes of data.
zip archives aren't compressed. they are plaintext
This leads to resource exhaustion on the extracting system, as most systems aren’t prepared to handle this amount of data.
ok so resource exhaustion. decompression. please explain how pypi security engineer in residence
This is possible because the ZIP and tar archive formats have separate entries for their “central directory” authoritative list of files within the archive, and “local file” entries which describe the file and its contents within the archive.
tar doesn't have a central directory
This separation means that a single “file” can be referenced multiple, potentially thousands or millions of times, within a single ZIP file central directory, leading to a high and exploitable “compression ratio”.
so:
- you iterate over the central dir and not the local entries
- it tells you to extract the same file "millions" of times
- you will write the same file path "millions" of times
sorry like i think this should easily be a fireable offense
A separate category of this vulnerability is creating archive files that, due to flaws in the implementation, never terminate the extraction process.
yeah this is the other part when cpython has some completely broken code that would never stand up to scrutiny and actively ignores where the spec tells you to validate offsets
This can be seen in CVE-2025-8194 where the “tarfile” module allowed tar member entry offsets to be negative,
any conforming tar impl would immediately reject the malformed data
this is a great fucking comment though
https://github.com/python/cpython/issues/130577#issuecomment-3285698061
Am I correct when I think that versions of Python before this commit: ac3d137 are not vulnerable because the nti function did not support negative numbers, and therefore a header cannot contain a negative number leaving to a negative seek call?
bro this guy seth
Denial of service vulnerabilities for archives aren’t a massive concern for installers, as the worst-case is that the installation fails or a system is rendered unusable until backup or restore occurs.
literally just an evil man
However, for package repositories, a denial of service can interrupt service for users.
package repositories don't extract archives
For package repositories, the easiest way to mitigate this vulnerability is to not extract the whole contents of an archive.
ok but pay close attention to this next part
Instead, most of the information required from an archive, such as checking the existence or contents of a metadata file or calculating a checksum, can be done without inflating the contents of the archive.
i can't believe he would specifically advise not extracting a METADATA file, to avoid the specific file that can actually give you an RCE if you use pip, because they don't accept my PRs anymore
To prevent ZIP bombs from having any impact on installers, package repositories can impose limits on the “compression ratio” and “uncompressed size” of archives. PyPI currently sets its ratio limit at 50x to allow for data-heavy packages to still be published.
50x compression ratio is actively literally impossible this can be proven with information theory. 8x is the best you can get via prefix trees and tANS sometimes goes up to 10x for very repetitive data
This means executables can have ZIPs appended to them to provide a “filesystem” that the executable can access by reading its own file path as a ZIP archive instead of an EXE.
this is why i'm fractal zip paging in the macrokernel. but also this actually won't work, you have to specially construct the zip on top of the exe. i know seth isn't merely mistaken here and is actively lying because i have done this a million times and it's quite tricky to ensure you get a valid zip that avoids immediately erroring