The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
Post
like one issue with a bare page is that it has no inherent semantics except the ones we impart to it by manipulating the PAT from the cpu and those aren't in the page itself. meanwhile i have personally invented so many semantics for the humble zip file it has a whole CVE and associated propaganda campaign against it https://alpha-omega.dev/wp-content/uploads/sites/22/2025/10/ao_wp_102725a.pdf
These formats were created a long time ago, 36 and 46 years ago, respectively.
they're scared of the old magic
Archive formats are more accurately described as a series of instructions that occur to a filesystem than a neat set of files bundled together.
i keep forgetting how thick the prose is on this piece of art lmao. "instructions that occur to a filesystem" is reaching so hard and it still sounds awesome
Both ZIPs and tar archives support having additional file “entries” appended to the archive without removing the previous entries of the same name, and still produce
a valid archive.
literally just glazing
Implementations need to process these instructions consistently to be safe from confusion attacks.
oh better watch out it's misty's psyduck oooooo
Archive formats support many features beyond mapping file paths to data
yeah don't forget how they also provide a structured index of mapped regions and colocated metadata that naturally maps to a file path
Archives support many of the same features that filesystems do, such as creation and modification times, permissions, ownership, and links between different filesystem entries
yeah what if we did make this into a filesystem
Supporting these features, which are often platform-specific, means that implementations need to handle each in their respective platform-specific manner, leading to differences and complexity.
translation: yeah symlinks are represented by just storing another file path as the data entry but windows makes symlinks require administrator permissions
For example, if the xz-utils project had a fully reproducible build, the attack used by the malicious maintainer could have been detected if projects were checking the reproducibility of
artifacts from the source code.
THE MALICIOUS MAINTAINER!
which reminds me that macrokernel isolation is the better way to address poettering shamelessly dlopen()ing shared libs because "uh i want to catch and respond to errors"
Software packages ideally should be byte-for-byte reproducible,
unless it's the rust compiler
but many features of archive formats make reproducing package archives difficult
"reproducing package archives" means you have to erase metadata in the code you ship to all your users because otherwise i'm going to continuously neg you until you give up maintainership
compression (stream names!)
this one is really trying
OMG THE EVIL SCARY AUTOCONF TAR COMMAND LINE
This long list of options is unwieldy, to say the least. If instead, there was a “--reproducible” option which enabled all these flags to achieve build reproducibility, that would be much more accessible and discoverable for users looking to create reproducible ZIP and tar archives.
pypi security engineer in residence who literally extracted the command line from the cpython build script where he himself commented: https://github.com/python/release-tools/pull/62
# Recommended options for creating reproducible tarballs from:
# https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html#Reproducibility
literally "the manual on the web said exactly how to do this but i would prefer a nonstandard flag with unclear semantics"
Extraction filter escapes (“ZIP slip”)
ZIP SLIP ZIP ZLIP
When installing a software package, the installation program needs to extract the contents of the archive to the system before continuing with the package installation.
no, zips are usable without blocking on extraction
This extraction process needs to be filtered to avoid modifying the system beyond what is intended.
actual made up false lie, pip just calls .extractall()
Escaping this filtering is commonly done in one of two ways,
you can't escape the filtering. the zip file is not alive
either with “parent directory” path segments (e.g, “..”)
banned everywhere
or abusing links such as symbolic or hard links,
zips don't support hard links
either through extraction order,
the order is unambiguous
path collisions,
yeah it might write to a file twice
or path truncations
literally not a thing
METADATA is able to inject deps through parser confusion because it uses the worst code i've ever seen in my entire life https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py and the packaging maintainer personally ensures it never uses my json format, but pypi doesn't check uploads for that shit
The trouble is that tar and ZIP archives are designed to be able to do these actions that can be abused by malicious archives.
MALICIOUS MAINTAINER!!!
bruh omg
There is a deprecation warning for using no defined filter when extracting an archive
imagine being so desperate to scare your users you'll add a deprecation warning to the language and won't ever block actually malicious uploads to pypi
with the default changing to “data” in Python 3.14
bruh https://peps.python.org/pep-0706/#filters this just breaks any build pipelines extracting exes and clobbers perms in absurd ways that really unnerve me