The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.
Post
Supporting these features, which are often platform-specific, means that implementations need to handle each in their respective platform-specific manner, leading to differences and complexity.
translation: yeah symlinks are represented by just storing another file path as the data entry but windows makes symlinks require administrator permissions
For example, if the xz-utils project had a fully reproducible build, the attack used by the malicious maintainer could have been detected if projects were checking the reproducibility of
artifacts from the source code.
THE MALICIOUS MAINTAINER!
which reminds me that macrokernel isolation is the better way to address poettering shamelessly dlopen()ing shared libs because "uh i want to catch and respond to errors"
Software packages ideally should be byte-for-byte reproducible,
unless it's the rust compiler
but many features of archive formats make reproducing package archives difficult
"reproducing package archives" means you have to erase metadata in the code you ship to all your users because otherwise i'm going to continuously neg you until you give up maintainership
compression (stream names!)
this one is really trying
OMG THE EVIL SCARY AUTOCONF TAR COMMAND LINE
This long list of options is unwieldy, to say the least. If instead, there was a “--reproducible” option which enabled all these flags to achieve build reproducibility, that would be much more accessible and discoverable for users looking to create reproducible ZIP and tar archives.
pypi security engineer in residence who literally extracted the command line from the cpython build script where he himself commented: https://github.com/python/release-tools/pull/62
# Recommended options for creating reproducible tarballs from:
# https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html#Reproducibility
literally "the manual on the web said exactly how to do this but i would prefer a nonstandard flag with unclear semantics"
Extraction filter escapes (“ZIP slip”)
ZIP SLIP ZIP ZLIP
When installing a software package, the installation program needs to extract the contents of the archive to the system before continuing with the package installation.
no, zips are usable without blocking on extraction
This extraction process needs to be filtered to avoid modifying the system beyond what is intended.
actual made up false lie, pip just calls .extractall()
Escaping this filtering is commonly done in one of two ways,
you can't escape the filtering. the zip file is not alive
either with “parent directory” path segments (e.g, “..”)
banned everywhere
or abusing links such as symbolic or hard links,
zips don't support hard links
either through extraction order,
the order is unambiguous
path collisions,
yeah it might write to a file twice
or path truncations
literally not a thing
METADATA is able to inject deps through parser confusion because it uses the worst code i've ever seen in my entire life https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py and the packaging maintainer personally ensures it never uses my json format, but pypi doesn't check uploads for that shit
The trouble is that tar and ZIP archives are designed to be able to do these actions that can be abused by malicious archives.
MALICIOUS MAINTAINER!!!
bruh omg
There is a deprecation warning for using no defined filter when extracting an archive
imagine being so desperate to scare your users you'll add a deprecation warning to the language and won't ever block actually malicious uploads to pypi
with the default changing to “data” in Python 3.14
bruh https://peps.python.org/pep-0706/#filters this just breaks any build pipelines extracting exes and clobbers perms in absurd ways that really unnerve me
All installers are recommended to use filtering when extracting Python package archives to avoid attacks like path traversal.
path traversal!!!!! scary!!!!!!!
The complexity of the ZIP and tar archive standards means that implementations can vary from each other.
i really need to figure out how to get more people asking questions about METADATA and why brett canon and donald stufft specifically are blocking an accepted PEP to translate it to JSON https://github.com/pypa/packaging/issues/570 he deleted my comments about this
i understand it's extremely subtle because there's no spec at all for the format specifying DEPENDENCIES but it's literally the only thing pypi actually needs to scan uploads for and instead this fucking guy wants to call parser confusion on MY BABY! MY CHILD WHOM I ADORE!
Using variations for which files are observed during extraction allows an attacker to smuggle files within an archive that aren’t detected, reviewed, or scanned, but then are installed on the target system upon installation.
this is literally actually something you can do with the METADATA file in every wheel and inject dependencies that escape detection. the zip format does not have variations that add new well-formed package names to the dependency list by adding a bit of invisibly malformed utf-8