Post · bonfire.cafe

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 20 hours ago

The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

oh my god forgot they had another blog post https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks/

especially notice how our boy seth calls out the wheel format PEP for being vague:

The most detail provided is:

and then does not propose a PEP change. he does have another accusation though:

However, most Python installers today do not do this check and extract the contents of the ZIP archive similar to unzip and then amend the installed RECORD within the virtual environment so that uninstalling the package works as expected.

our boy seth is claiming that "most python installers" will:

recompute checksums for the files it just extracted
open the RECORD file and write the new value

and then implies this is some sort of cover-up for the uninstall operation

Preventing ZIP parser confusion attacks on Python package installers - The Python Package Index Blog

PyPI will begin warning and will later reject wheels that contain differentiable ZIP features or incorrect RECORD files.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

the author of the wheel PEP seth calls out was one of the few people who reached out to help me with my http zip PR that pip still refuses to merge https://github.com/pypa/pip/pull/12208

GitHub

perform 1-3 HTTP requests for each wheel using fast-deps by cosmicexplorer · Pull Request #12208 · pypa/pip

Continued motivation for fast-deps While PEP 658 is the standards-compliant solution and metadata from there is already preferred when available, --use-feature=fast-deps avoids downloading wheels a...

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

it's so fucking batshit to make a whole blog post about "popular installers" failing to check the RECORD file without mentioning or contacting any of them to fix it

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

ok i'm more insane now but i'm also much more certain that seth larson writing active falsehoods about zips and checksums is because:

plaintext metadata and redundant offsets make it very difficult to retroactively modify data without breaking a human-checkable invariant
storing both compressed and uncompressed size means the decompression process cannot be used to inject extra data through length extension
zip files having a magic number and unambiguous offsets at the end foils possibly foils length extension
when negative range requests suddenly broke on pypi, that was likely about length extension

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

METADATA on the other hand is not plaintext, because its semantics depend upon the version of cpython/pip, and are not machine-checkable or human-visible

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

google's dirhash is so fucking funny though

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

problem is to make a good proof of concept for length extension you need to have a ton of parallel compute, close to a whole data center, to figure out how to go from:
(A) length N, checksum C [good]
(B) length N + M, checksum C [bad]

since SHA-256 is cryptographic, you need to essentially mine a bitcoin

it's a little harder than a bitcoin aiui, since you need to ensure the result N + M seems valid at first, but actually injects some evil data

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

https://en.wikipedia.org/wiki/Length_extension_attack

this page claims SHA-256/512 and SHA3 are not susceptible, which i'm pretty sure is false? it does admit:

A secret suffix MAC, which is calculated as Hash(message ‖ secret), isn't vulnerable to a length extension attack, but is vulnerable to another attack based on a hash collision.[7]

ah, so i was mistaken! length extension seems to require that the valid prefix is not known to the attacker

Length extension attack - Wikipedia

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

anyway, you know who operates pypi and has a lot of spare GPUs? obviously that's the very definition of google scale

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

that's why the RECORD file is weirdly redundant with the required zip metadata, EXCEPT: standard zips only provide a crc32 checksum for each entry, which is not cryptographic and much easier to collide with

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

(i don't blame phil katz, he was only seeking to capture actual network errors, and computers were much slower back then.

i do blame zstd for their several layers of deeply odd functionality, including the optional non-cryptographic xxhash and multiple variants of "frames" and "blocks" which would decompress to 0-length outputs and are completely and obviously redundant in the protocol.)

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

the reason i'm mentioning this is of course: how can we make the fractal zip even more powerful than its progenitor?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

i continue to find it ridiculous that no one has identified the properties of tree hashing (especially blake3) to be ideal for a filesystem. but then again, most filesystems generally have no idea what's going on most of the time

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

https://github.com/BLAKE3-team/BLAKE3/tree/master/c

Hashing large files with this function usually requires memory-mapping, since reading a file into memory in a single-threaded loop takes longer than hashing the resulting buffer. Note that hashing a memory-mapped file with this function produces a "random" pattern of disk reads, which can be slow on spinning disks. Again it's important to benchmark your specific use case.

this is mostly unhelpful, but we will forgive them

GitHub

BLAKE3/c at master · BLAKE3-team/BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function - BLAKE3-team/BLAKE3

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

what this does get at is that blake3 can be equivalently computed using multiple distinct memory access patterns https://raw.githubusercontent.com/BLAKE3-team/BLAKE3-specs/master/blake3.pdf

what does it mean to compute blake3?

BLAKE3 splits its input into 1 KiB chunks and arranges them as the leaves of a binary tree.

View (PDF)

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

note that the computation blake3 performs is a function of a whole contiguous region of memory, so this tree structure should not be interpreted as mapping to a filesystem VFS hierarchy. instead, the tree describes how to compose the checksum result from a sequence of subregions. in this, we can begin to infer how might be able to avoid redoing work

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

essentially, the blake3 compute tree is an in-memory structure which affords some degree of splitting and merging. and this happens to be just the thing pages are good at, and just the thing we know how to do with our fractal zip!

the fractal zip structure does assume data independence across local entries, but once blake3 is computed for an entry, we should be able to reuse that output to compute blake3 sums across a sequence of such read-only local entries

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

"what's the fractal all about?" well, there is a clearly recursive structure in the current sketch https://codeberg.org/cosmicexplorer/dice/src/branch/main/TODO.org although the nesting may be "inside-out" and the "prefix" is not actually an element of the data structure!

here's how we might think of our in-memory buffers:

    (local)))
  (central
(file)

the most flexible component is the contiguous (gapless) sequence of local entries. our central entries are also gapless, and must always correspond to the same sequence of local entries. this ensures the split/merge semantics of the local series can always be mirrored for the central series

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

if we do this, it also becomes natural to consider independent blake3 subtrees being computed until we span the whole of a local entry, at which point it can be flushed up to the corresponding central entry

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

do we keep all these subtrees after a local entry is hashed? they can always be computed again..? i suppose we should consider these internal subtrees like a schedulable resource

i believe the subtrees are not required for splitting, but for merging a local/central series with a third after it was split from/off a parent