Post · bonfire.cafe

@hipsterelectron@circumstances.run · 18 hours ago

The UNIX system has been in wide use for over 20 years, and has helped to define many areas of computing.

@hipsterelectron@circumstances.run · 5 hours ago

For example, if the xz-utils project had a fully reproducible build, the attack used by the malicious maintainer could have been detected if projects were checking the reproducibility of
artifacts from the source code.

THE MALICIOUS MAINTAINER!

which reminds me that macrokernel isolation is the better way to address poettering shamelessly dlopen()ing shared libs because "uh i want to catch and respond to errors"

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

Software packages ideally should be byte-for-byte reproducible,

unless it's the rust compiler

but many features of archive formats make reproducing package archives difficult

"reproducing package archives" means you have to erase metadata in the code you ship to all your users because otherwise i'm going to continuously neg you until you give up maintainership

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

compression (stream names!)

this one is really trying

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

OMG THE EVIL SCARY AUTOCONF TAR COMMAND LINE

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

This long list of options is unwieldy, to say the least. If instead, there was a “--reproducible” option which enabled all these flags to achieve build reproducibility, that would be much more accessible and discoverable for users looking to create reproducible ZIP and tar archives.

pypi security engineer in residence who literally extracted the command line from the cpython build script where he himself commented: https://github.com/python/release-tools/pull/62

# Recommended options for creating reproducible tarballs from:
 # https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html#Reproducibility

literally "the manual on the web said exactly how to do this but i would prefer a nonstandard flag with unclear semantics"

GitHub

Use recommended reproducibility options for tar by sethmlarson · Pull Request #62 · python/release-tools

Takes the reproducibility recommendations on tar archives from these sources: https://reproducible-builds.org/docs/archives/ https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html#...

gkrnours

@gkrnours@mastodon.gamedev.place · 4 hours ago

@hipsterelectron python could add a --reproducible flag that print the link to the doc 😺

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

Extraction filter escapes (“ZIP slip”)

ZIP SLIP ZIP ZLIP

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

When installing a software package, the installation program needs to extract the contents of the archive to the system before continuing with the package installation.

no, zips are usable without blocking on extraction

This extraction process needs to be filtered to avoid modifying the system beyond what is intended.

actual made up false lie, pip just calls .extractall()

Escaping this filtering is commonly done in one of two ways,

you can't escape the filtering. the zip file is not alive

either with “parent directory” path segments (e.g, “..”)

banned everywhere

or abusing links such as symbolic or hard links,

zips don't support hard links

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

either through extraction order,

the order is unambiguous

path collisions,

yeah it might write to a file twice

or path truncations

literally not a thing

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

METADATA is able to inject deps through parser confusion because it uses the worst code i've ever seen in my entire life https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py and the packaging maintainer personally ensures it never uses my json format, but pypi doesn't check uploads for that shit

The trouble is that tar and ZIP archives are designed to be able to do these actions that can be abused by malicious archives.

MALICIOUS MAINTAINER!!!

GitHub

cpython/Lib/email/_header_value_parser.py at main · python/cpython

The Python programming language. Contribute to python/cpython development by creating an account on GitHub.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

bruh omg

There is a deprecation warning for using no defined filter when extracting an archive

imagine being so desperate to scare your users you'll add a deprecation warning to the language and won't ever block actually malicious uploads to pypi

with the default changing to “data” in Python 3.14

bruh https://peps.python.org/pep-0706/#filters this just breaks any build pipelines extracting exes and clobbers perms in absurd ways that really unnerve me

Python Enhancement Proposals (PEPs)

PEP 706 – Filter for tarfile.extractall | peps.python.org

The extraction methods in tarfile gain a filter argument, which allows rejecting files or modifying metadata as the archive is extracted. Three built-in named filters are provided, aimed at limiting features that might be surprising or dangerous. These ...

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

All installers are recommended to use filtering when extracting Python package archives to avoid attacks like path traversal.

path traversal!!!!! scary!!!!!!!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

The complexity of the ZIP and tar archive standards means that implementations can vary from each other.

i really need to figure out how to get more people asking questions about METADATA and why brett canon and donald stufft specifically are blocking an accepted PEP to translate it to JSON https://github.com/pypa/packaging/issues/570 he deleted my comments about this

GitHub

Plans for packaging.metadata · Issue #570 · pypa/packaging

Parse "raw" metadata (#671) Parse "strict" metadata Emit "strict" metadata

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

i understand it's extremely subtle because there's no spec at all for the format specifying DEPENDENCIES but it's literally the only thing pypi actually needs to scan uploads for and instead this fucking guy wants to call parser confusion on MY BABY! MY CHILD WHOM I ADORE!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

Using variations for which files are observed during extraction allows an attacker to smuggle files within an archive that aren’t detected, reviewed, or scanned, but then are installed on the target system upon installation.

this is literally actually something you can do with the METADATA file in every wheel and inject dependencies that escape detection. the zip format does not have variations that add new well-formed package names to the dependency list by adding a bit of invisibly malformed utf-8

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

i am not amazed at this guy

Extraction denial-of-service (“ZIP bombs”)
Specially crafted ZIP archives, often called “ZIP bombs”, can be created to be only 10MB when compressed, but when extracted, result in hundreds of terabytes of data.

zip archives aren't compressed. they are plaintext

This leads to resource exhaustion on the extracting system, as most systems aren’t prepared to handle this amount of data.

ok so resource exhaustion. decompression. please explain how pypi security engineer in residence

This is possible because the ZIP and tar archive formats have separate entries for their “central directory” authoritative list of files within the archive, and “local file” entries which describe the file and its contents within the archive.

tar doesn't have a central directory

This separation means that a single “file” can be referenced multiple, potentially thousands or millions of times, within a single ZIP file central directory, leading to a high and exploitable “compression ratio”.

so:

you iterate over the central dir and not the local entries
it tells you to extract the same file "millions" of times
you will write the same file path "millions" of times

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

sorry like i think this should easily be a fireable offense

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

A separate category of this vulnerability is creating archive files that, due to flaws in the implementation, never terminate the extraction process.

yeah this is the other part when cpython has some completely broken code that would never stand up to scrutiny and actively ignores where the spec tells you to validate offsets

This can be seen in CVE-2025-8194 where the “tarfile” module allowed tar member entry offsets to be negative,

any conforming tar impl would immediately reject the malformed data

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

this is a great fucking comment though
https://github.com/python/cpython/issues/130577#issuecomment-3285698061

Am I correct when I think that versions of Python before this commit: ac3d137 are not vulnerable because the nti function did not support negative numbers, and therefore a header cannot contain a negative number leaving to a negative seek call?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

bro this guy seth

Denial of service vulnerabilities for archives aren’t a massive concern for installers, as the worst-case is that the installation fails or a system is rendered unusable until backup or restore occurs.

literally just an evil man

However, for package repositories, a denial of service can interrupt service for users.

package repositories don't extract archives

For package repositories, the easiest way to mitigate this vulnerability is to not extract the whole contents of an archive.

ok but pay close attention to this next part

Instead, most of the information required from an archive, such as checking the existence or contents of a metadata file or calculating a checksum, can be done without inflating the contents of the archive.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

i can't believe he would specifically advise not extracting a METADATA file, to avoid the specific file that can actually give you an RCE if you use pip, because they don't accept my PRs anymore

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

To prevent ZIP bombs from having any impact on installers, package repositories can impose limits on the “compression ratio” and “uncompressed size” of archives. PyPI currently sets its ratio limit at 50x to allow for data-heavy packages to still be published.

50x compression ratio is actively literally impossible this can be proven with information theory. 8x is the best you can get via prefix trees and tANS sometimes goes up to 10x for very repetitive data

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

This means executables can have ZIPs appended to them to provide a “filesystem” that the executable can access by reading its own file path as a ZIP archive instead of an EXE.

this is why i'm fractal zip paging in the macrokernel. but also this actually won't work, you have to specially construct the zip on top of the exe. i know seth isn't merely mistaken here and is actively lying because i have done this a million times and it's quite tricky to ensure you get a valid zip that avoids immediately erroring

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

As a program “opening” the appended ZIP, the file appears to either be an executable or a ZIP depending on which direction the file is read from:

he has this incredibly half-assed little graphic that tries very hard to make this confusing

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

These two features combined mean it’s impossible to accurately parse a ZIP64 archive using both features with “prepended data”.

actively false https://github.com/python/cpython/pull/139702 cpython intentionally chose not to support zip64 file comments before this change. in this change, the same fucking guy who closed an issue for the racist buffer overflow now actively choosed to recognize "prepended data", specifically so he could then raise a new fucking error

GitHub

gh-139700: Check consistency of the zip64 end of central directory record by serhiy-storchaka · Pull Request #139702 · python/cpython

Support records with "zip64 extensible data" if there are no bytes prepended to the ZIP file. Issue: Add more consistency checks in zipfile #139700

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

ok so this was worth reading again because it explains why i had to implement zip parsing a third fucking time last year because the stdlib suddenly broke my exes

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

Packaging ecosystems should be specific in their use of archive formats like tar and ZIP.

not METADATA, which is fine

For example, Python’s “zipfile” module currently doesn’t support multiple ZIP features, which are extremely uncommon today,

says the guy who just broke my exes out of spite

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

then they mention fuzzing and not my fuzzing for the good zip impl i wrote

Differential testing, running the same inputs through two different implementations and seeing if the results match, is also an effective method of finding bugs in data exchange formats

pretty sure this is called a compliance suite

The implementation with the fewest differentials

the most unique

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

Defining a “canonical” archive layout within the specifications of package archives,

rejecting valid inputs you personally dislike

along with tools or policy to push the ecosystem towards that canonical layout,

injecting code into other people's build processes

can reduce reproducibility issues

and concern trolling maintainers

and make future improvements to restrict archive features possible.

can be an effective strategy to finally make a programming language community susceptible to SHA-256 length extension attacks

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

Go’s packaging ecosystem has a concept called a “DirHash,”

google NIHed the idea of hashing a directory tree so they could inject their own very specific semantics

which is a pre-defined method of creating a checksum

(all checksums are "pre-defined" or they wouldn't match)

from the extracted file contents of an archive

which is not a directory

regardless of metadata, timestamps, and insertion order.

they specifically made 100% sure that length extension was possible by only recording SHA-256 hashes and not lengths, even though google's own API for bazel remote executions includes the length for this reason

they even made sure you couldn't tell which files may have changed by running the result through another SHA-256 shredder

dirhash package - golang.org/x/mod/sumdb/dirhash - Go Packages

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

the dirhash link wasn't there last time i looked and this is the first actual "proof" i've found of google specifically seeking to inject length-extension attacks from merkle-damgård constructions like SHA-256 into packaging ecosystems

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

This removes many system and implementation-specific differences of archives from affecting the final “checksum” of the package archive,

this lets us inject whatever we want into your build process

making this mechanism more reliable to use compared to byte-for-byte comparisons.

it is very important that you avoid formats with plaintext metadata

This checksum method doesn’t account for links, empty directories, or archives that contain different contents depending on the extracting implementation.

this checksum method will produce different output than you'd expect, and we won't explain why!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

Packaging ecosystems like Python’s often have many tools with different organizational structures and roadmaps, and therefore cannot easily coordinate on ecosystem-wide challenges like archive format features.

pypi suddenly stopped supporting negative range requests for wheels immediately after i gave a talk about my work at packagingcon 2023 https://web.archive.org/web/20250218154403/https://cfp.packaging-con.org/2023/talk/hpuhu7/, but they refused to implement PEP 658 for years after i added it to pip, and would provide no point of contact to explain why

Python Resolution Evolution: Decoupling Metadata from Downloads in Pip PackagingCon 2023

- *[slides](https://docs.google.com/presentation/d/1sw3XmDtp3as1Iy-kSfjbCuYr2wD3NHg4ilMOPx5HKn8/edit?usp=drivesdk)* - *[video](https://www.youtube.com/watch?v=DTKSnU2EqBY)* Until 2020, Python applications shipped using a "greedy" or non-backtracking resolver, which led to significant angst and workarounds: within larger organizations, by pinning every dependency, while historically forcing open source projects to under-constrain their declared dependencies. While there was a lot of discussion at the time about whether to dive into using a SAT solver or other techniques requiring native dependencies, it ended up being much easier for a bootstrapping tool like pip to use a pure-Python solution. At the same time, other rumblings have been heard across the Python community. Spack, another package manager written in Python, ended up rewriting its concretizer to use an ASP logic solver written in C++. As machine learning continues to become hotter and hotter, Python itself has finally become able to reconsider the Global Interpreter Lock to allow for greater control by native code. When I first began looking at this problem, I had been contributing to the pants build tool and helping to define our Python-level API to orchestrate build steps executed in parallel with Rust. In all of these cases, there is a desire to retain the programmability and hackability of Python without experiencing growing pains as it gets applied to more and more tasks. However, I was surprised to find that improvements to pip resolve performance and the time to create deployable applications from that resolution largely came from managing i/o more effectively, and not (as is often alleged) by any inherent slowness of Python as an implementation language. Indeed, I demonstrate a case later in the talk regarding the production of single-file zipapp (pex) executables (as were used by the Twitter Cortex team) where, although I have created a Rust project and Python extension to slightly improve the performance of this task, the main thrust of the speedup by orders of magnitude was simply in figuring out how to avoid redoing computation by re-compressing very large ML frameworks every time, so the pex project is able to achieve this speedup without taking on any native dependencies. Most significantly, the pip project has recently been putting in immense effort to enable "virtualized" metadata-only requirements in its implementation that allow the resolver to make progress without downloading massive binary wheels. After I posted an initial prototype of zip file hackery with HTTP range requests to get around Python not having developed a metadata standard yet, github.com/McSinyx made that code production-ready into `--use-feature=fast-deps` as a Google Summer of Code project. Later, to address the case of sdists which cannot be manipulated in this way, github.com/uranusjr proposed and got accepted PEP 658, which now provides metadata for wheels on pypi, while github.com/sbidoul took over and shipped `pip install --report` to make pip's resolution algorithm available to downstream consumers handling download and install separately, which was the key innovation underlying the feature I shipped for the Cortex team. With recent work, this will allow `pip install --dry-run --report` to avoid downloading any artifacts at all, which enables users like Twitter Cortex necessarily building very large applications to avoid performing large amounts of network i/o just to figure out what they need to include in their binary. Talk will discuss: - The workloads from the Twitter Cortex team and representative speedups from that work (see https://github.com/pantsbuild/pants/pull/8793). - How pip maintainers responded to my initial proposal (https://github.com/pypa/pip/issues/7819), and how other stakeholders weighed in on its general utility, expressed faith in the idea, and invested effort to move it forward (see saga at https://github.com/pypa/pip/issues/53). - How tools like pex have integrated the pip resolver to form extremely slick interfaces you can build other tools on top of (especially showcasing https://github.com/pantsbuild/pex/pull/2175). - What remains to be done to make pip the fastest resolver in the west (especially work at https://github.com/pypa/pip/issues/12184).

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

This is especially true when the issue arises from an ambiguity in a packaging standard, which can take months to fix and often requires public correspondence, meaning users are left to fend for themselves from exploits while standards are fixed.

yeah it's crazy how brett cannon deleted my comments identifying METADATA as a literal RCE vector and refuses to support the JSON transformation i added to pip years ago.

too bad how this keeps breaking builds https://github.com/scikit-build/scikit-build-core/issues/1006

GitHub

email.errors.HeaderWriteError: folded header contains newline · Issue #1006 · scikit-build/scikit-build-core

Not sure what this error means... 2025-03-01T14:05:36.0141774Z *** scikit-build-core 0.11.0 using CMake 3.31.6 (metadata_wheel) 2025-03-01T14:05:36.0557788Z Traceback (most recent call last): 2025-...

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

newline parser confusion means pip will misread Requires-Dist: lines

https://github.com/python/cpython/issues/117313

However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending.

this is a textbook injection attack and no one in python packaging is interested in parser confusion from the fucked up email header parsing

GitHub

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences · Issue #117313 · python/cpython

cpython/Lib/email/policy.py Line 208 in eefff68 lines = value.splitlines() I think it's problematic that the method email.policy.EmailPolicy._fold() relies on the generic str / bytes method .splitl...

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

the METADATA format is significantly older than zip archives, without any actual spec to compare it against

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

However, applying protections like rejecting archives from a large public package repository like PyPI is not without challenge.

you do it all the damn time seth

One of the biggest difficulties for developing open source package repositories is that maintainers don’t know who or what is interacting with their systems.

bullshit. google_pasta-0.2.0-py3-none-any.whl is actively exploiting the METADATA vuln and they know what they're doing

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

For this reason, there needs to be caution and care put into any decision that affects an entire ecosystem.

usually we do PEPs for that. but pypi doesn't do PEPs

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

oh my god forgot they had another blog post https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks/

especially notice how our boy seth calls out the wheel format PEP for being vague:

The most detail provided is:

and then does not propose a PEP change. he does have another accusation though:

However, most Python installers today do not do this check and extract the contents of the ZIP archive similar to unzip and then amend the installed RECORD within the virtual environment so that uninstalling the package works as expected.

our boy seth is claiming that "most python installers" will:

recompute checksums for the files it just extracted
open the RECORD file and write the new value

and then implies this is some sort of cover-up for the uninstall operation

Preventing ZIP parser confusion attacks on Python package installers - The Python Package Index Blog

PyPI will begin warning and will later reject wheels that contain differentiable ZIP features or incorrect RECORD files.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

the author of the wheel PEP seth calls out was one of the few people who reached out to help me with my http zip PR that pip still refuses to merge https://github.com/pypa/pip/pull/12208

GitHub

perform 1-3 HTTP requests for each wheel using fast-deps by cosmicexplorer · Pull Request #12208 · pypa/pip

Continued motivation for fast-deps While PEP 658 is the standards-compliant solution and metadata from there is already preferred when available, --use-feature=fast-deps avoids downloading wheels a...

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

it's so fucking batshit to make a whole blog post about "popular installers" failing to check the RECORD file without mentioning or contacting any of them to fix it

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

ok i'm more insane now but i'm also much more certain that seth larson writing active falsehoods about zips and checksums is because:

plaintext metadata and redundant offsets make it very difficult to retroactively modify data without breaking a human-checkable invariant
storing both compressed and uncompressed size means the decompression process cannot be used to inject extra data through length extension
zip files having a magic number and unambiguous offsets at the end foils possibly foils length extension
when negative range requests suddenly broke on pypi, that was likely about length extension

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

METADATA on the other hand is not plaintext, because its semantics depend upon the version of cpython/pip, and are not machine-checkable or human-visible

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

google's dirhash is so fucking funny though

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

problem is to make a good proof of concept for length extension you need to have a ton of parallel compute, close to a whole data center, to figure out how to go from:
(A) length N, checksum C [good]
(B) length N + M, checksum C [bad]

since SHA-256 is cryptographic, you need to essentially mine a bitcoin

it's a little harder than a bitcoin aiui, since you need to ensure the result N + M seems valid at first, but actually injects some evil data

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

https://en.wikipedia.org/wiki/Length_extension_attack

this page claims SHA-256/512 and SHA3 are not susceptible, which i'm pretty sure is false? it does admit:

A secret suffix MAC, which is calculated as Hash(message ‖ secret), isn't vulnerable to a length extension attack, but is vulnerable to another attack based on a hash collision.[7]

ah, so i was mistaken! length extension seems to require that the valid prefix is not known to the attacker

Length extension attack - Wikipedia

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

anyway, you know who operates pypi and has a lot of spare GPUs? obviously that's the very definition of google scale

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

that's why the RECORD file is weirdly redundant with the required zip metadata, EXCEPT: standard zips only provide a crc32 checksum for each entry, which is not cryptographic and much easier to collide with

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

(i don't blame phil katz, he was only seeking to capture actual network errors, and computers were much slower back then.

i do blame zstd for their several layers of deeply odd functionality, including the optional non-cryptographic xxhash and multiple variants of "frames" and "blocks" which would decompress to 0-length outputs and are completely and obviously redundant in the protocol.)

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

the reason i'm mentioning this is of course: how can we make the fractal zip even more powerful than its progenitor?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 1 hour ago

i continue to find it ridiculous that no one has identified the properties of tree hashing (especially blake3) to be ideal for a filesystem. but then again, most filesystems generally have no idea what's going on most of the time

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 60 minutes ago

https://github.com/BLAKE3-team/BLAKE3/tree/master/c

Hashing large files with this function usually requires memory-mapping, since reading a file into memory in a single-threaded loop takes longer than hashing the resulting buffer. Note that hashing a memory-mapped file with this function produces a "random" pattern of disk reads, which can be slow on spinning disks. Again it's important to benchmark your specific use case.

this is mostly unhelpful, but we will forgive them

GitHub

BLAKE3/c at master · BLAKE3-team/BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function - BLAKE3-team/BLAKE3

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 53 minutes ago

what this does get at is that blake3 can be equivalently computed using multiple distinct memory access patterns https://raw.githubusercontent.com/BLAKE3-team/BLAKE3-specs/master/blake3.pdf

what does it mean to compute blake3?

BLAKE3 splits its input into 1 KiB chunks and arranges them as the leaves of a binary tree.

View (PDF)

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 49 minutes ago

note that the computation blake3 performs is a function of a whole contiguous region of memory, so this tree structure should not be interpreted as mapping to a filesystem VFS hierarchy. instead, the tree describes how to compose the checksum result from a sequence of subregions. in this, we can begin to infer how might be able to avoid redoing work

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 42 minutes ago

essentially, the blake3 compute tree is an in-memory structure which affords some degree of splitting and merging. and this happens to be just the thing pages are good at, and just the thing we know how to do with our fractal zip!

the fractal zip structure does assume data independence across local entries, but once blake3 is computed for an entry, we should be able to reuse that output to compute blake3 sums across a sequence of such read-only local entries

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 18 minutes ago

"what's the fractal all about?" well, there is a clearly recursive structure in the current sketch https://codeberg.org/cosmicexplorer/dice/src/branch/main/TODO.org although the nesting may be "inside-out" and the "prefix" is not actually an element of the data structure!

here's how we might think of our in-memory buffers:

    (local)))
  (central
(file)

the most flexible component is the contiguous (gapless) sequence of local entries. our central entries are also gapless, and must always correspond to the same sequence of local entries. this ensures the split/merge semantics of the local series can always be mirrored for the central series

aeva

@aeva@mastodon.gamedev.place · 5 hours ago

@hipsterelectron couldn't the malicious maintainer simply turn off the reproducible builds tho