Post · bonfire.cafe

@hipsterelectron@circumstances.run · 13 hours ago

are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 12 hours ago

The kernel never marks the corrupted page dirty for writeback, so the file on disk remains unchanged and ordinary on-disk checksum comparisons miss the modification. However, the page cache is what actually gets read when accessing the file, so the corrupted in-memory version is immediately visible system-wide.

i can't fucking believe the precise visibility vs persistence distinction is a key feature of the meme vuln. bad APIs are usually evil i need to remember this

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 11 hours ago

A core primitive underlying this bug is splice(): it transfers data between file descriptors and pipes without copying, passing page cache pages by reference. When a user splices a file into a pipe and then into an AF_ALG socket, the socket's input scatterlist holds direct references to the kernel's cached pages of that file. The pages are not duplicated; the scatterlist entries point at the same physical pages that back every read(), mmap(), and execve() of that file.

so the splice() operation avoids copying through userspace. but you can improve perf compared to userspace copy without literally just handing out mutable refs. you can literally do the copy in kernel space

This in-place design is the root cause of the vulnerability. It places page cache pages in a writable scatterlist, separated from the legitimate write region by nothing more than an offset boundary.

they don't even have refs that enforce read/write permissions?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 11 hours ago

so linux seems to exhibit this pattern of behavior wherein they:

don't have any concept of i/o scheduling, or any way for applications to define a sequence of data dependencies, just heuristics
any non-posix i/o API SIMPLY MUST be zero-copy in the kernel, because zero-copy is the only thing that is fast
page cache entries are free real estate. is it the result of a read? is it the result of a write? who knows! it's visible everywhere immediately!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 11 hours ago

is fuchsia actually interesting or is it just bazelOS

~67 more replies

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 10 hours ago

i found this hpc paper with an argonne contributor when trying to find io prefetching research https://dl.acm.org/doi/10.1145/1851476.1851499 apparently argonne has a virtual file system that does "i/o scheduling" as i refer to it above https://en.wikipedia.org/wiki/Parallel_Virtual_File_System

Parallel Virtual File System - Wikipedia

https://dl.acm.org/doi/10.1145/1851476.1851499

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 10 hours ago

they do thankfully have their own non-posix i/o API. unfortunately it is very specifically for MD sims with MPI which means:

one file is processed by every processor at once instead of each core (task) owning a distinct graph of read/write dependencies like a build tool
read/write dependencies correspond to highly regular grid neighbors, largely a complete graph instead of corresponding to user-defined task dependencies
persistence is not a correctness requirement

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 10 hours ago

they still use linux but pvfs through mpi lets them kinda manage i/o separately because in fact The Heuristics Are Not Tuned For Their Oddball Shit. i agree with that

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 10 hours ago

"i/o visibility" as i've defined it above is pretty curious. it is a pretty sparse graph that identifies produce/consume relationships from write W in task A to read R in task B.

i guess that's a big difference: the i/o visibility is actually how tasks are linked into a dependency graph. so we have two separate degrees of freedom to optimize over:

we can schedule tasks, which determines the runtime sequence of data fetching
we can schedule fetches and commits

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

here's another domain assumption: visibility deps tend to follow structured concurrency (split, join, straight line), which makes data locality easier to solve for. i could use the double-ended zip format both in-flight as well as persisted because it's easy to split and merge but can be indexed out of band for visibility lookups. i like having a serializable format for ephemeral filesystem environments for sandboxed process invocations which supports append/concat for straight-line write->read deps that i can then dump to persistence

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

and now that we mention ephemeral filesystem trees and sandboxing it seems that this visibility graph is already a kind of structured permissions system. that we can codify a task's visible memory space with a uniquely-owned contiguous region of memory seems like a wonderful thing. it's ridiculous that every process can read every file by default

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

oh yeah linux unshare supports creating process chroots from overlays and i was gonna do this + FUSE for a build system but the docs are exceptionally atrocious. a FUSE daemon that just allocates several gigs of RAM and a direct mapped file for persistence seems like a plausible way to roll your own page cache and filesystem in a pinch but:
(1) libfuse is unmaintained
(2) if i want to understand persistence i think i want to write the nvme driver

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

i believe to execute mount you have to be root and i believe this includes overlayfs so if you want your ephemeral overlayfs (a very basic thing to want) you need to fucking:

unshare with the incantation that maps your current uid to 1 (root), producing an interactive subshell
from this persistent subshell, mount into a named directory to construct the ephemeral overlay
fucking unshare again to map the overlay dir as root fs, with extra incantations for each isolation mode (e.g. network)
execute your process
output in overlay chroot

and of course you most definitely cannot isolate everything and allowlist, linus torvalds assumes you just want to isolate one thing at a time. there's a flag for like maintaining a persistent directory or file for the namespace? but i don't know what that does?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

and that still doesn't at all let you optimize across the three musketeers of i/o from above:

application-level write chunking
i/o visibility scheduling/prefetching
persistence

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 9 hours ago

omg wait does this mean i get to make my own libc. god i'm gonna have to if i want to run any code right

David Gerard

@davidgerard@circumstances.run · 8 hours ago

@hipsterelectron thinking of @dysfun wanting to write a fedi client and immediately going into the weeds of database theory