are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?
Post
imagine if at any point in between lines of code in $LANG someone could modify the values of any variables in the current scope. and add some new variables to the stack
also it's unclear to me why io_uring needs to share buffers between kernel and userspace when getdents() just provides the buffer to the kernel to copy into. either io_uring is wrong or getdents() is wrong
i don't understand why a ring buffer is used here when maintaining order is completely irrelevant
Completion events may arrive in any order, there is no ordering between the request submission and the association completion. The SQ and CQ ring run independently of each other.
ok so the ring structure and semantics are not needed, the best and brightest of the linux kernel just didn't know any alternative asynchronous submission mechanism except the one with all the edge cases
One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them.
yeah, "CQ ring" and "submission side" are obviously referring to the output and input rings. i called these CQ and SQ earlier but i get informal when explaining important differences
Hence the submission side ring buffer
a new name for SQ every time
is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there's
[page break]
some reasoning behind it.
whenever i have some reasoning for an odd and confusing quirk, i make sure to put it on the page after the quirk
Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation.
odd and confusing feature of my privilege escalation API is justified by "some applications" being "flexible". oh wait no there's more
That in turns allows for easier conversion to the
io_uringinterface.
oh, so someone who pays their engineers hundreds of thousands of dollars per year asked linux to support a feature so that when they rewrote their code to use io_uring, they could put the submission queue inside another object
it makes sense that the submission queue is more complex than the completion queue, because that's where you provide malformed input for the kernel to misinterpret so you can RCE https://kernel.dk/io_uring.pdf. it would make no sense for the output queue to be complex, because i/o operations always have unambiguous expected results
oh it looks like the generically named copy dot fail was in fact a filesystem vulnerability specifically dependent upon the failure to codify transitions between the page cache and disk
The write bypasses the VFS path entirely; the corrupted page is never marked dirty
oh. so you're saying one of the specific distinctions i identified as an isolation boundary across independently testable module interfaces in a hypothetical microkernel an hour ago because the monolithic kernel just fails to specify interactions between subsystems is also the big major vuln. how surprising
in response to this https://toot.cat/@jamey/116580541741309473 i had typed out this research proposal i couldn't quite finish before starting the current thread
another reason i've been curious about models for OS components is because of the desire to specialize scheduling and caching logic for local filesystem i/o patterns, particularly from build tools which would like to align i/o scheduling with their highly structured task scheduler. i believe network i/o necessarily has somewhat variable latency even for highly structured local interconnects, whereas disk i/o has no such stochastic input and can schedule prefetching given a data dependency graph for a build task.
build tasks are pretty high-level, and rely upon a subprocess (someone else's code) to make calls to the i/o subsystem. but this parent/child communication isn't quite a "model" in the sense of formal verification, since it's used for performance and not correctness. the actual model checking i envision would attempt to factor filesystem functionality into independent components, enabling greater portability across OSes and hardware if the model admits a kind of composeable algebra that maps to hypothetical microkernel interfaces. model checking would verify cross-component behavior with particular respect to atomicity and memory safety.
the goal in all this is to support distinct implementation techniques for file i/o that can be arranged together (synthesized) to satisfy application-specific i/o patterns. a filesystem as of today is the sole uniform link from syscall to block device driver, and sorely needed features like atomic transactions and prefetch hints remain missing. instead of representing logical structure, they defer to heuristics to reverse-engineer that structure from decontextualized reads and writes.
in particular, the distinct operations of entering data into the filesystem, making it visible to other processes, and persisting to hardware are conflated, when for a build tool these are all separate phases of application logic. build sandboxes distinguish between the read-only ephemeral environment vs the generally write-only output paths. these distinctions can enable pipelining and locality optimizations, but the filesystem supports only the lowest common denominator interface, stripping away any additional structure.
there is a great deal of application-specific structure in local filesystem i/o, and representations of this in-process structure as well as the dependency relationships across processes can be composed from individually-tuned OS components. much like your proposal for drivers insightfully links portability to model checking, i hypothesize that formal verification of each component can be used to interface heterogenous data layouts for exotic process i/o patterns with an intermediate layer which normalizes write data so it can be read as input or persisted to disk. much like drivers in monolithic C kernels, filesystems
The kernel never marks the corrupted page dirty for writeback, so the file on disk remains unchanged and ordinary on-disk checksum comparisons miss the modification. However, the page cache is what actually gets read when accessing the file, so the corrupted in-memory version is immediately visible system-wide.
i can't fucking believe the precise visibility vs persistence distinction is a key feature of the meme vuln. bad APIs are usually evil i need to remember this
A core primitive underlying this bug is splice(): it transfers data between file descriptors and pipes without copying, passing page cache pages by reference. When a user splices a file into a pipe and then into an AF_ALG socket, the socket's input scatterlist holds direct references to the kernel's cached pages of that file. The pages are not duplicated; the scatterlist entries point at the same physical pages that back every read(), mmap(), and execve() of that file.
so the splice() operation avoids copying through userspace. but you can improve perf compared to userspace copy without literally just handing out mutable refs. you can literally do the copy in kernel space
This in-place design is the root cause of the vulnerability. It places page cache pages in a writable scatterlist, separated from the legitimate write region by nothing more than an offset boundary.
they don't even have refs that enforce read/write permissions?
so linux seems to exhibit this pattern of behavior wherein they:
- don't have any concept of i/o scheduling, or any way for applications to define a sequence of data dependencies, just heuristics
- any non-posix i/o API SIMPLY MUST be zero-copy in the kernel, because zero-copy is the only thing that is fast
- page cache entries are free real estate. is it the result of a read? is it the result of a write? who knows! it's visible everywhere immediately!
is fuchsia actually interesting or is it just bazelOS
i found this hpc paper with an argonne contributor when trying to find io prefetching research https://dl.acm.org/doi/10.1145/1851476.1851499 apparently argonne has a virtual file system that does "i/o scheduling" as i refer to it above https://en.wikipedia.org/wiki/Parallel_Virtual_File_System
they do thankfully have their own non-posix i/o API. unfortunately it is very specifically for MD sims with MPI which means:
- one file is processed by every processor at once instead of each core (task) owning a distinct graph of read/write dependencies like a build tool
- read/write dependencies correspond to highly regular grid neighbors, largely a complete graph instead of corresponding to user-defined task dependencies
- persistence is not a correctness requirement