are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?
Post
i did think it was cute to make readdir() call getdents() (although i think readdir() should be deprecated because "mysterious pointer not valid across invocations" is just a bad omen)
"maximum size of dirent record across entries of this directory" could be a neat API. or a maximum entry size for the whole filesystem. just something that gives you a guaranteed minimum buffer size that will always store at least one record
it's very fucked up that the filesystem conflates:
- write data into OS buffer
- make data visible to all other processes
- make data persist to disk
each are individually problematic because they fail to represent application-specific constraints. but on top of their individual failings they also can't be distinguished from each other. that's two levels of conflation
you might say "ohohoho but of course there is le fsync mon cheri" and yes that's correct there is a single method conflating global visibility and persistence! but write() is generally eventually consistent and may even get persisted without fsync()! the filesystem offers no way to hint that write() output doesn't need to be visible or persisted until synced!
so the filesystem tries to infer scheduling criteria from the sequence of read/write/seek calls on the current fd, which is very much like fortune telling in that it infers generic and unhelpful things about your future
i'm also coming back as usual to @zwol's proposed transactional operations. i think "transactions" makes people think of it like a database, where a transaction tries to perform multiple global state changes in sequence atomically. i think "scratch space" is a much more common requirement. i want "make an anonymous dirfd, let me populate it, then hook it up to the rest of the world at this path". hell, let me reserve a file/directory entry at a given path so open()/opendir() calls fail until i release the token or commit the result
imagine if at any point in between lines of code in $LANG someone could modify the values of any variables in the current scope. and add some new variables to the stack
also it's unclear to me why io_uring needs to share buffers between kernel and userspace when getdents() just provides the buffer to the kernel to copy into. either io_uring is wrong or getdents() is wrong
i don't understand why a ring buffer is used here when maintaining order is completely irrelevant
Completion events may arrive in any order, there is no ordering between the request submission and the association completion. The SQ and CQ ring run independently of each other.
ok so the ring structure and semantics are not needed, the best and brightest of the linux kernel just didn't know any alternative asynchronous submission mechanism except the one with all the edge cases
One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them.
yeah, "CQ ring" and "submission side" are obviously referring to the output and input rings. i called these CQ and SQ earlier but i get informal when explaining important differences
Hence the submission side ring buffer
a new name for SQ every time
is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there's
[page break]
some reasoning behind it.
whenever i have some reasoning for an odd and confusing quirk, i make sure to put it on the page after the quirk
Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation.
odd and confusing feature of my privilege escalation API is justified by "some applications" being "flexible". oh wait no there's more
That in turns allows for easier conversion to the
io_uringinterface.
oh, so someone who pays their engineers hundreds of thousands of dollars per year asked linux to support a feature so that when they rewrote their code to use io_uring, they could put the submission queue inside another object
it makes sense that the submission queue is more complex than the completion queue, because that's where you provide malformed input for the kernel to misinterpret so you can RCE https://kernel.dk/io_uring.pdf. it would make no sense for the output queue to be complex, because i/o operations always have unambiguous expected results