Post · bonfire.cafe

@hipsterelectron@circumstances.run · 4 hours ago

are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

i did think it was cute to make readdir() call getdents() (although i think readdir() should be deprecated because "mysterious pointer not valid across invocations" is just a bad omen)

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

"maximum size of dirent record across entries of this directory" could be a neat API. or a maximum entry size for the whole filesystem. just something that gives you a guaranteed minimum buffer size that will always store at least one record

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

it's very fucked up that the filesystem conflates:

write data into OS buffer
make data visible to all other processes
make data persist to disk

each are individually problematic because they fail to represent application-specific constraints. but on top of their individual failings they also can't be distinguished from each other. that's two levels of conflation

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

you might say "ohohoho but of course there is le fsync mon cheri" and yes that's correct there is a single method conflating global visibility and persistence! but write() is generally eventually consistent and may even get persisted without fsync()! the filesystem offers no way to hint that write() output doesn't need to be visible or persisted until synced!

~21 more replies

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

so the filesystem tries to infer scheduling criteria from the sequence of read/write/seek calls on the current fd, which is very much like fortune telling in that it infers generic and unhelpful things about your future

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

i'm also coming back as usual to @zwol's proposed transactional operations. i think "transactions" makes people think of it like a database, where a transaction tries to perform multiple global state changes in sequence atomically. i think "scratch space" is a much more common requirement. i want "make an anonymous dirfd, let me populate it, then hook it up to the rest of the world at this path". hell, let me reserve a file/directory entry at a given path so open()/opendir() calls fail until i release the token or commit the result

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

imagine if at any point in between lines of code in $LANG someone could modify the values of any variables in the current scope. and add some new variables to the stack

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

also it's unclear to me why io_uring needs to share buffers between kernel and userspace when getdents() just provides the buffer to the kernel to copy into. either io_uring is wrong or getdents() is wrong

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

i don't understand why a ring buffer is used here when maintaining order is completely irrelevant

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

Completion events may arrive in any order, there is no ordering between the request submission and the association completion. The SQ and CQ ring run independently of each other.

ok so the ring structure and semantics are not needed, the best and brightest of the linux kernel just didn't know any alternative asynchronous submission mechanism except the one with all the edge cases

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 3 hours ago

One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them.

yeah, "CQ ring" and "submission side" are obviously referring to the output and input rings. i called these CQ and SQ earlier but i get informal when explaining important differences

Hence the submission side ring buffer

a new name for SQ every time

is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there's

[page break]

some reasoning behind it.

whenever i have some reasoning for an odd and confusing quirk, i make sure to put it on the page after the quirk

Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation.

odd and confusing feature of my privilege escalation API is justified by "some applications" being "flexible". oh wait no there's more

That in turns allows for easier conversion to the io_uring interface.

oh, so someone who pays their engineers hundreds of thousands of dollars per year asked linux to support a feature so that when they rewrote their code to use io_uring, they could put the submission queue inside another object

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 2 hours ago

it makes sense that the submission queue is more complex than the completion queue, because that's where you provide malformed input for the kernel to misinterpret so you can RCE https://kernel.dk/io_uring.pdf. it would make no sense for the output queue to be complex, because i/o operations always have unambiguous expected results

View (PDF)