are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?
Post
i believe to execute mount you have to be root and i believe this includes overlayfs so if you want your ephemeral overlayfs (a very basic thing to want) you need to fucking:
unsharewith the incantation that maps your current uid to 1 (root), producing an interactive subshell- from this persistent subshell,
mountinto a named directory to construct the ephemeral overlay - fucking
unshareagain to map the overlay dir as root fs, with extra incantations for each isolation mode (e.g. network) - execute your process
- output in overlay chroot
and of course you most definitely cannot isolate everything and allowlist, linus torvalds assumes you just want to isolate one thing at a time. there's a flag for like maintaining a persistent directory or file for the namespace? but i don't know what that does?
and that still doesn't at all let you optimize across the three musketeers of i/o from above:
- application-level write chunking
- i/o visibility scheduling/prefetching
- persistence
omg wait does this mean i get to make my own libc. god i'm gonna have to if i want to run any code right
ok i want to learn what the hell a memory manager is and whether it's anything like an engineering manager. ugh i don't know any of this and i'm gonna have to issue special processor instructions too. i'm overwhelmed now time for something else
oh i can port musl probably
the reason i think this ~must~ be an OS is because:
- i absolutely want to do an insane filesystem
- i am infatuated with the idea of scratch memory (stack/heap) < strict per-process memory isolation < prefetching/merging (a memory architecture which never shares memory across processes and uses i/o visibility to schedule construction of each new process memory space)
- i want to know which bits are flipped in persistent nvme and modulate the precise EM waves sent over network radios for cryptography reasons
basically i think linux constantly screaming ZERO-COPY!!!! as the only possible way the kernel can optimize i/o is annoying, unimaginative, and ascientific. it seems obvious that splice() should just allocate a free page and copy. io_uring should just be getdents() (maybe two distinct buffers as a struct arg, submitted in a synchronous syscall).
ok i do see io_uring has a mode to link dependent i/o operations. i'm pretty sure the "zero-copy" claim is deeply misleading however and i've just found this incredible quote
As soon as an sqe is consumed by the kernel, the application is free to reuse that sqe entry. This is true even for cases where the kernel isn't completely done with a given sqe yet.
sqe means "submission queue entry". "sqe entry" has not been defined.
If the kernel does need to access it after the entry has been consumed, it will have made a stable copy of it.
so the kernel is copying!!! he admit it!!!!!
and in fact copying data from user space means you can schedule independently of user space too!!! a ring buffer is a very specialized instrument for serialization of data. io_uring is not doing that! each individual sqe may have wildly different latencies and affect wildly variable subsystems.
furthermore, unlike getdents(), the ring "size" (number of elements) does not meaningfully limit the amount of data in flight. each operation may invoke additional internal buffering, because the kernel can't write the completed entry until it's copied to/from userspace.
oh my fucking god the kthread just goes to sleep after a while. this is why you don't implement an event loop in the kernel! this is why you use a goddamn synchronous syscall with a fixed-size buffer!!!
then it says "a userspace application has no way to know if the data it's going to fetch next is cached or not." completely backwards. a userspace application is the only one who can tell the kernel what data it's going to fetch next. the kernel is the one to manipulate the cache
oh and of course this is actually all missing the point that the page cache strongly couples "writing user data into the kernel" with "the data is now globally visible". i do not want this? please employ basic pipelining techniques?
who would you ever want the kernel to copy from the network directly into a userspace pointer which you just have to remember cannot be read from until you pull the appropriate completed queue entry. and its size is unknown except from the appropriate entry?
enough of this nonsense