Post · bonfire.cafe

@hipsterelectron@circumstances.run · 11 hours ago

are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

i believe to execute mount you have to be root and i believe this includes overlayfs so if you want your ephemeral overlayfs (a very basic thing to want) you need to fucking:

unshare with the incantation that maps your current uid to 1 (root), producing an interactive subshell
from this persistent subshell, mount into a named directory to construct the ephemeral overlay
fucking unshare again to map the overlay dir as root fs, with extra incantations for each isolation mode (e.g. network)
execute your process
output in overlay chroot

and of course you most definitely cannot isolate everything and allowlist, linus torvalds assumes you just want to isolate one thing at a time. there's a flag for like maintaining a persistent directory or file for the namespace? but i don't know what that does?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

and that still doesn't at all let you optimize across the three musketeers of i/o from above:

application-level write chunking
i/o visibility scheduling/prefetching
persistence

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

omg wait does this mean i get to make my own libc. god i'm gonna have to if i want to run any code right

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

ok i want to learn what the hell a memory manager is and whether it's anything like an engineering manager. ugh i don't know any of this and i'm gonna have to issue special processor instructions too. i'm overwhelmed now time for something else

~43 more replies

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

oh i can port musl probably

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 6 hours ago

the reason i think this ~must~ be an OS is because:

i absolutely want to do an insane filesystem
i am infatuated with the idea of scratch memory (stack/heap) < strict per-process memory isolation < prefetching/merging (a memory architecture which never shares memory across processes and uses i/o visibility to schedule construction of each new process memory space)
i want to know which bits are flipped in persistent nvme and modulate the precise EM waves sent over network radios for cryptography reasons

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

basically i think linux constantly screaming ZERO-COPY!!!! as the only possible way the kernel can optimize i/o is annoying, unimaginative, and ascientific. it seems obvious that splice() should just allocate a free page and copy. io_uring should just be getdents() (maybe two distinct buffers as a struct arg, submitted in a synchronous syscall).

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

ok i do see io_uring has a mode to link dependent i/o operations. i'm pretty sure the "zero-copy" claim is deeply misleading however and i've just found this incredible quote

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 5 hours ago

As soon as an sqe is consumed by the kernel, the application is free to reuse that sqe entry. This is true even for cases where the kernel isn't completely done with a given sqe yet.

sqe means "submission queue entry". "sqe entry" has not been defined.

If the kernel does need to access it after the entry has been consumed, it will have made a stable copy of it.

so the kernel is copying!!! he admit it!!!!!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

and in fact copying data from user space means you can schedule independently of user space too!!! a ring buffer is a very specialized instrument for serialization of data. io_uring is not doing that! each individual sqe may have wildly different latencies and affect wildly variable subsystems.

furthermore, unlike getdents(), the ring "size" (number of elements) does not meaningfully limit the amount of data in flight. each operation may invoke additional internal buffering, because the kernel can't write the completed entry until it's copied to/from userspace.

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

oh my fucking god the kthread just goes to sleep after a while. this is why you don't implement an event loop in the kernel! this is why you use a goddamn synchronous syscall with a fixed-size buffer!!!

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

then it says "a userspace application has no way to know if the data it's going to fetch next is cached or not." completely backwards. a userspace application is the only one who can tell the kernel what data it's going to fetch next. the kernel is the one to manipulate the cache

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

oh and of course this is actually all missing the point that the page cache strongly couples "writing user data into the kernel" with "the data is now globally visible". i do not want this? please employ basic pipelining techniques?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

who would you ever want the kernel to copy from the network directly into a userspace pointer which you just have to remember cannot be read from until you pull the appropriate completed queue entry. and its size is unknown except from the appropriate entry?

d@nny disc@ mc²

@hipsterelectron@circumstances.run · 4 hours ago

enough of this nonsense