Solved! 馃コ

This was a pretty "interesting" bug. Remember when I invented a way to implement #async / #await in #C, for jobs running on a threadpool. Back then I said it only works when completion of the task resumes execution on the same pool thread.

Trying to improve overall performance, I found the complex logic to identify the thread job to put on a pool thread a real deal-breaker. Just having one single MPMC queue with a single semaphore for all pool threads to wait on is a lot more efficient. But then, a job continued after an awaited task will resume on a "random" thread.

It theoretically works by making sure to restore the CORRECT context (the original one of the pool thread) every time after executing a job, whether partially (up to the next await) or completely.

Only it didn't, at least here on #FreeBSD, and I finally understood the reason for this was that I was using #TLS (thread-local storage) to find the context to restore.

Well, most architectures store a pointer to the current thread metadata in a register. #POSIX user #context #switching saves and restores registers. I found a source claiming that the #Linux ( #glibc) implementation explicitly does NOT include the register holding a thread pointer. Obviously, #FreeBSD's implementation DOES include it. POSIX doesn't have to say anything about that.

In short, avoiding TLS accesses when running with a custom context solved the crash. 馃く

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 馃槥. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 馃檲.

Need to look into actually saving CPU cycles for further optimizations I guess...

I now experimented with different ideas how to implement the #lockfree #queue for multiple producers and multiple consumers. Unsurprisingly, some ideas just didn't work. One deadlocked (okaaay ... so it wasn't lockfree) and I eventually gave up trying to understand why.

The "winner" so far is only "almost lockfree", but at least slightly improves performance. Throughput is the same as with the simple locked variant, but average response times are 10 to 20% quicker (although they deviate stronger for whatever reason). Well, that's committed for now:

https://github.com/Zirias/poser/commit/4f2f80a8266e4762e04ead2b802e7a7c1b55090b

#C11 #atomics