More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a major bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. 馃槅

So I read some scientific papers 馃檲 ... and redesigned a lot (). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see this happen. 馃ぉ Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There must be some way to make it perform at least a bit better...

() edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: https://dl.acm.org/doi/10.1145/248052.248106

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 馃槥. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 馃檲.

Need to look into actually saving CPU cycles for further optimizations I guess...

I now experimented with different ideas how to implement the #lockfree #queue for multiple producers and multiple consumers. Unsurprisingly, some ideas just didn't work. One deadlocked (okaaay ... so it wasn't lockfree) and I eventually gave up trying to understand why.

The "winner" so far is only "almost lockfree", but at least slightly improves performance. Throughput is the same as with the simple locked variant, but average response times are 10 to 20% quicker (although they deviate stronger for whatever reason). Well, that's committed for now:

https://github.com/Zirias/poser/commit/4f2f80a8266e4762e04ead2b802e7a7c1b55090b

#C11 #atomics

Solved! 馃コ

This was a pretty "interesting" bug. Remember when I invented a way to implement #async / #await in #C, for jobs running on a threadpool. Back then I said it only works when completion of the task resumes execution on the same pool thread.

Trying to improve overall performance, I found the complex logic to identify the thread job to put on a pool thread a real deal-breaker. Just having one single MPMC queue with a single semaphore for all pool threads to wait on is a lot more efficient. But then, a job continued after an awaited task will resume on a "random" thread.

It theoretically works by making sure to restore the CORRECT context (the original one of the pool thread) every time after executing a job, whether partially (up to the next await) or completely.

Only it didn't, at least here on #FreeBSD, and I finally understood the reason for this was that I was using #TLS (thread-local storage) to find the context to restore.

Well, most architectures store a pointer to the current thread metadata in a register. #POSIX user #context #switching saves and restores registers. I found a source claiming that the #Linux ( #glibc) implementation explicitly does NOT include the register holding a thread pointer. Obviously, #FreeBSD's implementation DOES include it. POSIX doesn't have to say anything about that.

In short, avoiding TLS accesses when running with a custom context solved the crash. 馃く

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 馃槥. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 馃檲.

Need to look into actually saving CPU cycles for further optimizations I guess...