#Tag · bonfire.cafe

Please help me spread the link to #swad 😎

I really need some users by now, for those two reasons:

* I'm at a point where I fully covered my own needs (the reasons I started coding this), and getting some users is the only way to learn about what other people might need
* The complexity "exploded" after supporting so many OS-specific APIs (like #kqueue, #epoll, #eventfd, #signalfd, #timerfd, #eventports) and several #lockfree implementations based on #atomics while still providing fallbacks for everything that *should* work on any #POSIX systems ... I'm definitely unable at this point to think of every possible edge case and test it. If there are #bugs left (which is somewhat likely), I really need people reporting these to me

Thanks! 🙃

GitHub

GitHub - Zirias/swad: Simple Web Authentication Daemon

Simple Web Authentication Daemon. Contribute to Zirias/swad development by creating an account on GitHub.

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe · 6 months ago

More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a major bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. 😆

So I read some scientific papers 🙈 ... and redesigned a lot (). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see this happen. 🤩 Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There must be some way to make it perform at least a bit better...

() edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: https://dl.acm.org/doi/10.1145/248052.248106

Response times in percentiles. 97% stay below 10ms!

Throuput curve of my latest stress test of swad (with ramp-up and ramp-down phase)

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe · 6 months ago

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 😞. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 🙈.

Need to look into actually saving CPU cycles for further optimizations I guess...

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe replied · 6 months ago

I now experimented with different ideas how to implement the #lockfree #queue for multiple producers and multiple consumers. Unsurprisingly, some ideas just didn't work. One deadlocked (okaaay ... so it wasn't lockfree) and I eventually gave up trying to understand why.

The "winner" so far is only "almost lockfree", but at least slightly improves performance. Throughput is the same as with the simple locked variant, but average response times are 10 to 20% quicker (although they deviate stronger for whatever reason). Well, that's committed for now:

https://github.com/Zirias/poser/commit/4f2f80a8266e4762e04ead2b802e7a7c1b55090b

#C11 #atomics

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe · 6 months ago

Solved! 🥳

This was a pretty "interesting" bug. Remember when I invented a way to implement #async / #await in #C, for jobs running on a threadpool. Back then I said it only works when completion of the task resumes execution on the same pool thread.

Trying to improve overall performance, I found the complex logic to identify the thread job to put on a pool thread a real deal-breaker. Just having one single MPMC queue with a single semaphore for all pool threads to wait on is a lot more efficient. But then, a job continued after an awaited task will resume on a "random" thread.

It theoretically works by making sure to restore the CORRECT context (the original one of the pool thread) every time after executing a job, whether partially (up to the next await) or completely.

Only it didn't, at least here on #FreeBSD, and I finally understood the reason for this was that I was using #TLS (thread-local storage) to find the context to restore.

Well, most architectures store a pointer to the current thread metadata in a register. #POSIX user #context #switching saves and restores registers. I found a source claiming that the #Linux ( #glibc) implementation explicitly does NOT include the register holding a thread pointer. Obviously, #FreeBSD's implementation DOES include it. POSIX doesn't have to say anything about that.

In short, avoiding TLS accesses when running with a custom context solved the crash. 🤯

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe replied · 6 months ago

Need to look into actually saving CPU cycles for further optimizations I guess...