#Tag · bonfire.cafe

I need help. First the question: On #FreeBSD, with all ports built with #LibreSSL, can I somehow use the #clang #thread #sanitizer on a binary actually using LibreSSL and get sane output?

What I now observe debugging #swad:

- A version built with #OpenSSL (from base) doesn't crash. At least I tried very hard, really stressing it with #jmeter, to no avail. Built with LibreSSL, it does crash.
- Less relevant: the OpenSSL version also performs slightly better, but needs almost twice the RAM
- The thread sanitizer finds nothing to complain when built with OpenSSL
- It complains a lot with LibreSSL, but the reports look "fishy", e.g. it seems to intercept some OpenSSL API functions (like SHA384_Final)
- It even complains when running with a single-thread event loop.
- I use a single SSL_CTX per listening socket, creating SSL objects from it per connection ... also with multithreading; according to a few sources, this should be supported and safe.
- I can't imagine doing that on a *single* thread could break with LibreSSL, I mean, this would make SSL_CTX pretty much pointless
- I *could* imagine sharing the SSL_CTX with multiple threads to create their SSL objects from *might* not be safe with LibreSSL, but no idea how to verify as long as the thread sanitizer gives me "delusional" output 😳

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe · 6 months ago

Solved! 🥳

This was a pretty "interesting" bug. Remember when I invented a way to implement #async / #await in #C, for jobs running on a threadpool. Back then I said it only works when completion of the task resumes execution on the same pool thread.

Trying to improve overall performance, I found the complex logic to identify the thread job to put on a pool thread a real deal-breaker. Just having one single MPMC queue with a single semaphore for all pool threads to wait on is a lot more efficient. But then, a job continued after an awaited task will resume on a "random" thread.

It theoretically works by making sure to restore the CORRECT context (the original one of the pool thread) every time after executing a job, whether partially (up to the next await) or completely.

Only it didn't, at least here on #FreeBSD, and I finally understood the reason for this was that I was using #TLS (thread-local storage) to find the context to restore.

Well, most architectures store a pointer to the current thread metadata in a register. #POSIX user #context #switching saves and restores registers. I found a source claiming that the #Linux ( #glibc) implementation explicitly does NOT include the register holding a thread pointer. Obviously, #FreeBSD's implementation DOES include it. POSIX doesn't have to say anything about that.

In short, avoiding TLS accesses when running with a custom context solved the crash. 🤯

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe replied · 6 months ago

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 😞. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 🙈.

Need to look into actually saving CPU cycles for further optimizations I guess...

Felix Palmen :freebsd: :c64:

@zirias@mastodon.bsd.cafe · 6 months ago

Finally getting somewhere working on the next evolution step for #swad. I have a first version that (normally 🙈) doesn't crash quickly (so, no release yet, but it's available on the master branch).

The good news: It's indeed an improvement to have multiple parallel #reactor (event-loop) threads. It now handles 3000 requests per second on the same hardware, with overall good response times and without any errors. I uploaded the results of the stress test here:

https://zirias.github.io/swad/stress/

The bad news ... well, there are multiple.

1. It got even more memory hungry. The new stress test still simulates 1000 distinct clients (trying to do more fails on my machine as #jmeter can't create new threads any more...), but with delays reduced to 1/3 and doing 100 iterations each. This now leaves it with a resident set of almost 270 MiB ... tuning #jemalloc on #FreeBSD to return memory more promptly reduces this to 187 MiB (which is still a lot) and reduces performance a bit (some requests run into 429, overall response times are worse). I have no idea yet where to start trying to improve this.

2. It requires tuning to manage that load without errors, mainly using more threads for the thread pool, although these threads stay almost idle ... which probably means I have to find ways to make putting work on and off these threads more efficient. At least I have some ideas.

3. I've seen a crash which only happened once so far, no idea as of now how to reproduce. sigh. Massively parallel code in C really is a PITA.

Seems the more I improve here, the more I find that should also be improved. 🤪

#C #coding #performance