Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Thib
@thibaultamartin@mamot.fr  ·  activity timestamp 2 weeks ago

In early September, The Matrix Foundation homeserver went down.

I'm extremely proud of our SRE team. They had a Disaster Recovery Plan and monthly exercises to apply it, resulting in no data loss despite a 24h outage.

I've learned a lot about how to properly backup/restore a Postgres database when writing this post with SREs. We also learned how to better prevent and be resilient to human error.

Thanks all for the hugops during the outage!

https://matrix.org/blog/2025/10/post-mortem/

#homelab #selfHosting #sre

Post-mortem of the September 2 outage

Matrix, the open protocol for secure decentralised communications
  • Copy link
  • Flag this post
  • Block
Frisk
@Frisk@woof.tech replied  ·  activity timestamp 2 weeks ago

@thibaultamartin "The necessary course of action at this point was to clear the remains of the failed restore attempt from the data directory and start again. Since db-02 had already been cleared and needed to be restored, this didn’t register as a particularly high risk manoeuvre.

Unfortunately, in attempting to do so, we erroneously deleted the data directory of the primary on db-01."

This reads like a horror story, right on time. Thank you for the post mortem, was great to read.

  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login