Discussion
Loading...

#Tag

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
dch :flantifa: :flan_hacker:
@dch@bsd.network  ·  activity timestamp 11 hours ago

@jimsalter @joeress here's one for your 2.5 admins. #25admins #zfs #postgresql

https://matrix.org/blog/2025/10/post-mortem/ great write-up and #HugOps to those involved over that very stressful period.

- it's great to see they had a comprehensive backup/recovery strategy in place
- well done, really well done. Multiple fallback layers. Sad that they had to use them, but hey that's why we do this.
- Kudos to @beasts hosting for moral & technical support, once again I keep hearing good things about them

A story in SQL backup/recovery from matrix.org with three key lessons:

- always do critical recovery work with 2+ people checking and reviewing together (they did this), and rotate regions because sleep is critical
- never actually delete stuff during a crisis. Ever (narrator: they learned this the hard way)
- ZFS would have made this recovery significantly easier, in so many ways

It would have been almost trivial to recover from their failed storage with ZFS, and perhaps avoid either the failover, or the remote restore.

Scheduled ZFS snapshots would have meant a rollback instead of a recover in at least 2 of their high-risk moments.

It would have also meant higher storage costs because snapshots are almost but not quite free.

ZFS snapshots can be sent/received from an alternate system, over LAN at very high rates, much much faster than a remote S3-based streaming restore.

Post-mortem of the September 2 outage

Matrix, the open protocol for secure decentralised communications
  • Copy link
  • Flag this post
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0-rc.3.21 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login