#Tag · bonfire.cafe

(I don’t envy server engineers. As a mostly-client engineer, I can take down the local device, and that’s very bad, especially with app auto-updates. But I can’t usually take down a dozen other things as collateral damage! #HugOps)

Jordan

@jrose@social.belkadan.com · 2 days ago

Everybody is memeing on a Rust unwrap/panic/abort being the (a) cause of the https://blog.cloudflare.com/18-november-2025-outage/, and, sure, that code was not sufficiently defensive. So what would that same not-sufficiently defensive code done in other languages? Assuming a similar thought process went in about “we should preallocate this” but not “technically this data comes from elsewhere”, and using data structures matching the idioms in the standard library:

• Java, JavaScript, C#, Lisp: threw some kind of OutOfBounds error, most likely uncaught because it’s not a “checked exception” type; process still aborts in practice
• C: If you’re lucky, a returned error code with a good chance of being ignored here (“should never happen”); who knows what configuration it’s in after that. If you’re unlucky, silent buffer overflow, which could be worse than crashing (imagine if it let someone replace files on Cloudflare’s CDNs, for example).
• Haskell: if you’re very good at proving things about types, you’ll be in the Rust case if you’re lucky and silently truncating if you’re not.
• C++: one of the above, but probably the C case in practice.
• Swift: the Java case but with worse logging on the way out, probably :-/
• Erlang: the Java case, but you’ll probably leave better logs on the way out.

This wasn’t a “Rust bug”. This was an “input sanitization” bug. At least in Rust the choice to ignore bad data was written explicitly.

The Cloudflare Blog

Cloudflare outage on November 18, 2025

Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected.

Jordan

@jrose@social.belkadan.com replied · 2 days ago

Djoerd Hiemstra 🍉 boosted

Jonathan Kamens 86 47

@jik@federate.social · 3 days ago

Real-time photo of #Cloudflare #DevOps trying to get their service back up and running while web requests keep pouring in.
#HugOps

Man wearing a white button down shirt, looking despondent, sitting in water up to his elbows, with more water pouring over his head, with a partially submerged laptop in front of him

phildini

@phildini@wandering.shop · 2 days ago

#hugops to @github

Jonathan Kamens 86 47

@jik@federate.social · 3 days ago

Real-time photo of #Cloudflare #DevOps trying to get their service back up and running while web requests keep pouring in.
#HugOps

dch :flantifa: :flan_hacker:

@dch@bsd.network · 3 weeks ago

@jimsalter @joeress here's one for your 2.5 admins. #25admins #zfs #postgresql

https://matrix.org/blog/2025/10/post-mortem/ great write-up and #HugOps to those involved over that very stressful period.

- it's great to see they had a comprehensive backup/recovery strategy in place
- well done, really well done. Multiple fallback layers. Sad that they had to use them, but hey that's why we do this.
- Kudos to @beasts hosting for moral & technical support, once again I keep hearing good things about them

A story in SQL backup/recovery from matrix.org with three key lessons:

- always do critical recovery work with 2+ people checking and reviewing together (they did this), and rotate regions because sleep is critical
- never actually delete stuff during a crisis. Ever (narrator: they learned this the hard way)
- ZFS would have made this recovery significantly easier, in so many ways

It would have been almost trivial to recover from their failed storage with ZFS, and perhaps avoid either the failover, or the remote restore.

Scheduled ZFS snapshots would have meant a rollback instead of a recover in at least 2 of their high-risk moments.

It would have also meant higher storage costs because snapshots are almost but not quite free.

ZFS snapshots can be sent/received from an alternate system, over LAN at very high rates, much much faster than a remote S3-based streaming restore.

Post-mortem of the September 2 outage

Matrix, the open protocol for secure decentralised communications

Federico Mena Quintero boosted

Hazel Weakly

@hazelweakly@hachyderm.io · last month

#hugOps story time! Quote this and tell me the biggest incident you ever saw in production. It’s inevitable, it’s gonna happen, and learning from incidents is way better than shitting on people trying to fix them.

I’ll start :)

https://hazelweakly.me/blog/mother-of-all-outages/

Hazel Weakly

@hazelweakly@hachyderm.io · last month

I’ll start :)

https://hazelweakly.me/blog/mother-of-all-outages/

der.hans

@lufthans@mastodon.social · last month

Did AWS join the government shut down?

#USpol #AWS #GeekHumor #HugOps

coldclimate

@coldclimate@hachyderm.io · last month

I am seeing *so many* bad takes about the AWS outage, so many.
Everybody is smug until they get fucked with their pants on.
Even if you've done everything to run your own stuff and host it, it is very difficult to avoid services that will be impacted and there's *no way* the majority of businesses are running all that stuff themselves.
Comms tools, status pages, payment systems, monitoring systems, build tool, deployment tools, planning tools, the list goes on.
If you're telling me you're running absolutely everything yourself, well done, I have no idea how, and if you're better at all of that than all the SAAS providers, I struggle to believe you've time left in the day to run your actual business.
100% you should be doing your due diligence to make sure you're resilient where it matters, but being caught up in something like this is almost inevitable if you're a non trivial online company.
Stop throwing rocks, start sending #hugOps

Marc boosted

bert hubert 🇺🇦🇪🇺🇺🇦

@bert_hubert@eupolicy.social · last month

https://health.aws.amazon.com/health/status has a fascinating stream of updates if you are into that kind of thing. #hugops for everyone involved, it looks like an uphill battle. #aws

https://null/health/status

View the overall status and health of AWS services using the AWS Health Dashboard.

bert hubert 🇺🇦🇪🇺🇺🇦

@bert_hubert@eupolicy.social · last month

https://health.aws.amazon.com/health/status has a fascinating stream of updates if you are into that kind of thing. #hugops for everyone involved, it looks like an uphill battle. #aws

https://null/health/status

View the overall status and health of AWS services using the AWS Health Dashboard.

Stefano Marinelli boosted

dch :flantifa: :flan_hacker:

@dch@bsd.network · last month

#HugOps to all the IT and business people whose start of week has been borked by the AWA outage, while they try to restart everything from the ground up.

dch :flantifa: :flan_hacker:

@dch@bsd.network · last month

#HugOps to all the IT and business people whose start of week has been borked by the AWA outage, while they try to restart everything from the ground up.