Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
David Chisnall (*Now with 50% more sarcasm!*)
@david_chisnall@infosec.exchange  ·  activity timestamp 3 weeks ago

It’s interesting that the CrowdStrike and CloudFlare issues could both have been solved in the same way by applying quite old best practices for engineering resilient systems. Both involved a crash in applying an update. Both failed to handle a specific error case, but outside a small niche of very critical systems you should never assume that you have handled every error locally.

Both could have been avoided by sandboxing the parser code, separating it from the code that fetched the updates and, if applying the latest update failed a few times, reporting an error to the server and continuing with the last update that didn’t crash the parser.

This is the kind of system we make it trivial to build with #CHERIoT, which is why I think it is the best option for anything that wants to ship products that last for decades with a security update every few years.

  • Copy link
  • Flag this post
  • Block
synlogic4242
@synlogic4242@social.vivaldi.net replied  ·  activity timestamp 3 weeks ago

@david_chisnall yeah the Cloudflare incident looks like result of them not doing a thorough audit for SPOFs. preventing or mitigating each of the "what if" cases possible. layered defense is wise

  • Copy link
  • Flag this comment
  • Block
Edwin Török
@edwintorok@discuss.systems replied  ·  activity timestamp 3 weeks ago

@david_chisnall part of the problem is also that exceptions aren't tracked at the type system level. IIUC the Cloudflare parser correctly returned an error, which got converted into an exception by `unwrap`. Sandboxing would have to be applied to the correct place, otherwise it'd be the same problem: sandbox detects an unexpected error, which causes the application to crash with an unhandled exception (you can try to force the caller to handle it by using a Result type, but they can just convert it back into an exception if they're not careful).
Built-in support for A/B deployments would be great though, whether at the language or the sandbox level. IIUc the reason why sandboxing would help here is that it'd make it trivial to choose a different "image" to run for the parser, whereas in a regular application you couldn't do that due to symbol conflicts (unless you unload and reload a DSO, but not everything supports that).
Although systems like kubernetes do have support for rolling out and back services, health checks, etc..
Automated rollbacks are still tricky to get right if you have any state, because the old code may not be able to deal e.g. with a new DB schema, and then you have to roll that back too. And the rollback code could be buggy, since it is exercised less frequently.

  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.1-alpha.8 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login