Post · bonfire.cafe

"A small number of samples can poison LLMs of any size": https://www.anthropic.com/research/small-samples-poison

This feels like a pretty devastating finding. Black hats have to be trying this already (Why wouldn't they, the potential gains are quite large) and if it is really this easy then likely malicious code is already making its way into real code bases, maybe at scale.

😱

#ai

A small number of samples can poison LLMs of any size

Anthropic research on data-poisoning attacks in large language models

In a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a "backdoor" vulnerability in a large language model—regardless of model size or training data volume. Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents. Our results challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount. Our study focuses on a narrow backdoor (producing gibberish text) that is unlikely to pose significant risks in frontier models. Nevertheless, we’re sharing these findings to show that data-poisoning attacks might be more practical than believed, and to encourage further research on data poisoning and potential defenses against it.

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.0-rc.3.13 no JS en

Automatic federation enabled