Post · bonfire.cafe

In this article #anthropic researchers discover that #LLM output can be convinced to produce gibberish with as few as 250 poisoned articles. More importantly, this number does not scale with the size of the model. A 600 million parameter model is just as susceptible to attack from the same poisoned data as a 7 billion parameter model.

Now it is important to remember that LLMs are not smart. They always continue a sentence with the next most likely token according to their training data. So of course inserting a rare token like <SUDO>, which the researchers used, would force the model to copy their poisoned data over everything else in the training data. It's the only data that has that token.

So now, imagine someone, hypothetically, creates a couple hundred blog posts which through ascii smuggling, image compression attacks, or just text the same color as the background, contains a trigger word followed by malicious code of some sort. Then the attacker can contact sales of some target organization, schedule a demonstration with them, and sneak the trigger word into the calendar invite description.

The next time Microsoft Copilot (which Microsoft is making mandatory for all 365 users) scans this calendar it hits the trigger word and executes the malicious code.

I literally couldn't design a less secure system if I tried.

https://www.anthropic.com/research/small-samples-poison