Anthropic bought lots of physical books and cut them up and scanned them for training data. Do any other AI labs do the same thing?
While the destructive book-scanning method on a massive scale is specifically known for Anthropic, other AI labs use various approaches to obtain book training data. Large-scale book scanning is not a widespread public practice among other major AI companies, though they do use book data obtained through other means. 
Alternative methods for obtaining book data
Digitized library collections: In a more preservation-friendly approach, some AI labs partner with libraries and institutions that have existing digital archives. For instance, OpenAI and Microsoft announced a collaboration with Harvard libraries to use digital copies of nearly one million public domain books, some dating back to the 15th century.
Web scraping: Many AI models, such as those from OpenAI and Meta, have been accused of and sued over downloading millions of books from pirated websites like Library Genesis and Pirate Library Mirror. This method is legally and ethically contentious, and in a 2025 ruling against Anthropic, a judge differentiated between the company's purchased books and its use of pirated material.
Anthropic bought lots of physical books and cut them up and scanned them for training data. Do any other AI labs do the same thing? While the destructive book-scanning method on a massive scale is specifically known for Anthropic, other AI labs use various approaches to obtain book training data. Large-scale book scanning is not a widespread public practice among other major AI companies, though they do use book data obtained through other means. Alternative methods for obtaining book data Digitized library collections: In a more preservation-friendly approach, some AI labs partner with libraries and institutions that have existing digital archives. For instance, OpenAI and Microsoft announced a collaboration with Harvard libraries to use digital copies of nearly one million public domain books, some dating back to the 15th century. Web scraping: Many AI models, such as those from OpenAI and Meta, have been accused of and sued over downloading millions of books from pirated websites like Library Genesis and Pirate Library Mirror. This method is legally and ethically contentious, and in a 2025 ruling against Anthropic, a judge differentiated between the company's purchased books and its use of pirated material.
Chart from the NewsGuard full report showing the percentage of false information in responses from different AI models in August 2024 and August 2025. Most models show an increase in false information over time, with Inflection and Perplexity having the highest rates in 2025. Claude and Gemini have the lowest rates.
Chart from the NewsGuard full report showing the percentage of false information in responses from different AI models in August 2024 and August 2025. Most models show an increase in false information over time, with Inflection and Perplexity having the highest rates in 2025. Claude and Gemini have the lowest rates.
Quote from the NewsGuard article: “As chatbots adopted real-time web searches, they moved away from declining to answer questions. Their non-response rates fell from 31 percent in August 2024 to 0 percent in August 2025. But at 35 percent, their likelihood of repeating false information almost doubled. Instead of citing data cutoffs or refusing to weigh in on sensitive topics, the LLMs now pull from a polluted online information ecosystem — sometimes deliberately seeded by vast networks of malign actors, including Russian disinformation operations — and treat unreliable sources as credible.”
Quote from the NewsGuard article: “As chatbots adopted real-time web searches, they moved away from declining to answer questions. Their non-response rates fell from 31 percent in August 2024 to 0 percent in August 2025. But at 35 percent, their likelihood of repeating false information almost doubled. Instead of citing data cutoffs or refusing to weigh in on sensitive topics, the LLMs now pull from a polluted online information ecosystem — sometimes deliberately seeded by vast networks of malign actors, including Russian disinformation operations — and treat unreliable sources as credible.”
1+ more replies (not shown)