Post · bonfire.cafe

Post

@simon@fedi.simonwillison.net · 6 months ago

The previously sensible advice to never use ChatGPT for search needs to be rethought - GPT-5 in thinking mode is shockingly good at running searches now https://simonwillison.net/2025/Sep/6/research-goblin/

Alex Haydock

@alexhaydock@infosec.exchange · 6 months ago

@simon The fact that the official Oxford name uses the Oxford Comma is somehow deeply satisfying.

“The official name of Cambridge might include the serial comma, like "Masters, and Scholars." There's a similar structure with Oxford, which uses a comma after "Masters" too.”

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

Follow-up note about a Google's new "AI mode" - it's actually very good! Massively different from "AI overviews" which are terrible https://simonwillison.net/2025/Sep/7/ai-mode/

Dan Kegel

@dkegel@infosec.exchange · 6 months ago

@simon

I used the "dive deeper in ai mode" button on a question about jq. Two quick prompts later I had the right answer and a concise explanation.

Google's did well on this little task. It reminded me of how Google Search made me feel smarter back around 2000 or so.

I should try the same thing with other llms for comparison, but probably won't.

Blind Dragonoid

@mechanomon@tweesecake.social · 6 months ago

@simon Sometimes GPT5 gives links that are unrelated to the thing it's saying this link is a reference to

gabriele renzi

@riffraff@mastodon.social · 6 months ago

@simon I wonder what's different between ai mode, gemini, and and ai overviews that makes the first one unavailable in the EU but all the others are fine.

Chris Zubak-Skees

@zubakskees@mastodon.social · 6 months ago

@simon Corrected some types in the prompt and got a less-clear answer. Seems like ymmv

Anthropic bought lots of physical books and cut them up and scanned them for training data. Do any other AI labs do the same thing?
While the destructive book-scanning method on a massive scale is specifically known for Anthropic, other AI labs use various approaches to obtain book training data. Large-scale book scanning is not a widespread public practice among other major AI companies, though they do use book data obtained through other means.
Alternative methods for obtaining book data
Digitized library collections: In a more preservation-friendly approach, some AI labs partner with libraries and institutions that have existing digital archives. For instance, OpenAI and Microsoft announced a collaboration with Harvard libraries to use digital copies of nearly one million public domain books, some dating back to the 15th century.
Web scraping: Many AI models, such as those from OpenAI and Meta, have been accused of and sued over downloading millions of books from pirated websites like Library Genesis and Pirate Library Mirror. This method is legally and ethically contentious, and in a 2025 ruling against Anthropic, a judge differentiated between the company's purchased books and its use of pirated material. — Anthropic bought lots of physical books and cut them up and scanned them for training data. Do any other AI labs do the same thing? While the destructive book-scanning method on a massive scale is specifically known for Anthropic, other AI labs use various approaches to obtain book training data. Large-scale book scanning is not a widespread public practice among other major AI companies, though they do use book data obtained through other means. Alternative methods for obtaining book data Digitized library collections: In a more preservation-friendly approach, some AI labs partner with libraries and institutions that have existing digital archives. For instance, OpenAI and Microsoft announced a collaboration with Harvard libraries to use digital copies of nearly one million public domain books, some dating back to the 15th century. Web scraping: Many AI models, such as those from OpenAI and Meta, have been accused of and sued over downloading millions of books from pirated websites like Library Genesis and Pirate Library Mirror. This method is legally and ethically contentious, and in a 2025 ruling against Anthropic, a judge differentiated between the company's purchased books and its use of pirated material.

felix (grayscale) 🐺

@gray17@mastodon.social · 6 months ago

@simon reading the thinking traces, it strikes me that many of the successes have the answer in the first step, which suggests to me that the answer is easy to find with a normal search-engine query

I tried these simple text searches:
- "wikipedia use of britannica", the link to the relevant wikipedia page is the 6th link
- "building in reading", the link to the relevant wikipedia page is again the 6th link

so for those, using chatGPT seems like a net negative

SpaceLifeForm

@SpaceLifeForm@infosec.exchange · 6 months ago

@simon

Clearly a learned scholar from Oxford.

https://en.m.wikipedia.org/wiki/University_of_Cambridge

#OxfordComma

PeachMcD

@PeachMcD@union.place · 6 months ago

@simon

I'm just staying as far from AI as possible, on ecological & ethical grounds. AI & crypto are energy hogs run by bad actor, and I'm already implicated in too many horrific sins just feeding & clothing myself

Quinn Comendant

@com@mastodon.social · 6 months ago

@simon It’s great until it can’t discern misinformation/disinformation from reality.
https://www.newsguardtech.com/ai-monitor/august-2025-ai-false-claim-monitor/
Full report: https://www.newsguardtech.com/wp-content/uploads/2025/09/August-2025-One-Year-Progress-Report-3.pdf #llm #misinformation

2 media

Quote from the NewsGuard article: “As chatbots adopted real-time web searches, they moved away from declining to answer questions. Their non-response rates fell from 31 percent in August 2024 to 0 percent in August 2025. But at 35 percent, their likelihood of repeating false information almost doubled. Instead of citing data cutoffs or refusing to weigh in on sensitive topics, the LLMs now pull from a polluted online information ecosystem — sometimes deliberately seeded by vast networks of malign actors, including Russian disinformation operations — and treat unreliable sources as credible.”

Chart from the NewsGuard full report showing the percentage of false information in responses from different AI models in August 2024 and August 2025. Most models show an increase in false information over time, with Inflection and Perplexity having the highest rates in 2025. Claude and Gemini have the lowest rates.

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

@com it's hard to evaluate that report because it doesn't mention any model names - it talks about "ChatGPT" but I know from experience that GPT-4o is a terrible model for search whereas GPT-5 and o3 are a massive improvement

De-extincted Neanderthal

@Reshirams_Rad_Slam@mastodo.neoliber.al · 5 months ago

@simon @com Is any of the chatgpt free offerings better than Grok at searching? It takes a long time but from my tests (I haven't tried any OpenAI), Grok is the best at searching - avoiding anything Elon's whims might touch, of course. This is aggregated data from a RPG that's freely and relatively widely available online so it's as factual as it gets within these particular contexts, which is why I'm testing using this domain https://2e.aonprd.com/Mysteries.aspx . It should only ever output definite answers

De-extincted Neanderthal

@Reshirams_Rad_Slam@mastodo.neoliber.al · 5 months ago

@simon @com Ok. Gemini now that code execution+google search+url context is unleashed might be as good now at searching than grok if forced to search using Python. It wasn't like this before even with that trick.

Quinn Comendant

@com@mastodon.social · 6 months ago

@simon The model used in the 2025 audit (“OpenAI’s ChatGPT-5” [sic]) was mentioned in another press release: https://www.newsguardtech.com/press/newsguard-one-year-ai-audit-progress-report-finds-that-ai-models-spread-falsehoods-in-the-news-35-of-the-time/

The model from 2024 was “OpenAI’s ChatGPT-4”: https://www.newsguardtech.com/special-reports/generative-ai-models-mimic-russian-disinformation-cite-fake-news/

The main insight is that the reproduction of falsehoods worsened in a year (GPT-4 → GPT-5). Certainly, since GPT-5 *is better* at search, the results *should have improved!* Perhaps the culprit is an overall increase in misinformation on the web. 😞

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

@com I'm having real trouble with this. ChatGPT-4 isn't a model - did they mean 4o (launched in May 2024) or are they mixing results from both GPT-4 and GPT-4o?

GPT-5 has been out for a month, is their 2025 audit entirely from that time period or does it include the first six months of 2025 against other models?

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

@com o3 and GPT-5 Thinking (not necessarily regular non-thinking GPT-5) really do represent a step change in how effective models are at evaluating search results, so I'd like to see clarity from them on exactly which models they evaluated and when

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

@com in 2024 they were deliberate about obfuscating these details - a decision I am very much opposed to: "NewsGuard is not providing the scores for each individual chatbot or including their names in the examples below, because the audit found that the issue was pervasive across the entire AI industry rather than specific to a certain large language model."

1+ more replies (not shown)

Simon Willison

@simon@fedi.simonwillison.net · 6 months ago

I just ran one of their prompts though GPT-5 Thinking and for what looked to me like an impressive result https://chatgpt.com/share/68bcad53-e9ac-8006-9945-969eef306fd3