The Tow Center conducted a test to better understand how publishers’ content is represented on the ChatGPT platform. We randomly selected 20 publishers across three categories:
- Those partnered with OpenAI
- Those involved in lawsuits against OpenAI
- Unaffiliated publishers who either allowed or blocked ChatGPT’s search crawler
From each publisher, we selected 10 articles and extracted specific quotes. These quotes were chosen because, when entered into search engines like Google or Bing, they reliably returned the source article among the top three results. We then evaluated whether ChatGPT’s new search tool accurately identified the original source for each quote.
The dataset includes:
- Publisher: Name of the source.
- Source URL: Original article URL.
- Conversation Link: ChatGPT session link.
- Date of Article: Publication date.
- Paywalled Article?: Soft Paywall/Hard Paywall/No Paywall
- Prompt: Input quote.
- SearchGPT Answer: Title, publisher, date, and URL.
- Metrics:
- Correct Publisher? (1 pt)
- Correct Date? (1 pt)
- Correct URL? (1 pt)
- Searchability Score: Performance rating (1–3).
- Accuracy: Fully/Partially correct.
- Google/Bing Index: Rank of the source article on Bing and Google when the quote was searched
This dataset extends and updates PaleWire’s database, analyzing which publishers enabled or disabled OpenAI’s bots. We scraped the robots.txt
files of publisher websites to determine permissions for bots like GPTBot and OAI-SearchBot. The data seen is as of November 13th, 2024
We used the Tow Center's database that tracks platform-publisher relationships.