AI models at Apple, Salesforce, Anthropic, and other major technology players were trained on tens of thousands of YouTube videos without the creators’ consent and potentially in violation of YouTube’s terms, according to a new report appearing in both Proof News and Wired.
The companies trained their models in part by using “the Pile,” a collection by nonprofit EleutherAI that was put together as a way to offer a useful dataset to individuals or companies that don’t have the resources to compete with Big Tech, though it has also since been used by those bigger companies.
The Pile includes books, Wikipedia articles, and much more. That includes YouTube captions collected by YouTube’s captions API, scraped from 173,536 YouTube videos across more than 48,000 channels. That includes videos from big YouTubers like MrBeast, PewDiePie, and popular tech commentator Marques Brownlee. On X, Brownlee called out Apple’s usage of the dataset, but acknowledged that assigning blame is complex when Apple did not collect the data itself. He wrote:
Apple has sourced data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids “fault” here because they’re not the ones scraping
But this is going to be an evolving problem for a long time
It also includes the channels of numerous mainstream and online media brands, including videos written, produced, and published by Ars Technica and its staff and by numerous other Condé Nast brands like Wired and The New Yorker.
Coincidentally, one of the videos used in the dataset was an Ars Technica-produced short film wherein the joke was that it was already written by AI. Proof News’ article also mentions that it was trained on videos of a parrot, so AI models are parroting a parrot, parroting human speech, as well as parroting other AIs, parroting humans.
As AI-generated content continues to proliferate on the Internet, it will be increasingly challenging to put together datasets to train AI that don’t include content already produced by AI.
To be clear, some of this is not new news. The Pile is often used and referenced in AI circles and has been known to be used by tech companies for training in the past. It has been cited in multiple lawsuits by intellectual property owners against AI and tech companies. Defendants in those lawsuits, including OpenAI, say that this kind of scraping is fair use. The lawsuits have not yet been resolved in court.
However, Proof News did some digging to identify specifics about the use of YouTube captions and went so far as to create a tool that you can use to search the Pile for individual videos or channels.
The work exposes just how robust the data collection is and calls attention to how little control owners of intellectual property have over how their work is used if it’s on the open web.
It’s important to note that it is not necessarily the case that this data was used to train models to produce competitive content that reaches end users, however. For example, Apple may have trained on the dataset for research purposes, or to improve autocomplete for text typing on its devices.
Reactions from creators
Proof News also reached out to several of these creators for statements, as well as to the companies that used the dataset. Most creators were surprised their content had been used this way, and those who provided statements were critical of EleutherAI and the companies that used its dataset. For example, David Pakman of The David Pakman Show said:
No one came to me and said, “We would like to use this”… This is my livelihood, and I put time, resources, money, and staff time into creating this content. There’s really no shortage of work.
Julia Walsh, CEO of the production company Complexly is responsible for SciShow and other Hank and John Green educational content, said:
We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent.
There’s also the question of whether the scraping of this content violates YouTube’s terms, which prohibit accessing videos by “automated means.” EleutherAI founder Sid Black said he used a script to download the captions via YouTube’s API, just like a web browser does.
Anthropic is one of the companies that has trained models on the dataset, and for its part, it claims there’s no violation here. Spokesperson Jennifer Martinez said:
The Pile includes a very small subset of YouTube subtitles… YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.
A Google spokesperson told Proof News that Google has taken “action over the years to prevent abusive, unauthorized scraping” but didn’t provide a more specific response. This is not the first time that AI and tech companies have been subject to criticism for training models on YouTube videos without permission. Notably, OpenAI (the company behind ChatGPT and the video generation tool Sora) is believed to have used YouTube data to train its models, though not all allegations of this have been confirmed.
In an interview with The Verge’s Nilay Patel, Google CEO Sundar Pichai suggested that the use of YouTube videos to train OpenAI’s Sora would have violated YouTube’s terms. Granted, that usage is distinct from scraping captions via the API.