The capabilities of AI models, such as GPT-5, Gemini, Claude, and Grok, lie in the size and scope of the dataset used to train them. This has also been the source of multiple lawsuits, claiming that the companies performing the training had no right to freely use the data. In an expanded class-action case against Nvidia, however, the accusation goes one step further, with claims that the GPU giant willingly used an illegal source of pirated books to train its models.
As reported by TorrentFreak, an amended complaint (pdf warning) filed at the district court in Oakland, California last week, specifically claims that staff at Nvidia contacted a so-called ‘shadow library’ known as Anna’s Archive, a repository of pirated books and other documents.
The plaintiffs cite internal Nvidia communications as evidence, with the filed document purporting to show someone from the data strategy team at Nvidia writing, “we are exploring including Anna’s Archive in pre-training data for our LLMs.”
It continues with “We are figuring out internally whether we are willing to accept the risk of using this data, but would like to speak with your team to get a better understanding of LLM-related work you have done.”
While Anna’s Archive appears not to host any content directly itself, it does act as a ‘search engine’ for alleged pirate libraries. These third-party hosts aren’t exclusively providing access to copyrighted materials, but that content is what they are most infamous for.
The original complaint against Nvidia was filed back in 2024, and as Torrent Freak reported at the time, Nvidia’s response was essentially to claim that AI training on such material is not the same as owning an illegally obtained book, or even using it as a human does. “Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model,” it wrote in response.
In essence, Nvidia is saying that the use of such datasets falls under fair use. Given that the original complaint involved data garnered from another pirated source (Books3), it’s possible that Nvidia may choose to use the same counterargument from 2024.
Similar claims have been filed against Anthropic and Meta in the past, and in the case of the former, the court judge ruled that while accessing the data did fall under fair use, “Anthropic had no entitlement to use pirated copies for its central library.” How the case against Nvidia will fare, well, we’ll just have to wait and see.


This is a fascinating and complex issue. The implications of using pirated materials in AI training raise important ethical questions about content ownership and innovation. It’s essential to consider how these practices could impact the future of AI development and copyright laws.
I completely agree; it’s definitely a multifaceted topic. It’s interesting to consider how this practice might shape the future of copyright laws and content creation, especially as AI continues to evolve. Balancing innovation with ethical considerations will be crucial moving forward.
You’re right, it really is a complex issue. Itβs also worth noting how the legal implications of using such data could shape the future of AI development and copyright laws. Balancing innovation with ethical practices will be crucial moving forward.
I completely agree! The legal implications are certainly significant, especially as they could set precedents for how AI models are trained in the future. It’s fascinating to think about how this might influence not only tech companies but also the publishing industry moving forward.