AI training may infringe copyrights. Make sure it doesn´t

Training advanced AI models still requires large amounts of high-quality data, much of which contains copyrighted content. As a consequence, AI training may infringe copyrights.This article provides an update on whether using such works for this purpose may be considered ‘fair use’ under US copyright law, and explains how European copyright law addresses this issue.

The rise of large language models

Chatbots have become part of our daily lives, garnering millions of users. The success of these large language models (LLMs) is largely due to the variety of their potential applications and their user-friendly interface, which allows anyone to input queries (known as prompts) in plain English or any other natural language, without requiring any knowledge of how these models work or the mathematics behind them. However, generative AI is also notorious for being data hungry.

A staggering amount of training data

How hungry? For the training of ChatGPT-3, which started the AI race, reportedly 45 terabytes of raw compressed data were collected, of which just 570 gigabytes were used after filtering and cleaning. OpenAI has never published any exact figures on the amount of training data used to create ChatGPT-4, but independent estimates range from dozens of terabytes to as much as one petabyte (1,024 terabytes, or 1,125,899,906,842,624 bytes). As these figures may be a little abstract, let’s assume that a large language model is trained on 100 terabytes of data, which is still a conservative estimate. This would be equivalent to roughly 50 billion pages, or 200 million books. This is 12 times the capacity of the U.S. Library of Congress, which currently holds ‘only’ 17 million books. Reading them all would take the lifetime of about 50,000 people. Figure 1 illustrates the amount of training data and provides additional examples.

Figure 1 Analogies for AI Training Data Volumes created with Dall-E3

But where does all this data come from? This question has been raised increasingly lately, and the sources vary. Training data for generative AI models is typically collected by ‘web scraping’ from news articles, but also from internet forums, wikis, books, academic papers, and synthetic data generated by earlier AI models. It’s easy to see that much of this data contains content that is protected by copyright. In fact, a lot of it.

The number of court cases is growing.

While several AI companies have begun to enter into licensing agreements for such content, the question of whether, and if so to what extent, copyrighted content may be used freely for training purposes remains unanswered and is the subject of an increasing number of lawsuits. Figure 2 lists some notable cases in the United States, Canada, and India.

Figure 2 AI training cases in U.S. courts

Recently, the U.S. Copyright Office published a white paper on this topic as part 3 of its comprehensive “Report on Copyright and Artificial Intelligence “. This issue is also the subject of ongoing discussion within the European Union, with an increasing number of law review articles and other papers being published on the topic. For executives who lack the time and/or inclination to read all of the available literature, but who wish to learn about the current status of this topic and its background, we first address the suggestions made by the U.S. Copyright Office and then tackle the legal situation in Europe.

The U.S. Copyright Office on fair use of copyrighted works in training AI

At 175 pages long, the white paper from the U.S. Copyright Office (hereafter referred to as “the Copyright Office”) may be the most comprehensive publication on AI and copyright outside of academic circles. It provides an introduction to the technical background of machine learning and generative language models, shedding light on how training data is compiled, filtered and cleaned before addressing the training phases. It then explores the possibility of prima facie copyright infringement in these training phases. The centerpiece of the paper is devoted to the principle of fair use, which was first established by the courts and then enshrined in Section 107 of the US Copyright Act. This section provides that the fair use of a copyrighted work does not constitute copyright infringement. While the Copyright Act does not provide a definitive definition of “fair use”, it does list four relevant factors for determining whether use of a copyrighted work is fair.

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

the nature of the copyrighted work;

the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

the effect of the use upon the potential market for or value of the copyrighted work.

The Copyright Office examined these four factors in the context of training AI models. In doing so, it also considered the relevant case law and observed

“that the first and fourth factors can be expected to assume considerable weight in the analysis. Different uses of copyrighted works in AI training will be more transformative than others. And given the volume, speed and sophistication with which AI systems can generate outputs, and the vast number of works that may be used in training, the impact on the markets for copyrighted works could be of unprecedented scale.”

While the Copyright Office understandably pointed out that the wide range of potential uses of copyrighted content for training AI models makes it impossible to predict litigation outcomes, it indicated that it

“expects that some uses of copyrighted works for generative AI training will qualify as fair use, and some will not. On one end of the spectrum, uses for purposes of noncommercial research or analysis that do not enable portions of the works to be reproduced in the outputs are likely to be fair. On the other end, the copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available, is unlikely to qualify as fair use. Many uses, however, will fall somewhere in between.”

This conclusion may seem akin to that of the Oracle of Delphi, as it does not eliminate the considerable uncertainty surrounding uses that ‘fall in between’ the two extremes. These uses are likely to represent the largest proportion of use cases for copyrighted content in the training of AI models, so AI training may infringe copyrights in some of these cases as well. However, it would be unfair to criticize the Copyright Office for reaching this conclusion, given the factual, legal, and technical complexity of the matter. Using copyrighted content for training purposes remains risky in the United States and can result in substantial damages if courts rule in favour of the authors. In the United States, therefore, the answer to the question ‘Do our AI training practices infringe copyright?’ largely depends on an examination of each individual case. However, there will, of course, be a learning curve from the court decisions that are still being made, and it is likely that ‘the fog of law’ will be lifted to some extent.

Some court cases provide further guidance

In Bartz v. Anthropic, the court ruled in favour of Meta, finding that “the technology at issue was among the most transformative many of us will see in our lifetimes” and that “the copies used to train specific LLMs were justified as a fair use” since “every factor but the nature of the copyrighted works favors this result”. The conversion of legitimately purchased copies of the copyrighted works into digital library copies was also considered fair use by the court as the purpose and character of this use strongly favored this result. However, the court clearly stated that “The downloaded pirated copies used to build a central library were not justified by a fair use.” Perhaps even more importantly, it refused to follow
Anthropic´ s claim that it may make and keep indefinitely copies made from central library copies but not used for training purposes. According to the court, Anthropic was not entitled to an order approving all the copying that Anthropic had ever done after obtaining the data, as Anthropic had claimed.

In the case of Sarah Silverman and other authors v. Meta, the court reportedly dismissed the plaintiffs’ complaint that Meta had infringed their copyrights by using their works for AI training. The plaintiffs had argued that the output of Metas´s AI system contained pieces of their books and that the use of their works by Meta adversely effectively barred them from licensing these works for training (other) AI models. The court took the view that Meta´s Llama could not generate “enough text from the plaintiffs’ books to matter, and the plaintiffs are not entitled to the market for licensing their works.” Even so, the court made it clear that his ruling did not condone Meta’s practice of training its Llama model using the plaintiffs’ books and observed that it might not have ruled as it did if only the plaintiffs had presented evidence that Meta´s use of their works had an (adverse) effect upon the potential market for their books (factor 4).

In Thomson Reuters v. Ross Intelligence, the court held that using headnotes from the Westlaw legal database to develop a competing database was not “fair use” and therefore required the prior consent of the rights owner, which had not been granted. The court reached this decision after considering all four fair use factors. In doing so, the court concluded that factors one (purpose and character of the use) and four (the effect of the use on the potential market for the copyrighted works) favored Thomson Reuters, while factor two (nature of the copyrighted work) mattered less and factor four mattered more than the others.

If there is a lesson to be learned from these and other U.S. court judgments, it may be that the manner in which copyrighted works are obtained to train AI models does matter, and that using pirated copies will likely cause problems. Moreover, AI companies that use copyrighted works to train AI models for commercial purposes will not be able to claim that this serves the noble cause of scientific research. The away: Depending on the facts of each case, AI training may infringe copyrights.

The European perspective on training AI models with copyrighted content

Text and data mining is principally legal in the EU. Even so, AI training may infringe copyrights. — Text and data mining in the EU – Photo by Curioso Photography on Unsplash.com (text added)

The European Union is not renowned for its liberal approach to AI regulation. Following its risk-averse AI Act, the EU has recently published its Code of Practice for Generative AI (GenAI). Europe doesn’t appear to be seriously considering the concerns of various industry sectors, or even those of Emmanuel Macron, regarding scaling back its regulatory work, which threatens to seriously hinder the EU’s aim of staying technologically competitive — let alone becoming an AI powerhouse that rivals the US and China. You may therefore be surprised to learn that the EU has enacted a copyright exemption that seems to allow the training of AI models using copyrighted works under certain conditions.

The EU text and data mining privilege

The magic wand that makes it possible is Article 3 of the EU Directive on Copyright and Related Rights in the Digital Single Market (the “DSM Directive”), which obliges EU Member States to provide an exception to copyright protection for “text and data mining” which the Directive defines as

“any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”.

Does that sound familiar? After all, current generative AI (Gen-AI) models are designed to detect correlations between data sets and analyze them to generate their output. Germany, for example, has amended its Copyright Act to permit the reproduction of lawfully accessible works for the purpose of text and data mining. So, case closed? Well, not so fast. The good news for the tech sector is that legal literature largely supports the text and data mining exemption as a legal basis for training AI models without infringing copyright. However, there is still some resistance. For example, Tim Dornis, a German law professor, rejected the application of the text and data mining exemption, calling it a ‘fallacy’ and ‘highly questionable’. However, this view is perhaps unsurprising, as it is the result of a research project carried out for the Initiative Urheberrecht (‘Initiative Copyright’), an umbrella organisation representing the interests of 140,000 authors of copyrighted works.

Even if concerns about the scope of the text and data mining privilege prove to be unfounded, anyone training AI models with content from websites should be aware that Article 4 of the DSM Directive and the national copyright laws of EU Member States contain exceptions to the text and data mining privilege. These exceptions apply if rightholders have expressly reserved the right to prevent the use of works they have made publicly available. This reservation must be made in machine-readable form so that it can be recognised by web crawlers and similar tools.

Providers of general-purpose AI models should also be aware of the fact that Article 53 of the EU AI Act which enters into force on August 2, 2025 provides that they shall

“draw up and make publicly available a sufficiently detailed summary about the content used for training of the
general-purpose AI model, according to a template provided by the AI Office” and
put in place a policy ensuring that they identify and comply with such reservations against text and data mining.

The EU Code of Practice for General-AI Models

In July 2025, the European Commission published the copyright chapter of its Code of Practice for General-Purpose AI Systems. While this code of practice is voluntary, it is expected that providers of such systems will become signatories. The Code reinforces the obligations of these providers as set out in Article 53 of the EU AI Act and Article 4 of the DSM Directive, as summarised above. However, it also goes further, requiring signatories to make additional commitments, such as respecting paywalls and excluding certain websites known for copyright infringement from their web-crawling. Furthermore, the Code requires providers of general-purpose models to ensure that these models do not generate output that infringes copyright or related rights. This short list of commitments under the Code of Practice is not exhaustive.

Kent Walker, Google´s and Alphabet´s president of Global Affairs announced on July 30, 2025 that Google and Alphabet will join other companies including U.S. model providers in signing the Code of Practice but he also voiced concerns “that the AI Act and Code risk slowing Europe’s development and deployment of AI. In particular, departures from EU copyright law, steps that slow approvals, or requirements that expose trade secrets could chill European model development and deployment, harming Europe’s competitiveness”.

European case law

With just one decision of a German court on text and data mining case law is still scarce and that decision of the regional court Hamburg did expressly not deal with the question “whether the training of AI in its entirety falls under the exemption of Section 44 b) UrhG (German Copyright Act) or not.” However, the court found that downloading and analyzing a photograph for the purpose of matching the content of the image with the image annotation represented an analysis for obtaining information on “correlations (i.e. the question whether images and image descriptions did match or not) and was thus permissible under the text and data mining exemption in § 44b of the German Copyright Act. However, it is important to note that only lawfully accessible works my be subject to text and data mining, so pirated works are out of bounds much as they were in the case in Bartz v. Anthropic under the U.S. Copyright Act. Moreover, the DSM Directive and national copyright laws of EU Member States provide that reproductions and extractions made of copyrighted works for other purposes than scientific research may only be retained for as long as is necessary for the purposes of such text and data mining and must be deleted if this no longer the case.

This blog may revisit the permissibility of training AI models with copyrighted works when new court rulings or other noteworthy material becomes available. If you are in the business of developing AI models, make sure that you don´t neglect legal compliance.

About the author

With more than 25 years of experience, Andreas Leupold is a lawyer trusted by German, European, US and UK clients.

He specializes in intellectual property (IP) and IT law and the law of armed conflict (LOAC). Andreas advises clients in the industrial and defense sectors on how to address the unique legal challenges posed by artificial intelligence and emerging technologies.

A recognized thought leader, he has edited and co-authored several handbooks on IT law and the legal dimensions of 3D printing/Additive Manufacturing, which he also examined in a landmark study for NATO/NSPA.

Connect with Andreas on LinkedIn