by Luca Farkas
The intersection of copyright law and generative artificial intelligence (AI) has become a key point of contention, particularly regarding AI's reliance on copyrighted materials for training purposes. AI models often process large datasets—many of which include copyrighted works—raising fundamental questions about whether AI developers can use these materials without permission, or if such use constitutes copyright infringement.[1] Some critics argue that AI training constitutes "large-scale theft" of intellectual property,[2] while others believe that certain uses of copyrighted content for training should be permitted under specific conditions.[3] This debate reflects the ongoing tension between safeguarding the intellectual property rights of creators and fostering technological innovation.[4]
While the legal landscape in the United States remains unsettled, with courts grappling over the application of the fair use doctrine to AI training, the European Union has established a more structured framework that provides clearer guidelines for AI developers. This divergence in legal clarity highlights the challenges of navigating copyright law in the age of AI, which will be explored further in this analysis below, with a focus on the varying approaches in the US and EU, and potential legal solutions.
The Core Issue: Does AI Training Cross the Line of Copyright Infringement?
At the heart of this debate is whether the use of copyrighted material for AI training constitutes infringement. OpenAI, for example, trains its models on large, publicly available datasets that may include copyrighted works.[5] These datasets are used to teach the AI system how to understand language, recognize patterns, and generate responses. However, the process of copying these datasets for analysis raises the question of whether such use constitutes unauthorized reproduction or distribution, a hallmark of copyright infringement. OpenAI has attempted to address some of these concerns by introducing a feature that allows copyright holders to block their content from being scraped for AI training purposes.[6] This gives data owners the ability to opt out of future data usage, offering some level of control over how their content is used. However, this feature does not address content that has already been collected, leaving existing training data untouched.[7]
The lack of clear legal frameworks governing the use of such content for AI training further complicates the issue, as it remains unclear whether the use of copyrighted works in training models constitutes fair use or whether AI developers need to obtain explicit permission from copyright holders.[8]
Exploring Legal Solutions to AI and Copyright Conflicts
Several theoretical solutions have been proposed to address the copyright challenges posed by AI training. These solutions reflect varying degrees of flexibility and regulatory oversight:
1. Treating AI training as infringement unless authorized: One approach is to treat the use of copyrighted material for AI training as infringement unless explicit permission is obtained from copyright holders.[9] This solution would prioritize the protection of intellectual property rights, however, could create significant challenges for AI developers to gather the authorization.
2. Permitting AI training under specific conditions: An intermediate solution involves allowing the reproduction of copyrighted works for AI training under certain conditions, such as when the use is non-commercial, the works are lawfully accessed, or the copyright holder has not objected to the use of their content in machine-readable formats.[10] This approach would strike a balance between the interests of AI developers and copyright holders, offering more flexibility than the first option while still respecting creators' rights.
3. Permitting AI training even without authorization or payment: At the other extreme, the solution is to allow AI developers to use copyrighted content for training purposes without requiring prior authorization or payment, even for commercial uses.[11] Proponents of this approach argue that AI training is transformative in nature and that the use of copyrighted works for the purpose of learning general patterns and structures does not harm the market for the original work.[12] This solution, based on the fair use doctrine, would allow AI systems to develop quickly without requiring developers to negotiate licenses for each work used in training.
While these theoretical solutions offer a starting point, their practical application varies significantly across jurisdictions. The USA and the EU provide different legal frameworks for dealing with AI training and copyright.
USA: Embracing Fair Use
In the United States, the fair use doctrine is a key point in the debate over AI and copyright.[13] This doctrine allows certain uses of copyrighted material without permission, such as for research, commentary, or educational purposes, provided that the use meets specific criteria (e.g., being transformative, not harming the market for the original work).[14] Many legal experts argue that AI training should fall under the fair use exception, especially since AI models are often used to extract general patterns rather than replicate the original content.[15] OpenAI has echoed this sentiment, arguing that “training AI models using publicly available internet materials is fair use,” a position supported by the Library Copyright Alliance (LCA) in its 2023 “Principles for Copyright and Artificial Intelligence”[16] document.[17] They emphasize that established legal precedents back the ingestion of copyrighted works for AI training. However, this interpretation remains contentious and may lead to legal challenges that could reshape our understanding of fair use in the context of artificial intelligence.[18] As courts begin to address these issues, their decisions will be critical in balancing the need for innovation in AI with the rights of copyright holders. The outcomes of these cases will not only influence future practices but also determine how we navigate the intersection of technology and intellectual property law.
The US Copyright Office is set to release a comprehensive study in 2024 that will delve into various aspects of copyright law as it relates to AI, including the legal implications of training AI models on copyrighted works.[19] This report promises to provide essential insights into how copyright law can adapt to these emerging technologies.
EU: Striking a Middle Ground with TDM Exceptions
The European Union has taken a more structured approach to AI and copyright with the EU Directive 2019/790 on Copyright in the Digital Single Market (CDSMD). Under Article 4 of this directive, text and data mining (TDM) are permitted for commercial purposes, provided that the works are lawfully accessible, and the copyright holder has not explicitly reserved their rights to block such use.[20]
The EU AI Act, while focused primarily on AI safety and trustworthiness, also introduces a few specific obligations related to copyright law, complimenting the CDSMD. It requires providers of generative AI models to respect copyright protections, including compliance with machine-readable rights reservations, as outlined in the CDSMD.[21] This means AI providers must ensure their models identify and comply with access restrictions set by copyright holders, such as "opt-out" instructions that prevent their works from being used for AI training. These obligations, though limited, represent an important step in aligning AI regulation with existing copyright frameworks.
However, within the EU, different member states have adopted varying approaches to the application of these principles. For instance, Poland has adopted a more restrictive stance by explicitly excluding the use of TDM exceptions for developing generative AI models unless the copyright holder has granted explicit permission.[22] While some EU countries allow TDM exceptions for research or educational purposes, Poland has taken a more cautious approach that could create significant barriers for AI developers. Developers in Poland would need to secure individual permissions for each dataset they wish to use, which could slow the pace of AI innovation. In 2024, Spain introduced a more forward-looking approach with its proposal to implement Extended Collective Licensing (ECL) for AI training.[23] This proposal aims to strike a better balance between the large-scale data requirements of AI development and the rights of copyright holders. By allowing Collective Management Organizations (CMOs) to grant non-exclusive licenses that cover a broad spectrum of works, Spain seeks to streamline the licensing process and reduce the burden of negotiating individual permissions.[24] In contrast to Poland's restrictive model, Spain’s ECL proposal provides a more flexible, inclusive framework for AI developers, enabling them to access the necessary data for training without being hindered by complex and time-consuming negotiations.
Rising Litigation in the USA, EU and Beyond
Litigation to resolve the legal issues surrounding copyright and AI training is currently taking place in several jurisdictions.
In the USA several relevant cases are currently underway that address crucial issues of copyright and AI. In 2020, UAB Planner5D, an imaging firm specializing in 3D virtual objects, filed copyright claims accusing Meta and Princeton University of illegally obtaining its data for the purpose of training AI models.[25] The core legal issues of this case involved whether these 3D objects, created from 2D references, met the copyrightability threshold, and whether Meta’s use constituted an infringement under U.S. copyright law. Although in September 2023, the U.S. District Court for the Northern District of California ruled that the objects in question were original enough to qualify for copyright protection despite the U.S. Copyright Office's refusal to register them, it has yet to be seen whether Meta’s use of these objects for AI training constitutes infringement.[26] Additionally, in 2023, The New York Times filed a lawsuit against Microsoft and OpenAI, asserting that they trained their large language models (LLMs) using millions of copyrighted materials from the Times. The lawsuit contends that the outputs generated by these models reproduce content from the Times, thereby infringing its copyright.
In the EU, there has been ongoing debate regarding the applicability of the text and data mining (TDM) exception in the Copyright Directive (CDSMD) to AI training activities.[27] A significant ruling by the Hamburg District Court in September 2024 addressed this issue in Kneschke v. LAION (310 O 227/23).[28] The court explored key legal questions regarding copyright holders' ability to reserve their rights and how such reservations should be communicated.[29] It clarified that copyright holders can reserve their rights to block the use of their works in AI training under the EU’s Commercial TDM exception—found in Article 4 of the CDSMD and Section 44b of the German Copyright Act.[30] However, for these reservations to be valid, they must be clearly communicated, particularly in a machine-readable format. While the court did not make a final ruling on whether these rights could be reserved retroactively, it provided important comments on machine-readability requirements for opt-outs, suggesting that a natural language opt-out could be sufficient, allowing the copyright holder to opt out of having their image used in the dataset.[31] However, the court made it clear that this is not a general rule and that the specifics of how opt-outs should be communicated will depend on the circumstances, including the technical developments at the time of use. This ruling suggests a more flexible approach to the machine-readability requirement than some might have anticipated.[32]
In the United Kingdom, the Getty Images vs. Stability AI case is pending trial before the High Court in London. Getty Images initiated legal proceedings in January 2023, alleging that Stability AI collected millions of images from its websites without consent to train an image-generating AI model known as Stable Diffusion.[33] Getty further claims that outputs from this AI model violate its intellectual property rights by reproducing substantial parts of its copyrighted works. Additionally, the UK government has recently launched a consultation process aimed at reforming copyright law in light of advancements in AI, which began on December 17, 2024, and will continue for ten weeks until February 25, 2025.[34]
Despite these efforts across jurisdictions, issues related to copyright and AI training are likely to remain unresolved by courts for many years to come.
Conclusion: Balancing Innovation and Intellectual Property Rights
The legal challenges surrounding AI training and copyright are evolving, with no one-size-fits-all solution. While jurisdictions like the United States continue to debate the applicability of the fair use doctrine for AI training, the European Union has taken a more structured approach, particularly through its Copyright Directive and text and data mining (TDM) exceptions. This difference in regulatory clarity highlights the EU’s more defined framework, which offers AI developers greater certainty about their obligations. However, even within the EU, the interpretation of copyright exceptions varies, and litigation in both the U.S. and EU continues to shape the landscape.
As these legal frameworks continue to develop, the key challenge will be striking a balance between fostering AI innovation and respecting the rights of content creators, ensuring that both technological progress and intellectual property protections are upheld.
[1] Eurojust, Generative Artificial Intelligence: The Impact on Intellectual Property Crime, November 2023, https://www.eurojust.europa.eu/sites/default/files/assets/generative-ai-impact-to-ip-crimes.pdf, hereinafter: Eurojust Report, page 12.
[2] Initiative Urheberrecht, Study Reveals AI Training is Copyright Infringement. 05.09.2024. https://urheber.info/diskurs/ai-training-is-copyright-infringement.
[3] Forbes, Roomy Khan, AI Training Data Dilemma: Legal Experts Argue for Fair Use, 04.10.2024. https://www.forbes.com/sites/roomykhan/2024/10/04/ai-training-data-dilemma-legal-experts-argue-for-fair-use/, hereinafter: Forbes, Khan, 2024.
[4] Alexander Peukert, Regulating IP Exclusion/Inclusion on a Global Scale: The Example of Copyright vs. AI Training, Research Paper of the Faculty of Law of Goethe University Frankfurt/M. No. 3/2024, hereinafter: Peukert, 2024, para. 8.
[5] Eurojust Report, page 12.
[6] Ibid.
[7] Ibid.
[8] Eurojust Report, page 12, Peukert 2024, para. 10.
[9] Peukert, 2024, para. 8.
[10] Ibid.
[11] Ibid.
[12] Forbes, Khan, 2024.
[13] Eurojust Report, page 13.
[14] Peukert, 2024, para. 9.
[15] Forbes, Khan, 2024.
[16] Library Copyright Alliance (LCA), Principles for Copyright and Artificial Intelligence, 10.07.2023. https://www.librarycopyrightalliance.org/wp-content/uploads/2023/06/AI-principles.pdf.
[17] American Library Association, Training Generative AI Models on Copyrighted Works is Fair Use, 23.01.2024. https://www.arl.org/blog/training-generative-ai-models-on-copyrighted-works-is-fair-use/.
[18] Forbes, Khan, 2024.
[19] US Copyright Office, Artificial Intelligence Study, 31.07.2024. https://www.copyright.gov/policy/artificial-intelligence/.
[20] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC, PE/51/2019/REV/1, OJ L 130, 17.5.2019, pp. 92–125. Available at https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790.
[21] Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No. 300/2008, (EU) No. 167/2013, (EU) No. 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144, and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act or AI Act), 2024 O.J. (L 2024/1689), available at http://data.europa.eu/eli/reg/2024/1689/oj, Art. 53(1)(c).
[22] Peukert, 2024, para 9.
[23] Communia Association, Teresa Nobre, A First Look at the Spanish Proposal to Introduce ECL for AI Training,10.12.2024. https://communia-association.org/2024/12/10/a-first-look-at-the-spanish-proposal-to-introduce-ecl-for-ai-training/.
[24] Ibid.
[25] UAB "Planner5D" v. Facebook Inc., 3:20-cv-08261, (N.D. Cal. Nov 23, 2020) ECF No. 1.
[26] Practical Law Intellectual Property & Technology, Copyrightability of Virtual Objects Based on Reference Images: N.D. Cal., 16.10.2023. https://content.next.westlaw.com/practical-law/document/I00a2fdf269f111ee8921fbef1a541940/Copyrightability-of-Virtual-Objects-Based-on-Reference-Images-N-D-Cal?viewType=FullText&transitionType=Default&contextData=(sc.Default).
[27] Peukert, 2024, para. 9.
[28] IPTechBlog, Dr. Sandra Mueller. Breaking News from Germany: Hamburg District Court Breaks New Ground with Judgment on the Use of Copyrighted Material as AI Training Data. 11.10.2024. https://www.iptechblog.com/2024/10/breaking-news-from-germany-hamburg-district-court-breaks-new-ground-with-judgment-on-the-use-of-copyrighted-material-as-ai-training-data/.
[29] Ibid.
[30] Two Birds, Dr. Simon Hembt, Dr. Niels Lutzhöft, and Toby Bond. Long-Awaited German Judgment by the District Court of Hamburg (Kneschke v. LAION) on the Text and Data Mining Exception(s), 01.10.2024. https://www.twobirds.com/en/insights/2024/germany/long-awaited-german-judgment-by-the-district-court-of-hamburg-kneschke-v-laion.
[31] Ibid.
[32] Ibid.
[33] Penningtons Manches Cooper, Generative AI in the courts: Getty Images v Stability AI, 16.02.2024. https://www.penningtonslaw.com/news-publications/latest-news/2024/generative-ai-in-the-courts-getty-images-v-stability-ai.
[34] UK Government, Open consultation: Copyright and Artificial Intelligence, 17.12.2024. https://www.gov.uk/government/consultations/copyright-and-artificial-intelligence.