@AdamMossoff
When Large Language Models (LLMs) exploded onto the scene with the release of ChatGPT in 2022, people called it Artificial Intelligence (AI) and immediately anthropomorphized these computer systems with all sorts of human metaphors. The LLM "trains" on the written material, the LLM "digests" this material into its algorithm, the LLM "hallucinates" when it gives wrong answers, etc., etc. It is undeniable that massive amounts of copyrighted works are used for this "training" of LLMs; in fact, copyright infringement lawsuits have revealed that Meta (Facebook) and Anthropic relied on massive storehouses of pirated works on piracy websites for "training" their LLMs. To avoid the consequences of their piracy, Meta, Anthropic, and other AI companies, like OpenAI, have exploited the human metaphors in describing how LLMs function to argue in court that they’re not liable for copyright infringement in their unauthorized copying and use of the works they’ve used to build their LLMs. Alternatively, AI companies argue that it’s fair use because their LLM systems are simply doing the equivalent "transformative" work of a human reading a book and then using the information like a human would in applying its ideas in one's own life. Regardless of whether they've argued no infringement or fair use, the AI companies have always maintained that the copies of the copyrighted works they used to build their LLMs are not "in" the LLM systems. They've consistently maintained that there's no literal copies, as the works are retained in the LLMs in the same way that a book read by a person is not literally inside this person's mind after one reads it. Well, copyright law scholars and researchers have now shown that these claims by AI companies are 100% false. They are completely self-serving arguments that have exploited the anthropomorphized metaphors for LLMs, hiding the actual massive copying and retention of copyrighted works in the LLMs. The researchers proved this by making queries of LLM systems to create stories based on general summaries of plots or themes, and the LLMs responded with answers that were the literal, word-for-word copies of portions of copyrighted books or entire copyrighted books. In other words, LLMs are just a far more complicated computer program that relies on large-scale storage of data that the LLM accesses and retrieves when prompted by the user of the LLM. Yes, LLMs are a new innovative development in computer programs, but these programs are built on classic digital copying of massive numbers of copyrighted works stored in databases. To invoke the famous philosopher’s joke: It’s still copyright infringement all the way down. You can read this important article here: https://t.co/kmYr8GMBTJ