Artificial Intelligence, Extra Bytes

New Research Raises Legal Red Flags for AI Developers

An open book.

Lily Morris

Contributing Writer

Save

A new study from Stanford and Yale researchers has produced detailed evidence that several major AI systems can output copyrighted material with high textual accuracy.

The research focused on four prominent large language models: OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, Anthropic’s Claude 3.7 Sonnet, and xAI’s Grok 3. The findings show that, under specific prompting methods, these models generated long passages from well-known books that closely matched the originals.

These results raise concerns about how such models handle protected data and whether public claims about their internal behavior are supported by observable outcomes.

With ongoing legal cases testing the boundaries of data use and copyright protection in AI training, this study introduces material that may be relevant to current and future proceedings.

Why It Matters: The study provides direct evidence that high-performing AI systems can return copyrighted content in a form nearly identical to the source. This introduces legal risks and prompts closer inspection of how these systems are trained and evaluated.

Full Passages from Protected Books Were Reproduced with High Accuracy: Researchers tested the outputs of four leading models and found that Claude 3.7 Sonnet could reproduce the entirety of 1984 with an accuracy rate above 94%. Gemini returned large portions of Harry Potter and the Sorcerer’s Stone with a match rate nearing 77%. The excerpts were extended sequences, some spanning entire chapters.

Prompt Variation Was Key to Extracting These Outputs: The study used a method known as Best-of-N, which involves submitting multiple variations of the same prompt and selecting the version that returns the most complete result. This technique allowed researchers to bypass filters and access large volumes of copyrighted text. The same approach has appeared in legal filings, including in OpenAI’s response to The New York Times lawsuit, where it was described as not reflective of everyday use.

Public Claims by AI Developers Were Not Supported by the Findings: Companies such as OpenAI and Google have told regulators that their models do not contain copies of training data. Both firms have asserted that their systems do not store original inputs. The study’s results, which include long-form replication of copyrighted material with high precision, directly contradict these statements.

The Legal Implications Extend to Multiple Ongoing Cases: The research may affect how courts view the legality of current model behavior. If copyrighted content is being reproduced with little to no transformation, arguments based on fair use become more difficult to defend. Existing lawsuits involving authors, media outlets, and rights organizations could use this study as supporting evidence that model behavior aligns with reproduction rather than independent creation.

Legal and Technical Experts Remain Divided on How to Classify This Behavior: The study has prompted further debate about whether these models are retaining data or recreating it during inference. Stanford law professor Mark Lemley, who has represented AI developers, acknowledged the uncertainty around whether models contain copies or produce them dynamically. The difference could determine whether this behavior is considered infringement under current copyright law.

Go Deeper -> Researchers Just Found Something That Could Shake the AI Industry to Its Core – Futurism

Trusted insights for technology leaders

Our readers are CIOs, CTOs, and senior IT executives who rely on The National CIO Review for smart, curated takes on the trends shaping the enterprise, from GenAI to cybersecurity and beyond.

Subscribe to our 4x a week newsletter to keep up with the insights that matter.

Save

January 19, 2026

☀️ Subscribe to the Early Morning Byte! Begin your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

☀️ Your latest edition of the Early Morning Byte is here! Kickstart your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.