OpenAI Accidentally Deleted Potential Evidence In New York Times Copyright Lawsuit

OpenAI, currently being sued by The New York Times and Daily News for allegedly using their copyrighted content without permission to train its AI models, revealed an accidental deletion of critical data related to the case. The incident has added complexity to an already high-profile lawsuit over AI training practices.

To assist the plaintiffs in their investigation, OpenAI provided two virtual machines to search its training datasets for potentially infringing content. Virtual machines, commonly used for testing and backups, were employed here to enable the plaintiffs’ legal teams to identify evidence of copyrighted material in OpenAI’s datasets. Since November 1, experts and attorneys have spent over 150 hours conducting these searches.

On November 14, OpenAI engineers inadvertently deleted search data stored on one of the virtual machines. Although much of the data was recovered, the folder structure and file names were permanently lost. This made it impossible to determine how The Times or Daily News content may have been used in training OpenAI’s AI models. Consequently, the plaintiffs had to start their analysis from scratch, leading to significant additional effort and time.

In a letter filed with the U.S. District Court for the Southern District of New York, the plaintiffs’ attorneys expressed frustration, citing the loss of an entire week’s worth of work. They emphasized that while there is no evidence of intentional deletion, OpenAI is better equipped to conduct the searches using its internal tools.

OpenAI responded on November 22, denying the plaintiffs’ claims and suggesting that the error stemmed from a requested configuration change. According to OpenAI’s legal team, the change altered the hard drive’s structure and caused some file organization to be lost. They clarified that no files were permanently erased and maintained there was no intentional wrongdoing.

This case raises key issues regarding the legality of using publicly available data for AI training. OpenAI argues that such use qualifies as “fair use,” meaning licensing is not required even for profit-generating AI models like GPT-4. However, OpenAI has proactively signed licensing agreements with several publishers, including the Associated Press, Financial Times, and News Corp. Reports suggest one partner, Dotdash Meredith, receives at least $16 million annually under such a deal.

While OpenAI has neither confirmed nor denied the use of specific copyrighted works in its training datasets, this lawsuit underscores growing tensions between AI companies and content creators. The case could significantly influence intellectual property laws and how they apply to AI development moving forward.

Related Articles