Training Data, Copyright, and AI Art: A Critical Overview

After the NYT filed suit against OpenAI and Microsoft for using their news to educate graduate students, it became clear that the dispute would not stay contained. The discussion expands beyond the text, and by the time the audio phase concluded, it was evident that cloning a human voice or creating a new one had become almost child’s play. This brings us to a crucial point: the shift from static images to moving pictures on screens—video.

The static image, already a fragile medium, faces new strains. An unintended leak, described as incidental, exposed a trove of more than 16,000 artists whose work could be used without permission in education through Midjourney, one of the leading tools in AI image generation today. Stable Diffusion and DALL-E from OpenAI are also part of this landscape. Some of those 16,000 artists were already named in a class-action lawsuit filed last year. The entities at the center include AICreator of Stable Diffusion, Midjourney, DeviantArt, photography, and digital as well as traditional art. The platforms leverage software to turn text into images.

The list’s origin is notable: a spreadsheet hosted on Google Docs created by developers working mid-session. Named the Midjourney Style List, this document, as reported by the London-based The Art Newspaper, allegedly came from Midjourney developers and was used to study how the program could imitate specific artists and styles. A designer named Jon Lam, affiliated with Riot Games, is cited as the source in a series of tweets. Lam’s X profile shows screenshots of a conversation among Midjourney developers, with even the CEO reportedly taking part. The dialogue discusses resources that enable machine-made visuals to imitate real creators and the arrival of a new era for a wide group of aspiring artists, including twenty-some thousand who could be drawn into the training process. The sentiment in some comments was blunt: resources exist to access content that would train the model to replicate others. The implication was clear for many: those affected would be drawn into the legal process once their works appeared in the training data. This list encompasses independent creators and major institutions alike, including artists such as Pablo Picasso, Frida Kahlo, and even Walt Disney, whose works could be included in the concerns surrounding Midjourney’s training inputs.

The core question centers on permission. Some voices in the conversation acknowledge that a portion of the training materials could be derived from content used without consent. The provocative remark from a participant suggested that, by using extracted datasets, training models could be made to forget their origins. Legal issues persist, with the negligence claim being dismissed last October in part, while questions about copyright infringement involving the LAION-5B dataset remain unresolved. The case was reorganized and reopened in November, highlighting ongoing tensions between innovation and rights.

While access to the document list outside closed channels was limited, it did circulate widely once Jon Lam made it public. The public record remains searchable on archive sites and in a handful of tweets from the video game creator. The irregular publication, coupled with related legal actions, has prompted speculation about potential crimes tied to improper use of creative work. Those named on the list could be identified in the ongoing judicial process, which adds pressure on the upper echelons of the field. The involvement of major names in the creative world—figures tied to Picasso, Kahlo, Disney and others—could influence the trajectory of the legal proceedings and broaden the scrutiny of how training data is assembled and used.

AI now faces a breakfast of questions and a mid-morning snack of controversy as it prepares to train video models this January. Observers note that the next wave of AI training will depend on data quality and access. A well-known voice in tech remarks that high-quality, diverse data is essential for model performance, and that ongoing training is necessary to improve accuracy. The implications of training data quality stretch far beyond a single product; they shape the speed and direction of AI development.

The industry’s reality remains unchanged in some respects: there is a tug-of-war between ambition and permission. The market for AI tools continues to grow, and so does the interest in protecting the creators behind the works these tools learn from. OpenAI, valued in the hundreds of billions, continues to navigate a landscape where copyright concerns intersect with rapid innovation. The discourse includes artists who fear that their online presence could be repurposed to generate new digital versions of their work without consent, potentially thinning the lines between homage, transformation, and misappropriation. In this climate, the balance between creative freedom and respect for original rights remains a moving target, shaping policy debates and practical practices across the industry.

What are You Looking For?

AI, Copyright and the Training Data Dilemma: Voices, Visuals, and the Laws

Reactions and Reflections After Brazil’s January 8 Events

Alicante Teatro Winter-Spring Season: Major Dance, Theater, and Music Lineup

AI, Copyright and the Training Data Dilemma: Voices, Visuals, and the Laws

Related posts:

Reactions and Reflections After Brazil’s January 8 Events

Alicante Teatro Winter-Spring Season: Major Dance, Theater, and Music Lineup