
Microsoft Aims to Track Influence of AI Training Data
Microsoft is embarking on a research project aimed at tracing the influence of specific training examples on generative AI models. This initiative, revealed in a job listing from December, seeks to understand how particular data, such as photos and books, contribute to the outputs of these models.
The project's goal is to develop methods to efficiently and usefully estimate the impact of individual data points on AI-generated content. The job listing emphasizes the current lack of transparency in neural network architectures regarding the sources of their creations, highlighting the need for change. This change could potentially lead to a system of incentives, recognition, and even payment for individuals who contribute valuable data to future AI models.
The Copyright Conundrum
This research comes at a crucial time, as AI-powered generators for text, code, images, video, and music are facing numerous intellectual property lawsuits. Many AI companies train their models on vast datasets scraped from the internet, some of which is copyrighted material. While these companies often invoke the "fair use" doctrine to justify their practices, creatives are pushing back against the potentially unlawful usage of their content.
Microsoft is not immune to these legal challenges. The New York Times has sued Microsoft and OpenAI, alleging copyright infringement due to the use of millions of Times articles in training their models. Additionally, software developers have sued Microsoft over the use of their code in training GitHub Copilot.
Data Dignity and the Future of AI
Microsoft's research effort, dubbed "training-time provenance," involves Jaron Lanier, a prominent technologist and scientist at Microsoft Research. Lanier is a proponent of "data dignity," which emphasizes the connection between digital content and the individuals who created it.
Lanier envisions a system where the most significant contributors to an AI-generated output are acknowledged and rewarded. For instance, if an AI model creates a unique piece of content, the artists, writers, or other creators whose work heavily influenced the output would be recognized and potentially compensated.
Several companies are already exploring similar concepts. Bria, an AI model developer, aims to compensate data owners based on their "overall influence." Adobe and Shutterstock also offer payouts to dataset contributors. However, these programs are not yet the norm, with many large labs opting for opt-out mechanisms instead of contributor compensation.
While this project may only be a proof of concept, it underscores the growing importance of addressing the ethical and legal considerations surrounding AI training data. Other labs, including Google and OpenAI, have advocated for weakening copyright protections for AI development. Whether Microsoft's research will lead to meaningful change remains to be seen, but it signals a potential shift toward greater transparency and fairness in the world of AI.
Source: TechCrunch