Mehtab Khan (Yale Law School) and Alex Hanna (Distributed AI Research Institute) have posted “The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability” (19 Ohio St. Tech. L.J. (Forthcoming 2023)) on SSRN. Here is the abstract:
There has been increased attention toward the datasets that are used to train and build AI technologies from the computer science and social science research communities, but less from legal scholarship. Both Large-Scale Language Datasets (LSLDs) and Large-Scale Computer Vision Datasets (LSCVDs) have been at the forefront of such discussions, due to recent controversies involving the use of facial recognition technologies, and the discussion of the use of publicly-available text for the training of massive models which generate human-like text. Many of these datasets serve as “benchmarks” to develop models that are used both in academic and industry research, while others are used solely for training models. The process of developing LSLDs and LSCVDs is complex and contextual, involving dozens of decisions about what kinds of data to collect, label, and train a model on, as well as how to make the data available to other researchers. However, little attention has been paid to mapping and consolidating the legal issues that arise at different stages of this process: when the data is being collected, after the data is used to build and evaluate models and applications, and how that data is distributed more widely.
In this article, we offer four main contributions. First, we describe what kinds of objects these datasets are, how many different kinds exist, what types of modalities they encompass, and why they are important. Second, we provide more clarity about the stages of dataset development – a process that has thus far been subsumed within broader discussions about bias and discrimination – and the subjects who may be susceptible to harms at each point of development. Third, we provide a matrix of both the stages of dataset development and the subjects of dataset development, which traces the connections between stages and subjects. Fourth, we use this analysis to identify some basic legal issues that arise at the various stages in order to foster a better understanding of the dilemmas and tensions that arise at every stage. We situate our discussion within wider discussion of current debates and proposals related to algorithmic accountability. This paper fulfills an essential gap when it comes to comprehending the complicated landscape of legal issues connected to datasets and the gigantic AI models trained on them.