EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary ๐Ÿ“‘๐Ÿ“š๐Ÿ–ผ๏ธ๐Ÿ‡ช๐Ÿ‡บ

Community Article Published July 3, 2024

Two weeks ago, OpenFuture and Mozilla published a template proposal for the EU AI Act-mandated sufficiently detailed summary for GPAI training datasets. The template was devised after conversations with a few contributors (including yours truly) and is introduced by an excellent policy brief led by Zuzanna Warso with Maximilian Gahntz and Paul Keller. The policy brief is available here: Towards Robust Training Data Transparency.

At the time of writing, this document stands out as one of the most comprehensive EU AI Act training data summaries proposed. It addresses both the wide range of legitimate questions stakeholders will have for the GPAI systems and necessary trade-offs between different constraints. While further discussions and work by the AI Office are needed to converge to a final version, the breadth of considerations covered by this initial proposal makes it a strong starting point โ€“ or a strong reference document to support ongoing conversations. The proposal stands bolstered in particular by the authors' attention to two critical aspects โ€“ namely, matching the requested information to the regulatory mandate of the data summary and addressing the diversity of data sources and uses involved in training a General-Purpose AI system.

image/png Overview of the proposed Data Summary Blueprint

Motivation, Mandate, and Level of Detail

While the EU data summary shares some of its goals and motivations with other formats of dataset documentation, such as data statements, datasheets, or data nutrition labels, understanding how its requirements differ from theirs is essential to managing the various trade-offs between different interests inherent in documenting large and complex AI artifacts.

To that end, the policy brief begins by examining the mandate for this data summary: that it be "sufficiently detailed" to meet the needs of EU citizens and organizations with legitimate interests in upholding their rights under EU law while โ€œtaking [...] the need to protect trade secrets [into due account] โ€. The current text of the act explicitly mentions copyright as one such category of legitimate interest in this context, and EU laws and charters also support rights to privacy and data protection, science, non-discrimination, and fair competition, among others โ€“ all of which depend on information about the training data of GPAIs for their safeguard, as outlined in the brief.

This frames the EU data summary as an implicit trade-off between two competing interests. The information provided about the training data must strike a balance between being sufficiently meaningful so that stakeholders with varied legitimate questions about the development and properties of GPAIs can have a sufficient starting point for their investigations on the one hand, and a โ€œdue considerationโ€ for trade secrets โ€“ but not an absolute deference โ€“ on the other. The template should also avoid requiring overly complicated processes that may exclude well-meaning but less-resourced actors with different organizational constraints from those of larger companies, by aiming for straightforward and self-explanatory minimal requirements for the types of information required.

Proposed Blueprint Approach and Highlights

The proposal presented by OpenFuture and Mozilla addresses these tensions by introducing a blueprint structured around specific questions. These questions are informed both by a practitionerโ€™s understanding of the different stages of data curation that go into training a model and by the categories of legitimate interest outlined in the brief; organized into sections covering general information about the dataset, data sources and individual datasets, data diversity, and data processing in the training.

The Blueprint is well worth reading in its entirety, and comments from practitioners and other stakeholders are welcome! To get started, the following highlights should provide a sense of the types of questions addressed and of the approach taken:

  • Differences of data source types, data origins, and dataset uses: the EU AI Act asks for a summary of the data used in the training of a GPAI system. In current practice of model training, however, this covers many different types of data, put to many different uses โ€“ and that require different kinds of documentation depending of how theyโ€™re obtained (e.g. publicly accessible web data, data licensed from a copyright holder, data purchased from data workers, user data from commercial system deployment, etc.) and what use they are put to (e.g. pre-training with a given training objective, fine-tuning, validation or evaluation, performance or safety, etc.). This diversity underlines a point of tension between simplicity requirements and sufficient detail to support the summary: while it may be tempting to try to propose a single documentation format for all of theses types of datasets and data origins, the meaningful differences between their social and legal contexts and their impact on the trained system would risk making it irrelevant by drowning out the specific information that is most relevant in each of the different contexts.
  • Documenting the head of the distribution of web domains: One trainig data context that has received particular attention in discussions of AI data is the use of web-crawled data in pre-training. Data obtained from publicly available web sources, either by processing CommonCrawl archives or via a companyโ€™s own web scraping tool, makes up a significant portion of the material that would be covered in a GPAI training data summary. Web-scale crawled datasets are difficult to document systematically, especially in a static format, but one way to approach them to provide meaningful information for rights holders and organizations with a legitimate interest consists in listing the top web domains they include. For example, Google Deepmind provided the top-20 domains of the MassiveWeb dataset (Gopher LLM paper, 2021, Appendix A), which together accounted for 15 percent of the overall data and give a good sense of the types of text prioritized in the curation process. More recent web-based datasets have become orders of magnitude larger than MassiveWeb, but the top domains still provide meaningful information. For example, in the recently released FineWeb dataset (2024), which includes data from 4 million domains, the top-100 domains account for 5% of the pages in the dataset, and the top-1000 and top-10,000 account for 13% and 28% respectively (while only representing 0.025% and 0.25% of the domains). Providing these lists as part of the data summary provides high value to parties with a legitimate interest, who may independently go investigate the kind of text and media hosted on those web domains to draw conclusions about the technology, while minimizing the amount of work required of the developers to try to predict what those questions may be.

Every aspect of the proposed Data Summary Blueprint corresponds to a similar attempt to find the right balance between utility to the various stakeholders and feasibility for the developers. Striking that balance will be essential to making the summary practical, and fulfilling its role as a tool for sustainable governance of the technology โ€“ the next few months will be critical in enabling this outcome for the EU ๐Ÿ‡ช๐Ÿ‡บ.

Additional Resources