DATA PROVIDER-HOST AGREEMENT
BigScience is an open research collaboration involving over 1000 participants from 60 countries, focusing its collaborative research efforts in the study and development of natural language processing systems (hereinafter NLP).
The project is motivated by recent evolutions in the field brought about by the growing capabilities, popularity, size and cost of Large Language Model-based methods. The computational resources and data needed to develop LLMs are affordable by a handful of institutions, who often conduct this research behind closed doors despite its significant impact on society.
Thanks to the support of a large compute grant on the French Jean Zay public super-computer, the participants of BigScience can instead collaborate across a range of academic institutions and organizations to create an openly accessible Large Language Model (LLM), available for the general public. This can be used to fuel research, governance, regulation, and future technology.
In particular, the choice and governance of the Data used to develop these technologies are of paramount importance. Previous work has mainly relied on text obtained from snapshots of the Internet, due to the large amount of Data and availability. Unfortunately, this convenience choice raises multiple ethical and legal issues and leads the technology to amplify harmful biases in its deployed applications.
BigScience takes an alternative approach of identifying Data sources for a training corpus. Specifically, our participants built an annotated catalog of high-quality language resources to cover the diversity of languages and social contexts that should make up such a training corpus. There are two essential parties in charge of making this data available, under the auspices of BigScience: First, the Data Providers, any institution willing to license datasets of interest purely for research purposes on a royalty free basis; and Second, the Data Hosts, institutions willing to contribute their technical capabilities in order to host the data provided, enabling society to access it. These are the champions of data sharing and openness in research.
This License governs the use of Data as informed by the BigScience Ethical Charter and the values set forth in the BigScience workshop. These establish the perspective informing this license that text and language are above all human-centric data. This means that data subjects have inherent rights and protections, and interests that exist outside of its Machine Learning context, and which we also need to account for.
Although the BigScience community does not aim to impose its values on potential users of the Data, it is determined to take tangible steps towards protecting the community from inappropriate uses of the work being developed by BigScience.
Consequently, the main objective of this Data Provider Agreement (the Agreement) is to serve as the core instrument enabling and governing the sharing of data between the interested parties, for the benefit of open research. Both parties strive to serve this goal by entering into this Agreement.
“Agreement” means this Agreement including all its Exhibits.
“Confidential Information” means information that one Party discloses to the other Party under this Agreement and that is marked as confidential or would normally be considered confidential.
“Data” means machine-readable informational content (individually or as a whole i.e., collection of Datasets) made available by the Data Provider.
“Meta-Data” means supplementary information of the Data, for example, summaries or visualizations of the data, restricted excerpts, authorship information and high-level statistics (i.e. word counts)).
“Dataset” means one specific collection of Data that the Data Provider has the necessary rights for sharing under this agreement.
“Processed Dataset” is a Dataset that is further processed via Data transformations, including additional modifications to one dataset (e.g., personal information removal, additional annotations, extracted text, subsetting by language, removal of individual data points), dataset combinations, etc.
“Data Host” means a legal entity permitted to process, prepare, and manage subsequent 3rd party access to the Data of the Data Provider under the scope of this agreement.
“Data Provider” means the individual or legal entity granting permission to the Data Host to access and further manage the Data for the purpose of this Agreement.
“Derived Work” means any artifact created using Data covered by this Agreement.
“Parties” means any individual or entity entering into this Agreement.
“Third Parties” means individuals or legal entities that are not controlled by any of the involved parties in this Agreement.
“User” means individual and/or legal entity having access to the data provided by the Data Provider and hosted by the Data Host for the purpose of this Agreement.
The Data Provider grants to the Data Host a non-exclusive, non-transferable, non-sublicensable, irrevocable, perpetual, royalty-free and worldwide license to use (that is access, store, prepare, process, label and/or share) the agreed upon Data (see List of Datasets in Exhibit A) in accordance with the use case scenarios and further (re)distribution policy, as stated in Annex III (see below).
Neither party will charge any fees, royalties or costs associated with implementing this Agreement. All accruing costs or expenses of any party in relation to this Agreement are solely to be carried by the responsible party alone.
The Data Provider shall make reasonable efforts to provide the Data to the Data Host using up-to-date security standards (this may include but is not limited to data transmission via secure transport protocols, storage on secured servers as well as secure data processing). In case the data is made accessible via authentication the Data Host ensures that the used authentication method meets up-to-date standards.
Neither party shall be liable to the other for a failure of performance undertaken in this Agreement if prevented from doing so by any circumstances beyond its reasonable control (such as but not limited to fire, flood, drought, war, explosion, terrorism, computer hacking and viruses, acts of any government body, perils of the sea and air).
Each party shall treat this Agreement and all information and/or business practices of the other party it aquires or becomes knowledgeable of as confidential. Confidential information does not include any public or generally available information or any information independently obtained or available prior to entering this Agreement. Notwithstanding the foregoing, either party is allowed to reveal confidential information if it is required by law to do so.
This Agreement including its exhibits and attachments constitute the entirety of the Agreement between the parties and supersedes any prior negotiations or understanding.
This Agreement can be amended or modified by mutual consent at any time. The amendment and/or modification must be put forth in writing.
Any dispute that may arise from the breach of this Agreement will be first subject to an alternative dispute resolution phase under the auspices of the BigScience Community.
The provisions set forth in section 6(b) (Limits of Liability), 10 (Term and Termination), 12 (Confidentiality), 15 (Governing Law), 16 (Survival) and Exhibit A (Section Restrictions of Use in the Dataset section) shall survive the termination of this agreement and continue to bind both parties.
If any provision of this Agreement is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein. The parties agree to substitute such a provision with a valid provision most closely resembling the intent of such severed provision.
Unless and to the extent expressly agreed to in writing between the Data Host and the Data Provider no other terms and conditions shall be binding to either party.
The parties acknowledge that they fully understand and agree to all of their rights and obligations under this Agreement.
DATA PROVIDER DATA HOST
Date and Location Date and Location
DATA PROVIDER SCHEDULE (EXHIBIT A)
Data Provider Name
Data Provider Contact Information
Data Provider Name
Special Conditions (if applicable)
Data Management Plan
List of Datasets
License of Datasets (if more than one, please assign in List of Datasets)
Restrictions - Please indicate of any restrictions apply to any of the above listed datasets
Scope / use cases:
▯ under condition: openly released models, results, and artifacts
▯under condition: use RAIL license for ML artifacts (has to be attached)
▯under condition: value alignment (determined by data host)
▯under condition: value alignment (data modelers sign click-through form)
Acknowledging the immense value and benefits that your datasets may provide, and being conscious and respectful towards the different economic interests that you may have, this Agreement offers the Data Provider a flexible set of optional frameworks for the use, re-use, and distribution of data:
▯The Data Provider permits the Data Host to use the Data for the purpose set out in this Agreement. The Data Host is not allowed to make the Data publicly available outside of the remits of this license (this does not include Meta Data).
▯The Data Provider permits the Data Host to make the Data (as a whole or in parts or processed) available to downstream users upon signing a non-dissemination agreement.
▯The Data Provider permits the Data Host to make the Data (as a whole or in parts or processed) available to downstream users using a system that supports authentication/synchronization
▯The Data Provider permits the Data Host to make the Data (as a whole or in parts or processed) available with modifications such as anonymizing personal and/or sensitive information about individuals.
▯The Data Provider permits the Data Host to use the Data for the purpose set out in this Agreement. Additionally, the Data Host is allowed to make the Data publicly available under the Data license (select one) provided by the Data Provider.
▯ CC BY 4.0 (Link)
▯ CC BY-NC-ND 4.0 (Link)
▯ CC BY-NC-SA 3.0 (Link)
▯ CC BY-NC-SA 4.0 (Link)
▯ CC BY-SA 3.0 (Link)
▯ CC BY-SA 4.0 (Link)
▯ CC-BY-NC 4.0 (Link)
▯ Microsoft Research Data License Agreement (Link)
▯custom license agreement (see Attachment if applicable)
▯ Linux Foundation CDLA Permissive
▯ Linux Foundation CDLA Restrictive
RAIL Model License (EXHIBIT B)
Find here: BLOOM RAIL License v1.0
Potential further clauses:
X. CONFLICT RESOLUTION
In the case of any dispute, the parties shall attempt to resolve the issue by negotiation first.
In the case such negotiations cannot resolve the issue within six months either party may bring the issue to the applicable court of law.
X. NO WAIVER
The failure of the Data Host or Data Provider to enforce or execute any right or provision of this Agreement shall not constitute a waiver of that right or provision.
Headings and Section titles in this Agreement are only for convenience and are not to be considered in construing this Agreement.