Create 05_add_dataset_metadata
Browse files
sections/05_add_dataset_metadata
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## **5 Add Metadata to the Dataset Card**
|
| 2 |
+
|
| 3 |
+
### **Overview**
|
| 4 |
+
|
| 5 |
+
A the top of the \`README.md file include metadata about the dataset in yaml format
|
| 6 |
+
\---
|
| 7 |
+
language: …
|
| 8 |
+
license: …
|
| 9 |
+
size\_categories: …
|
| 10 |
+
pretty\_name: '...'
|
| 11 |
+
tags: …
|
| 12 |
+
dataset\_summary: …
|
| 13 |
+
dataset\_description: …
|
| 14 |
+
acknowledgements: …
|
| 15 |
+
repo: …
|
| 16 |
+
citation\_bibtex: …
|
| 17 |
+
citation\_apa: …
|
| 18 |
+
\---
|
| 19 |
+
|
| 20 |
+
For the full spec, see the Dataset Card specification
|
| 21 |
+
|
| 22 |
+
* [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards)
|
| 23 |
+
* [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
|
| 24 |
+
* [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)
|
| 25 |
+
|
| 26 |
+
To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure)
|
| 27 |
+
configs:
|
| 28 |
+
dataset\_info:
|
| 29 |
+
|
| 30 |
+
While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub)
|
| 31 |
+
|
| 32 |
+
* [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py)
|
| 33 |
+
* See below for more details about how to use push\_to\_hub(...) for different common formats
|
| 34 |
+
|
| 35 |
+
### **Metadata fields**
|
| 36 |
+
|
| 37 |
+
#### License
|
| 38 |
+
|
| 39 |
+
* If the dataset is licensed under an existing standard license, then use it
|
| 40 |
+
* If it is unclear, then the authors need to be contacted for clarification
|
| 41 |
+
* Licensing it under the Rosetta License
|
| 42 |
+
* Add the following to the dataset card:
|
| 43 |
+
|
| 44 |
+
license: other
|
| 45 |
+
|
| 46 |
+
license\_name: rosetta-license-1.0
|
| 47 |
+
|
| 48 |
+
license\_link: LICENSE.md
|
| 49 |
+
|
| 50 |
+
* Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset
|
| 51 |
+
|
| 52 |
+
#### Citation
|
| 53 |
+
|
| 54 |
+
* If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org)
|
| 55 |
+
* [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/):
|
| 56 |
+
|
| 57 |
+
#### tags
|
| 58 |
+
|
| 59 |
+
* Standard tags for searching for HuggingFace datasets
|
| 60 |
+
* typically:
|
| 61 |
+
|
| 62 |
+
\- biology
|
| 63 |
+
|
| 64 |
+
\- chemistry
|
| 65 |
+
|
| 66 |
+
#### repo
|
| 67 |
+
|
| 68 |
+
* Github, repository, figshare, etc. URL for data or project
|
| 69 |
+
|
| 70 |
+
#### citation\_bibtex
|
| 71 |
+
|
| 72 |
+
* Citation in bibtex format
|
| 73 |
+
* You can use https://www.doi2bib.org/
|
| 74 |
+
|
| 75 |
+
#### citation\_apa
|
| 76 |
+
|
| 77 |
+
* Citation in APA format
|