Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

maom commited on Feb 3

Commit

6e640e8

verified ·

1 Parent(s): cf03a4c

Create 05_add_dataset_metadata

Browse files

Files changed (1) hide show

sections/05_add_dataset_metadata +77 -0

sections/05_add_dataset_metadata ADDED Viewed

	@@ -0,0 +1,77 @@

+## **5 Add Metadata to the Dataset Card**
+### **Overview**
+A the top of the \`README.md file include metadata about the dataset in yaml format
+\---
+language: …
+license: …
+size\_categories: …
+pretty\_name: '...'
+tags: …
+dataset\_summary: …
+dataset\_description: …
+acknowledgements: …
+repo: …
+citation\_bibtex: …
+citation\_apa: …
+\---
+For the full spec, see the Dataset Card specification
+* [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards)
+* [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
+* [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)
+To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure)
+configs:
+dataset\_info:
+While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub)
+* [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py)
+* See below for more details about how to use push\_to\_hub(...) for different common formats
+### **Metadata fields**
+#### License
+* If the dataset is licensed under an existing standard license, then use it
+* If it is unclear, then the authors need to be contacted for clarification
+* Licensing it under the Rosetta License
+  * Add the following to the dataset card:
+    license: other
+    license\_name: rosetta-license-1.0
+    license\_link: LICENSE.md
+  * Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset
+#### Citation
+* If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org)
+* [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/):
+#### tags
+* Standard tags for searching for HuggingFace datasets
+* typically:
+  \- biology
+  \- chemistry
+#### repo
+* Github, repository, figshare, etc. URL for data or project
+#### citation\_bibtex
+* Citation in bibtex format
+* You can use https://www.doi2bib.org/
+#### citation\_apa
+* Citation in APA format