maom commited on
Commit
6e640e8
·
verified ·
1 Parent(s): cf03a4c

Create 05_add_dataset_metadata

Browse files
Files changed (1) hide show
  1. sections/05_add_dataset_metadata +77 -0
sections/05_add_dataset_metadata ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## **5 Add Metadata to the Dataset Card**
2
+
3
+ ### **Overview**
4
+
5
+ A the top of the \`README.md file include metadata about the dataset in yaml format
6
+ \---
7
+ language: …
8
+ license: …
9
+ size\_categories: …
10
+ pretty\_name: '...'
11
+ tags: …
12
+ dataset\_summary: …
13
+ dataset\_description: …
14
+ acknowledgements: …
15
+ repo: …
16
+ citation\_bibtex: …
17
+ citation\_apa: …
18
+ \---
19
+
20
+ For the full spec, see the Dataset Card specification
21
+
22
+ * [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards)
23
+ * [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
24
+ * [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)
25
+
26
+ To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure)
27
+ configs:
28
+ dataset\_info:
29
+
30
+ While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub)
31
+
32
+ * [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py)
33
+ * See below for more details about how to use push\_to\_hub(...) for different common formats
34
+
35
+ ### **Metadata fields**
36
+
37
+ #### License
38
+
39
+ * If the dataset is licensed under an existing standard license, then use it
40
+ * If it is unclear, then the authors need to be contacted for clarification
41
+ * Licensing it under the Rosetta License
42
+ * Add the following to the dataset card:
43
+
44
+ license: other
45
+
46
+ license\_name: rosetta-license-1.0
47
+
48
+ license\_link: LICENSE.md
49
+
50
+ * Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset
51
+
52
+ #### Citation
53
+
54
+ * If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org)
55
+ * [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/):
56
+
57
+ #### tags
58
+
59
+ * Standard tags for searching for HuggingFace datasets
60
+ * typically:
61
+
62
+ \- biology
63
+
64
+ \- chemistry
65
+
66
+ #### repo
67
+
68
+ * Github, repository, figshare, etc. URL for data or project
69
+
70
+ #### citation\_bibtex
71
+
72
+ * Citation in bibtex format
73
+ * You can use https://www.doi2bib.org/
74
+
75
+ #### citation\_apa
76
+
77
+ * Citation in APA format