pedrodev2026 commited on
Commit
d7987fc
·
verified ·
1 Parent(s): 4bd5ca7

Create DATASET_CREDITS.md

Browse files
Files changed (1) hide show
  1. DATASET_CREDITS.md +46 -0
DATASET_CREDITS.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Credits
2
+
3
+ ## NVIDIA – OpenCodeGeneticInstruct Dataset
4
+
5
+ This project uses data derived from the **OpenCodeGeneticInstruct** dataset created and published by **NVIDIA**.
6
+
7
+ Dataset: nvidia/OpenCodeGeneticInstruct
8
+ Provider: NVIDIA
9
+
10
+ We would like to acknowledge and thank **NVIDIA** for making this dataset publicly available to support research and development in code generation and instruction-following models.
11
+
12
+ The dataset was accessed through the Hugging Face ecosystem and processed via a script that streams and reformats a subset of the data for use in this project. The original dataset structure and content remain the intellectual property of NVIDIA and their respective contributors.
13
+
14
+ ## Original Dataset License
15
+
16
+ The **OpenCodeGeneticInstruct** dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
17
+
18
+ This means the data may be shared and adapted, provided that **appropriate credit is given to the original creators (NVIDIA)** and any required attribution is preserved according to the terms of the license.
19
+
20
+ ## Dataset Source
21
+
22
+ NVIDIA – OpenCodeGeneticInstruct dataset
23
+ Available on Hugging Face
24
+
25
+ ## Processing and Usage in This Project
26
+
27
+ In this repository, the dataset was processed and adapted for downstream usage with the following steps:
28
+
29
+ * Streaming examples using the Hugging Face `datasets` library
30
+ * Extracting a subset of samples from the dataset
31
+ * Converting dataset fields into an `instruction` / `response` format
32
+ * Limiting each example to **a maximum of 512 tokens per row**
33
+ * Reducing the dataset to a final subset of **25,000 rows**
34
+ * Exporting the processed samples as **JSONL** for training and experimentation
35
+
36
+ ## License of the Processed Dataset
37
+
38
+ The **processed dataset distributed in this repository** is released under the **BSD 3-Clause License (BSD-3)**.
39
+
40
+ This license applies **only to the processed dataset and associated scripts provided in this repository**, while the **original dataset content remains subject to the CC BY 4.0 license** from NVIDIA.
41
+
42
+ Users must ensure that proper **attribution to NVIDIA and the OpenCodeGeneticInstruct dataset** is maintained when redistributing or using this processed dataset.
43
+
44
+ ## Appreciation
45
+
46
+ We appreciate NVIDIA's contribution to the open AI ecosystem and their efforts in releasing high-quality datasets that enable experimentation, research, and model development.