update namespace
Browse files- datasets/github_code.md +3 -3
datasets/github_code.md
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
-
We also released [Github code dataset](https://huggingface.co/datasets/
|
| 2 |
|
| 3 |
```python
|
| 4 |
from datasets import load_dataset
|
| 5 |
|
| 6 |
-
ds = load_dataset("
|
| 7 |
print(next(iter(ds)))
|
| 8 |
|
| 9 |
#OUTPUT:
|
|
@@ -20,7 +20,7 @@ print(next(iter(ds)))
|
|
| 20 |
You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
|
| 21 |
|
| 22 |
<p align="center">
|
| 23 |
-
<img src="https://huggingface.co/datasets/
|
| 24 |
</p>
|
| 25 |
|
| 26 |
For model-specific information about the pretraining dataset, please select a model below:
|
|
|
|
| 1 |
+
We also released [Github code dataset](https://huggingface.co/datasets/codeparrot/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
|
| 2 |
|
| 3 |
```python
|
| 4 |
from datasets import load_dataset
|
| 5 |
|
| 6 |
+
ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
|
| 7 |
print(next(iter(ds)))
|
| 8 |
|
| 9 |
#OUTPUT:
|
|
|
|
| 20 |
You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
|
| 21 |
|
| 22 |
<p align="center">
|
| 23 |
+
<img src="https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png" alt="drawing" width="650"/>
|
| 24 |
</p>
|
| 25 |
|
| 26 |
For model-specific information about the pretraining dataset, please select a model below:
|