Spaces:

codeparrot
/

code-generation-models

Running

loubnabnl HF Staff commited on Jun 24, 2022

Commit

d3d5f4d

1 Parent(s): ab4a29a

update namespace

Files changed (1) hide show

datasets/github_code.md CHANGED Viewed

@@ -1,9 +1,9 @@
-We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
 ```python
 from datasets import load_dataset
-ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
 print(next(iter(ds)))
 #OUTPUT:
@@ -20,7 +20,7 @@ print(next(iter(ds)))
 You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
 <p align="center">
-    <img src="https://huggingface.co/datasets/lvwerra/github-code/resolve/main/github-code-stats-alpha.png" alt="drawing" width="650"/>
 </p>
 For model-specific information about the pretraining dataset, please select a model below:

+We also released [Github code dataset](https://huggingface.co/datasets/codeparrot/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
 ```python
 from datasets import load_dataset
+ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
 print(next(iter(ds)))
 #OUTPUT:
 You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
 <p align="center">
+    <img src="https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png" alt="drawing" width="650"/>
 </p>
 For model-specific information about the pretraining dataset, please select a model below: