loubnabnl HF staff commited on
Commit
4a8f8af
1 Parent(s): 46dbbb1
Files changed (1) hide show
  1. datasets/github_code.txt +20 -0
datasets/github_code.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories from 32 programming languages. The dataset can be loaded in a streaming mode if you don't want to download it because of memory issues, this will create an iterable dataset:
2
+
3
+ ```python
4
+ from datasets import load_dataset
5
+
6
+ ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
7
+ print(next(iter(ds)))
8
+
9
+ #OUTPUT:
10
+ {
11
+ 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
12
+ 'repo_name': 'MirekSz/webpack-es6-ts',
13
+ 'path': 'app/mods/mod190.js',
14
+ 'language': 'JavaScript',
15
+ 'license': 'isc',
16
+ 'size': 73
17
+ }
18
+
19
+ ```
20
+ You can see that in addition to the code, the samples include the metadata: repo name, path, language, license, and the size of the file.