|
We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories in 32 programming languages. The dataset can be loaded in a streaming mode if you don't want to download it because of memory issues, this will create an iterable dataset: |
|
|
|
```python |
|
from datasets import load_dataset |
|
|
|
ds = load_dataset("lvwerra/github-code", streaming=True, split="train") |
|
print(next(iter(ds))) |
|
|
|
#OUTPUT: |
|
{ |
|
'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n", |
|
'repo_name': 'MirekSz/webpack-es6-ts', |
|
'path': 'app/mods/mod190.js', |
|
'language': 'JavaScript', |
|
'license': 'isc', |
|
'size': 73 |
|
} |
|
|
|
``` |
|
You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. |
|
|
|
For model-specific information about the pretraining dataset, please select a model below: |