IntelligenzaArtificiale commited on
Commit
f8b2586
1 Parent(s): bc258f9

Delete datasets

Browse files
datasets/.ipynb_checkpoints/codeparrot-checkpoint.txt DELETED
@@ -1,9 +0,0 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
- - Exact match deduplication
3
- - Filtering
4
- - Average line length < 100
5
- - Maximum line length < 1000
6
- - Alpha numeric characters fraction > 0.25
7
- - Remove auto-generated files (keyword search)
8
-
9
- For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
 
 
 
 
 
 
 
 
 
 
datasets/.ipynb_checkpoints/opt-checkpoint.txt DELETED
@@ -1,2 +0,0 @@
1
- [OPT](https://huggingface.co/facebook/opt-30b) was trained on the following 5 filtered datasets of textual documents, one of them includes code, [The Pile](https://arxiv.org/pdf/2101.00027v1.pdf), it used *Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews*.
2
- The final training data contains 180B tokens corresponding to 800GB of data. For more details please refer to this [paper](https://arxiv.org/abs/2205.01068)
 
 
 
datasets/codegen.md DELETED
@@ -1,14 +0,0 @@
1
- [Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
2
-
3
- It was sequentially trained on three datasets:
4
- - [The Pile](https://huggingface.co/datasets/the_pile)
5
- - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
- - 217GB of Python data from GitHub repositories
7
-
8
- The second and third datasets used the following preprocessing:
9
- - Exact match deduplication
10
- - Filtering:
11
- - Exact match deduplication
12
- - Average line length < 100 tokens
13
- - Maximum line length < 1000 MB
14
- - Characters being decimal or hexadecimal digits >90%
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/codeparrot.md DELETED
@@ -1,9 +0,0 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
- - Exact match deduplication
3
- - Filtering:
4
- - Average line length < 100 tokens
5
- - Maximum line length < 1000 MB
6
- - Alphanumeric characters fraction > 0.25
7
- - Remove auto-generated files (keyword search)
8
-
9
- For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
 
 
 
 
 
 
 
 
 
 
datasets/github_code.md DELETED
@@ -1,26 +0,0 @@
1
- We also released [Github code dataset](https://huggingface.co/datasets/codeparrot/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
2
-
3
- ```python
4
- from datasets import load_dataset
5
-
6
- ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
7
- print(next(iter(ds)))
8
-
9
- #OUTPUT:
10
- {
11
- 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
12
- 'repo_name': 'MirekSz/webpack-es6-ts',
13
- 'path': 'app/mods/mod190.js',
14
- 'language': 'JavaScript',
15
- 'license': 'isc',
16
- 'size': 73
17
- }
18
-
19
- ```
20
- You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
21
-
22
- <p align="center">
23
- <img src="https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png" alt="drawing" width="650"/>
24
- </p>
25
-
26
- For model-specific information about the pretraining dataset, please select a model below:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/incoder.md DELETED
@@ -1,14 +0,0 @@
1
- [InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
-
3
- The GitHub data was cleaned with the following steps:
4
- - Average line length < 100 tokens
5
- - Maximum line length < 3000 MB
6
- - Alphanumeric characters fraction > 0.4
7
- - Remove auto-generated files (keyword search)
8
-
9
- The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
10
- - all questions that have at least one answer
11
- - up to ten answers with a non-negative score (sorted by score) per question
12
- - up to five comments per question/answer
13
-
14
- Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/intro.md DELETED
@@ -1,8 +0,0 @@
1
- Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from platforms such as Stack Overflow. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
2
-
3
- Below is the distribution of the pretraining data size of some code models, we provide model-specific information for open-source models later in this section:
4
- <p align="center">
5
- <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/data_distrub.png" alt="drawing" width="440"/>
6
- </p>
7
-
8
- Some other useful datasets that are available on the 🤗 Hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from open-source libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later. You can also find [APPS](https://huggingface.co/datasets/loubnabnl/apps), a benchmark with 10000 problems consisting of programming questions in English and code solutions in Python, this dataset was also used in Codex evaluation along with HumanEval.
 
 
 
 
 
 
 
 
 
datasets/polycoder.md DELETED
@@ -1,5 +0,0 @@
1
- The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **249GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
- - Exact match deduplication
3
- - Filtering:
4
- - Average line length < 100 tokens
5
- - Maximum line length < 1000 MB