Aspr commited on
Commit
41cd853
1 Parent(s): efc3eaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -1
README.md CHANGED
@@ -18,9 +18,20 @@ tags:
18
 
19
  # KStack-full models
20
 
21
- KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset.
22
  This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
23
 
 
 
 
 
 
 
 
 
 
 
 
24
  # Model use
25
 
26
  ```python
 
18
 
19
  # KStack-full models
20
 
21
+ KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
22
  This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
23
 
24
+ ## Rule-based filtering
25
+ To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
26
+ * We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
27
+ * Next, we filter out files which belong to the repos with less than 5 Kotlin files
28
+ * Finally, we remove files which have less than 20 SLOC
29
+
30
+ We clean the content of the remaining dataset entries according to the following rules:
31
+ * We remove all non-ASCII entries
32
+ * We remove all package lines such as _package kotlinx.coroutines.channels_
33
+ * We remove half of the import lines.
34
+
35
  # Model use
36
 
37
  ```python