wiktorwojcik112 commited on
Commit
5f0a59f
verified
1 Parent(s): 3f92b16

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -3
README.md CHANGED
@@ -1,3 +1,20 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - christopher/rosetta-code
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - code
8
+ ---
9
+
10
+ This is a CoreML model for identification of following programming languages:
11
+
12
+ ```go, lua, perl, python, apl, shell, c, c#, c++, cobol, lisp, erlang, fortran, groovy, haskell, java, javascript, kotlin, objective-c, pascal, php, powershell, r, ruby, rust, scala, scheme, swift, dart, sql, text, mysql, typescript, ecma, cmake, html, latex, jinja, json, toml, css```
13
+
14
+ It was trained on a cleaned up and filtered rosetta-code dataset (more precisely: https://huggingface.co/datasets/christopher/rosetta-code, but cleaned up).
15
+
16
+ ## ProgrammingLanguageIdentificationV1
17
+ First version of PIL model. It was trained on 20 362 data points (including validation, which was picked automatically).
18
+ Because each programming language has a different number of snippets (lowest: css, ecma, toml (1), highest: go (1110)) its accuracy varies a lot between languages. It's general accuracy is 98,8% for training and validation.
19
+
20
+ Future versions will focus on increasing dataset size.