wiktorwojcik112
/

ProgrammingLanguageIdentification

Text Classification

Model card Files Files and versions

wiktorwojcik112 commited on Jul 4, 2024

Commit

5f0a59f

·

verified ·

1 Parent(s): 3f92b16

Create README.md

Files changed (1) hide show

README.md +20 -3

README.md CHANGED Viewed

@@ -1,3 +1,20 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- christopher/rosetta-code
+pipeline_tag: text-classification
+tags:
+- code
+---
+This is a CoreML model for identification of following programming languages:
+```go, lua, perl, python, apl, shell, c, c#, c++, cobol, lisp, erlang, fortran, groovy, haskell, java, javascript, kotlin, objective-c, pascal, php, powershell, r, ruby, rust, scala, scheme, swift, dart, sql, text, mysql, typescript, ecma, cmake, html, latex, jinja, json, toml, css```
+It was trained on a cleaned up and filtered rosetta-code dataset (more precisely: https://huggingface.co/datasets/christopher/rosetta-code, but cleaned up).
+## ProgrammingLanguageIdentificationV1
+First version of PIL model. It was trained on 20 362 data points (including validation, which was picked automatically).
+Because each programming language has a different number of snippets (lowest: css, ecma, toml (1), highest: go (1110)) its accuracy varies a lot between languages. It's general accuracy is 98,8% for training and validation.
+Future versions will focus on increasing dataset size.