Spaces:
Runtime error
Runtime error
title: Tglang_programming_langugage_detection | |
app_file: app.py | |
sdk: gradio | |
sdk_version: 4.5.0 | |
# Tglang - identify a programming language of a code snippet | |
[github repo](https://github.com/Rusteam/tglang) | |
This is a solution for [Telegram hackathon](https://contest.com/docs/ML-Competition-2023-r2). | |
The list of supported languages: | |
```markdown | |
TGLANG_LANGUAGE_C | |
TGLANG_LANGUAGE_CPLUSPLUS | |
TGLANG_LANGUAGE_CSHARP | |
TGLANG_LANGUAGE_CSS | |
TGLANG_LANGUAGE_DART | |
TGLANG_LANGUAGE_DOCKER | |
TGLANG_LANGUAGE_FUNC | |
TGLANG_LANGUAGE_GO | |
TGLANG_LANGUAGE_HTML | |
TGLANG_LANGUAGE_JAVA | |
TGLANG_LANGUAGE_JAVASCRIPT | |
TGLANG_LANGUAGE_JSON | |
TGLANG_LANGUAGE_KOTLIN | |
TGLANG_LANGUAGE_LUA | |
TGLANG_LANGUAGE_NGINX | |
TGLANG_LANGUAGE_OBJECTIVE_C | |
TGLANG_LANGUAGE_PHP | |
TGLANG_LANGUAGE_POWERSHELL | |
TGLANG_LANGUAGE_PYTHON | |
TGLANG_LANGUAGE_RUBY | |
TGLANG_LANGUAGE_RUST | |
TGLANG_LANGUAGE_SHELL | |
TGLANG_LANGUAGE_SOLIDITY | |
TGLANG_LANGUAGE_SQL | |
TGLANG_LANGUAGE_SWIFT | |
TGLANG_LANGUAGE_TL | |
TGLANG_LANGUAGE_TYPESCRIPT | |
TGLANG_LANGUAGE_XML | |
``` | |
Other programming languages and non-code text are identified | |
as `TGLANG_LANGUAGE_OTHER` (index 0). | |
## Model development | |
### Data | |
- Training data consisted of 3.7k+ files with 220k+ lines of code. | |
It consisted of files from the [Stack dataset](https://huggingface.co/datasets/bigcode/the-stack/viewer/default/train) | |
and manually collected from GitHub. | |
- Test set was manually labelled from [Telegram r1 files](https://data-static.usercontent.dev/ml2023-r1-dataset.tar.gz) | |
It consisted of 493 files and 7404 lines of code. Not all classes are present in the test set. | |
- Train files were split into shorter sequences of lines to | |
match the test files' length. | |
- OTHER files from the telegram files were added to the train set | |
to make up 20% of the data and to the test set to make up 50% of the data. | |
### Model | |
1. Tokenizer - a simple text tokenizer is used to extract | |
keywords and special characters from the code. Numbers, | |
comments and docstrings are removed. | |
2. Text embedding - a TfIdf vectorizer is used to extract | |
features from the train set. TfIdf params are: | |
```python | |
max_features=1000, | |
binary=True, | |
ngram_range=(1,1), | |
tokenizer=tokenize_text, | |
lowercase=False, | |
``` | |
3. Classifier - a simple multinomial naive bayes is trained on | |
vectorizer output. | |
### Results | |
- Accuracy on the test set: 0.82 | |
- Accuracy on the validation set: 0.83 | |