metadata

title: Tglang_programming_langugage_detection
app_file: app.py
sdk: gradio
sdk_version: 4.5.0

Tglang - identify a programming language of a code snippet

github repo

This is a solution for Telegram hackathon.

The list of supported languages:

  TGLANG_LANGUAGE_C
  TGLANG_LANGUAGE_CPLUSPLUS
  TGLANG_LANGUAGE_CSHARP
  TGLANG_LANGUAGE_CSS
  TGLANG_LANGUAGE_DART
  TGLANG_LANGUAGE_DOCKER
  TGLANG_LANGUAGE_FUNC
  TGLANG_LANGUAGE_GO
  TGLANG_LANGUAGE_HTML
  TGLANG_LANGUAGE_JAVA
  TGLANG_LANGUAGE_JAVASCRIPT
  TGLANG_LANGUAGE_JSON
  TGLANG_LANGUAGE_KOTLIN
  TGLANG_LANGUAGE_LUA
  TGLANG_LANGUAGE_NGINX
  TGLANG_LANGUAGE_OBJECTIVE_C
  TGLANG_LANGUAGE_PHP
  TGLANG_LANGUAGE_POWERSHELL
  TGLANG_LANGUAGE_PYTHON
  TGLANG_LANGUAGE_RUBY
  TGLANG_LANGUAGE_RUST
  TGLANG_LANGUAGE_SHELL
  TGLANG_LANGUAGE_SOLIDITY
  TGLANG_LANGUAGE_SQL
  TGLANG_LANGUAGE_SWIFT
  TGLANG_LANGUAGE_TL
  TGLANG_LANGUAGE_TYPESCRIPT
  TGLANG_LANGUAGE_XML

Other programming languages and non-code text are identified as TGLANG_LANGUAGE_OTHER (index 0).

Model development

Data

Training data consisted of 3.7k+ files with 220k+ lines of code. It consisted of files from the Stack dataset and manually collected from GitHub.
Test set was manually labelled from Telegram r1 files It consisted of 493 files and 7404 lines of code. Not all classes are present in the test set.
Train files were split into shorter sequences of lines to match the test files' length.
OTHER files from the telegram files were added to the train set to make up 20% of the data and to the test set to make up 50% of the data.

Model

Tokenizer - a simple text tokenizer is used to extract keywords and special characters from the code. Numbers, comments and docstrings are removed.
Text embedding - a TfIdf vectorizer is used to extract features from the train set. TfIdf params are:

    max_features=1000,
    binary=True, 
    ngram_range=(1,1), 
    tokenizer=tokenize_text,
    lowercase=False,

Classifier - a simple multinomial naive bayes is trained on vectorizer output.

Results

Accuracy on the test set: 0.82
Accuracy on the validation set: 0.83