rusteam's picture
Upload folder using huggingface_hub
88fb14c

A newer version of the Gradio SDK is available: 4.44.1

Upgrade
metadata
title: Tglang_programming_langugage_detection
app_file: app.py
sdk: gradio
sdk_version: 4.5.0

Tglang - identify a programming language of a code snippet

github repo

This is a solution for Telegram hackathon.

The list of supported languages:

  TGLANG_LANGUAGE_C
  TGLANG_LANGUAGE_CPLUSPLUS
  TGLANG_LANGUAGE_CSHARP
  TGLANG_LANGUAGE_CSS
  TGLANG_LANGUAGE_DART
  TGLANG_LANGUAGE_DOCKER
  TGLANG_LANGUAGE_FUNC
  TGLANG_LANGUAGE_GO
  TGLANG_LANGUAGE_HTML
  TGLANG_LANGUAGE_JAVA
  TGLANG_LANGUAGE_JAVASCRIPT
  TGLANG_LANGUAGE_JSON
  TGLANG_LANGUAGE_KOTLIN
  TGLANG_LANGUAGE_LUA
  TGLANG_LANGUAGE_NGINX
  TGLANG_LANGUAGE_OBJECTIVE_C
  TGLANG_LANGUAGE_PHP
  TGLANG_LANGUAGE_POWERSHELL
  TGLANG_LANGUAGE_PYTHON
  TGLANG_LANGUAGE_RUBY
  TGLANG_LANGUAGE_RUST
  TGLANG_LANGUAGE_SHELL
  TGLANG_LANGUAGE_SOLIDITY
  TGLANG_LANGUAGE_SQL
  TGLANG_LANGUAGE_SWIFT
  TGLANG_LANGUAGE_TL
  TGLANG_LANGUAGE_TYPESCRIPT
  TGLANG_LANGUAGE_XML

Other programming languages and non-code text are identified as TGLANG_LANGUAGE_OTHER (index 0).

Model development

Data

  • Training data consisted of 3.7k+ files with 220k+ lines of code. It consisted of files from the Stack dataset and manually collected from GitHub.
  • Test set was manually labelled from Telegram r1 files It consisted of 493 files and 7404 lines of code. Not all classes are present in the test set.
  • Train files were split into shorter sequences of lines to match the test files' length.
  • OTHER files from the telegram files were added to the train set to make up 20% of the data and to the test set to make up 50% of the data.

Model

  1. Tokenizer - a simple text tokenizer is used to extract keywords and special characters from the code. Numbers, comments and docstrings are removed.
  2. Text embedding - a TfIdf vectorizer is used to extract features from the train set. TfIdf params are:
    max_features=1000,
    binary=True, 
    ngram_range=(1,1), 
    tokenizer=tokenize_text,
    lowercase=False,
  1. Classifier - a simple multinomial naive bayes is trained on vectorizer output.

Results

  • Accuracy on the test set: 0.82
  • Accuracy on the validation set: 0.83