Introduction

Novora Code Classifier v1 Tiny, is a tiny Text Classification model, which classifies given code text input under 1 of 31 different classes (programming languages).

This model is designed to be able to run on CPU, but optimally runs on GPUs.

Info

  • 1 of 31 classes output
  • 512 token input dimension
  • 64 hidden dimensions
  • 2 linear layers
  • The snowflake-arctic-embed-xs model is used as the embeddings model.
  • Dataset split into 80% training set, 20% testing set.
  • The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
  • Picked from the 18th epoch out of 20 done.

Architecture

The CodeClassifier-v1-Tiny model employs a neural network architecture optimized for text classification tasks, specifically for classifying programming languages from code snippets. This model includes:

  • Bidirectional LSTM Feature Extractor: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.

  • Fully Connected Layers: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.

The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.

Testing/Training Datasets

I have put here the samples entered into the training/testing pipeline, its a very small amount.

Language Testing Count Training Count
Ada 20 80
Assembly 20 80
C 20 80
C# 20 80
C++ 20 80
COBOL 14 55
Common Lisp 20 80
Dart 20 80
Erlang 20 80
F# 20 80
Go 20 80
Haskell 20 80
Java 20 80
JavaScript 20 80
Julia 20 80
Kotlin 20 80
Lua 20 80
MATLAB 20 80
PHP 20 80
Perl 20 80
Prolog 1 4
Python 20 80
R 20 80
Ruby 20 80
Rust 20 80
SQL 20 80
Scala 20 80
Swift 20 80
TypeScript 20 80

Example Code

import torch.nn as nn
import torch.nn.functional as F

class CodeClassifier(nn.Module):
    def __init__(self, num_classes, embedding_dim, hidden_dim, num_layers, bidirectional=False):
        super(CodeClassifier, self).__init__()
        self.feature_extractor = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=bidirectional)
        self.dropout = nn.Dropout(0.5)  # Reintroduce dropout
        self.fc1 = nn.Linear(hidden_dim * (2 if bidirectional else 1), hidden_dim)  # Intermediate layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  # Output layer

    def forward(self, x):
        x = x.unsqueeze(1)  # Add sequence dimension
        x, _ = self.feature_extractor(x)
        x = x.squeeze(1)  # Remove sequence dimension
        x = self.fc1(x)
        x = self.dropout(x)  # Apply dropout
        x = self.fc2(x)
        return x

import torch
from transformers import AutoTokenizer, AutoModel
from pathlib import Path

def infer(text, model_path, embedding_model_name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load tokenizer and embedding model
    tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
    embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
    embedding_model.eval()

    # Prepare inputs
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate embeddings
    with torch.no_grad():
        embeddings = embedding_model(**inputs)[0][:, 0]

    # Load classifier model
    model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model = model.to(device)
    model.eval()

    # Predict class
    with torch.no_grad():
        output = model(embeddings)
        _, predicted = torch.max(output, dim=1)

    # Language labels
    languages = [
        'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
        'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
        'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
        'Swift', 'TypeScript'
    ]
    
    return languages[predicted.item()]

# Example usage
if __name__ == "__main__":
    example_text = "print('Hello, world!')"  # Replace with actual text for inference
    model_file_path = Path("./model.safetensors")
    predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
    print(f"Predicted programming language: {predicted_language}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
340k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Novora/CodeClassifier-v1-Tiny