metadata

license: mit
datasets:
  - kenhktsui/code-natural-language-classification-dataset
language:
  - en
metrics:
  - f1
pipeline_tag: text-classification
library_name: fasttext

code-natural-language-classification-dataset

Dataset

This classifier classifies a text into Code or NaturalLanguage.
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97.
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.

🛠️Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin"))


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list)
  return [{"label": l[0].lstrip("__label__"), "score": s[0]}
           for l, s in zip(*pred)]


predict([
  """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
  """import torch""",
  """Short text won't work"""
])
# [{'label': 'NaturalLanguage', 'score': 0.96747404},
# {'label': 'Code', 'score': 1.00001},
# {'label': 'Code', 'score': 1.000009}]

📊Evaluation

                 precision    recall  f1-score   support

           Code       0.97      1.00      0.98    581282
NaturalLanguage       1.00      0.92      0.95    228993

       accuracy                           0.98    810275
      macro avg       0.98      0.96      0.97    810275
   weighted avg       0.98      0.98      0.98    810275

📝Definition of Label

Code covers:

{'Assembly',
 'Batchfile',
 'C',
 'C#',
 'C++',
 'CMake',
 'CSS',
 'Dockerfile',
 'FORTRAN',
 'GO',
 'HTML',
 'Haskell',
 'Java',
 'JavaScript',
 'Julia',
 'Lua',
 'Makefile',
 'PHP',
 'Perl',
 'PowerShell',
 'Python',
 'Ruby',
 'Rust',
 'SQL',
 'Scala',
 'Shell',
 'TeX',
 'TypeScript',
 'Visual Basic'}

Markdown is disregarded as it has a high overlap with natural language.

⚠️Known Limitation

The classifier does not handle short text well, which might not be surprising.
It has a tendency to classify short natural language into code, which you might find so in code comment.