code-natural-language-classification-dataset

Dataset

This classifier classifies a text into Code or NaturalLanguage.
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97.
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.

🛠️Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin"))  # "model_quantized.bin" for quantized version


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list)
  return [{"label": l[0].lstrip("__label__"), "score": s[0]}
           for l, s in zip(*pred)]


predict([
  """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
  """import torch""",
  """Short text won't work"""
])
# [{'label': 'NaturalLanguage', 'score': 0.96747404},
# {'label': 'Code', 'score': 1.00001},
# {'label': 'Code', 'score': 1.000009}]

📊Evaluation

full version

                 precision    recall  f1-score   support

           Code       0.97      1.00      0.98    581282
NaturalLanguage       1.00      0.92      0.95    228993

       accuracy                           0.98    810275
      macro avg       0.98      0.96      0.97    810275
   weighted avg       0.98      0.98      0.98    810275

quantized version

                 precision    recall  f1-score   support

           Code       0.95      1.00      0.97    581282
NaturalLanguage       1.00      0.86      0.93    228993

      micro avg       0.96      0.96      0.96    810275
      macro avg       0.97      0.93      0.95    810275
   weighted avg       0.96      0.96      0.96    810275

📝Definition of Label

Code covers:

{'Assembly',
 'Batchfile',
 'C',
 'C#',
 'C++',
 'CMake',
 'CSS',
 'Dockerfile',
 'FORTRAN',
 'GO',
 'HTML',
 'Haskell',
 'Java',
 'JavaScript',
 'Julia',
 'Lua',
 'Makefile',
 'PHP',
 'Perl',
 'PowerShell',
 'Python',
 'Ruby',
 'Rust',
 'SQL',
 'Scala',
 'Shell',
 'TeX',
 'TypeScript',
 'Visual Basic'}

Markdown is disregarded as it has a high overlap with natural language.

⚠️Known Limitation

The classifier does not handle short text well, which might not be surprising.
It has a tendency to classify short natural language into code, which you might find so in code comment.

kenhktsui
/

code-natural-language-fasttext-classifier

code-natural-language-classification-dataset

🛠️Usage

📊Evaluation

📝Definition of Label

⚠️Known Limitation

Dataset used to train kenhktsui/code-natural-language-fasttext-classifier

Collection including kenhktsui/code-natural-language-fasttext-classifier

FastText Model for Pretraining Data Curation