README.md · LemiSt/code-segmentor-distilbert at main

metadata

license: apache-2.0
tags:
  - Token Classification
widget:
  - text: "The following is a bubble sort implementation taken from TeamTest57/Whack-A-Mole on github. int iro = 0; int score = 0; void bubble_sort() {\n\tint i, j;\n\tfor (i = 0; i < mole_num - 1; i++)\n\t\tfor (j = mole_num - 1; j >= i + 1; j--)\n\t\t\tif (hole_y[j] < hole_y[j - 1]) {\n\t\t\t\tint temp;\n\t\t\t\ttemp = hole_y[j];\n\t\t\t\thole_y[j] = hole_y[j - 1];\n\t\t\t\thole_y[j - 1] = temp;\n\t\t\t\ttemp = hole_x[j];\n\t\t\t\thole_x[j] = hole_x[j - 1];\n\t\t\t\thole_x[j - 1] = temp;\n\t\t\t}\n}"
    example_title: example 1
  - text: >-
      # Sample animal inherits from custom metaclass class
      Panda(metaclass=CustomMeta):
          """I bet you see this docstring printed as well"""
          fav_food = "Bamboo"
          loves_code = True

          def activity(self):
              print("Zzz...")
      This programming code was taken from cyberpanda/PythonStuff on GitHub and
      is cc0-licensed. It defines a class with member variables and methods.
    example_title: example 2

This is a distilbert-base-multilingual-cased-Model fine-tuned with a NER objective to tag tokens based on whether they belong to a code block or natural language text. The dataset of 78210 examples was generated by randomly combining code and text blocks from other permissively-licensed datasets, with some examples containing only code and some only regular text.

The model achieves the following stats on the validation set:

Metric	Value
Loss	0.0788
F1 Score	0.8619
Precision	0.8362
Recall	0.8893
Accuracy	0.9792