---
base_model:
- microsoft/codebert-base
datasets:
- devngho/the-stack-llm-annotations-v2
language:
- code
library_name: transformers
license: mit
metrics:
- f1
---

# devngho/code_edu_classifier-v3-microsoft_codebert-base

이 모델은 [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)에 classifier를 추가한 모델입니다. [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)의 코드 버전을 목표로, 코드의 교육성 점수를 평가합니다.
학습에는 [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)에서 추출한 샘플을 [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)로 평가한 [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) 데이터셋이 사용되었습니다.

이 연구는 Google의 TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/)의 Cloud TPU 제공으로 수행되었습니다. ⚡

## 상세

- **제작:** devngho
- **언어:** code
- **라이선스:** mit
- **기반 모델:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

## 학습 상세

- learning_rate: 3e-4 (cosine)
- warmup_ratio: 0.1
- batch_size: 2048(512*4)
- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
- duration: 4h 41m
- steps: 6080

## 학습 장비

TPU v4-8

## 성능

```
Validation Report:
              precision    recall  f1-score   support

           0       0.80      0.06      0.10        72
           1       0.62      0.40      0.48       835
           2       0.61      0.62      0.61      2722
           3       0.48      0.72      0.58      1891
           4       0.62      0.02      0.05       623
           5       0.00      0.00      0.00         1

    accuracy                           0.55      6144
   macro avg       0.52      0.30      0.30      6144
weighted avg       0.58      0.55      0.52      6144

Confusion Matrix:
[[   4   36   30    2    0    0]
 [   1  330  464   40    0    0]
 [   0  157 1684  881    0    0]
 [   0    5  516 1361    9    0]
 [   0    0   71  537   15    0]
 [   0    0    0    1    0    0]]
```

3 이상과 미만으로 구분할 때 f1 score는 약 0.72입니다.

# devngho/code_edu_classifier-v3-microsoft_codebert-base

This model is [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) with classfier head. It is designed to evaluate the educational value of codes, similar to the [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), but focused on code. The training data comes from [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) dataset, contains samples extracted from [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) and evaluated using [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct).

This research was supported with Cloud TPUs from Google's TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/).⚡

- **Developed by:** devngho
- **Language(s):** code
- **License:** mit
- **Base model:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

## Training detail

- learning_rate: 3e-4 (cosine)
- warmup_ratio: 0.1
- batch_size: 2048(512*4)
- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
- duration: 4h 41m
- steps: 6080

## Training hardware

TPU v4-8

## Performance

```
Validation Report:
              precision    recall  f1-score   support

           0       0.80      0.06      0.10        72
           1       0.62      0.40      0.48       835
           2       0.61      0.62      0.61      2722
           3       0.48      0.72      0.58      1891
           4       0.62      0.02      0.05       623
           5       0.00      0.00      0.00         1

    accuracy                           0.55      6144
   macro avg       0.52      0.30      0.30      6144
weighted avg       0.58      0.55      0.52      6144

Confusion Matrix:
[[   4   36   30    2    0    0]
 [   1  330  464   40    0    0]
 [   0  157 1684  881    0    0]
 [   0    5  516 1361    9    0]
 [   0    0   71  537   15    0]
 [   0    0    0    1    0    0]]
```

The F1 score is about 0.72 when separating above and below 3.