devngho's picture
Update README.md
0fb43d8 verified
---
base_model:
- microsoft/codebert-base
datasets:
- devngho/the-stack-llm-annotations-v2
language:
- code
library_name: transformers
license: mit
metrics:
- f1
---
# devngho/code_edu_classifier-v3-microsoft_codebert-base
์ด ๋ชจ๋ธ์€ [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)์— classifier๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)์˜ ์ฝ”๋“œ ๋ฒ„์ „์„ ๋ชฉํ‘œ๋กœ, ์ฝ”๋“œ์˜ ๊ต์œก์„ฑ ์ ์ˆ˜๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
ํ•™์Šต์—๋Š” [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)์—์„œ ์ถ”์ถœํ•œ ์ƒ˜ํ”Œ์„ [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)๋กœ ํ‰๊ฐ€ํ•œ [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) ๋ฐ์ดํ„ฐ์…‹์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ด ์—ฐ๊ตฌ๋Š” Google์˜ TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/)์˜ Cloud TPU ์ œ๊ณต์œผ๋กœ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. โšก
## ์ƒ์„ธ
- **์ œ์ž‘:** devngho
- **์–ธ์–ด:** code
- **๋ผ์ด์„ ์Šค:** mit
- **๊ธฐ๋ฐ˜ ๋ชจ๋ธ:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
## ํ•™์Šต ์ƒ์„ธ
- learning_rate: 3e-4 (cosine)
- warmup_ratio: 0.1
- batch_size: 2048(512*4)
- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
- duration: 4h 41m
- steps: 6080
## ํ•™์Šต ์žฅ๋น„
TPU v4-8
## ์„ฑ๋Šฅ
```
Validation Report:
precision recall f1-score support
0 0.80 0.06 0.10 72
1 0.62 0.40 0.48 835
2 0.61 0.62 0.61 2722
3 0.48 0.72 0.58 1891
4 0.62 0.02 0.05 623
5 0.00 0.00 0.00 1
accuracy 0.55 6144
macro avg 0.52 0.30 0.30 6144
weighted avg 0.58 0.55 0.52 6144
Confusion Matrix:
[[ 4 36 30 2 0 0]
[ 1 330 464 40 0 0]
[ 0 157 1684 881 0 0]
[ 0 5 516 1361 9 0]
[ 0 0 71 537 15 0]
[ 0 0 0 1 0 0]]
```
3 ์ด์ƒ๊ณผ ๋ฏธ๋งŒ์œผ๋กœ ๊ตฌ๋ถ„ํ•  ๋•Œ f1 score๋Š” ์•ฝ 0.72์ž…๋‹ˆ๋‹ค.
# devngho/code_edu_classifier-v3-microsoft_codebert-base
This model is [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) with classfier head. It is designed to evaluate the educational value of codes, similar to the [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), but focused on code. The training data comes from [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) dataset, contains samples extracted from [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) and evaluated using [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct).
This research was supported with Cloud TPUs from Google's TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/).โšก
- **Developed by:** devngho
- **Language(s):** code
- **License:** mit
- **Base model:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
## Training detail
- learning_rate: 3e-4 (cosine)
- warmup_ratio: 0.1
- batch_size: 2048(512*4)
- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
- duration: 4h 41m
- steps: 6080
## Training hardware
TPU v4-8
## Performance
```
Validation Report:
precision recall f1-score support
0 0.80 0.06 0.10 72
1 0.62 0.40 0.48 835
2 0.61 0.62 0.61 2722
3 0.48 0.72 0.58 1891
4 0.62 0.02 0.05 623
5 0.00 0.00 0.00 1
accuracy 0.55 6144
macro avg 0.52 0.30 0.30 6144
weighted avg 0.58 0.55 0.52 6144
Confusion Matrix:
[[ 4 36 30 2 0 0]
[ 1 330 464 40 0 0]
[ 0 157 1684 881 0 0]
[ 0 5 516 1361 9 0]
[ 0 0 71 537 15 0]
[ 0 0 0 1 0 0]]
```
The F1 score is about 0.72 when separating above and below 3.