--- base_model: - microsoft/codebert-base datasets: - devngho/the-stack-llm-annotations-v2 language: - code library_name: transformers license: mit metrics: - f1 --- # devngho/code_edu_classifier-v3-microsoft_codebert-base 이 모델은 [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)에 classifier를 추가한 모델입니다. [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)의 코드 버전을 목표로, 코드의 교육성 점수를 평가합니다. 학습에는 [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)에서 추출한 샘플을 [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)로 평가한 [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) 데이터셋이 사용되었습니다. 이 연구는 Google의 TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/)의 Cloud TPU 제공으로 수행되었습니다. ⚡ ## 상세 - **제작:** devngho - **언어:** code - **라이선스:** mit - **기반 모델:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) ## 학습 상세 - learning_rate: 3e-4 (cosine) - warmup_ratio: 0.1 - batch_size: 2048(512*4) - optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01) - duration: 4h 41m - steps: 6080 ## 학습 장비 TPU v4-8 ## 성능 ``` Validation Report: precision recall f1-score support 0 0.80 0.06 0.10 72 1 0.62 0.40 0.48 835 2 0.61 0.62 0.61 2722 3 0.48 0.72 0.58 1891 4 0.62 0.02 0.05 623 5 0.00 0.00 0.00 1 accuracy 0.55 6144 macro avg 0.52 0.30 0.30 6144 weighted avg 0.58 0.55 0.52 6144 Confusion Matrix: [[ 4 36 30 2 0 0] [ 1 330 464 40 0 0] [ 0 157 1684 881 0 0] [ 0 5 516 1361 9 0] [ 0 0 71 537 15 0] [ 0 0 0 1 0 0]] ``` 3 이상과 미만으로 구분할 때 f1 score는 약 0.72입니다. # devngho/code_edu_classifier-v3-microsoft_codebert-base This model is [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) with classfier head. It is designed to evaluate the educational value of codes, similar to the [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), but focused on code. The training data comes from [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) dataset, contains samples extracted from [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) and evaluated using [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct). This research was supported with Cloud TPUs from Google's TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/).⚡ - **Developed by:** devngho - **Language(s):** code - **License:** mit - **Base model:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) ## Training detail - learning_rate: 3e-4 (cosine) - warmup_ratio: 0.1 - batch_size: 2048(512*4) - optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01) - duration: 4h 41m - steps: 6080 ## Training hardware TPU v4-8 ## Performance ``` Validation Report: precision recall f1-score support 0 0.80 0.06 0.10 72 1 0.62 0.40 0.48 835 2 0.61 0.62 0.61 2722 3 0.48 0.72 0.58 1891 4 0.62 0.02 0.05 623 5 0.00 0.00 0.00 1 accuracy 0.55 6144 macro avg 0.52 0.30 0.30 6144 weighted avg 0.58 0.55 0.52 6144 Confusion Matrix: [[ 4 36 30 2 0 0] [ 1 330 464 40 0 0] [ 0 157 1684 881 0 0] [ 0 5 516 1361 9 0] [ 0 0 71 537 15 0] [ 0 0 0 1 0 0]] ``` The F1 score is about 0.72 when separating above and below 3.