E-Star-12B-v2-Base
β οΈ λΉμμ
μ μ¬μ© μ μ© (Non-Commercial Use Only)
μ΄ λͺ¨λΈμ CC BY-NC 4.0 λΌμ΄μ μ€ νμ λ°°ν¬λ©λλ€.
μμ
μ λͺ©μ μ μ¬μ©(μ νΒ·μλΉμ€ ν΅ν©, μ λ£ API μ 곡, λ΄λΆ μ΄μ μμ€ν
μ μ© λ±)μ νμ©λμ§ μμ΅λλ€.
μμ
μ μ΄μ© λ¬Έμ: [Selectstar 곡μ μ±λ]μ ν΅ν΄ λ³λ λΌμ΄μ μ€ κ³μ½μ΄ νμν©λλ€.
- μμ μ: Selectstar Eval team
- μμ±μΌ: 2026-05-22
- μν: active
1. λͺ¨λΈ μ€λͺ
μν€ν
μ² / νλΌλ―Έν°
| νλͺ© |
λ΄μ© |
| λ² μ΄μ€ λͺ¨λΈ |
Gemma-3-12B-IT |
| νλΌλ―Έν° μ |
12B |
| νμ΅ λ°©μ |
Full Fine-Tuning (SFT) |
| μΆλ ₯ ꡬ쑰 |
feedback β highlight β decision |
λ²μ μ 보
| λ²μ |
μ€λͺ
|
| v0.1 |
μ΄κΈ° Base λ²μ . K2-Feedback κΈ°λ° 3λ¨κ³ νν°λ§ λ°μ΄ν°(6,311κ°)λ‘ νμ΅ |
λͺ©μ (μ¬μ© μ¬λ‘)
νκ΅μ΄ 루λΈλ¦ κΈ°λ° νκ°λ₯Ό μμ μ μΌλ‘ μννλ SLM κΈ°λ° evaluator. κΈμ΅Β·λ²λ₯ λλ©μΈ RAG νμ΄νλΌμΈμ νμ§ νκ°μ νΉνλμ΄ μμΌλ©°, λ€μ νκ° μΆμ μ§μνλ€.
- Faithfulness: μλ΅μ΄ μ 곡λ λ¬Έμμ κ·Όκ±°νλμ§ νλ¨ (νκ° μ§λ¨)
- Context Relevancy: κ²μλ λ¬Έμκ° μ§μμ κ΄λ ¨λλμ§ νλ¨ (κ²μ νμ§)
- Response Relevancy: μλ΅μ΄ μ§μμ μ μ ν λμνλμ§ νλ¨ (μ’
ν© μλ΅ μ ν©μ±)
λͺ¨λΈ μμ© κ°λ₯μ±
- κΈμ΅Β·λ²λ₯ μΈ λλ©μΈμ RAG νκ°λ‘ νμ₯ κ°λ₯ (λλ©μΈλ³ 루λΈλ¦ μ€κ³ νμ)
- λ²μ© 루λΈλ¦ κΈ°λ° LLM μΆλ ₯ νμ§ νκ° (Ko Feedback Benchμμ κ²μ¦λ rubric following λ₯λ ₯)
- νκ° νμ΄νλΌμΈ μλν μ frontier λͺ¨λΈ λλΉ λΉμ© ν¨μ¨μ λμμΌλ‘ νμ©
2. λͺ¨λΈ μ€ν λ°©λ²
νμ΅ μ½λ μ€λν«
from trl import SFTTrainer, SFTConfig
from transformers import GemmaForCausalLM, AutoTokenizer
model_name = "google/gemma-3-12b-it"
model = GemmaForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sft_config = SFTConfig(
output_dir="./eval-estar-base-v0.1",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=1e-5,
num_train_epochs=5,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
bf16=True,
logging_steps=10,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=sft_config,
)
trainer.train()
β» μ€μ νμ΅ μ validation loss κΈ°μ€ 2 μνμμ early stopping μ μ©
μΆλ‘ μ½λ μ€λν«
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "datumo/E-Star-12B-v2-Base"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
system_prompt = """You are a rubric evaluator.
Your task is to evaluate a response strictly and only according to the provided pass criteria and scoring rubric.
In your output, return the final evaluation (the three output tags: <feedback>, <highlight>, and <decision>).
# Evaluation Procedure (must follow all steps):
1. First, carefully read the Data to Evaluate, the pass criteria, and the scoring rubric to fully understand the requirements.
2. Evaluate the response only against the given criteria: do not introduce external standards, do not reward style unless the rubric explicitly allows it, and judge by absolute rubric definitions rather than relative comparisons.
3. Re-check fine-grained details in the response and the rubric, ensuring any tags (if present) are correctly mapped to the pass criteria and that small deviations are not overlooked.
4. Write criterion-focused feedback that explicitly references the rubric, quoting exact words or phrases from the response when they are decisive, and clearly stating which criteria are satisfied and which are violated.
5. Finally, extract the key verbatim spans that most influenced your judgment and assign the final score according to the scoring rubric.
"""
user_prompt = """You MUST write ALL output (<feedback>, <highlight>, <decision>) in the SAME language as the input question and response being evaluated. If the input is in Korean, your entire output MUST be in Korean.
# Output Format:
<feedback>
Write detailed feedback (reasons) that strictly evaluates the quality of the response using only the given scoring rubric. Do not explicitly state the score in a sentence (e.g., "Therefore, the score is β¦").
</feedback>
<highlight>
List of words or phrases that you believe are the most important in determining the score.
</highlight>
<decision>
Provide the final integer score assigned based on the scoring rubric.
</decision>
# Data to Evaluate
### Problem
ν 곡μ₯μμ ν루μ 120κ°μ μ νμ μμ°νλ€. λΆλλ₯ μ΄ 5%μΌ λ, μΌμ£ΌμΌ(7μΌ) λμ μμ°λλ μ μ μ νμ μλ?
### Model Response
ν루 μμ°λ: 120κ°
λΆλλ₯ : 5% β λΆλν: 120 Γ 0.05 = 6κ°
ν루 μ μ μ ν: 120 - 6 = 114κ°
μΌμ£ΌμΌ μ μ μ ν: 114 Γ 7 = 798κ°
### Optional Ground Truth
798κ°
# Rubric
Evaluate whether the model correctly solves the problem and provides reasoning that is logically consistent with the final answer. Prioritize correctness of the conclusion, then soundness of the reasoning.
Score 1: The final answer is wrong and the reasoning is invalid, irrelevant, or missing.
Score 2: The response shows limited progress but contains major reasoning flaws leading to an incorrect or unreliable answer.
Score 3: The response demonstrates partial reasoning ability but is incomplete, contains mistakes, or reaches an uncertain result.
Score 4: The response is mostly correct with generally sound reasoning, though minor errors or gaps may remain.
Score 5: The response reaches the correct answer through clear, consistent, and logically valid reasoning appropriate to the problem."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=2048,
temperature=0.0,
do_sample=False,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
3. νμ΅ λ°μ΄ν°μ
HuggingFace λ°μ΄ν°μ
π€ datumo/E-Star-Train-6K
| νλͺ© |
λ΄μ© |
| μλ λ°μ΄ν° |
K2-Feedback (HAERAEHUB, 2024) β 99.7K |
| μ΅μ’
νμ΅ λ°μ΄ν° |
6,311κ° (3λ¨κ³ νν°λ§ ν) |
νν°λ§ νμ΄νλΌμΈ μμ½
| λ¨κ³ |
κ·λͺ¨ λ³ν |
λ°©λ² |
| Stage 1 |
99.7K β 26K |
Qwen3-30B-A3B / Qwen3-Next-80B-A3B κ° μ΄κΈ° ν©μ |
| Stage 2 |
26K β 8K |
Gemma λ² μ΄μ€ λͺ¨λΈ κΈ°μ€ μΌμΉ/λΆμΌμΉ κ· νν |
| Stage 3 |
8K β 6K |
GPT-5.2 λ¨μΌ νκ° + μν frontier debate κ΅μ°¨ κ²μ¦ |
νκ° λ²€μΉλ§ν¬
4. νμ΅ μ€μ
μ£Όμ νμ΅ νλΌλ―Έν°
learning_rate: 1e-5
num_train_epochs: 5 (early stopping at epoch 2)
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
bf16: true
eval_strategy: epoch
metric_for_best_model: eval_loss
load_best_model_at_end: true
method: Full Fine-Tuning (no LoRA)
framework: TRL SFTTrainer (von Werra et al., 2020)
μ¬μ© GPU / νμ΅ μκ°
| νλͺ© |
λ΄μ© |
| GPU |
4 |
| νμ΅ μκ° |
1.2 hr |
5. νκ° κ²°κ³Ό
5.1 Feedback Bench (μμ΄, Rubric Following)
| Type |
Models |
Pearson |
Kendall Ο |
Spearman |
| Frontier |
GPT-5.2 |
0.916 |
0.865 |
0.911 |
| Frontier |
Sonnet-4.6 |
0.840 |
0.776 |
0.847 |
| Instruct SLM |
Gemma-3-12B-IT |
0.810 |
0.725 |
0.794 |
| Instruct SLM |
oss-20b |
0.844 |
0.762 |
0.839 |
| Evaluator LM |
Prometheus-8x7B-v2.0 |
0.823 |
0.736 |
0.806 |
| Evaluator LM |
GLIDER 3.8B |
0.678 |
0.595 |
0.688 |
| Ours |
E-Star-12B-Base |
0.856 |
0.778 |
0.847 |
5.2 Ko Feedback Bench (νκ΅μ΄, Rubric Following)
| Type |
Models |
Pearson |
Kendall Ο |
Spearman |
| Frontier |
GPT-5.2 |
0.929 |
0.886 |
0.925 |
| Frontier |
Sonnet-4.6 |
0.820 |
0.758 |
0.833 |
| Instruct SLM |
Gemma-3-12B-IT |
0.653 |
0.593 |
0.661 |
| Instruct SLM |
oss-20b |
0.778 |
0.704 |
0.779 |
| Evaluator LM |
Prometheus-8x7B-v2.0 |
0.377 |
0.441 |
0.501 |
| Evaluator LM |
GLIDER 3.8B |
0.523 |
0.487 |
0.563 |
| Ours |
E-Star-12B-Base |
0.826 |
0.754 |
0.819 |
5.3 RAG Quality Bench (κΈμ΅Β·λ²λ₯ , Domain Adaptation)
| Models |
LAW(CR) |
LAW(FF) |
LAW(RR) |
FIN(CR) |
FIN(FF) |
FIN(RR) |
Average |
| GPT-5.2 |
0.846 |
0.785 |
0.941 |
0.882 |
0.740 |
0.970 |
0.861 |
| Sonnet-4.6 |
0.910 |
0.786 |
0.872 |
0.932 |
0.845 |
0.925 |
0.878 |
| Gemma-3-12B-IT |
0.620 |
0.742 |
0.742 |
0.830 |
0.713 |
0.821 |
0.745 |
| oss-20b |
0.846 |
0.722 |
0.870 |
0.793 |
0.752 |
0.900 |
0.813 |
| Prometheus-8x7B-v2.0 |
0.392 |
0.477 |
0.772 |
0.386 |
0.240 |
0.806 |
0.512 |
| GLIDER 3.8B |
0.657 |
0.670 |
0.680 |
0.432 |
0.415 |
0.548 |
0.567 |
| E-Star-12B-Base |
0.853 |
0.730 |
0.816 |
0.835 |
0.720 |
0.880 |
0.806 |
CR = Context Relevancy, FF = Faithfulness, RR = Response Relevancy
6. νκ³
- Ko Feedback Benchλ κΈ°κ³λ²μ κΈ°λ°μΌλ‘ ꡬμΆλμμΌλ―λ‘, λ²μ λ
Έμ΄μ¦κ° νκ° μ±λ₯ μΈ‘μ μ μν₯μ μ€ μ μμ
- νμ΅ λ°μ΄ν°μ λ²€μΉλ§ν¬ λ μ΄λΈμ΄ λμΌν debate κΈ°λ° μ μ°¨λ‘ κ΅¬μΆλμμΌλ―λ‘, μ λμ νκ° νμ§λ³΄λ€λ ν©μ κΈ°λ° λ μ΄λΈλ§ κΈ°μ€κ³Όμ μ λ ¬ μ λλ₯Ό λ°μν μ μμ (λλ©μΈ μ λ¬Έκ° human evaluation λ―Έμ€μ)
- Reference-free μ€μ μΌλ‘ νμ΅ λ° νκ°λμμΌλ―λ‘, reference ν¬ν¨ νκ²½μμμ μ±λ₯μ λ³λ κ²μ¦ νμ
- 12B κ·λͺ¨ SLM νΉμ±μ frontier λͺ¨λΈ λλΉ λ³΅μ‘ν 루λΈλ¦ ν΄μ λ₯λ ₯μ νκ³κ° μμ μ μμ
- RAG νκ° μ μ
λ ₯ λ¬Έμ μ μ¦κ°μ λ°λ₯Έ μ±λ₯ λ³νλ λ―Έκ²μ¦
7. λΌμ΄μ μ€
CC BY-NC 4.0 β λΉμμ
μ μ¬μ© μ μ©
μ΄ λͺ¨λΈμ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) λΌμ΄μ μ€ νμ λ°°ν¬λ©λλ€.
β
νμ© μ¬ν
- νμ μ°κ΅¬ λ° λ
Όλ¬Έ μμ±μ μ¬μ©
- λΉμ리 κ΅μ‘ λͺ©μ μ νμ©
- κ°μΈ νμ΅ λ° μ€ν λͺ©μ μ μ¬μ©
- μ 쑰건 νμ μμ λ° μ¬λ°°ν¬ (λ¨, μλ³Έ μΆμ² λͺ
μ λ° λμΌ λΌμ΄μ μ€ μ μ© νμ)
β κΈμ§ μ¬ν (λΉμμ
μ μ‘°ν μλ°)
- μ λ£ μ νΒ·μλΉμ€μ λͺ¨λΈμ ν΅ν©νκ±°λ API ννλ‘ μ 곡νλ νμ
- μ¬λ΄ μ΄μ μμ€ν
, κ³ κ° λλ©΄ μλΉμ€, μμ΅ μ°½μΆ νμ΄νλΌμΈμ μ§Β·κ°μ μ μΌλ‘ μ μ©νλ νμ
- λͺ¨λΈ κ°μ€μΉλ₯Ό μμ
μ λͺ©μ μΌλ‘ μ¬λ°°ν¬νκ±°λ νλ§€νλ νμ
- μμ
μ λͺ¨λΈ νμ΅μ μν νμΈνλ λ°μ΄ν° μμ± λ± κ°μ μ μμ
νμ©
μμ
μ μ΄μ© λ¬Έμ: μμ
μ λΌμ΄μ μ€κ° νμνμ κ²½μ°, Selectstar 곡μ μ±λμ ν΅ν΄ λ¬Έμν΄ μ£ΌμΈμ.
κ΅¬μ± μμλ³ λΌμ΄μ μ€
| κ΅¬μ± μμ |
λΌμ΄μ μ€ |
| λ² μ΄μ€ λͺ¨λΈ (Gemma-3-12B-IT) |
Gemma Terms of Use (Google) |
| νμ΅ λ°μ΄ν° (K2-Feedback) |
K2-Feedback (HAERAEHUB, 2024) λΌμ΄μ μ€ μ μ±
μ λ°λ¦ |
| νμΈνλ κ°μ€μΉ (λ³Έ λͺ¨λΈ) |
CC BY-NC 4.0 |
μ€μ: λ² μ΄μ€ λͺ¨λΈμΈ Gemmaμ μ΄μ© μ½κ΄ λν μ€μν΄μΌ ν©λλ€. Gemma Terms of Useμμ λ³λλ‘ κΈμ§νλ μ¬μ© λ°©μμ λ³Έ λͺ¨λΈμλ λμΌνκ² μ μ©λ©λλ€.