E-Star-12B-v2-Base

⚠️ 비상업적 사용 전용 (Non-Commercial Use Only) 이 모델은 CC BY-NC 4.0 라이선스 하에 배포됩니다. 상업적 목적의 사용(제품·서비스 통합, 유료 API 제공, 내부 운영 시스템 적용 등)은 허용되지 않습니다. 상업적 이용 문의: [Selectstar 공식 채널]을 통해 별도 라이선스 계약이 필요합니다.

소유자: Selectstar Eval team
작성일: 2026-05-22
상태: active

1. 모델 설명

아키텍처 / 파라미터

항목	내용
베이스 모델	Gemma-3-12B-IT
파라미터 수	12B
학습 방식	Full Fine-Tuning (SFT)
출력 구조	feedback → highlight → decision

버전 정보

버전	설명
v0.1	초기 Base 버전. K2-Feedback 기반 3단계 필터링 데이터(6,311개)로 학습

목적 (사용 사례)

한국어 루브릭 기반 평가를 안정적으로 수행하는 SLM 기반 evaluator. 금융·법률 도메인 RAG 파이프라인의 품질 평가에 특화되어 있으며, 다음 평가 축을 지원한다.

Faithfulness: 응답이 제공된 문서에 근거하는지 판단 (환각 진단)
Context Relevancy: 검색된 문서가 질의에 관련되는지 판단 (검색 품질)
Response Relevancy: 응답이 질의에 적절히 대응하는지 판단 (종합 응답 적합성)

모델 응용 가능성

금융·법률 외 도메인의 RAG 평가로 확장 가능 (도메인별 루브릭 설계 필요)
범용 루브릭 기반 LLM 출력 품질 평가 (Ko Feedback Bench에서 검증된 rubric following 능력)
평가 파이프라인 자동화 시 frontier 모델 대비 비용 효율적 대안으로 활용

2. 모델 실행 방법

학습 코드 스니펫

from trl import SFTTrainer, SFTConfig
from transformers import GemmaForCausalLM, AutoTokenizer

model_name = "google/gemma-3-12b-it"
model = GemmaForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sft_config = SFTConfig(
    output_dir="./eval-estar-base-v0.1",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    num_train_epochs=5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,          # early stopping 기준: validation loss
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    bf16=True,
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=sft_config,
)

trainer.train()

※ 실제 학습 시 validation loss 기준 2 에폭에서 early stopping 적용

추론 코드 스니펫

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "datumo/E-Star-12B-v2-Base"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# ── System Prompt ──
system_prompt = """You are a rubric evaluator.
Your task is to evaluate a response strictly and only according to the provided pass criteria and scoring rubric. 
In your output, return the final evaluation (the three output tags: <feedback>, <highlight>, and <decision>).

# Evaluation Procedure (must follow all steps):

1. First, carefully read the Data to Evaluate, the pass criteria, and the scoring rubric to fully understand the requirements.
2. Evaluate the response only against the given criteria: do not introduce external standards, do not reward style unless the rubric explicitly allows it, and judge by absolute rubric definitions rather than relative comparisons.
3. Re-check fine-grained details in the response and the rubric, ensuring any tags (if present) are correctly mapped to the pass criteria and that small deviations are not overlooked.
4. Write criterion-focused feedback that explicitly references the rubric, quoting exact words or phrases from the response when they are decisive, and clearly stating which criteria are satisfied and which are violated.
5. Finally, extract the key verbatim spans that most influenced your judgment and assign the final score according to the scoring rubric.
"""

# ── User Prompt 예시 (Reasoning / Problem Solving) ──
user_prompt = """You MUST write ALL output (<feedback>, <highlight>, <decision>) in the SAME language as the input question and response being evaluated. If the input is in Korean, your entire output MUST be in Korean.
# Output Format:
<feedback>
Write detailed feedback (reasons) that strictly evaluates the quality of the response using only the given scoring rubric. Do not explicitly state the score in a sentence (e.g., "Therefore, the score is …").
</feedback>
<highlight>
List of words or phrases that you believe are the most important in determining the score.
</highlight>
<decision>
Provide the final integer score assigned based on the scoring rubric.
</decision>

# Data to Evaluate

### Problem
한 공장에서 하루에 120개의 제품을 생산한다. 불량률이 5%일 때, 일주일(7일) 동안 생산되는 정상 제품의 수는?

### Model Response
하루 생산량: 120개
불량률: 5% → 불량품: 120 × 0.05 = 6개
하루 정상 제품: 120 - 6 = 114개
일주일 정상 제품: 114 × 7 = 798개

### Optional Ground Truth
798개

# Rubric
Evaluate whether the model correctly solves the problem and provides reasoning that is logically consistent with the final answer. Prioritize correctness of the conclusion, then soundness of the reasoning.

Score 1: The final answer is wrong and the reasoning is invalid, irrelevant, or missing.
Score 2: The response shows limited progress but contains major reasoning flaws leading to an incorrect or unreliable answer.
Score 3: The response demonstrates partial reasoning ability but is incomplete, contains mistakes, or reaches an uncertain result.
Score 4: The response is mostly correct with generally sound reasoning, though minor errors or gaps may remain.
Score 5: The response reaches the correct answer through clear, consistent, and logically valid reasoning appropriate to the problem."""

# ── Inference ──
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    temperature=0.0,
    do_sample=False,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

3. 학습 데이터셋

HuggingFace 데이터셋

🤗 datumo/E-Star-Train-6K

항목	내용
시드 데이터	K2-Feedback (HAERAEHUB, 2024) — 99.7K
최종 학습 데이터	6,311개 (3단계 필터링 후)

필터링 파이프라인 요약

단계	규모 변화	방법
Stage 1	99.7K → 26K	Qwen3-30B-A3B / Qwen3-Next-80B-A3B 간 초기 합의
Stage 2	26K → 8K	Gemma 베이스 모델 기준 일치/불일치 균형화
Stage 3	8K → 6K	GPT-5.2 단일 평가 + 소형 frontier debate 교차 검증

평가 벤치마크

🤗 datumo/Feedback-Bench — Rubric following (한국어 / 영어)
🤗 datumo/Rag-Quality-Bench — Domain adaptation (금융·법률)

4. 학습 설정

주요 학습 파라미터

# SFT Config
learning_rate: 1e-5
num_train_epochs: 5 (early stopping at epoch 2)
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
bf16: true
eval_strategy: epoch
metric_for_best_model: eval_loss
load_best_model_at_end: true

# Fine-tuning
method: Full Fine-Tuning (no LoRA)
framework: TRL SFTTrainer (von Werra et al., 2020)

사용 GPU / 학습 시간

항목	내용
GPU	4
학습 시간	1.2 hr

5. 평가 결과

5.1 Feedback Bench (영어, Rubric Following)

Type	Models	Pearson	Kendall τ	Spearman
Frontier	GPT-5.2	0.916	0.865	0.911
Frontier	Sonnet-4.6	0.840	0.776	0.847
Instruct SLM	Gemma-3-12B-IT	0.810	0.725	0.794
Instruct SLM	oss-20b	0.844	0.762	0.839
Evaluator LM	Prometheus-8x7B-v2.0	0.823	0.736	0.806
Evaluator LM	GLIDER 3.8B	0.678	0.595	0.688
Ours	E-Star-12B-Base	0.856	0.778	0.847

5.2 Ko Feedback Bench (한국어, Rubric Following)

Type	Models	Pearson	Kendall τ	Spearman
Frontier	GPT-5.2	0.929	0.886	0.925
Frontier	Sonnet-4.6	0.820	0.758	0.833
Instruct SLM	Gemma-3-12B-IT	0.653	0.593	0.661
Instruct SLM	oss-20b	0.778	0.704	0.779
Evaluator LM	Prometheus-8x7B-v2.0	0.377	0.441	0.501
Evaluator LM	GLIDER 3.8B	0.523	0.487	0.563
Ours	E-Star-12B-Base	0.826	0.754	0.819

5.3 RAG Quality Bench (금융·법률, Domain Adaptation)

Models	LAW(CR)	LAW(FF)	LAW(RR)	FIN(CR)	FIN(FF)	FIN(RR)	Average
GPT-5.2	0.846	0.785	0.941	0.882	0.740	0.970	0.861
Sonnet-4.6	0.910	0.786	0.872	0.932	0.845	0.925	0.878
Gemma-3-12B-IT	0.620	0.742	0.742	0.830	0.713	0.821	0.745
oss-20b	0.846	0.722	0.870	0.793	0.752	0.900	0.813
Prometheus-8x7B-v2.0	0.392	0.477	0.772	0.386	0.240	0.806	0.512
GLIDER 3.8B	0.657	0.670	0.680	0.432	0.415	0.548	0.567
E-Star-12B-Base	0.853	0.730	0.816	0.835	0.720	0.880	0.806

CR = Context Relevancy, FF = Faithfulness, RR = Response Relevancy

6. 한계

Ko Feedback Bench는 기계번역 기반으로 구축되었으므로, 번역 노이즈가 평가 성능 측정에 영향을 줄 수 있음
학습 데이터와 벤치마크 레이블이 동일한 debate 기반 절차로 구축되었으므로, 절대적 평가 품질보다는 합의 기반 레이블링 기준과의 정렬 정도를 반영할 수 있음 (도메인 전문가 human evaluation 미실시)
Reference-free 설정으로 학습 및 평가되었으므로, reference 포함 환경에서의 성능은 별도 검증 필요
12B 규모 SLM 특성상 frontier 모델 대비 복잡한 루브릭 해석 능력에 한계가 있을 수 있음
RAG 평가 시 입력 문서 수 증가에 따른 성능 변화는 미검증

7. 라이선스

CC BY-NC 4.0 — 비상업적 사용 전용

이 모델은 Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 라이선스 하에 배포됩니다.

✅ 허용 사항

학술 연구 및 논문 작성에 사용
비영리 교육 목적의 활용
개인 학습 및 실험 목적의 사용
위 조건 하에 수정 및 재배포 (단, 원본 출처 명시 및 동일 라이선스 적용 필수)

❌ 금지 사항 (비상업적 조항 위반)

유료 제품·서비스에 모델을 통합하거나 API 형태로 제공하는 행위
사내 운영 시스템, 고객 대면 서비스, 수익 창출 파이프라인에 직·간접적으로 적용하는 행위
모델 가중치를 상업적 목적으로 재배포하거나 판매하는 행위
상업적 모델 학습을 위한 파인튜닝 데이터 생성 등 간접적 상업 활용

상업적 이용 문의: 상업적 라이선스가 필요하신 경우, Selectstar 공식 채널을 통해 문의해 주세요.

구성 요소별 라이선스

구성 요소	라이선스
베이스 모델 (Gemma-3-12B-IT)	Gemma Terms of Use (Google)
학습 데이터 (K2-Feedback)	K2-Feedback (HAERAEHUB, 2024) 라이선스 정책에 따름
파인튜닝 가중치 (본 모델)	CC BY-NC 4.0

중요: 베이스 모델인 Gemma의 이용 약관 또한 준수해야 합니다. Gemma Terms of Use에서 별도로 금지하는 사용 방식은 본 모델에도 동일하게 적용됩니다.

Downloads last month: 55

Safetensors

Model size

12B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for datumo/E-star-12B-base

Base model

google/gemma-3-12b-pt

Finetuned

google/gemma-3-12b-it

Finetuned

(360)

this model