Instructions to use ingeol/mistral-7b-arc-cpt-dpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ingeol/mistral-7b-arc-cpt-dpo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ingeol/mistral-7b-arc-cpt-dpo")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ingeol/mistral-7b-arc-cpt-dpo")
model = AutoModelForCausalLM.from_pretrained("ingeol/mistral-7b-arc-cpt-dpo")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ingeol/mistral-7b-arc-cpt-dpo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ingeol/mistral-7b-arc-cpt-dpo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ingeol/mistral-7b-arc-cpt-dpo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ingeol/mistral-7b-arc-cpt-dpo

SGLang

How to use ingeol/mistral-7b-arc-cpt-dpo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ingeol/mistral-7b-arc-cpt-dpo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ingeol/mistral-7b-arc-cpt-dpo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ingeol/mistral-7b-arc-cpt-dpo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ingeol/mistral-7b-arc-cpt-dpo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ingeol/mistral-7b-arc-cpt-dpo with Docker Model Runner:
```
docker model run hf.co/ingeol/mistral-7b-arc-cpt-dpo
```

Mistral-7B · ARC-Challenge (CPT → DPO)

mistralai/Mistral-7B-v0.1 를 2-스테이지 학습(CPT → DPO) 으로 다듬어 ARC-Challenge 25-shot 점수를 크게 끌어올린 모델입니다.

ARC-Challenge test, 25-shot	Baseline (Mistral-7B-v0.1)	이 모델 (CPT→DPO)	Δ
`acc_norm` (주지표)	0.6143	0.7526	+0.1383
`acc` (보조지표)	0.5700	0.7474	+0.1774

평가: EleutherAI lm-evaluation-harness arc_challenge, --num_fewshot 25, dtype=float16.

핵심 아이디어 — 평가는 "생성"이 아니라 "랭킹"이다

lm-evaluation-harness 의 arc_challenge 는 모델이 답을 생성하게 하지 않습니다. 각 보기에 대해

"Question: {question}\nAnswer:" + " " + {보기 텍스트}

의 로그우도를 계산하고 보기 길이로 정규화(acc_norm)한 뒤 가장 높은 보기를 고릅니다. 따라서 점수를 움직이는 유일한 요소는 **모델이 정답 보기의 텍스트에 더 높은 우도를 주는가**입니다. 이 모델의 모든 학습 단계는 그 사실에 정렬돼 있습니다(학습 타깃은 항상 정답 보기 텍스트, letter "B" 가 아님).

방법

Stage 1 · CPT (Continual Pre-Training, 지식 주입) ARC 분포와 일치하는 과학 산문(SciQ support 단락 + CAMEL-AI bio/chem/physics 풀이)으로 4K packed-CLM(전 토큰 loss). 정답 과학 문장의 우도를 전반적으로 끌어올립니다. 단독 효과는 작습니다(acc_norm 0.6143 → 0.6195).

Stage 2 · DPO (Direct Preference Optimization, 랭킹 정렬) CPT 체크포인트 위에 fresh LoRA(r=64, α=128, 7개 projection)를 올리고, 문항당 정답 보기 텍스트 = chosen / 오답 보기 텍스트 = rejected 인 preference 쌍(harness 레이아웃과 바이트 일치)으로 학습합니다. logP(정답) > logP(오답) 은 acc_norm 랭킹 목적함수 그 자체이며, reference 는 어댑터를 끈 같은 모델(= CPT 체크포인트)입니다.

loss = -logσ( β · [ (logπ_w − logπ_w^ref) − (logπ_l − logπ_l^ref) ] )   # β = 0.1

왜 조합이 폭발하나: CPT 는 정답 문장에 줄 재료(지식) 를, DPO 는 그 재료로 정답을 오답 위로 랭킹 하는 능력을 줍니다. CPT(0.62)·SFT→DPO(0.65) 각각은 평범하지만 CPT→DPO 는 0.7526 으로 도약합니다(초가산적, super-additive).

학습 디테일

	Stage 1 · CPT	Stage 2 · DPO
데이터	SciQ `support` + CAMEL-AI (bio/chem/physics) 산문	ARC-C/E + OpenBookQA + SciQ train: gold vs distractor 쌍
목적함수	packed-CLM (전 토큰 next-token)	DPO (β=0.1), reference = adapter-off
방식	full-FT + DeepSpeed ZeRO-2 (4×A6000)	frozen ckpt 위 fresh LoRA (r=64/α=128/7proj)
epochs / lr	3 / 1e-5	1 / 5e-6
seq len	4096 (packed)	512
dtype / attn	bf16 / `sdpa`	bf16 / `sdpa`

저장: LoRA 를 머지한 full fp16 standalone 모델(quantization_config 없음 → fp16 baseline 과 공정 비교).
train/test 분리: ARC-Challenge test split 은 학습에 절대 사용하지 않았습니다(평가 전용).
flash-attn 미사용, 전 단계 attn_implementation="sdpa".

사용법

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "ingeol/mistral-7b-arc-cpt-dpo"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.float16, device_map="auto")

# 이 모델의 본래 사용처는 harness 와 동일한 포맷에서 "보기 텍스트의 우도"를 비교하는 것.
def score(question, choice):
    prompt = f"Question: {question}\nAnswer:"
    ids = tok(prompt + " " + choice, return_tensors="pt").to(model.device)
    ctx = tok(prompt, return_tensors="pt")["input_ids"].shape[1]
    with torch.no_grad():
        logp = model(**ids).logits.log_softmax(-1)
    tgt = ids["input_ids"][0, ctx:]
    sel = logp[0, ctx-1:-1].gather(1, tgt.unsqueeze(1)).sum().item()
    return sel / max(1, len(choice))   # acc_norm: 보기 길이로 정규화

q = "Which gas do plants release during photosynthesis?"
print({c: round(score(q, c), 3) for c in ["oxygen", "nitrogen", "carbon dioxide", "hydrogen"]})

평가 재현

pip install lm-eval
lm_eval --model hf \
  --model_args pretrained=ingeol/mistral-7b-arc-cpt-dpo,dtype=float16 \
  --tasks arc_challenge --num_fewshot 25 --batch_size 8

한계 / 주의

이 모델은 ARC-Challenge(과학 MCQA) 랭킹에 특화돼 있습니다. 범용 instruction-following / 대화 성능은 목표가 아니며 보장하지 않습니다.
baseline 재현 편차: 공식 acc_norm 0.6143 vs 로컬 재현 0.6067 (dtype/harness 버전 변동) ~0.7pt. Δ 는 공식값 기준 보고.
생성형 추론(CoT/<think>)은 harness 채점 위치("...Answer:" 직후)에 들어가지 않아 직접 점수에 기여하지 않습니다 — 이 모델은 정답 텍스트 랭킹을 정면으로 최적화한 결과입니다.
참고: 같은 CPT 체크포인트에 SimPO 를 적용하면 acc_norm 0.7551 로 미세하게 높지만(stderr ±0.014 내 동률), 정규화 없는 acc 에서는 DPO(0.7474) > SimPO(0.7295) 이고 DPO 가 reference-anchored 라 더 원칙적이라 DPO 를 채택했습니다.

베이스 / 라이선스

Base model: mistralai/Mistral-7B-v0.1 (Apache-2.0)
학습 데이터: AI2 ARC, OpenBookQA, SciQ, CAMEL-AI (biology/chemistry/physics) — 각 데이터셋 라이선스를 따릅니다.

Downloads last month: 13

Safetensors

Model size

7B params

Tensor type

F16

Model tree for ingeol/mistral-7b-arc-cpt-dpo

Base model

mistralai/Mistral-7B-v0.1

Finetuned

(928)

this model

Datasets used to train ingeol/mistral-7b-arc-cpt-dpo

Evaluation results

acc_norm (25-shot) on AI2 ARC-Challenge (test, 25-shot)
test set self-reported

0.753
acc (25-shot) on AI2 ARC-Challenge (test, 25-shot)
test set self-reported

0.747