Model Card: selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge
Date: June 5, 2026
selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge is a specialized, fine-tuned large language model designed to function as an "LLM-as-a-Judge," specifically optimized for evaluating and ranking creative writing responses. Built on the Qwen 3.5 4B architecture, it assesses pairs of responses to determine whichever is superior in accuracy, clarity, and originality.
1. Model Details
- Developed by: selfhypnosis-ai
- Base Model: Qwen3.5-4B
- Task: LLM-as-a-Judge (Pairwise Preference Evaluation)
- Language(s): English
- Format: 16-bit weights (A quantized version is also available at selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge-GPTQ-4bit, which exhibits no measurable degradation in evaluation performance.)
- Fine-Tuned Context Length: 32,768 (32k) tokens
- Training Infrastructure: 1x NVIDIA B200
- Training Epochs: 3
2. Intended Use and Implementation
This model relies on a specific prompting strategy and fundamentally operates by extracting and comparing the output log probabilities (logits) of the tokens A and B.
To effectively mitigate the persistent "positional bias" common in LLM judges, inference should be run twice for every pair of texts—swapping the positions of Response A and Response B—and averaging the results.
Prompting Format
The model expects a strict instruction-based system prompt and a JSON-structured user prompt.
quality_sys_msg = (
"You are a specialized AI assistant tasked with evaluating two responses to a given instruction. "
"Your purpose is to fairly compare the quality of each response and select the better one. "
"The user provides the instruction and the two responses A and B. Evaluate based on accuracy, "
"clarity, and originality, considering repetitive or unoriginal responses as lower quality."
)
# Example User Content
user_content = json.dumps({
"instruction": "Write a short poem about the ocean.",
"A": "[Response 1 Text]",
"B": "[Response 2 Text]"
}, ensure_ascii=False)
messages = [
{"role": "system", "content": quality_sys_msg},
{"role": "user", "content": user_content}
]
Logits Extraction & Position Swapping
Rather than simply parsing the generated output text, the structural design requires evaluating the underlying token logprobs for the letters A and B to determine a confidence score. The probability formula is defined as:
Here is an abbreviated vllm implementation demonstrating the dual-pass logic:
import math
import json
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_path = "selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge"
llm = LLM(model=model_path, kv_cache_dtype="fp8", max_model_len=32768)
tokenizer = AutoTokenizer.from_pretrained(model_path)
id_A = tokenizer.encode("A", add_special_tokens=False)[0]
id_B = tokenizer.encode("B", add_special_tokens=False)[0]
sampling_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=10)
def evaluate_pair(instruction, text1, text2):
# Pass 1: text1 is A, text2 is B
c1 = json.dumps({"instruction": instruction, "A": text1, "B": text2})
p1 = tokenizer.apply_chat_template(
[{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c1}],
tokenize=False, add_generation_prompt=True
)
# Pass 2: text2 is A, text1 is B
c2 = json.dumps({"instruction": instruction, "A": text2, "B": text1})
p2 = tokenizer.apply_chat_template(
[{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c2}],
tokenize=False, add_generation_prompt=True
)
outputs = llm.generate([p1, p2], sampling_params, use_tqdm=False)
def extract_probs(out):
lps = out.outputs[0].logprobs[0]
logA = getattr(lps.get(id_A), "logprob", -100)
logB = getattr(lps.get(id_B), "logprob", -100)
pA, pB = math.exp(logA), math.exp(logB)
return (pA/(pA+pB)*100, pB/(pA+pB)*100) if (pA+pB) > 0 else (50.0, 50.0)
S1_pos1_prob, S1_pos2_prob = extract_probs(outputs[0])
S2_pos1_prob, S2_pos2_prob = extract_probs(outputs[1])
# S1_pos1_prob = Probability text1 is better
# S2_pos2_prob = Probability text1 is better (when in position B)
avg_text1_better = (S1_pos1_prob + S2_pos2_prob) / 2
avg_text2_better = (S1_pos2_prob + S2_pos1_prob) / 2
return avg_text1_better, avg_text2_better
3. Discrete Pairwise Classification (Zero-Shot Accuracy)
Methodology
The model's zero-shot assessment capabilities were evaluated using the eqbench_creative_writing dataset. Ground-truth preference labels in this dataset are defined by Claude 4.6 Sonnet (and supplemented by Claude 4 Sonnet), representing state-of-the-art automated baseline scoring.
To isolate purely semantic reasoning capabilities from length bias, text pairs were stringently filtered using a maximum length ratio (R):
Furthermore, the evaluation targeted a highly constrained difficulty window, restricting the sample to pairs with a narrow and complex ground-truth score margin (5.0 <= Score Difference <= 10.0).
Quantitative Results
While broader general evaluations demonstrate an aggregate accuracy of approximately 78%, the results on this targeted, high-difficulty subset (N=500, Seed: 42) illustrate the distinct advantages of logit-based evaluation versus discrete token prediction:
| Metric | Measured Value | Description |
|---|---|---|
| Strict Accuracy | 59.8% | (Token Prediction): The frequency of correctly identifying the superior response using hard classification. This strictly measures if the model output the exact correct token ('A' or 'B') across both spatial orientations. |
| Positional Consistency | 79.4% | (Logit Evaluation): The stability metric derived by using the logits. This reflects how consistently the model prefers the same underlying text representation—regardless of its assignment to position A or B—by analyzing the continuous probabilistic logit distribution. |
| Mean Target Probability Mass | 65.7% | The average calculated logit-derived probability assigned to the ground-truth superior response (the "better" answer). |
| Mean Non-Target Probability Mass | 34.3% | The average calculated logit-derived probability assigned to the ground-truth inferior response (the "worse" answer). |
4. Tournament Separability Meta-Evaluation (Judgemark v4)
Following the meta-evaluation principles introduced in Judgemark v4, we direct-tested our model's discriminative resolution: can it assign preferences that cleanly separate stronger writing from weaker writing across the dataset?
Because this judge is optimized for pairwise tournament Elo evaluations rather than pointwise rubric scoring, we run two different meta-evaluations directly in the Elo space:
Option A: Prompt-Isolated Elo Separability (Judgemark v4 Pointwise Analogue)
- Methodology: To simulate item-level scoring, we execute isolated Swiss tournaments (5 rounds) restricted to each individual prompt in the dataset. This builds a score distribution of 32 prompt-specific Elo ratings for each of the 101 writer models. We then calculate Omega-Squared (ω²) and Mean Absolute Cliff's Delta across these prompt-level distributions.
- Separability Results:
- Omega-Squared (ANOVA effect size):
0.6811(68.1% of the variance in prompt-specific Elo ratings is explained solely by writer model identity, indicating high resilience to prompt-specific noise). - Mean Absolute Cliff's Delta:
0.6377 - Combined Separability Score:
87.92 / 100.0(Formula:((0.5 * ω² + 0.5 * d_cliff) / 0.75) * 100).
- Omega-Squared (ANOVA effect size):
- Symmetry & Bias Results:
- Positional Bias (Avg Assignment): Position 1 (A): 52.3% | Position 2 (B): 47.7%
- Length Bias (Longer Response Win Rate): 53.4%
- Length Bias (Avg EV Margin for Longer): +5.3%
- Significance: This indicates top-tier discriminative ability. The judge consistently correctly partitions writing quality in the expected direction, reliably separating strong models from low-tier anchor models while maintaining robust symmetry against length and position biases.
📈 Option A: Prompt-Isolated Elo Leaderboard
| Rank | Model Name | Avg Elo (Samples: 32) |
|---|---|---|
| 1 | moonshotai/Kimi-K2.5 | 1231.4 |
| 2 | moonshotai/Kimi-K2.6 | 1230.3 |
| 3 | moonshotai/Kimi-K2-Instruct | 1229.7 |
| 4 | claude-opus-4-7 | 1229.0 |
| 5 | deepseek-ai/DeepSeek-V3.2 | 1226.0 |
| 6 | grok-4.20-beta | 1226.0 |
| 7 | o3 | 1225.6 |
| 8 | deepseek-ai/DeepSeek-V4-Pro | 1225.0 |
| 9 | openrouter/sherlock-dash-alpha | 1224.3 |
| 10 | claude-opus-4-6 | 1222.9 |
| 11 | moonshotai/Kimi-K2-Thinking | 1222.8 |
| 12 | grok-4.1-fast | 1222.3 |
| 13 | zai-org/GLM-5.1 | 1222.1 |
| 14 | NousResearch/Hermes-4-405B | 1221.8 |
| 15 | gemini-3-pro-preview | 1220.1 |
| 16 | gemini-2.5-pro-preview-06-05 | 1219.1 |
| 17 | deepseek-ai/DeepSeek-R1 | 1219.1 |
| 18 | gemini-3.1-pro-preview | 1219.0 |
| 19 | google/gemma-4-31B-it | 1218.7 |
| 20 | Qwen/Qwen3.5-397B-A17B | 1218.7 |
| 21 | claude-opus-4 | 1218.4 |
| 22 | deepseek-ai/DeepSeek-R1-0528 | 1218.0 |
| 23 | claude-opus-4-5-20251101 | 1217.6 |
| 24 | deepseek-ai/DeepSeek-V4-Flash | 1217.6 |
| 25 | zai-org/GLM-4.6 | 1217.4 |
| 26 | deepseek-ai/DeepSeek-V3.1 | 1215.7 |
| 27 | Nanbeige/Nanbeige4-3B-Thinking-2511 | 1215.4 |
| 28 | zai-org/GLM-5 | 1213.9 |
| 29 | openrouter/pony-alpha | 1213.2 |
| 30 | hunter-alpha | 1213.1 |
| 31 | google/gemma-4-26B-A4B-it | 1213.0 |
| 32 | claude-sonnet-4-6 | 1212.3 |
| 33 | mistral-medium-3.1 | 1211.8 |
| 34 | zai-org/GLM-4.5 | 1211.7 |
| 35 | qwen/qwen3-235b-a22b:thinking | 1211.3 |
| 36 | RekaAI/reka-flash-3 | 1210.5 |
| 37 | claude-sonnet-4.5 | 1210.3 |
| 38 | claude-3-5-sonnet-20241022 | 1209.9 |
| 39 | gpt-5.4 | 1209.1 |
| 40 | deepseek-ai/DeepSeek-V3-0324 | 1208.7 |
| 41 | zai-org/GLM-4.7 | 1206.6 |
| 42 | mistral-small-creative | 1206.0 |
| 43 | gpt-5-mini-2025-08-07 | 1204.9 |
| 44 | minimax/minimax-m2.5 | 1204.9 |
| 45 | gpt-5.5 | 1204.6 |
| 46 | chatgpt-4o-latest-2025-03-27 | 1204.5 |
| 47 | gpt-5-2025-08-07 | 1204.0 |
| 48 | quasar-alpha | 1203.4 |
| 49 | openrouter/horizon-beta | 1203.3 |
| 50 | ifable/gemma-2-Ifable-9B | 1203.3 |
| 51 | optimus-alpha | 1203.1 |
| 52 | qwen/qwq-32b | 1203.1 |
| 53 | claude-3-7-sonnet-20250219 | 1202.7 |
| 54 | mistralai/Mistral-Large-3-675B-Instruct-2512 | 1202.6 |
| 55 | openrouter/horizon-alpha | 1202.2 |
| 56 | gpt-5.4-mini | 1202.0 |
| 57 | mistralai/Mistral-Small-3.2-24B-Instruct-2506 | 1201.8 |
| 58 | claude-sonnet-4 | 1200.8 |
| 59 | grok-3-beta | 1199.1 |
| 60 | gemini-2.5-pro-exp-03-25 | 1198.9 |
| 61 | gpt-5.3-chat | 1198.7 |
| 62 | anthropic/claude-3.5-haiku-20241022 | 1197.2 |
| 63 | deepseek-ai/DeepSeek-V3.2-Speciale | 1196.9 |
| 64 | zai-org/GLM-4.7-Flash | 1196.7 |
| 65 | gpt-4.1-mini | 1196.2 |
| 66 | gpt-5.2 | 1196.0 |
| 67 | chatgpt-4o-latest-2025-01-29 | 1195.1 |
| 68 | gpt-4.5-preview | 1194.6 |
| 69 | CohereForAI/c4ai-command-a-03-2025 | 1192.9 |
| 70 | gemini-2.5-flash-preview | 1190.2 |
| 71 | openai/gpt-oss-120b | 1189.6 |
| 72 | google/gemma-3-27b-it | 1187.4 |
| 73 | gpt-5-nano-2025-08-07 | 1187.0 |
| 74 | gemini-2.0-flash-001 | 1186.7 |
| 75 | meta-llama/llama-3.1-405b-instruct | 1185.3 |
| 76 | google/gemma-3-4b-it | 1185.2 |
| 77 | google/gemma-3-12b-it | 1184.1 |
| 78 | allura-org/Gemma-3-Glitter-12B | 1183.0 |
| 79 | mistralai/mistral-large-2411 | 1182.9 |
| 80 | sam-paech/Darkest-muse-v1 | 1180.8 |
| 81 | google/gemma-2-9b-it | 1180.4 |
| 82 | liquid/lfm-7b | 1178.2 |
| 83 | THUDM/GLM-4-32B-0414 | 1177.3 |
| 84 | meta-llama/llama-3.1-70b-instruct | 1177.2 |
| 85 | gpt-4.1-nano | 1176.9 |
| 86 | openai/gpt-3.5-turbo-0613 | 1176.7 |
| 87 | mistralai/Mistral-Nemo-Instruct-2407 | 1176.5 |
| 88 | anthropic/claude-3-haiku | 1176.2 |
| 89 | openai/gpt-4-0314 | 1174.7 |
| 90 | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 1173.6 |
| 91 | openai/gpt-oss-20b | 1172.8 |
| 92 | gpt-4o-mini | 1172.5 |
| 93 | mistralai/Pixtral-Large-Instruct-2411 | 1172.1 |
| 94 | meta-llama/llama-3.1-8b-instruct | 1170.5 |
| 95 | openrouter/cypher-alpha | 1169.2 |
| 96 | meta-llama/llama-3.2-3b-instruct | 1165.9 |
| 97 | ToastyPigeon/Gemma-3-Starshine-12B | 1164.8 |
| 98 | mistralai/mistral-small-3.1-24b-instruct-2503 | 1164.5 |
| 99 | meta-llama/Llama-4-Scout-17B-16E-Instruct | 1162.7 |
| 100 | mistralai/Mistral-Small-24B-Instruct-2501 | 1160.2 |
| 101 | meta-llama/llama-3.2-1b-instruct | 1146.3 |
Option B: Bootstrapped Tournament Separability (Leaderboard Stability)
- Methodology: To measure the overall stability and reproducibility of a global leaderboard under different prompt selections, we perform bootstrap resampling on the prompts (drawing 32 prompts with replacement) over 10 independent iterations. A full 20-round tournament is run for each sample to calculate global Elos.
- Separability Results:
- Omega-Squared (ANOVA effect size):
0.9020(90.2% of the variance in global Elos across bootstrap runs is explained by model identity. The remaining 9.8% is due to prompt selection variance). - Mean Absolute Cliff's Delta:
0.8188 - Combined Separability Score:
114.72 / 100.0(Exceeds 100 due to the official 0.75 normalization scaling factor).
- Omega-Squared (ANOVA effect size):
- Symmetry & Bias Results:
- Positional Bias (Avg Assignment): Position 1 (A): 52.6% | Position 2 (B): 47.4%
- Length Bias (Longer Response Win Rate): 51.6%
- Length Bias (Avg EV Margin for Longer): +2.7%
- Significance: This confirms the extreme stability of the tournament rankings. Aggregating matches across 32 prompts successfully smooths out prompt-level noise, showing that a slightly different prompt mix will not alter the leaderboard (average model variance is only about ±15 Elo points).
📈 Option B: Bootstrapped Tournament Elo Leaderboard
| Rank | Model Name | Avg Elo (Across Bootstraps) |
|---|---|---|
| 1 | moonshotai/Kimi-K2.6 | 1294.3 ± 16.6 |
| 2 | grok-4.20-beta | 1290.3 ± 12.5 |
| 3 | moonshotai/Kimi-K2-Instruct | 1283.8 ± 25.5 |
| 4 | claude-opus-4-7 | 1278.8 ± 18.7 |
| 5 | o3 | 1275.8 ± 16.9 |
| 6 | moonshotai/Kimi-K2.5 | 1272.9 ± 29.0 |
| 7 | moonshotai/Kimi-K2-Thinking | 1266.1 ± 22.1 |
| 8 | openrouter/sherlock-dash-alpha | 1256.3 ± 16.2 |
| 9 | claude-opus-4-6 | 1255.5 ± 14.2 |
| 10 | zai-org/GLM-5.1 | 1250.6 ± 17.3 |
| 11 | deepseek-ai/DeepSeek-V3.2 | 1247.7 ± 14.0 |
| 12 | grok-4.1-fast | 1246.1 ± 14.8 |
| 13 | NousResearch/Hermes-4-405B | 1245.4 ± 20.4 |
| 14 | gemini-3.1-pro-preview | 1245.2 ± 8.5 |
| 15 | Nanbeige/Nanbeige4-3B-Thinking-2511 | 1244.9 ± 17.5 |
| 16 | deepseek-ai/DeepSeek-V4-Pro | 1243.1 ± 15.5 |
| 17 | deepseek-ai/DeepSeek-V4-Flash | 1237.2 ± 13.8 |
| 18 | gemini-3-pro-preview | 1236.9 ± 12.8 |
| 19 | deepseek-ai/DeepSeek-R1 | 1235.3 ± 20.3 |
| 20 | google/gemma-4-31B-it | 1234.5 ± 12.7 |
| 21 | deepseek-ai/DeepSeek-V3.1 | 1233.6 ± 17.0 |
| 22 | claude-opus-4-5-20251101 | 1232.4 ± 7.0 |
| 23 | claude-opus-4 | 1230.9 ± 19.8 |
| 24 | zai-org/GLM-4.6 | 1229.4 ± 13.3 |
| 25 | deepseek-ai/DeepSeek-R1-0528 | 1228.8 ± 12.1 |
| 26 | zai-org/GLM-4.5 | 1228.5 ± 11.5 |
| 27 | gemini-2.5-pro-preview-06-05 | 1227.7 ± 9.7 |
| 28 | Qwen/Qwen3.5-397B-A17B | 1227.6 ± 17.1 |
| 29 | mistral-medium-3.1 | 1226.4 ± 10.9 |
| 30 | hunter-alpha | 1224.2 ± 11.7 |
| 31 | qwen/qwen3-235b-a22b:thinking | 1224.0 ± 15.0 |
| 32 | google/gemma-4-26B-A4B-it | 1223.0 ± 9.9 |
| 33 | zai-org/GLM-5 | 1221.1 ± 9.8 |
| 34 | openrouter/pony-alpha | 1220.9 ± 9.9 |
| 35 | gpt-5.4 | 1217.6 ± 13.6 |
| 36 | zai-org/GLM-4.7 | 1216.5 ± 9.1 |
| 37 | quasar-alpha | 1216.5 ± 9.4 |
| 38 | claude-sonnet-4-6 | 1215.6 ± 16.0 |
| 39 | claude-3-5-sonnet-20241022 | 1213.8 ± 11.3 |
| 40 | chatgpt-4o-latest-2025-03-27 | 1213.8 ± 10.7 |
| 41 | RekaAI/reka-flash-3 | 1213.7 ± 10.3 |
| 42 | claude-sonnet-4.5 | 1213.6 ± 8.5 |
| 43 | gpt-5-mini-2025-08-07 | 1210.4 ± 10.6 |
| 44 | optimus-alpha | 1208.3 ± 7.5 |
| 45 | gpt-5.3-chat | 1207.1 ± 12.0 |
| 46 | claude-sonnet-4 | 1207.0 ± 12.5 |
| 47 | gpt-5.5 | 1206.7 ± 10.9 |
| 48 | mistral-small-creative | 1206.6 ± 12.6 |
| 49 | deepseek-ai/DeepSeek-V3-0324 | 1206.0 ± 11.2 |
| 50 | openrouter/horizon-alpha | 1205.8 ± 5.5 |
| 51 | gpt-5-2025-08-07 | 1205.5 ± 6.8 |
| 52 | minimax/minimax-m2.5 | 1205.2 ± 8.2 |
| 53 | qwen/qwq-32b | 1203.0 ± 14.3 |
| 54 | mistralai/Mistral-Large-3-675B-Instruct-2512 | 1202.2 ± 11.3 |
| 55 | ifable/gemma-2-Ifable-9B | 1201.5 ± 12.8 |
| 56 | gemini-2.5-pro-exp-03-25 | 1201.1 ± 11.9 |
| 57 | gpt-5.4-mini | 1200.9 ± 17.1 |
| 58 | openrouter/horizon-beta | 1200.4 ± 12.4 |
| 59 | grok-3-beta | 1199.0 ± 15.6 |
| 60 | anthropic/claude-3.5-haiku-20241022 | 1196.4 ± 11.2 |
| 61 | deepseek-ai/DeepSeek-V3.2-Speciale | 1195.9 ± 16.0 |
| 62 | zai-org/GLM-4.7-Flash | 1195.6 ± 12.2 |
| 63 | claude-3-7-sonnet-20250219 | 1193.9 ± 6.3 |
| 64 | gpt-5.2 | 1192.2 ± 10.5 |
| 65 | mistralai/Mistral-Small-3.2-24B-Instruct-2506 | 1191.4 ± 10.4 |
| 66 | chatgpt-4o-latest-2025-01-29 | 1190.0 ± 11.9 |
| 67 | google/gemma-3-27b-it | 1186.6 ± 6.9 |
| 68 | CohereForAI/c4ai-command-a-03-2025 | 1185.3 ± 10.3 |
| 69 | gpt-5-nano-2025-08-07 | 1184.2 ± 12.0 |
| 70 | gpt-4.5-preview | 1183.5 ± 12.2 |
| 71 | gpt-4.1-mini | 1180.8 ± 8.7 |
| 72 | openai/gpt-oss-120b | 1180.1 ± 14.1 |
| 73 | google/gemma-3-12b-it | 1178.8 ± 13.0 |
| 74 | allura-org/Gemma-3-Glitter-12B | 1178.0 ± 11.0 |
| 75 | gemini-2.5-flash-preview | 1175.9 ± 9.2 |
| 76 | google/gemma-3-4b-it | 1175.5 ± 9.4 |
| 77 | sam-paech/Darkest-muse-v1 | 1172.7 ± 6.1 |
| 78 | gemini-2.0-flash-001 | 1170.7 ± 5.8 |
| 79 | gpt-4.1-nano | 1166.8 ± 5.9 |
| 80 | liquid/lfm-7b | 1164.0 ± 11.2 |
| 81 | THUDM/GLM-4-32B-0414 | 1161.6 ± 13.0 |
| 82 | mistralai/mistral-large-2411 | 1160.2 ± 10.0 |
| 83 | gpt-4o-mini | 1155.2 ± 16.1 |
| 84 | mistralai/Pixtral-Large-Instruct-2411 | 1152.6 ± 12.0 |
| 85 | google/gemma-2-9b-it | 1152.6 ± 11.6 |
| 86 | openai/gpt-4-0314 | 1152.0 ± 8.2 |
| 87 | meta-llama/llama-3.1-70b-instruct | 1152.0 ± 11.5 |
| 88 | meta-llama/llama-3.1-405b-instruct | 1151.8 ± 10.5 |
| 89 | mistralai/Mistral-Nemo-Instruct-2407 | 1150.9 ± 9.9 |
| 90 | openai/gpt-oss-20b | 1146.0 ± 19.8 |
| 91 | ToastyPigeon/Gemma-3-Starshine-12B | 1140.6 ± 13.2 |
| 92 | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 1140.2 ± 12.3 |
| 93 | anthropic/claude-3-haiku | 1140.1 ± 16.4 |
| 94 | meta-llama/llama-3.1-8b-instruct | 1136.3 ± 16.9 |
| 95 | openai/gpt-3.5-turbo-0613 | 1136.2 ± 10.4 |
| 96 | openrouter/cypher-alpha | 1132.3 ± 17.1 |
| 97 | mistralai/mistral-small-3.1-24b-instruct-2503 | 1128.6 ± 12.7 |
| 98 | mistralai/Mistral-Small-24B-Instruct-2501 | 1110.0 ± 16.1 |
| 99 | meta-llama/Llama-4-Scout-17B-16E-Instruct | 1107.8 ± 21.0 |
| 100 | meta-llama/llama-3.2-3b-instruct | 1095.4 ± 24.9 |
| 101 | meta-llama/llama-3.2-1b-instruct | 1042.4 ± 17.4 |
5. Limitations, Known Biases, & Future Work
- Domain Specificity: This model is explicitly fine-tuned on instruction following and creative writing matrices. Its discriminative accuracy is expected to degrade if deployed for rigorous fact-checking, mathematical, or strict code-generation evaluations.
- Residual Length Correlation: While effectively minimized as shown in the tournament data, autoregressive models intrinsically favor longer responses heuristically. Unbounded production usage without truncation/length constraints may gently re-introduce verbosity bias.
- Input Sensitivity: The empirical performance strictly relies on evaluating continuous token logprobs (
AandB). Utilizing naive string-text generation (only looking at the output token) rather than parsing the logits will yield sub-optimal structural performance and higher variance. - Future Data Scaling: This V1 release was trained over 3 epochs on a single B200 GPU. Scaling the training pipeline with more extensive and diverse training data is already planned for the future and is anticipated to yield significant improvements in discriminative performance.
- Downloads last month
- 108