Model Card: selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge

Date: June 5, 2026

selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge is a specialized, fine-tuned large language model designed to function as an "LLM-as-a-Judge," specifically optimized for evaluating and ranking creative writing responses. Built on the Qwen 3.5 4B architecture, it assesses pairs of responses to determine whichever is superior in accuracy, clarity, and originality.


1. Model Details

  • Developed by: selfhypnosis-ai
  • Base Model: Qwen3.5-4B
  • Task: LLM-as-a-Judge (Pairwise Preference Evaluation)
  • Language(s): English
  • Format: 16-bit weights (A quantized version is also available at selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge-GPTQ-4bit, which exhibits no measurable degradation in evaluation performance.)
  • Fine-Tuned Context Length: 32,768 (32k) tokens
  • Training Infrastructure: 1x NVIDIA B200
  • Training Epochs: 3

2. Intended Use and Implementation

This model relies on a specific prompting strategy and fundamentally operates by extracting and comparing the output log probabilities (logits) of the tokens A and B.

To effectively mitigate the persistent "positional bias" common in LLM judges, inference should be run twice for every pair of texts—swapping the positions of Response A and Response B—and averaging the results.

Prompting Format

The model expects a strict instruction-based system prompt and a JSON-structured user prompt.

quality_sys_msg = (
    "You are a specialized AI assistant tasked with evaluating two responses to a given instruction. "
    "Your purpose is to fairly compare the quality of each response and select the better one. "
    "The user provides the instruction and the two responses A and B. Evaluate based on accuracy, "
    "clarity, and originality, considering repetitive or unoriginal responses as lower quality."
)

# Example User Content
user_content = json.dumps({
    "instruction": "Write a short poem about the ocean.",
    "A": "[Response 1 Text]",
    "B": "[Response 2 Text]"
}, ensure_ascii=False)

messages = [
    {"role": "system", "content": quality_sys_msg},
    {"role": "user", "content": user_content}
]

Logits Extraction & Position Swapping

Rather than simply parsing the generated output text, the structural design requires evaluating the underlying token logprobs for the letters A and B to determine a confidence score. The probability formula is defined as:

P(A)=elogprobAelogprobA+elogprobB×100P(A) = \frac{e^{\text{logprob}_A}}{e^{\text{logprob}_A} + e^{\text{logprob}_B}} \times 100

Here is an abbreviated vllm implementation demonstrating the dual-pass logic:

import math
import json
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge"
llm = LLM(model=model_path, kv_cache_dtype="fp8", max_model_len=32768)
tokenizer = AutoTokenizer.from_pretrained(model_path)

id_A = tokenizer.encode("A", add_special_tokens=False)[0]
id_B = tokenizer.encode("B", add_special_tokens=False)[0]

sampling_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=10)

def evaluate_pair(instruction, text1, text2):
    # Pass 1: text1 is A, text2 is B
    c1 = json.dumps({"instruction": instruction, "A": text1, "B": text2})
    p1 = tokenizer.apply_chat_template(
        [{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c1}], 
        tokenize=False, add_generation_prompt=True
    )
    
    # Pass 2: text2 is A, text1 is B
    c2 = json.dumps({"instruction": instruction, "A": text2, "B": text1})
    p2 = tokenizer.apply_chat_template(
        [{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c2}], 
        tokenize=False, add_generation_prompt=True
    )

    outputs = llm.generate([p1, p2], sampling_params, use_tqdm=False)

    def extract_probs(out):
        lps = out.outputs[0].logprobs[0]
        logA = getattr(lps.get(id_A), "logprob", -100)
        logB = getattr(lps.get(id_B), "logprob", -100)
        pA, pB = math.exp(logA), math.exp(logB)
        return (pA/(pA+pB)*100, pB/(pA+pB)*100) if (pA+pB) > 0 else (50.0, 50.0)

    S1_pos1_prob, S1_pos2_prob = extract_probs(outputs[0])
    S2_pos1_prob, S2_pos2_prob = extract_probs(outputs[1])

    # S1_pos1_prob = Probability text1 is better
    # S2_pos2_prob = Probability text1 is better (when in position B)
    avg_text1_better = (S1_pos1_prob + S2_pos2_prob) / 2
    avg_text2_better = (S1_pos2_prob + S2_pos1_prob) / 2

    return avg_text1_better, avg_text2_better

3. Discrete Pairwise Classification (Zero-Shot Accuracy)

Methodology

The model's zero-shot assessment capabilities were evaluated using the eqbench_creative_writing dataset. Ground-truth preference labels in this dataset are defined by Claude 4.6 Sonnet (and supplemented by Claude 4 Sonnet), representing state-of-the-art automated baseline scoring.

To isolate purely semantic reasoning capabilities from length bias, text pairs were stringently filtered using a maximum length ratio (R):

R=max(LA,LB)min(LA,LB)1.2R = \frac{\max(L_A, L_B)}{\min(L_A, L_B)} \le 1.2

Furthermore, the evaluation targeted a highly constrained difficulty window, restricting the sample to pairs with a narrow and complex ground-truth score margin (5.0 <= Score Difference <= 10.0).

Quantitative Results

While broader general evaluations demonstrate an aggregate accuracy of approximately 78%, the results on this targeted, high-difficulty subset (N=500, Seed: 42) illustrate the distinct advantages of logit-based evaluation versus discrete token prediction:

Metric Measured Value Description
Strict Accuracy 59.8% (Token Prediction): The frequency of correctly identifying the superior response using hard classification. This strictly measures if the model output the exact correct token ('A' or 'B') across both spatial orientations.
Positional Consistency 79.4% (Logit Evaluation): The stability metric derived by using the logits. This reflects how consistently the model prefers the same underlying text representation—regardless of its assignment to position A or B—by analyzing the continuous probabilistic logit distribution.
Mean Target Probability Mass 65.7% The average calculated logit-derived probability assigned to the ground-truth superior response (the "better" answer).
Mean Non-Target Probability Mass 34.3% The average calculated logit-derived probability assigned to the ground-truth inferior response (the "worse" answer).

4. Tournament Separability Meta-Evaluation (Judgemark v4)

Following the meta-evaluation principles introduced in Judgemark v4, we direct-tested our model's discriminative resolution: can it assign preferences that cleanly separate stronger writing from weaker writing across the dataset?

Because this judge is optimized for pairwise tournament Elo evaluations rather than pointwise rubric scoring, we run two different meta-evaluations directly in the Elo space:

Option A: Prompt-Isolated Elo Separability (Judgemark v4 Pointwise Analogue)

  • Methodology: To simulate item-level scoring, we execute isolated Swiss tournaments (5 rounds) restricted to each individual prompt in the dataset. This builds a score distribution of 32 prompt-specific Elo ratings for each of the 101 writer models. We then calculate Omega-Squared (ω²) and Mean Absolute Cliff's Delta across these prompt-level distributions.
  • Separability Results:
    • Omega-Squared (ANOVA effect size): 0.6811 (68.1% of the variance in prompt-specific Elo ratings is explained solely by writer model identity, indicating high resilience to prompt-specific noise).
    • Mean Absolute Cliff's Delta: 0.6377
    • Combined Separability Score: 87.92 / 100.0 (Formula: ((0.5 * ω² + 0.5 * d_cliff) / 0.75) * 100).
  • Symmetry & Bias Results:
    • Positional Bias (Avg Assignment): Position 1 (A): 52.3% | Position 2 (B): 47.7%
    • Length Bias (Longer Response Win Rate): 53.4%
    • Length Bias (Avg EV Margin for Longer): +5.3%
  • Significance: This indicates top-tier discriminative ability. The judge consistently correctly partitions writing quality in the expected direction, reliably separating strong models from low-tier anchor models while maintaining robust symmetry against length and position biases.

📈 Option A: Prompt-Isolated Elo Leaderboard
Rank Model Name Avg Elo (Samples: 32)
1 moonshotai/Kimi-K2.5 1231.4
2 moonshotai/Kimi-K2.6 1230.3
3 moonshotai/Kimi-K2-Instruct 1229.7
4 claude-opus-4-7 1229.0
5 deepseek-ai/DeepSeek-V3.2 1226.0
6 grok-4.20-beta 1226.0
7 o3 1225.6
8 deepseek-ai/DeepSeek-V4-Pro 1225.0
9 openrouter/sherlock-dash-alpha 1224.3
10 claude-opus-4-6 1222.9
11 moonshotai/Kimi-K2-Thinking 1222.8
12 grok-4.1-fast 1222.3
13 zai-org/GLM-5.1 1222.1
14 NousResearch/Hermes-4-405B 1221.8
15 gemini-3-pro-preview 1220.1
16 gemini-2.5-pro-preview-06-05 1219.1
17 deepseek-ai/DeepSeek-R1 1219.1
18 gemini-3.1-pro-preview 1219.0
19 google/gemma-4-31B-it 1218.7
20 Qwen/Qwen3.5-397B-A17B 1218.7
21 claude-opus-4 1218.4
22 deepseek-ai/DeepSeek-R1-0528 1218.0
23 claude-opus-4-5-20251101 1217.6
24 deepseek-ai/DeepSeek-V4-Flash 1217.6
25 zai-org/GLM-4.6 1217.4
26 deepseek-ai/DeepSeek-V3.1 1215.7
27 Nanbeige/Nanbeige4-3B-Thinking-2511 1215.4
28 zai-org/GLM-5 1213.9
29 openrouter/pony-alpha 1213.2
30 hunter-alpha 1213.1
31 google/gemma-4-26B-A4B-it 1213.0
32 claude-sonnet-4-6 1212.3
33 mistral-medium-3.1 1211.8
34 zai-org/GLM-4.5 1211.7
35 qwen/qwen3-235b-a22b:thinking 1211.3
36 RekaAI/reka-flash-3 1210.5
37 claude-sonnet-4.5 1210.3
38 claude-3-5-sonnet-20241022 1209.9
39 gpt-5.4 1209.1
40 deepseek-ai/DeepSeek-V3-0324 1208.7
41 zai-org/GLM-4.7 1206.6
42 mistral-small-creative 1206.0
43 gpt-5-mini-2025-08-07 1204.9
44 minimax/minimax-m2.5 1204.9
45 gpt-5.5 1204.6
46 chatgpt-4o-latest-2025-03-27 1204.5
47 gpt-5-2025-08-07 1204.0
48 quasar-alpha 1203.4
49 openrouter/horizon-beta 1203.3
50 ifable/gemma-2-Ifable-9B 1203.3
51 optimus-alpha 1203.1
52 qwen/qwq-32b 1203.1
53 claude-3-7-sonnet-20250219 1202.7
54 mistralai/Mistral-Large-3-675B-Instruct-2512 1202.6
55 openrouter/horizon-alpha 1202.2
56 gpt-5.4-mini 1202.0
57 mistralai/Mistral-Small-3.2-24B-Instruct-2506 1201.8
58 claude-sonnet-4 1200.8
59 grok-3-beta 1199.1
60 gemini-2.5-pro-exp-03-25 1198.9
61 gpt-5.3-chat 1198.7
62 anthropic/claude-3.5-haiku-20241022 1197.2
63 deepseek-ai/DeepSeek-V3.2-Speciale 1196.9
64 zai-org/GLM-4.7-Flash 1196.7
65 gpt-4.1-mini 1196.2
66 gpt-5.2 1196.0
67 chatgpt-4o-latest-2025-01-29 1195.1
68 gpt-4.5-preview 1194.6
69 CohereForAI/c4ai-command-a-03-2025 1192.9
70 gemini-2.5-flash-preview 1190.2
71 openai/gpt-oss-120b 1189.6
72 google/gemma-3-27b-it 1187.4
73 gpt-5-nano-2025-08-07 1187.0
74 gemini-2.0-flash-001 1186.7
75 meta-llama/llama-3.1-405b-instruct 1185.3
76 google/gemma-3-4b-it 1185.2
77 google/gemma-3-12b-it 1184.1
78 allura-org/Gemma-3-Glitter-12B 1183.0
79 mistralai/mistral-large-2411 1182.9
80 sam-paech/Darkest-muse-v1 1180.8
81 google/gemma-2-9b-it 1180.4
82 liquid/lfm-7b 1178.2
83 THUDM/GLM-4-32B-0414 1177.3
84 meta-llama/llama-3.1-70b-instruct 1177.2
85 gpt-4.1-nano 1176.9
86 openai/gpt-3.5-turbo-0613 1176.7
87 mistralai/Mistral-Nemo-Instruct-2407 1176.5
88 anthropic/claude-3-haiku 1176.2
89 openai/gpt-4-0314 1174.7
90 meta-llama/Llama-4-Maverick-17B-128E-Instruct 1173.6
91 openai/gpt-oss-20b 1172.8
92 gpt-4o-mini 1172.5
93 mistralai/Pixtral-Large-Instruct-2411 1172.1
94 meta-llama/llama-3.1-8b-instruct 1170.5
95 openrouter/cypher-alpha 1169.2
96 meta-llama/llama-3.2-3b-instruct 1165.9
97 ToastyPigeon/Gemma-3-Starshine-12B 1164.8
98 mistralai/mistral-small-3.1-24b-instruct-2503 1164.5
99 meta-llama/Llama-4-Scout-17B-16E-Instruct 1162.7
100 mistralai/Mistral-Small-24B-Instruct-2501 1160.2
101 meta-llama/llama-3.2-1b-instruct 1146.3

Option B: Bootstrapped Tournament Separability (Leaderboard Stability)

  • Methodology: To measure the overall stability and reproducibility of a global leaderboard under different prompt selections, we perform bootstrap resampling on the prompts (drawing 32 prompts with replacement) over 10 independent iterations. A full 20-round tournament is run for each sample to calculate global Elos.
  • Separability Results:
    • Omega-Squared (ANOVA effect size): 0.9020 (90.2% of the variance in global Elos across bootstrap runs is explained by model identity. The remaining 9.8% is due to prompt selection variance).
    • Mean Absolute Cliff's Delta: 0.8188
    • Combined Separability Score: 114.72 / 100.0 (Exceeds 100 due to the official 0.75 normalization scaling factor).
  • Symmetry & Bias Results:
    • Positional Bias (Avg Assignment): Position 1 (A): 52.6% | Position 2 (B): 47.4%
    • Length Bias (Longer Response Win Rate): 51.6%
    • Length Bias (Avg EV Margin for Longer): +2.7%
  • Significance: This confirms the extreme stability of the tournament rankings. Aggregating matches across 32 prompts successfully smooths out prompt-level noise, showing that a slightly different prompt mix will not alter the leaderboard (average model variance is only about ±15 Elo points).

📈 Option B: Bootstrapped Tournament Elo Leaderboard
Rank Model Name Avg Elo (Across Bootstraps)
1 moonshotai/Kimi-K2.6 1294.3 ± 16.6
2 grok-4.20-beta 1290.3 ± 12.5
3 moonshotai/Kimi-K2-Instruct 1283.8 ± 25.5
4 claude-opus-4-7 1278.8 ± 18.7
5 o3 1275.8 ± 16.9
6 moonshotai/Kimi-K2.5 1272.9 ± 29.0
7 moonshotai/Kimi-K2-Thinking 1266.1 ± 22.1
8 openrouter/sherlock-dash-alpha 1256.3 ± 16.2
9 claude-opus-4-6 1255.5 ± 14.2
10 zai-org/GLM-5.1 1250.6 ± 17.3
11 deepseek-ai/DeepSeek-V3.2 1247.7 ± 14.0
12 grok-4.1-fast 1246.1 ± 14.8
13 NousResearch/Hermes-4-405B 1245.4 ± 20.4
14 gemini-3.1-pro-preview 1245.2 ± 8.5
15 Nanbeige/Nanbeige4-3B-Thinking-2511 1244.9 ± 17.5
16 deepseek-ai/DeepSeek-V4-Pro 1243.1 ± 15.5
17 deepseek-ai/DeepSeek-V4-Flash 1237.2 ± 13.8
18 gemini-3-pro-preview 1236.9 ± 12.8
19 deepseek-ai/DeepSeek-R1 1235.3 ± 20.3
20 google/gemma-4-31B-it 1234.5 ± 12.7
21 deepseek-ai/DeepSeek-V3.1 1233.6 ± 17.0
22 claude-opus-4-5-20251101 1232.4 ± 7.0
23 claude-opus-4 1230.9 ± 19.8
24 zai-org/GLM-4.6 1229.4 ± 13.3
25 deepseek-ai/DeepSeek-R1-0528 1228.8 ± 12.1
26 zai-org/GLM-4.5 1228.5 ± 11.5
27 gemini-2.5-pro-preview-06-05 1227.7 ± 9.7
28 Qwen/Qwen3.5-397B-A17B 1227.6 ± 17.1
29 mistral-medium-3.1 1226.4 ± 10.9
30 hunter-alpha 1224.2 ± 11.7
31 qwen/qwen3-235b-a22b:thinking 1224.0 ± 15.0
32 google/gemma-4-26B-A4B-it 1223.0 ± 9.9
33 zai-org/GLM-5 1221.1 ± 9.8
34 openrouter/pony-alpha 1220.9 ± 9.9
35 gpt-5.4 1217.6 ± 13.6
36 zai-org/GLM-4.7 1216.5 ± 9.1
37 quasar-alpha 1216.5 ± 9.4
38 claude-sonnet-4-6 1215.6 ± 16.0
39 claude-3-5-sonnet-20241022 1213.8 ± 11.3
40 chatgpt-4o-latest-2025-03-27 1213.8 ± 10.7
41 RekaAI/reka-flash-3 1213.7 ± 10.3
42 claude-sonnet-4.5 1213.6 ± 8.5
43 gpt-5-mini-2025-08-07 1210.4 ± 10.6
44 optimus-alpha 1208.3 ± 7.5
45 gpt-5.3-chat 1207.1 ± 12.0
46 claude-sonnet-4 1207.0 ± 12.5
47 gpt-5.5 1206.7 ± 10.9
48 mistral-small-creative 1206.6 ± 12.6
49 deepseek-ai/DeepSeek-V3-0324 1206.0 ± 11.2
50 openrouter/horizon-alpha 1205.8 ± 5.5
51 gpt-5-2025-08-07 1205.5 ± 6.8
52 minimax/minimax-m2.5 1205.2 ± 8.2
53 qwen/qwq-32b 1203.0 ± 14.3
54 mistralai/Mistral-Large-3-675B-Instruct-2512 1202.2 ± 11.3
55 ifable/gemma-2-Ifable-9B 1201.5 ± 12.8
56 gemini-2.5-pro-exp-03-25 1201.1 ± 11.9
57 gpt-5.4-mini 1200.9 ± 17.1
58 openrouter/horizon-beta 1200.4 ± 12.4
59 grok-3-beta 1199.0 ± 15.6
60 anthropic/claude-3.5-haiku-20241022 1196.4 ± 11.2
61 deepseek-ai/DeepSeek-V3.2-Speciale 1195.9 ± 16.0
62 zai-org/GLM-4.7-Flash 1195.6 ± 12.2
63 claude-3-7-sonnet-20250219 1193.9 ± 6.3
64 gpt-5.2 1192.2 ± 10.5
65 mistralai/Mistral-Small-3.2-24B-Instruct-2506 1191.4 ± 10.4
66 chatgpt-4o-latest-2025-01-29 1190.0 ± 11.9
67 google/gemma-3-27b-it 1186.6 ± 6.9
68 CohereForAI/c4ai-command-a-03-2025 1185.3 ± 10.3
69 gpt-5-nano-2025-08-07 1184.2 ± 12.0
70 gpt-4.5-preview 1183.5 ± 12.2
71 gpt-4.1-mini 1180.8 ± 8.7
72 openai/gpt-oss-120b 1180.1 ± 14.1
73 google/gemma-3-12b-it 1178.8 ± 13.0
74 allura-org/Gemma-3-Glitter-12B 1178.0 ± 11.0
75 gemini-2.5-flash-preview 1175.9 ± 9.2
76 google/gemma-3-4b-it 1175.5 ± 9.4
77 sam-paech/Darkest-muse-v1 1172.7 ± 6.1
78 gemini-2.0-flash-001 1170.7 ± 5.8
79 gpt-4.1-nano 1166.8 ± 5.9
80 liquid/lfm-7b 1164.0 ± 11.2
81 THUDM/GLM-4-32B-0414 1161.6 ± 13.0
82 mistralai/mistral-large-2411 1160.2 ± 10.0
83 gpt-4o-mini 1155.2 ± 16.1
84 mistralai/Pixtral-Large-Instruct-2411 1152.6 ± 12.0
85 google/gemma-2-9b-it 1152.6 ± 11.6
86 openai/gpt-4-0314 1152.0 ± 8.2
87 meta-llama/llama-3.1-70b-instruct 1152.0 ± 11.5
88 meta-llama/llama-3.1-405b-instruct 1151.8 ± 10.5
89 mistralai/Mistral-Nemo-Instruct-2407 1150.9 ± 9.9
90 openai/gpt-oss-20b 1146.0 ± 19.8
91 ToastyPigeon/Gemma-3-Starshine-12B 1140.6 ± 13.2
92 meta-llama/Llama-4-Maverick-17B-128E-Instruct 1140.2 ± 12.3
93 anthropic/claude-3-haiku 1140.1 ± 16.4
94 meta-llama/llama-3.1-8b-instruct 1136.3 ± 16.9
95 openai/gpt-3.5-turbo-0613 1136.2 ± 10.4
96 openrouter/cypher-alpha 1132.3 ± 17.1
97 mistralai/mistral-small-3.1-24b-instruct-2503 1128.6 ± 12.7
98 mistralai/Mistral-Small-24B-Instruct-2501 1110.0 ± 16.1
99 meta-llama/Llama-4-Scout-17B-16E-Instruct 1107.8 ± 21.0
100 meta-llama/llama-3.2-3b-instruct 1095.4 ± 24.9
101 meta-llama/llama-3.2-1b-instruct 1042.4 ± 17.4

5. Limitations, Known Biases, & Future Work

  • Domain Specificity: This model is explicitly fine-tuned on instruction following and creative writing matrices. Its discriminative accuracy is expected to degrade if deployed for rigorous fact-checking, mathematical, or strict code-generation evaluations.
  • Residual Length Correlation: While effectively minimized as shown in the tournament data, autoregressive models intrinsically favor longer responses heuristically. Unbounded production usage without truncation/length constraints may gently re-introduce verbosity bias.
  • Input Sensitivity: The empirical performance strictly relies on evaluating continuous token logprobs (A and B). Utilizing naive string-text generation (only looking at the output token) rather than parsing the logits will yield sub-optimal structural performance and higher variance.
  • Future Data Scaling: This V1 release was trained over 3 epochs on a single B200 GPU. Scaling the training pipeline with more extensive and diverse training data is already planned for the future and is anticipated to yield significant improvements in discriminative performance.
Downloads last month
108
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support