Model Card: `selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge`

Date: June 5, 2026

selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge is a specialized, fine-tuned large language model designed to function as an "LLM-as-a-Judge," specifically optimized for evaluating and ranking creative writing responses. Built on the Qwen 3.5 4B architecture, it assesses pairs of responses to determine whichever is superior in accuracy, clarity, and originality.

1. Model Details

Developed by: selfhypnosis-ai
Base Model: Qwen3.5-4B
Task: LLM-as-a-Judge (Pairwise Preference Evaluation)
Language(s): English
Format: 16-bit weights (A quantized version is also available at selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge-GPTQ-4bit, which exhibits no measurable degradation in evaluation performance.)
Fine-Tuned Context Length: 32,768 (32k) tokens
Training Infrastructure: 1x NVIDIA B200
Training Epochs: 3

2. Intended Use and Implementation

This model relies on a specific prompting strategy and fundamentally operates by extracting and comparing the output log probabilities (logits) of the tokens A and B.

To effectively mitigate the persistent "positional bias" common in LLM judges, inference should be run twice for every pair of texts—swapping the positions of Response A and Response B—and averaging the results.

Prompting Format

The model expects a strict instruction-based system prompt and a JSON-structured user prompt.

quality_sys_msg = (
    "You are a specialized AI assistant tasked with evaluating two responses to a given instruction. "
    "Your purpose is to fairly compare the quality of each response and select the better one. "
    "The user provides the instruction and the two responses A and B. Evaluate based on accuracy, "
    "clarity, and originality, considering repetitive or unoriginal responses as lower quality."
)

# Example User Content
user_content = json.dumps({
    "instruction": "Write a short poem about the ocean.",
    "A": "[Response 1 Text]",
    "B": "[Response 2 Text]"
}, ensure_ascii=False)

messages = [
    {"role": "system", "content": quality_sys_msg},
    {"role": "user", "content": user_content}
]

Logits Extraction & Position Swapping

Rather than simply parsing the generated output text, the structural design requires evaluating the underlying token logprobs for the letters A and B to determine a confidence score. The probability formula is defined as:

$P(A) = \frac{e^{\text{logprob}_A}}{e^{\text{logprob}_A} + e^{\text{logprob}_B}} \times 100$

Here is an abbreviated vllm implementation demonstrating the dual-pass logic:

import math
import json
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge"
llm = LLM(model=model_path, kv_cache_dtype="fp8", max_model_len=32768)
tokenizer = AutoTokenizer.from_pretrained(model_path)

id_A = tokenizer.encode("A", add_special_tokens=False)[0]
id_B = tokenizer.encode("B", add_special_tokens=False)[0]

sampling_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=10)

def evaluate_pair(instruction, text1, text2):
    # Pass 1: text1 is A, text2 is B
    c1 = json.dumps({"instruction": instruction, "A": text1, "B": text2})
    p1 = tokenizer.apply_chat_template(
        [{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c1}], 
        tokenize=False, add_generation_prompt=True
    )
    
    # Pass 2: text2 is A, text1 is B
    c2 = json.dumps({"instruction": instruction, "A": text2, "B": text1})
    p2 = tokenizer.apply_chat_template(
        [{"role": "system", "content": quality_sys_msg}, {"role": "user", "content": c2}], 
        tokenize=False, add_generation_prompt=True
    )

    outputs = llm.generate([p1, p2], sampling_params, use_tqdm=False)

    def extract_probs(out):
        lps = out.outputs[0].logprobs[0]
        logA = getattr(lps.get(id_A), "logprob", -100)
        logB = getattr(lps.get(id_B), "logprob", -100)
        pA, pB = math.exp(logA), math.exp(logB)
        return (pA/(pA+pB)*100, pB/(pA+pB)*100) if (pA+pB) > 0 else (50.0, 50.0)

    S1_pos1_prob, S1_pos2_prob = extract_probs(outputs[0])
    S2_pos1_prob, S2_pos2_prob = extract_probs(outputs[1])

    # S1_pos1_prob = Probability text1 is better
    # S2_pos2_prob = Probability text1 is better (when in position B)
    avg_text1_better = (S1_pos1_prob + S2_pos2_prob) / 2
    avg_text2_better = (S1_pos2_prob + S2_pos1_prob) / 2

    return avg_text1_better, avg_text2_better

3. Discrete Pairwise Classification (Zero-Shot Accuracy)

Methodology

The model's zero-shot assessment capabilities were evaluated using the eqbench_creative_writing dataset. Ground-truth preference labels in this dataset are defined by Claude 4.6 Sonnet (and supplemented by Claude 4 Sonnet), representing state-of-the-art automated baseline scoring.

To isolate purely semantic reasoning capabilities from length bias, text pairs were stringently filtered using a maximum length ratio (R):

$R = \frac{\max(L_A, L_B)}{\min(L_A, L_B)} \le 1.2$

Furthermore, the evaluation targeted a highly constrained difficulty window, restricting the sample to pairs with a narrow and complex ground-truth score margin (5.0 <= Score Difference <= 10.0).

Quantitative Results

While broader general evaluations demonstrate an aggregate accuracy of approximately 78%, the results on this targeted, high-difficulty subset (N=500, Seed: 42) illustrate the distinct advantages of logit-based evaluation versus discrete token prediction:

Metric	Measured Value	Description
Strict Accuracy	59.8%	(Token Prediction): The frequency of correctly identifying the superior response using hard classification. This strictly measures if the model output the exact correct token ('A' or 'B') across both spatial orientations.
Positional Consistency	79.4%	(Logit Evaluation): The stability metric derived by using the logits. This reflects how consistently the model prefers the same underlying text representation—regardless of its assignment to position A or B—by analyzing the continuous probabilistic logit distribution.
Mean Target Probability Mass	65.7%	The average calculated logit-derived probability assigned to the ground-truth superior response (the "better" answer).
Mean Non-Target Probability Mass	34.3%	The average calculated logit-derived probability assigned to the ground-truth inferior response (the "worse" answer).

4. Tournament Separability Meta-Evaluation (Judgemark v4)

Following the meta-evaluation principles introduced in Judgemark v4, we direct-tested our model's discriminative resolution: can it assign preferences that cleanly separate stronger writing from weaker writing across the dataset?

Because this judge is optimized for pairwise tournament Elo evaluations rather than pointwise rubric scoring, we run two different meta-evaluations directly in the Elo space:

Option A: Prompt-Isolated Elo Separability (Judgemark v4 Pointwise Analogue)

Methodology: To simulate item-level scoring, we execute isolated Swiss tournaments (5 rounds) restricted to each individual prompt in the dataset. This builds a score distribution of 32 prompt-specific Elo ratings for each of the 101 writer models. We then calculate Omega-Squared (ω²) and Mean Absolute Cliff's Delta across these prompt-level distributions.
Separability Results:
- Omega-Squared (ANOVA effect size): 0.6811 (68.1% of the variance in prompt-specific Elo ratings is explained solely by writer model identity, indicating high resilience to prompt-specific noise).
- Mean Absolute Cliff's Delta: 0.6377
- Combined Separability Score: 87.92 / 100.0 (Formula: ((0.5 * ω² + 0.5 * d_cliff) / 0.75) * 100).
Symmetry & Bias Results:
- Positional Bias (Avg Assignment): Position 1 (A): 52.3% | Position 2 (B): 47.7%
- Length Bias (Longer Response Win Rate): 53.4%
- Length Bias (Avg EV Margin for Longer): +5.3%
Significance: This indicates top-tier discriminative ability. The judge consistently correctly partitions writing quality in the expected direction, reliably separating strong models from low-tier anchor models while maintaining robust symmetry against length and position biases.

📈 Option A: Prompt-Isolated Elo Leaderboard

Rank	Model Name	Avg Elo (Samples: 32)
1	moonshotai/Kimi-K2.5	1231.4
2	moonshotai/Kimi-K2.6	1230.3
3	moonshotai/Kimi-K2-Instruct	1229.7
4	claude-opus-4-7	1229.0
5	deepseek-ai/DeepSeek-V3.2	1226.0
6	grok-4.20-beta	1226.0
7	o3	1225.6
8	deepseek-ai/DeepSeek-V4-Pro	1225.0
9	openrouter/sherlock-dash-alpha	1224.3
10	claude-opus-4-6	1222.9
11	moonshotai/Kimi-K2-Thinking	1222.8
12	grok-4.1-fast	1222.3
13	zai-org/GLM-5.1	1222.1
14	NousResearch/Hermes-4-405B	1221.8
15	gemini-3-pro-preview	1220.1
16	gemini-2.5-pro-preview-06-05	1219.1
17	deepseek-ai/DeepSeek-R1	1219.1
18	gemini-3.1-pro-preview	1219.0
19	google/gemma-4-31B-it	1218.7
20	Qwen/Qwen3.5-397B-A17B	1218.7
21	claude-opus-4	1218.4
22	deepseek-ai/DeepSeek-R1-0528	1218.0
23	claude-opus-4-5-20251101	1217.6
24	deepseek-ai/DeepSeek-V4-Flash	1217.6
25	zai-org/GLM-4.6	1217.4
26	deepseek-ai/DeepSeek-V3.1	1215.7
27	Nanbeige/Nanbeige4-3B-Thinking-2511	1215.4
28	zai-org/GLM-5	1213.9
29	openrouter/pony-alpha	1213.2
30	hunter-alpha	1213.1
31	google/gemma-4-26B-A4B-it	1213.0
32	claude-sonnet-4-6	1212.3
33	mistral-medium-3.1	1211.8
34	zai-org/GLM-4.5	1211.7
35	qwen/qwen3-235b-a22b:thinking	1211.3
36	RekaAI/reka-flash-3	1210.5
37	claude-sonnet-4.5	1210.3
38	claude-3-5-sonnet-20241022	1209.9
39	gpt-5.4	1209.1
40	deepseek-ai/DeepSeek-V3-0324	1208.7
41	zai-org/GLM-4.7	1206.6
42	mistral-small-creative	1206.0
43	gpt-5-mini-2025-08-07	1204.9
44	minimax/minimax-m2.5	1204.9
45	gpt-5.5	1204.6
46	chatgpt-4o-latest-2025-03-27	1204.5
47	gpt-5-2025-08-07	1204.0
48	quasar-alpha	1203.4
49	openrouter/horizon-beta	1203.3
50	ifable/gemma-2-Ifable-9B	1203.3
51	optimus-alpha	1203.1
52	qwen/qwq-32b	1203.1
53	claude-3-7-sonnet-20250219	1202.7
54	mistralai/Mistral-Large-3-675B-Instruct-2512	1202.6
55	openrouter/horizon-alpha	1202.2
56	gpt-5.4-mini	1202.0
57	mistralai/Mistral-Small-3.2-24B-Instruct-2506	1201.8
58	claude-sonnet-4	1200.8
59	grok-3-beta	1199.1
60	gemini-2.5-pro-exp-03-25	1198.9
61	gpt-5.3-chat	1198.7
62	anthropic/claude-3.5-haiku-20241022	1197.2
63	deepseek-ai/DeepSeek-V3.2-Speciale	1196.9
64	zai-org/GLM-4.7-Flash	1196.7
65	gpt-4.1-mini	1196.2
66	gpt-5.2	1196.0
67	chatgpt-4o-latest-2025-01-29	1195.1
68	gpt-4.5-preview	1194.6
69	CohereForAI/c4ai-command-a-03-2025	1192.9
70	gemini-2.5-flash-preview	1190.2
71	openai/gpt-oss-120b	1189.6
72	google/gemma-3-27b-it	1187.4
73	gpt-5-nano-2025-08-07	1187.0
74	gemini-2.0-flash-001	1186.7
75	meta-llama/llama-3.1-405b-instruct	1185.3
76	google/gemma-3-4b-it	1185.2
77	google/gemma-3-12b-it	1184.1
78	allura-org/Gemma-3-Glitter-12B	1183.0
79	mistralai/mistral-large-2411	1182.9
80	sam-paech/Darkest-muse-v1	1180.8
81	google/gemma-2-9b-it	1180.4
82	liquid/lfm-7b	1178.2
83	THUDM/GLM-4-32B-0414	1177.3
84	meta-llama/llama-3.1-70b-instruct	1177.2
85	gpt-4.1-nano	1176.9
86	openai/gpt-3.5-turbo-0613	1176.7
87	mistralai/Mistral-Nemo-Instruct-2407	1176.5
88	anthropic/claude-3-haiku	1176.2
89	openai/gpt-4-0314	1174.7
90	meta-llama/Llama-4-Maverick-17B-128E-Instruct	1173.6
91	openai/gpt-oss-20b	1172.8
92	gpt-4o-mini	1172.5
93	mistralai/Pixtral-Large-Instruct-2411	1172.1
94	meta-llama/llama-3.1-8b-instruct	1170.5
95	openrouter/cypher-alpha	1169.2
96	meta-llama/llama-3.2-3b-instruct	1165.9
97	ToastyPigeon/Gemma-3-Starshine-12B	1164.8
98	mistralai/mistral-small-3.1-24b-instruct-2503	1164.5
99	meta-llama/Llama-4-Scout-17B-16E-Instruct	1162.7
100	mistralai/Mistral-Small-24B-Instruct-2501	1160.2
101	meta-llama/llama-3.2-1b-instruct	1146.3

Option B: Bootstrapped Tournament Separability (Leaderboard Stability)

Methodology: To measure the overall stability and reproducibility of a global leaderboard under different prompt selections, we perform bootstrap resampling on the prompts (drawing 32 prompts with replacement) over 10 independent iterations. A full 20-round tournament is run for each sample to calculate global Elos.
Separability Results:
- Omega-Squared (ANOVA effect size): 0.9020 (90.2% of the variance in global Elos across bootstrap runs is explained by model identity. The remaining 9.8% is due to prompt selection variance).
- Mean Absolute Cliff's Delta: 0.8188
- Combined Separability Score: 114.72 / 100.0 (Exceeds 100 due to the official 0.75 normalization scaling factor).
Symmetry & Bias Results:
- Positional Bias (Avg Assignment): Position 1 (A): 52.6% | Position 2 (B): 47.4%
- Length Bias (Longer Response Win Rate): 51.6%
- Length Bias (Avg EV Margin for Longer): +2.7%
Significance: This confirms the extreme stability of the tournament rankings. Aggregating matches across 32 prompts successfully smooths out prompt-level noise, showing that a slightly different prompt mix will not alter the leaderboard (average model variance is only about ±15 Elo points).

📈 Option B: Bootstrapped Tournament Elo Leaderboard

Rank	Model Name	Avg Elo (Across Bootstraps)
1	moonshotai/Kimi-K2.6	1294.3 ± 16.6
2	grok-4.20-beta	1290.3 ± 12.5
3	moonshotai/Kimi-K2-Instruct	1283.8 ± 25.5
4	claude-opus-4-7	1278.8 ± 18.7
5	o3	1275.8 ± 16.9
6	moonshotai/Kimi-K2.5	1272.9 ± 29.0
7	moonshotai/Kimi-K2-Thinking	1266.1 ± 22.1
8	openrouter/sherlock-dash-alpha	1256.3 ± 16.2
9	claude-opus-4-6	1255.5 ± 14.2
10	zai-org/GLM-5.1	1250.6 ± 17.3
11	deepseek-ai/DeepSeek-V3.2	1247.7 ± 14.0
12	grok-4.1-fast	1246.1 ± 14.8
13	NousResearch/Hermes-4-405B	1245.4 ± 20.4
14	gemini-3.1-pro-preview	1245.2 ± 8.5
15	Nanbeige/Nanbeige4-3B-Thinking-2511	1244.9 ± 17.5
16	deepseek-ai/DeepSeek-V4-Pro	1243.1 ± 15.5
17	deepseek-ai/DeepSeek-V4-Flash	1237.2 ± 13.8
18	gemini-3-pro-preview	1236.9 ± 12.8
19	deepseek-ai/DeepSeek-R1	1235.3 ± 20.3
20	google/gemma-4-31B-it	1234.5 ± 12.7
21	deepseek-ai/DeepSeek-V3.1	1233.6 ± 17.0
22	claude-opus-4-5-20251101	1232.4 ± 7.0
23	claude-opus-4	1230.9 ± 19.8
24	zai-org/GLM-4.6	1229.4 ± 13.3
25	deepseek-ai/DeepSeek-R1-0528	1228.8 ± 12.1
26	zai-org/GLM-4.5	1228.5 ± 11.5
27	gemini-2.5-pro-preview-06-05	1227.7 ± 9.7
28	Qwen/Qwen3.5-397B-A17B	1227.6 ± 17.1
29	mistral-medium-3.1	1226.4 ± 10.9
30	hunter-alpha	1224.2 ± 11.7
31	qwen/qwen3-235b-a22b:thinking	1224.0 ± 15.0
32	google/gemma-4-26B-A4B-it	1223.0 ± 9.9
33	zai-org/GLM-5	1221.1 ± 9.8
34	openrouter/pony-alpha	1220.9 ± 9.9
35	gpt-5.4	1217.6 ± 13.6
36	zai-org/GLM-4.7	1216.5 ± 9.1
37	quasar-alpha	1216.5 ± 9.4
38	claude-sonnet-4-6	1215.6 ± 16.0
39	claude-3-5-sonnet-20241022	1213.8 ± 11.3
40	chatgpt-4o-latest-2025-03-27	1213.8 ± 10.7
41	RekaAI/reka-flash-3	1213.7 ± 10.3
42	claude-sonnet-4.5	1213.6 ± 8.5
43	gpt-5-mini-2025-08-07	1210.4 ± 10.6
44	optimus-alpha	1208.3 ± 7.5
45	gpt-5.3-chat	1207.1 ± 12.0
46	claude-sonnet-4	1207.0 ± 12.5
47	gpt-5.5	1206.7 ± 10.9
48	mistral-small-creative	1206.6 ± 12.6
49	deepseek-ai/DeepSeek-V3-0324	1206.0 ± 11.2
50	openrouter/horizon-alpha	1205.8 ± 5.5
51	gpt-5-2025-08-07	1205.5 ± 6.8
52	minimax/minimax-m2.5	1205.2 ± 8.2
53	qwen/qwq-32b	1203.0 ± 14.3
54	mistralai/Mistral-Large-3-675B-Instruct-2512	1202.2 ± 11.3
55	ifable/gemma-2-Ifable-9B	1201.5 ± 12.8
56	gemini-2.5-pro-exp-03-25	1201.1 ± 11.9
57	gpt-5.4-mini	1200.9 ± 17.1
58	openrouter/horizon-beta	1200.4 ± 12.4
59	grok-3-beta	1199.0 ± 15.6
60	anthropic/claude-3.5-haiku-20241022	1196.4 ± 11.2
61	deepseek-ai/DeepSeek-V3.2-Speciale	1195.9 ± 16.0
62	zai-org/GLM-4.7-Flash	1195.6 ± 12.2
63	claude-3-7-sonnet-20250219	1193.9 ± 6.3
64	gpt-5.2	1192.2 ± 10.5
65	mistralai/Mistral-Small-3.2-24B-Instruct-2506	1191.4 ± 10.4
66	chatgpt-4o-latest-2025-01-29	1190.0 ± 11.9
67	google/gemma-3-27b-it	1186.6 ± 6.9
68	CohereForAI/c4ai-command-a-03-2025	1185.3 ± 10.3
69	gpt-5-nano-2025-08-07	1184.2 ± 12.0
70	gpt-4.5-preview	1183.5 ± 12.2
71	gpt-4.1-mini	1180.8 ± 8.7
72	openai/gpt-oss-120b	1180.1 ± 14.1
73	google/gemma-3-12b-it	1178.8 ± 13.0
74	allura-org/Gemma-3-Glitter-12B	1178.0 ± 11.0
75	gemini-2.5-flash-preview	1175.9 ± 9.2
76	google/gemma-3-4b-it	1175.5 ± 9.4
77	sam-paech/Darkest-muse-v1	1172.7 ± 6.1
78	gemini-2.0-flash-001	1170.7 ± 5.8
79	gpt-4.1-nano	1166.8 ± 5.9
80	liquid/lfm-7b	1164.0 ± 11.2
81	THUDM/GLM-4-32B-0414	1161.6 ± 13.0
82	mistralai/mistral-large-2411	1160.2 ± 10.0
83	gpt-4o-mini	1155.2 ± 16.1
84	mistralai/Pixtral-Large-Instruct-2411	1152.6 ± 12.0
85	google/gemma-2-9b-it	1152.6 ± 11.6
86	openai/gpt-4-0314	1152.0 ± 8.2
87	meta-llama/llama-3.1-70b-instruct	1152.0 ± 11.5
88	meta-llama/llama-3.1-405b-instruct	1151.8 ± 10.5
89	mistralai/Mistral-Nemo-Instruct-2407	1150.9 ± 9.9
90	openai/gpt-oss-20b	1146.0 ± 19.8
91	ToastyPigeon/Gemma-3-Starshine-12B	1140.6 ± 13.2
92	meta-llama/Llama-4-Maverick-17B-128E-Instruct	1140.2 ± 12.3
93	anthropic/claude-3-haiku	1140.1 ± 16.4
94	meta-llama/llama-3.1-8b-instruct	1136.3 ± 16.9
95	openai/gpt-3.5-turbo-0613	1136.2 ± 10.4
96	openrouter/cypher-alpha	1132.3 ± 17.1
97	mistralai/mistral-small-3.1-24b-instruct-2503	1128.6 ± 12.7
98	mistralai/Mistral-Small-24B-Instruct-2501	1110.0 ± 16.1
99	meta-llama/Llama-4-Scout-17B-16E-Instruct	1107.8 ± 21.0
100	meta-llama/llama-3.2-3b-instruct	1095.4 ± 24.9
101	meta-llama/llama-3.2-1b-instruct	1042.4 ± 17.4

5. Limitations, Known Biases, & Future Work

Domain Specificity: This model is explicitly fine-tuned on instruction following and creative writing matrices. Its discriminative accuracy is expected to degrade if deployed for rigorous fact-checking, mathematical, or strict code-generation evaluations.
Residual Length Correlation: While effectively minimized as shown in the tournament data, autoregressive models intrinsically favor longer responses heuristically. Unbounded production usage without truncation/length constraints may gently re-introduce verbosity bias.
Input Sensitivity: The empirical performance strictly relies on evaluating continuous token logprobs (A and B). Utilizing naive string-text generation (only looking at the output token) rather than parsing the logits will yield sub-optimal structural performance and higher variance.
Future Data Scaling: This V1 release was trained over 3 epochs on a single B200 GPU. Scaling the training pipeline with more extensive and diverse training data is already planned for the future and is anticipated to yield significant improvements in discriminative performance.

Downloads last month: 108

Safetensors

Model size

5B params

Tensor type

BF16

F32

Model Card: selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge