Windy-Qwen3.5-27B-FP8
FP8-dynamic quantization of C3DS/Windy-Qwen3.5-27B — the LoRA-merged Qwen3.5-27B fine-tuned for three-level wind-energy opposition classification (detection / frames / claims).
This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. The merged FP8 checkpoint loads directly with transformers, vLLM (≥ 0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.
- ~27 GB on disk (vs ~54 GB for the BF16 sibling) — fits comfortably on a single A100/H100/H200.
- Slightly outperforms BF16 on this task — FP8 dynamic quantization matched or beat the BF16 model on every reported metric (see Results below).
- Beats both frontier APIs (Claude Opus 4.7, GPT-5.5) on 7 of 9 metrics evaluated, including detection F1 and all four frame/claim samples F1 metrics.
This model accompanies a forthcoming paper from the C3DS group on wind-energy opposition discourse — research-preview status until that paper is out.
Results
Evaluated on the held-out wind-opposition test set (773 rows, 436 opposition-positive):
Detection (binary)
| Metric | Windy-27B (BF16) | Windy-27B (FP8 — this model) | Claude Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|
| Precision | 0.871 | 0.877 | 0.896 | 0.927 |
| Recall | 0.917 | 0.920 | 0.890 | 0.846 |
| F1 | 0.894 | 0.898 | 0.893 | 0.885 |
Samples F1 (multi-label frame / claim accuracy)
| View | Windy-27B (BF16) | Windy-27B (FP8) | Claude Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|
| Frames — all rows | 0.781 | 0.787 | 0.791 | 0.792 |
| Frames — opposition only | 0.747 | 0.751 | 0.734 | 0.697 |
| Claims — all rows | 0.741 | 0.755 | 0.754 | 0.745 |
| Claims — opposition only | 0.675 | 0.694 | 0.667 | 0.614 |
Highlights
- FP8 ≥ BF16 on every metric — quantization is essentially free at this scale and recipe.
- Best detection F1 in the lineup (0.898), edging out Opus 4.7 (0.893) and GPT-5.5 (0.885).
- Wins all four frame/claim samples-F1 metrics vs frontier APIs, including the harder "opposition-only" views.
- Zero parse failures on 773 test items.
- A 27 GB FP8 model deployable on a single A100 outperforms ~$20-per-million-token frontier APIs on this task.
Usage
With vLLM
vllm serve C3DS/Windy-Qwen3.5-27B-FP8 \
--port 8000 \
--max-model-len 4096 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--served-model-name Windy-Qwen3.5-27B
--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).
The system prompt is identical to the BF16 sibling; we bundle it as wind_prompts.json for self-contained loading:
import json
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/Windy-Qwen3.5-27B-FP8", "wind_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
def classify(text):
resp = client.chat.completions.create(
model="Windy-Qwen3.5-27B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": text},
],
temperature=0,
max_tokens=4000,
)
return resp.choices[0].message.content
The model produces a reasoning trace inside <think>…</think> followed by a YAML block (opposition_detected, frames, claims). To parse: take the content after </think> and read the YAML.
Multimodal — image + text
The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible
image_url content part, and this fine-tune preserves that capability — pass
the wind system prompt alongside an image (e.g. a screenshot of a tweet, news
headline, or protest sign) and the model will run the same three-level
detection / frames / claims classification on the visual input.
Serve vLLM with multimodal flags enabled:
vllm serve C3DS/Windy-Qwen3.5-27B-FP8 \
--port 8000 \
--max-model-len 8192 \
--trust-remote-code \
--limit-mm-per-prompt image=4 \
--enable-prefix-caching \
--served-model-name Windy-Qwen3.5-27B
import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/Windy-Qwen3.5-27B-FP8", "wind_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
def image_part(path):
p = Path(path)
mime = mimetypes.guess_type(p)[0] or "image/png"
b64 = base64.b64encode(p.read_bytes()).decode()
return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="Windy-Qwen3.5-27B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": [
{"type": "text", "text": "Read the image (and any caption below) and classify the wind-opposition framing depicted."},
image_part("screenshot.png"),
{"type": "text", "text": "### Caption:\n<optional caption>"},
]},
],
temperature=0,
max_tokens=4000,
)
print(resp.choices[0].message.content)
Training & Quantization
Fine-tuning (inherited from the BF16 sibling)
- Base model:
Qwen/Qwen3.5-27B - Method: LoRA (rank 16, α 16, dropout 0) on
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights - Training data: RECoT chat messages distilled from a teacher LLM (
claude-opus-4-7) over an expert-annotated wind-opposition corpus, with synthetic positive / near-miss negative augmentation. Final training set: 726 rows after teacher second-guessing filter; 86-row held-out eval mirror. - Framework: Unsloth + TRL
SFTTrainer - Hyperparameters: 3 epochs,
per_device_train_batch_size=1,gradient_accumulation_steps=8,lr=2e-4, cosine schedule, 10 warmup steps,max_seq_length=8192,adamw_8bit,bf16
FP8 quantization
- Scheme:
fp8_e4m3dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed. - Targets: linear layers in transformer blocks (
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj);lm_headleft in BF16. - Tool:
llmcompressorwithQuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC")applied to the merged BF16 checkpoint. - Validation: the quantized model was re-evaluated on the wind test set (see Results above). FP8 matches or beats BF16 within rounding.
Hardware notes
- Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Tested on A100, H100, H200. Ampere uses up-cast FP8→BF16 matmul; Hopper uses native FP8 tensor cores.
- Memory: ~27 GB for weights, leaving 50+ GB on a single A100 80GB for KV cache.
Limitations
- Forthcoming paper. The methodology and dataset will be published in a future C3DS paper.
- Domain-specific. Trained on wind-opposition discourse from social media and news.
- Thinking tokens. Training used
enable_thinking=True. Parse output after</think>or disable thinking at inference. - Detection precision. Recall (0.920) outpaces precision (0.877); high-precision use cases may need additional thresholding.
- Quantization is weight-only. Activations are BF16.
Related models
C3DS/Windy-Qwen3.5-27B— BF16 sibling (this model is the FP8 quantization).C3DS/CARDS-Wind-Qwen3.6-27B-FP8— joint single-backbone trained on CARDS + Wind; one model handles both tasks.
License
Apache 2.0, inherited from Qwen3.5-27B.
- Downloads last month
- 276