Windy-Qwen3.5-27B-FP8

FP8-dynamic quantization of C3DS/Windy-Qwen3.5-27B — the LoRA-merged Qwen3.5-27B fine-tuned for three-level wind-energy opposition classification (detection / frames / claims).

This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. The merged FP8 checkpoint loads directly with transformers, vLLM (≥ 0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.

  • ~27 GB on disk (vs ~54 GB for the BF16 sibling) — fits comfortably on a single A100/H100/H200.
  • Slightly outperforms BF16 on this task — FP8 dynamic quantization matched or beat the BF16 model on every reported metric (see Results below).
  • Beats both frontier APIs (Claude Opus 4.7, GPT-5.5) on 7 of 9 metrics evaluated, including detection F1 and all four frame/claim samples F1 metrics.

This model accompanies a forthcoming paper from the C3DS group on wind-energy opposition discourse — research-preview status until that paper is out.

Results

Evaluated on the held-out wind-opposition test set (773 rows, 436 opposition-positive):

Detection (binary)

Metric Windy-27B (BF16) Windy-27B (FP8 — this model) Claude Opus 4.7 GPT-5.5
Precision 0.871 0.877 0.896 0.927
Recall 0.917 0.920 0.890 0.846
F1 0.894 0.898 0.893 0.885

Samples F1 (multi-label frame / claim accuracy)

View Windy-27B (BF16) Windy-27B (FP8) Claude Opus 4.7 GPT-5.5
Frames — all rows 0.781 0.787 0.791 0.792
Frames — opposition only 0.747 0.751 0.734 0.697
Claims — all rows 0.741 0.755 0.754 0.745
Claims — opposition only 0.675 0.694 0.667 0.614

Highlights

  • FP8 ≥ BF16 on every metric — quantization is essentially free at this scale and recipe.
  • Best detection F1 in the lineup (0.898), edging out Opus 4.7 (0.893) and GPT-5.5 (0.885).
  • Wins all four frame/claim samples-F1 metrics vs frontier APIs, including the harder "opposition-only" views.
  • Zero parse failures on 773 test items.
  • A 27 GB FP8 model deployable on a single A100 outperforms ~$20-per-million-token frontier APIs on this task.

Usage

With vLLM

vllm serve C3DS/Windy-Qwen3.5-27B-FP8 \
  --port 8000 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --served-model-name Windy-Qwen3.5-27B

--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).

The system prompt is identical to the BF16 sibling; we bundle it as wind_prompts.json for self-contained loading:

import json
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/Windy-Qwen3.5-27B-FP8", "wind_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

def classify(text):
    resp = client.chat.completions.create(
        model="Windy-Qwen3.5-27B",
        messages=[
            {"role": "system", "content": slim_system_instruction},
            {"role": "user",   "content": text},
        ],
        temperature=0,
        max_tokens=4000,
    )
    return resp.choices[0].message.content

The model produces a reasoning trace inside <think>…</think> followed by a YAML block (opposition_detected, frames, claims). To parse: take the content after </think> and read the YAML.

Multimodal — image + text

The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible image_url content part, and this fine-tune preserves that capability — pass the wind system prompt alongside an image (e.g. a screenshot of a tweet, news headline, or protest sign) and the model will run the same three-level detection / frames / claims classification on the visual input.

Serve vLLM with multimodal flags enabled:

vllm serve C3DS/Windy-Qwen3.5-27B-FP8 \
  --port 8000 \
  --max-model-len 8192 \
  --trust-remote-code \
  --limit-mm-per-prompt image=4 \
  --enable-prefix-caching \
  --served-model-name Windy-Qwen3.5-27B
import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/Windy-Qwen3.5-27B-FP8", "wind_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]

def image_part(path):
    p = Path(path)
    mime = mimetypes.guess_type(p)[0] or "image/png"
    b64 = base64.b64encode(p.read_bytes()).decode()
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="Windy-Qwen3.5-27B",
    messages=[
        {"role": "system", "content": slim_system_instruction},
        {"role": "user", "content": [
            {"type": "text", "text": "Read the image (and any caption below) and classify the wind-opposition framing depicted."},
            image_part("screenshot.png"),
            {"type": "text", "text": "### Caption:\n<optional caption>"},
        ]},
    ],
    temperature=0,
    max_tokens=4000,
)
print(resp.choices[0].message.content)

Training & Quantization

Fine-tuning (inherited from the BF16 sibling)

  • Base model: Qwen/Qwen3.5-27B
  • Method: LoRA (rank 16, α 16, dropout 0) on q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights
  • Training data: RECoT chat messages distilled from a teacher LLM (claude-opus-4-7) over an expert-annotated wind-opposition corpus, with synthetic positive / near-miss negative augmentation. Final training set: 726 rows after teacher second-guessing filter; 86-row held-out eval mirror.
  • Framework: Unsloth + TRL SFTTrainer
  • Hyperparameters: 3 epochs, per_device_train_batch_size=1, gradient_accumulation_steps=8, lr=2e-4, cosine schedule, 10 warmup steps, max_seq_length=8192, adamw_8bit, bf16

FP8 quantization

  • Scheme: fp8_e4m3 dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed.
  • Targets: linear layers in transformer blocks (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj); lm_head left in BF16.
  • Tool: llmcompressor with QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC") applied to the merged BF16 checkpoint.
  • Validation: the quantized model was re-evaluated on the wind test set (see Results above). FP8 matches or beats BF16 within rounding.

Hardware notes

  • Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Tested on A100, H100, H200. Ampere uses up-cast FP8→BF16 matmul; Hopper uses native FP8 tensor cores.
  • Memory: ~27 GB for weights, leaving 50+ GB on a single A100 80GB for KV cache.

Limitations

  • Forthcoming paper. The methodology and dataset will be published in a future C3DS paper.
  • Domain-specific. Trained on wind-opposition discourse from social media and news.
  • Thinking tokens. Training used enable_thinking=True. Parse output after </think> or disable thinking at inference.
  • Detection precision. Recall (0.920) outpaces precision (0.877); high-precision use cases may need additional thresholding.
  • Quantization is weight-only. Activations are BF16.

Related models

License

Apache 2.0, inherited from Qwen3.5-27B.

Downloads last month
276
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for C3DS/Windy-Qwen3.5-27B-FP8

Base model

Qwen/Qwen3.5-27B
Adapter
(1)
this model