CARDS-Qwen3.5-27B-FP8
FP8-dynamic quantization of C3DS/CARDS-Qwen3.5-27B — the LoRA-merged Qwen3.5-27B fine-tuned on the CARDS taxonomy from Coan et al. (2025) for climate-contrarian-claim classification.
This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. The merged FP8 checkpoint loads directly with transformers, vLLM (≥0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.
- ~27 GB on disk (vs ~54 GB for the BF16 sibling) — fits comfortably on a single A100/H100/H200, leaves more headroom for KV cache.
- No accuracy loss observed vs the BF16 model on the held-out CARDS test set (within rounding; see Results below).
Results
Evaluated on the held-out CARDS test set (1,436 samples, Level 1, min_support ≥ 3):
| Metric | Qwen3.5-27B (base) | Qwen3.5-27B FT (BF16) | Qwen3.5-27B FT (FP8 — this model) | Claude Opus 4.6 | Claude Opus 4.7 |
|---|---|---|---|---|---|
| Samples F1 | 0.844 | 0.884 | 0.886 | 0.893 | 0.882 |
| Macro F1 | 0.710 | 0.766 | 0.770 | 0.751 | 0.771 |
| Micro F1 | 0.854 | 0.877 | 0.879 | 0.881 | 0.874 |
| Precision | 0.870 | 0.879 | 0.881 | 0.863 | 0.868 |
| Recall | 0.838 | 0.874 | 0.876 | 0.900 | 0.880 |
| Parse failures | 86 / 1436 | 0 / 1436 | 0 / 1436 | 0 / 1436 | 0 / 1436 |
- Matches or slightly exceeds the BF16 model on every reported metric — FP8 dynamic quantization does not degrade accuracy on this task.
- Same parse reliability as BF16 (zero failures on 1,436 test items).
- Best Macro F1 at L1 among the five compared models, tied with Opus 4.7.
Usage
With vLLM
vllm serve C3DS/CARDS-Qwen3.5-27B-FP8 \
--port 8000 \
--max-model-len 4096 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--served-model-name CARDS-Qwen3.5-27B
--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).
The system prompt and user-message format are identical to the BF16 sibling. We bundle them in this repo as cards_prompts.json for self-contained loading:
import json
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-27B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger = prompts["cot_trigger"]
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
def classify(text):
resp = client.chat.completions.create(
model="CARDS-Qwen3.5-27B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": f"### Text:\n{text}\n\n{cot_trigger}"},
],
temperature=0,
max_tokens=4000,
)
return resp.choices[0].message.content
print(classify("These are only a few renewable energy technologies at work"))
The model produces a reasoning trace inside <think>…</think> followed by a YAML categories: block listing predicted CARDS codes. To parse: take the content after </think> and read the categories: list.
See the project repository for training scripts, quantization recipe, evaluation code, and dataset preparation.
Multimodal — image + text
The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible
image_url content part, and this fine-tune preserves that capability — pass
the system prompt below alongside an image (with or without caption text) and
the model will classify the depicted claim under the CARDS taxonomy.
Serve vLLM with multimodal flags enabled:
vllm serve C3DS/CARDS-Qwen3.5-27B-FP8 \
--port 8000 \
--max-model-len 8192 \
--trust-remote-code \
--limit-mm-per-prompt image=4 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--served-model-name CARDS-Qwen3.5-27B
import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-27B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger = prompts["cot_trigger"]
def image_part(path):
p = Path(path)
mime = mimetypes.guess_type(p)[0] or "image/png"
b64 = base64.b64encode(p.read_bytes()).decode()
return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="CARDS-Qwen3.5-27B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": [
{"type": "text", "text": "Read the image (and any caption below) and classify the climate claim it makes."},
image_part("screenshot.png"),
{"type": "text", "text": f"### Caption:\n<optional caption>\n\n{cot_trigger}"},
]},
],
temperature=0,
max_tokens=4000,
)
print(resp.choices[0].message.content)
Training & Quantization
Fine-tuning (inherited from the BF16 sibling)
- Base model:
Qwen/Qwen3.5-27B - Method: LoRA (rank 16, α 16, dropout 0) on
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights - Dataset:
C3DS/cards_sft_dataset - Framework: Unsloth + TRL
SFTTrainer - Hyperparameters: 3 epochs,
per_device_train_batch_size=1,gradient_accumulation_steps=8,lr=2e-4, cosine schedule, 10 warmup steps,max_seq_length=4096,adamw_8bit,bf16 - Checkpoint selection: best via
load_best_model_at_end=True
FP8 quantization
- Scheme:
fp8_e4m3dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed. - Targets: all linear layers in transformer blocks (
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head—lm_headis typically excluded; see project repo for the exact ignore list). - Tool:
llmcompressorwithQuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC")applied to the merged BF16 checkpoint. - Validation: the quantized model is re-evaluated on the CARDS test set (see Results above). FP8 matches BF16 within rounding.
Hardware notes
- Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Tested on A100, H100, H200. Ampere (A100) uses up-cast FP8→BF16 matmul; Hopper (H100/H200) uses native FP8 tensor cores for additional throughput.
- Memory: ~27 GB for weights, leaving 50+ GB on a single A100 80GB for KV cache. Comfortable for
max_model_len=8192atmax_num_seqs ≥ 256.
Limitations
- Thinking tokens. Training used
enable_thinking=True. Either parse output after</think>, or disable thinking at inference viachat_template_kwargs={"enable_thinking": false}. Reserve token budget for the reasoning trace before the final YAML block. - Quantization is weight-only. Activations are BF16. For more aggressive compression (FP8 activations, or W4A16 GPTQ/AWQ), see project follow-up work.
Citation
@article{cards2pO2025,
title={Large language model reveals an increase in climate contrarian speech in the United States Congress},
author={Travis G. Coan and Ranadheer Malla and Mirjam O. Nanko and William Kattrup and J. Timmons Roberts and John Cook and Constantine Boussalis},
journal={Communications Sustainability},
year={2025}
}
License
Apache 2.0, inherited from Qwen3.5-27B.
- Downloads last month
- 231