CARDS-Qwen3.5-27B-FP8

FP8-dynamic quantization of C3DS/CARDS-Qwen3.5-27B — the LoRA-merged Qwen3.5-27B fine-tuned on the CARDS taxonomy from Coan et al. (2025) for climate-contrarian-claim classification.

This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. The merged FP8 checkpoint loads directly with transformers, vLLM (≥0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.

~27 GB on disk (vs ~54 GB for the BF16 sibling) — fits comfortably on a single A100/H100/H200, leaves more headroom for KV cache.
No accuracy loss observed vs the BF16 model on the held-out CARDS test set (within rounding; see Results below).

Results

Evaluated on the held-out CARDS test set (1,436 samples, Level 1, min_support ≥ 3):

Metric	Qwen3.5-27B (base)	Qwen3.5-27B FT (BF16)	Qwen3.5-27B FT (FP8 — this model)	Claude Opus 4.6	Claude Opus 4.7
Samples F1	0.844	0.884	0.886	0.893	0.882
Macro F1	0.710	0.766	0.770	0.751	0.771
Micro F1	0.854	0.877	0.879	0.881	0.874
Precision	0.870	0.879	0.881	0.863	0.868
Recall	0.838	0.874	0.876	0.900	0.880
Parse failures	86 / 1436	0 / 1436	0 / 1436	0 / 1436	0 / 1436

Matches or slightly exceeds the BF16 model on every reported metric — FP8 dynamic quantization does not degrade accuracy on this task.
Same parse reliability as BF16 (zero failures on 1,436 test items).
Best Macro F1 at L1 among the five compared models, tied with Opus 4.7.

Usage

With vLLM

vllm serve C3DS/CARDS-Qwen3.5-27B-FP8 \
  --port 8000 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --served-model-name CARDS-Qwen3.5-27B

--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).

The system prompt and user-message format are identical to the BF16 sibling. We bundle them in this repo as cards_prompts.json for self-contained loading:

import json
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-27B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger             = prompts["cot_trigger"]

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

def classify(text):
    resp = client.chat.completions.create(
        model="CARDS-Qwen3.5-27B",
        messages=[
            {"role": "system", "content": slim_system_instruction},
            {"role": "user",   "content": f"### Text:\n{text}\n\n{cot_trigger}"},
        ],
        temperature=0,
        max_tokens=4000,
    )
    return resp.choices[0].message.content

print(classify("These are only a few renewable energy technologies at work"))

The model produces a reasoning trace inside <think>…</think> followed by a YAML categories: block listing predicted CARDS codes. To parse: take the content after </think> and read the categories: list.

See the project repository for training scripts, quantization recipe, evaluation code, and dataset preparation.

Multimodal — image + text

The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible image_url content part, and this fine-tune preserves that capability — pass the system prompt below alongside an image (with or without caption text) and the model will classify the depicted claim under the CARDS taxonomy.

Serve vLLM with multimodal flags enabled:

vllm serve C3DS/CARDS-Qwen3.5-27B-FP8 \
  --port 8000 \
  --max-model-len 8192 \
  --trust-remote-code \
  --limit-mm-per-prompt image=4 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --served-model-name CARDS-Qwen3.5-27B

import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-27B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger             = prompts["cot_trigger"]

def image_part(path):
    p = Path(path)
    mime = mimetypes.guess_type(p)[0] or "image/png"
    b64 = base64.b64encode(p.read_bytes()).decode()
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="CARDS-Qwen3.5-27B",
    messages=[
        {"role": "system", "content": slim_system_instruction},
        {"role": "user", "content": [
            {"type": "text", "text": "Read the image (and any caption below) and classify the climate claim it makes."},
            image_part("screenshot.png"),
            {"type": "text", "text": f"### Caption:\n<optional caption>\n\n{cot_trigger}"},
        ]},
    ],
    temperature=0,
    max_tokens=4000,
)
print(resp.choices[0].message.content)

Training & Quantization

Fine-tuning (inherited from the BF16 sibling)

Base model: Qwen/Qwen3.5-27B
Method: LoRA (rank 16, α 16, dropout 0) on q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights
Dataset: C3DS/cards_sft_dataset
Framework: Unsloth + TRL SFTTrainer
Hyperparameters: 3 epochs, per_device_train_batch_size=1, gradient_accumulation_steps=8, lr=2e-4, cosine schedule, 10 warmup steps, max_seq_length=4096, adamw_8bit, bf16
Checkpoint selection: best via load_best_model_at_end=True

FP8 quantization

Scheme: fp8_e4m3 dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed.
Targets: all linear layers in transformer blocks (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head — lm_head is typically excluded; see project repo for the exact ignore list).
Tool: llmcompressor with QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC") applied to the merged BF16 checkpoint.
Validation: the quantized model is re-evaluated on the CARDS test set (see Results above). FP8 matches BF16 within rounding.

Hardware notes

Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Tested on A100, H100, H200. Ampere (A100) uses up-cast FP8→BF16 matmul; Hopper (H100/H200) uses native FP8 tensor cores for additional throughput.
Memory: ~27 GB for weights, leaving 50+ GB on a single A100 80GB for KV cache. Comfortable for max_model_len=8192 at max_num_seqs ≥ 256.

Limitations

Thinking tokens. Training used enable_thinking=True. Either parse output after </think>, or disable thinking at inference via chat_template_kwargs={"enable_thinking": false}. Reserve token budget for the reasoning trace before the final YAML block.
Quantization is weight-only. Activations are BF16. For more aggressive compression (FP8 activations, or W4A16 GPTQ/AWQ), see project follow-up work.

Citation

@article{cards2pO2025,
  title={Large language model reveals an increase in climate contrarian speech in the United States Congress},
  author={Travis G. Coan and Ranadheer Malla and Mirjam O. Nanko and William Kattrup and J. Timmons Roberts and John Cook and Constantine Boussalis},
  journal={Communications Sustainability},
  year={2025}
}