YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

language:

en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3.5-2B-Base tags: medical neurology domain-adaptation continued-pretraining qwen3.5 datasets: pubmed arxiv pipeline_tag: text-generation model-index: name: Neuro-Qwen3.5-2B-base

Neuro-Qwen3.5-2B — Medical Domain-Adapted Base Model

A 2B parameter base model adapted for neurological and medical text through continued pre-training on Qwen3.5-2B-Base. Designed as a foundation for medical NLP — apply your own SFT or fine-tuning on top.

Model Description

Neuro-Qwen3.5-2B-base is Qwen/Qwen3.5-2B-Base with continued pre-training (CPT) on ~0.6B tokens of English medical text:

Source	Proportion	Description
Patient EHR records	45%	Deidentified clinical notes from hospital EHR system
arXiv neuroscience	25%	Neuroscience research papers (LaTeX cleaned, classified, chunked)
PubMed neurology	15%	Neurology and neuroscience abstracts (filtered + enriched)
General medical	12%	PubMed medical abstracts + medical textbook excerpts
English regularizer	3%	General English text to prevent catastrophic forgetting

This is a base model — it does not follow instructions out of the box. Apply SFT for chat/instruction-following capabilities.

When to Use This Model

As a starting point for medical SFT — better medical knowledge than stock Qwen3.5-2B-Base
For medical text embeddings, NER, summarization — lower perplexity on clinical text
For research on domain adaptation of hybrid attention architectures
When you need a small (2B) medical-aware model that fits on a single GPU

Evaluation

We evaluate along three axes: (1) domain perplexity to confirm the CPT shifted the distribution toward medical text, (2) MCQ accuracy to measure downstream task performance, and (3) GPT-judged reasoning quality to assess clinical reasoning depth.

Three model configurations are compared throughout:

Label	Description
Qwen3.5-2B (Instruct)	Official instruction-tuned release from Alibaba — the off-the-shelf baseline
Base+SFT	Stock Qwen3.5-2B-Base fine-tuned with our medical SFT recipe
CPT+SFT	Our CPT model fine-tuned with the same medical SFT recipe

Comparing all three tells us: (a) how our final model stacks up against the official Instruct release, and (b) how much of the gain comes specifically from CPT (CPT+SFT vs Base+SFT with identical SFT).

1. Domain Perplexity (lower = better fit)

Measured on held-out documents not seen during training:

Domain	Qwen3.5-2B-Base	Ours (CPT)	Improvement
Patient EHR	5.03	4.50	-10.5%
PubMed Neurology	3.13	3.03	-3.2%
arXiv Neuroscience	12.21	12.07	-1.1%
English (general)	8.62	8.63	+0.1%

CPT yields significant perplexity reduction on medical text while preserving general English capability.

2. MCQ Accuracy (lm-evaluation-harness)

We compare our CPT+SFT model against both the official Instruct release and a Base+SFT control that uses the same SFT data without CPT. This lets us separately assess absolute performance (vs Instruct) and the isolated CPT contribution (vs Base+SFT).

Benchmark	Qwen3.5-2B (Instruct)	Base+SFT	CPT+SFT	CPT Δ
PubMedQA	73.2%	75.2% ±1%	77.0% ±1%	+1.8%
MMLU Clinical Knowledge	59.6%	63.8% ±1%	65.4% ±1%	+1.6%
MMLU Anatomy	57.8%	62.2% ±1%	65.7% ±1%	+3.5%
MMLU College Medicine	61.3%	64.7% ±1%	67.3% ±1%	+2.6%
MMLU Professional Medicine	61.0%	58.8% ±1%	62.3% ±1%	+3.5%
MMLU Medical Genetics	74.0%	76.0% ±1%	80.0% ±1%	+4.0%
MMLU College Biology	72.2%	75.0% ±1%	77.7% ±1%	+2.7%
MMLU Virology	46.4%	50.0% ±1%	52.0% ±1%	+2.0%
MMLU Nutrition	67.0%	69.6% ±1%	71.6% ±1%	+2.0%
Medical Average	63.6%	66.1% ±1%	68.8% ±1%	+2.7%

CPT Δ = CPT+SFT minus Base+SFT (isolates the CPT contribution with identical SFT).

Both SFT variants outperform the official Instruct model on medical benchmarks, confirming the value of targeted medical SFT. On top of that, CPT provides a consistent ~2 pp boost across the board, with the largest gains on knowledge-intensive benchmarks — Medical Genetics (+4.0%), Anatomy (+3.5%), and Professional Medicine (+3.5%) — where deeper domain knowledge matters most.

General Reasoning (no catastrophic forgetting)

Benchmark	Qwen3.5-2B (Instruct)	CPT+SFT
HellaSwag	62.3%	66.6%
ARC-Challenge	41.4%	47.3%
WinoGrande	63.5%	65.7%
BoolQ	71.7%	74.2%

No degradation on general benchmarks — CPT+SFT matches or exceeds the Instruct model across all four tasks.

3. Reasoning Quality (GPT-5.2 Blind Evaluation)

To assess clinical reasoning depth beyond MCQ accuracy, we use GPT-5.2 as a blind judge. Here we compare CPT+SFT vs Base+SFT directly — since both share the same SFT, any difference in reasoning quality is attributable to CPT alone.

200 questions sampled from MedMCQA (with gold explanations), PubMedQA (with long answers), and MedQA test split. Full responses including <think> reasoning were evaluated. All comparisons are blind and position-randomized.

Dimension Scores (1–5 scale)

Dimension	Base+SFT	CPT+SFT
Diagnostic Reasoning	2.98	3.05
Clinical Knowledge	3.09	3.22
Management Quality	2.36	2.47
Reasoning Chain	2.69	2.83
Overall	2.86	2.98

CPT+SFT outscores Base+SFT across every dimension, with the largest gains in Clinical Knowledge (+0.13) and Reasoning Chain (+0.14) — consistent with CPT injecting deeper medical knowledge that surfaces during chain-of-thought reasoning.

Pairwise Preference (blind, position-randomized)

	CPT+SFT Wins	Base+SFT Wins	Tie
GPT-5.2 Judge	52%	38%	10%

Training Details


Base model	Qwen3.5-2B-Base
Architecture	Qwen3.5 (hybrid linear + full attention, 24 layers)
Parameters	1.88B
CPT data	~0.6B tokens, English only
Hardware	8× NVIDIA RTX PRO 6000 Blackwell (102GB each)
Optimizer	AdamW (lr=2e-6, β=(0.9, 0.95), weight decay=0.1)
Schedule	Warmup-Stable-Decay (1% warmup, 85% stable, cosine decay)
Sequence length	4,096 tokens
Effective batch	256 sequences/step (~1M tokens/step)
Training	~1 epoch

Data Pipeline

Medical data was processed through a quality enrichment pipeline using Qwen3-32B:

PubMed/arXiv: Filtered for medical relevance, short abstracts expanded to multi-paragraph explanations, LaTeX stripped and classified
Patient records: Deidentified EHR notes, chunked with 256-token overlap for long documents
Quality control: Automated classification (keep/drop), noise filtering, deduplication

How to Use

As a base for SFT

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load as base model for your own fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "NDIJayant/Neuro-Qwen3.5-2B-base",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NDIJayant/Neuro-Qwen3.5-2B-base")

# This is a BASE model — use it as a starting point for SFT
# It will not follow instructions without fine-tuning

For text embeddings / perplexity scoring

# Score medical text likelihood
text = "The patient presented with acute onset left hemiparesis..."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs, labels=inputs.input_ids)
    perplexity = torch.exp(outputs.loss)
    print(f"Perplexity: {perplexity.item():.2f}")

Limitations

English only — all training data is English
Not a clinical tool — for research purposes only, must not be used for clinical decision-making
Base model — does not follow instructions without SFT
Single institution EHR — may reflect institution-specific clinical patterns
Modest CPT scale — 0.6B tokens is a starting point; 3-5B tokens would yield larger gains
No RLHF/DPO — base CPT only, no alignment training

Reproducing This Work

Training code, data enrichment pipeline, and evaluation scripts are available at: [github-link]

Key scripts:

cpt_train_v2.py — continued pre-training with multi-GPU DDP
enrich_data.py — data quality pipeline using Qwen3-32B
eval_qwen.py — lm-evaluation-harness wrapper
eval_reasoning_judge.py — GPT-judged reasoning evaluation

Citation

@misc{neuro-qwen-2026,
  title={Neuro-Qwen3.5-2B-base: Domain-Adapted Base Model for Medical NLP},
  author={Jayant},
  year={2026},
  url={https://huggingface.co/your-username/neuro-qwen3.5-2b-base}
}

Acknowledgments

Built on Qwen3.5-2B-Base by Alibaba Cloud. Data enrichment powered by Qwen3-32B. Evaluation using lm-evaluation-harness and GPT-5.2.

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support