YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
language:
en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3.5-2B-Base tags: medical neurology domain-adaptation continued-pretraining qwen3.5 datasets: pubmed arxiv pipeline_tag: text-generation model-index: name: Neuro-Qwen3.5-2B-base
Neuro-Qwen3.5-2B β Medical Domain-Adapted Base Model
A 2B parameter base model adapted for neurological and medical text through continued pre-training on Qwen3.5-2B-Base. Designed as a foundation for medical NLP β apply your own SFT or fine-tuning on top.
Model Description
Neuro-Qwen3.5-2B-base is Qwen/Qwen3.5-2B-Base with continued pre-training (CPT) on ~0.6B tokens of English medical text:
| Source | Proportion | Description |
|---|---|---|
| Patient EHR records | 45% | Deidentified clinical notes from hospital EHR system |
| arXiv neuroscience | 25% | Neuroscience research papers (LaTeX cleaned, classified, chunked) |
| PubMed neurology | 15% | Neurology and neuroscience abstracts (filtered + enriched) |
| General medical | 12% | PubMed medical abstracts + medical textbook excerpts |
| English regularizer | 3% | General English text to prevent catastrophic forgetting |
This is a base model β it does not follow instructions out of the box. Apply SFT for chat/instruction-following capabilities.
When to Use This Model
- As a starting point for medical SFT β better medical knowledge than stock Qwen3.5-2B-Base
- For medical text embeddings, NER, summarization β lower perplexity on clinical text
- For research on domain adaptation of hybrid attention architectures
- When you need a small (2B) medical-aware model that fits on a single GPU
Evaluation
We evaluate along three axes: (1) domain perplexity to confirm the CPT shifted the distribution toward medical text, (2) MCQ accuracy to measure downstream task performance, and (3) GPT-judged reasoning quality to assess clinical reasoning depth.
Three model configurations are compared throughout:
| Label | Description |
|---|---|
| Qwen3.5-2B (Instruct) | Official instruction-tuned release from Alibaba β the off-the-shelf baseline |
| Base+SFT | Stock Qwen3.5-2B-Base fine-tuned with our medical SFT recipe |
| CPT+SFT | Our CPT model fine-tuned with the same medical SFT recipe |
Comparing all three tells us: (a) how our final model stacks up against the official Instruct release, and (b) how much of the gain comes specifically from CPT (CPT+SFT vs Base+SFT with identical SFT).
1. Domain Perplexity (lower = better fit)
Measured on held-out documents not seen during training:
| Domain | Qwen3.5-2B-Base | Ours (CPT) | Improvement |
|---|---|---|---|
| Patient EHR | 5.03 | 4.50 | -10.5% |
| PubMed Neurology | 3.13 | 3.03 | -3.2% |
| arXiv Neuroscience | 12.21 | 12.07 | -1.1% |
| English (general) | 8.62 | 8.63 | +0.1% |
CPT yields significant perplexity reduction on medical text while preserving general English capability.
2. MCQ Accuracy (lm-evaluation-harness)
We compare our CPT+SFT model against both the official Instruct release and a Base+SFT control that uses the same SFT data without CPT. This lets us separately assess absolute performance (vs Instruct) and the isolated CPT contribution (vs Base+SFT).
| Benchmark | Qwen3.5-2B (Instruct) | Base+SFT | CPT+SFT | CPT Ξ |
|---|---|---|---|---|
| PubMedQA | 73.2% | 75.2% Β±1% | 77.0% Β±1% | +1.8% |
| MMLU Clinical Knowledge | 59.6% | 63.8% Β±1% | 65.4% Β±1% | +1.6% |
| MMLU Anatomy | 57.8% | 62.2% Β±1% | 65.7% Β±1% | +3.5% |
| MMLU College Medicine | 61.3% | 64.7% Β±1% | 67.3% Β±1% | +2.6% |
| MMLU Professional Medicine | 61.0% | 58.8% Β±1% | 62.3% Β±1% | +3.5% |
| MMLU Medical Genetics | 74.0% | 76.0% Β±1% | 80.0% Β±1% | +4.0% |
| MMLU College Biology | 72.2% | 75.0% Β±1% | 77.7% Β±1% | +2.7% |
| MMLU Virology | 46.4% | 50.0% Β±1% | 52.0% Β±1% | +2.0% |
| MMLU Nutrition | 67.0% | 69.6% Β±1% | 71.6% Β±1% | +2.0% |
| Medical Average | 63.6% | 66.1% Β±1% | 68.8% Β±1% | +2.7% |
CPT Ξ = CPT+SFT minus Base+SFT (isolates the CPT contribution with identical SFT).
Both SFT variants outperform the official Instruct model on medical benchmarks, confirming the value of targeted medical SFT. On top of that, CPT provides a consistent ~2 pp boost across the board, with the largest gains on knowledge-intensive benchmarks β Medical Genetics (+4.0%), Anatomy (+3.5%), and Professional Medicine (+3.5%) β where deeper domain knowledge matters most.
General Reasoning (no catastrophic forgetting)
| Benchmark | Qwen3.5-2B (Instruct) | CPT+SFT |
|---|---|---|
| HellaSwag | 62.3% | 66.6% |
| ARC-Challenge | 41.4% | 47.3% |
| WinoGrande | 63.5% | 65.7% |
| BoolQ | 71.7% | 74.2% |
No degradation on general benchmarks β CPT+SFT matches or exceeds the Instruct model across all four tasks.
3. Reasoning Quality (GPT-5.2 Blind Evaluation)
To assess clinical reasoning depth beyond MCQ accuracy, we use GPT-5.2 as a blind judge. Here we compare CPT+SFT vs Base+SFT directly β since both share the same SFT, any difference in reasoning quality is attributable to CPT alone.
200 questions sampled from MedMCQA (with gold explanations), PubMedQA (with long answers), and MedQA test split. Full responses including <think> reasoning were evaluated. All comparisons are blind and position-randomized.
Dimension Scores (1β5 scale)
| Dimension | Base+SFT | CPT+SFT |
|---|---|---|
| Diagnostic Reasoning | 2.98 | 3.05 |
| Clinical Knowledge | 3.09 | 3.22 |
| Management Quality | 2.36 | 2.47 |
| Reasoning Chain | 2.69 | 2.83 |
| Overall | 2.86 | 2.98 |
CPT+SFT outscores Base+SFT across every dimension, with the largest gains in Clinical Knowledge (+0.13) and Reasoning Chain (+0.14) β consistent with CPT injecting deeper medical knowledge that surfaces during chain-of-thought reasoning.
Pairwise Preference (blind, position-randomized)
| CPT+SFT Wins | Base+SFT Wins | Tie | |
|---|---|---|---|
| GPT-5.2 Judge | 52% | 38% | 10% |
Training Details
| Base model | Qwen3.5-2B-Base |
| Architecture | Qwen3.5 (hybrid linear + full attention, 24 layers) |
| Parameters | 1.88B |
| CPT data | ~0.6B tokens, English only |
| Hardware | 8Γ NVIDIA RTX PRO 6000 Blackwell (102GB each) |
| Optimizer | AdamW (lr=2e-6, Ξ²=(0.9, 0.95), weight decay=0.1) |
| Schedule | Warmup-Stable-Decay (1% warmup, 85% stable, cosine decay) |
| Sequence length | 4,096 tokens |
| Effective batch | 256 sequences/step (~1M tokens/step) |
| Training | ~1 epoch |
Data Pipeline
Medical data was processed through a quality enrichment pipeline using Qwen3-32B:
- PubMed/arXiv: Filtered for medical relevance, short abstracts expanded to multi-paragraph explanations, LaTeX stripped and classified
- Patient records: Deidentified EHR notes, chunked with 256-token overlap for long documents
- Quality control: Automated classification (keep/drop), noise filtering, deduplication
How to Use
As a base for SFT
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load as base model for your own fine-tuning
model = AutoModelForCausalLM.from_pretrained(
"NDIJayant/Neuro-Qwen3.5-2B-base",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NDIJayant/Neuro-Qwen3.5-2B-base")
# This is a BASE model β use it as a starting point for SFT
# It will not follow instructions without fine-tuning
For text embeddings / perplexity scoring
# Score medical text likelihood
text = "The patient presented with acute onset left hemiparesis..."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs.input_ids)
perplexity = torch.exp(outputs.loss)
print(f"Perplexity: {perplexity.item():.2f}")
Limitations
- English only β all training data is English
- Not a clinical tool β for research purposes only, must not be used for clinical decision-making
- Base model β does not follow instructions without SFT
- Single institution EHR β may reflect institution-specific clinical patterns
- Modest CPT scale β 0.6B tokens is a starting point; 3-5B tokens would yield larger gains
- No RLHF/DPO β base CPT only, no alignment training
Reproducing This Work
Training code, data enrichment pipeline, and evaluation scripts are available at: [github-link]
Key scripts:
cpt_train_v2.pyβ continued pre-training with multi-GPU DDPenrich_data.pyβ data quality pipeline using Qwen3-32Beval_qwen.pyβ lm-evaluation-harness wrappereval_reasoning_judge.pyβ GPT-judged reasoning evaluation
Citation
@misc{neuro-qwen-2026,
title={Neuro-Qwen3.5-2B-base: Domain-Adapted Base Model for Medical NLP},
author={Jayant},
year={2026},
url={https://huggingface.co/your-username/neuro-qwen3.5-2b-base}
}
Acknowledgments
Built on Qwen3.5-2B-Base by Alibaba Cloud. Data enrichment powered by Qwen3-32B. Evaluation using lm-evaluation-harness and GPT-5.2.
- Downloads last month
- 1