YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

language:

en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3.5-2B-Base tags: medical neurology domain-adaptation continued-pretraining qwen3.5 datasets: pubmed arxiv pipeline_tag: text-generation model-index: name: Neuro-Qwen3.5-2B-base

Neuro-Qwen3.5-2B β€” Medical Domain-Adapted Base Model

A 2B parameter base model adapted for neurological and medical text through continued pre-training on Qwen3.5-2B-Base. Designed as a foundation for medical NLP β€” apply your own SFT or fine-tuning on top.

Model Description

Neuro-Qwen3.5-2B-base is Qwen/Qwen3.5-2B-Base with continued pre-training (CPT) on ~0.6B tokens of English medical text:

Source Proportion Description
Patient EHR records 45% Deidentified clinical notes from hospital EHR system
arXiv neuroscience 25% Neuroscience research papers (LaTeX cleaned, classified, chunked)
PubMed neurology 15% Neurology and neuroscience abstracts (filtered + enriched)
General medical 12% PubMed medical abstracts + medical textbook excerpts
English regularizer 3% General English text to prevent catastrophic forgetting

This is a base model β€” it does not follow instructions out of the box. Apply SFT for chat/instruction-following capabilities.

When to Use This Model

  • As a starting point for medical SFT β€” better medical knowledge than stock Qwen3.5-2B-Base
  • For medical text embeddings, NER, summarization β€” lower perplexity on clinical text
  • For research on domain adaptation of hybrid attention architectures
  • When you need a small (2B) medical-aware model that fits on a single GPU

Evaluation

We evaluate along three axes: (1) domain perplexity to confirm the CPT shifted the distribution toward medical text, (2) MCQ accuracy to measure downstream task performance, and (3) GPT-judged reasoning quality to assess clinical reasoning depth.

Three model configurations are compared throughout:

Label Description
Qwen3.5-2B (Instruct) Official instruction-tuned release from Alibaba β€” the off-the-shelf baseline
Base+SFT Stock Qwen3.5-2B-Base fine-tuned with our medical SFT recipe
CPT+SFT Our CPT model fine-tuned with the same medical SFT recipe

Comparing all three tells us: (a) how our final model stacks up against the official Instruct release, and (b) how much of the gain comes specifically from CPT (CPT+SFT vs Base+SFT with identical SFT).


1. Domain Perplexity (lower = better fit)

Measured on held-out documents not seen during training:

Domain Qwen3.5-2B-Base Ours (CPT) Improvement
Patient EHR 5.03 4.50 -10.5%
PubMed Neurology 3.13 3.03 -3.2%
arXiv Neuroscience 12.21 12.07 -1.1%
English (general) 8.62 8.63 +0.1%

CPT yields significant perplexity reduction on medical text while preserving general English capability.


2. MCQ Accuracy (lm-evaluation-harness)

We compare our CPT+SFT model against both the official Instruct release and a Base+SFT control that uses the same SFT data without CPT. This lets us separately assess absolute performance (vs Instruct) and the isolated CPT contribution (vs Base+SFT).

Benchmark Qwen3.5-2B (Instruct) Base+SFT CPT+SFT CPT Ξ”
PubMedQA 73.2% 75.2% Β±1% 77.0% Β±1% +1.8%
MMLU Clinical Knowledge 59.6% 63.8% Β±1% 65.4% Β±1% +1.6%
MMLU Anatomy 57.8% 62.2% Β±1% 65.7% Β±1% +3.5%
MMLU College Medicine 61.3% 64.7% Β±1% 67.3% Β±1% +2.6%
MMLU Professional Medicine 61.0% 58.8% Β±1% 62.3% Β±1% +3.5%
MMLU Medical Genetics 74.0% 76.0% Β±1% 80.0% Β±1% +4.0%
MMLU College Biology 72.2% 75.0% Β±1% 77.7% Β±1% +2.7%
MMLU Virology 46.4% 50.0% Β±1% 52.0% Β±1% +2.0%
MMLU Nutrition 67.0% 69.6% Β±1% 71.6% Β±1% +2.0%
Medical Average 63.6% 66.1% Β±1% 68.8% Β±1% +2.7%

CPT Ξ” = CPT+SFT minus Base+SFT (isolates the CPT contribution with identical SFT).

Both SFT variants outperform the official Instruct model on medical benchmarks, confirming the value of targeted medical SFT. On top of that, CPT provides a consistent ~2 pp boost across the board, with the largest gains on knowledge-intensive benchmarks β€” Medical Genetics (+4.0%), Anatomy (+3.5%), and Professional Medicine (+3.5%) β€” where deeper domain knowledge matters most.

General Reasoning (no catastrophic forgetting)

Benchmark Qwen3.5-2B (Instruct) CPT+SFT
HellaSwag 62.3% 66.6%
ARC-Challenge 41.4% 47.3%
WinoGrande 63.5% 65.7%
BoolQ 71.7% 74.2%

No degradation on general benchmarks β€” CPT+SFT matches or exceeds the Instruct model across all four tasks.


3. Reasoning Quality (GPT-5.2 Blind Evaluation)

To assess clinical reasoning depth beyond MCQ accuracy, we use GPT-5.2 as a blind judge. Here we compare CPT+SFT vs Base+SFT directly β€” since both share the same SFT, any difference in reasoning quality is attributable to CPT alone.

200 questions sampled from MedMCQA (with gold explanations), PubMedQA (with long answers), and MedQA test split. Full responses including <think> reasoning were evaluated. All comparisons are blind and position-randomized.

Dimension Scores (1–5 scale)

Dimension Base+SFT CPT+SFT
Diagnostic Reasoning 2.98 3.05
Clinical Knowledge 3.09 3.22
Management Quality 2.36 2.47
Reasoning Chain 2.69 2.83
Overall 2.86 2.98

CPT+SFT outscores Base+SFT across every dimension, with the largest gains in Clinical Knowledge (+0.13) and Reasoning Chain (+0.14) β€” consistent with CPT injecting deeper medical knowledge that surfaces during chain-of-thought reasoning.

Pairwise Preference (blind, position-randomized)

CPT+SFT Wins Base+SFT Wins Tie
GPT-5.2 Judge 52% 38% 10%

Training Details

Base model Qwen3.5-2B-Base
Architecture Qwen3.5 (hybrid linear + full attention, 24 layers)
Parameters 1.88B
CPT data ~0.6B tokens, English only
Hardware 8Γ— NVIDIA RTX PRO 6000 Blackwell (102GB each)
Optimizer AdamW (lr=2e-6, Ξ²=(0.9, 0.95), weight decay=0.1)
Schedule Warmup-Stable-Decay (1% warmup, 85% stable, cosine decay)
Sequence length 4,096 tokens
Effective batch 256 sequences/step (~1M tokens/step)
Training ~1 epoch

Data Pipeline

Medical data was processed through a quality enrichment pipeline using Qwen3-32B:

  • PubMed/arXiv: Filtered for medical relevance, short abstracts expanded to multi-paragraph explanations, LaTeX stripped and classified
  • Patient records: Deidentified EHR notes, chunked with 256-token overlap for long documents
  • Quality control: Automated classification (keep/drop), noise filtering, deduplication

How to Use

As a base for SFT

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load as base model for your own fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "NDIJayant/Neuro-Qwen3.5-2B-base",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NDIJayant/Neuro-Qwen3.5-2B-base")

# This is a BASE model β€” use it as a starting point for SFT
# It will not follow instructions without fine-tuning

For text embeddings / perplexity scoring

# Score medical text likelihood
text = "The patient presented with acute onset left hemiparesis..."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs, labels=inputs.input_ids)
    perplexity = torch.exp(outputs.loss)
    print(f"Perplexity: {perplexity.item():.2f}")

Limitations

  • English only β€” all training data is English
  • Not a clinical tool β€” for research purposes only, must not be used for clinical decision-making
  • Base model β€” does not follow instructions without SFT
  • Single institution EHR β€” may reflect institution-specific clinical patterns
  • Modest CPT scale β€” 0.6B tokens is a starting point; 3-5B tokens would yield larger gains
  • No RLHF/DPO β€” base CPT only, no alignment training

Reproducing This Work

Training code, data enrichment pipeline, and evaluation scripts are available at: [github-link]

Key scripts:

  • cpt_train_v2.py β€” continued pre-training with multi-GPU DDP
  • enrich_data.py β€” data quality pipeline using Qwen3-32B
  • eval_qwen.py β€” lm-evaluation-harness wrapper
  • eval_reasoning_judge.py β€” GPT-judged reasoning evaluation

Citation

@misc{neuro-qwen-2026,
  title={Neuro-Qwen3.5-2B-base: Domain-Adapted Base Model for Medical NLP},
  author={Jayant},
  year={2026},
  url={https://huggingface.co/your-username/neuro-qwen3.5-2b-base}
}

Acknowledgments

Built on Qwen3.5-2B-Base by Alibaba Cloud. Data enrichment powered by Qwen3-32B. Evaluation using lm-evaluation-harness and GPT-5.2.

Downloads last month
1
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support