Update README.md

7225bc2 verified 4 days ago

7.3 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- prompt-routing
	- complexity-classifier
	- deberta-v3
	- llm-router
	- cost-optimization
	datasets:
	- RowRed/ComplexityRouter
	- OpenAssistant/oasst2
	base_model:
	- microsoft/deberta-v3-base
	---

	# ComplexityRouter: A Complexity based LLM Router

	Introducing ComplexityRouter, a lightweight prompt complexity classifier finetuned from microsoft/deberta-v3-base. Using prompts from [Open Assistant Conversations Dataset Release 2 (OASST2)](https://huggingface.co/datasets/OpenAssistant/oasst2), some from myself, and some made more complex by Qwen3.5-4B (non-thinking mode), I created a synthetic dataset classifying 4,400 of the prompts using [Qwen3.5-4B Non Thinking Mode](https://huggingface.co/Qwen/Qwen3.5-4B).

	It assigns prompts to one of 4 complexity levels, making it useful for routing queries to the appropriate LLM tier.

	## Model Details

	### Model Description

	- Model type: Text Classification (multi‑class)
	- Language: English
	- Backbone: microsoft/deberta-v3-base
	- License: Apache‑2.0
	- Finetuned from model: microsoft/deberta-v3-base
	- Training data: OASST2 + synthetic augmentations + manually created prompts
	Labels generated by Qwen3.5‑4B (non‑thinking mode).

	### Model Sources

	- Dataset repository: https://huggingface.co/datasets/RowRed/ComplexityRouter

	## Uses

	### Direct Use

	Route prompts to appropriate LLM tiers based on predicted complexity:

	\| Level \| Meaning \| Suggested LLM Tier \|
	\|-------\|---------\|-------------------\|
	\| 0 (Trivial) \| Simple lookups, basic Q&A (e.g., “What is 2+2?”) \| Fast/cheap local model \|
	\| 1 (Simple) \| Moderate reasoning, basic domain knowledge \| Mid‑tier model \|
	\| 2 (Moderate) \| Complex reasoning, deep knowledge required \| Strong model \|
	\| 3 (Complex) \| Very complex reasoning, niche expertise \| Frontier API model \|

	Recommended routing strategy: Group levels 0 and 1 together (fast/cheap tier), level 2 as standard, level 3 as premium. The model achieves 93.0% adjacent accuracy on my test, meaning it rarely misroutes by more than one tier.

	### Out‑of‑Scope Use

	- Multi‑turn conversation routing (single prompts only).
	- Non‑English prompts (training data was English‑only).
	- Prompts requiring image or multimodal understanding.

	## Bias, Risks, and Limitations

	- Training data is synthetic and may not represent all real‑world prompt distributions.
	- Level 1 (Simple) and Level 2 (Moderate) have lower per‑class F1 scores – boundary cases are inherently ambiguous.
	- The model may struggle with very domain‑specific technical jargon.
	- Performance may degrade on prompts that are very different from the training distribution.

	## Notice
	This is my first attempt making a widespread finetune. There are probably lots of issues, but thought the idea was sound. I might make a second (hopefully better) version eventually, but am not sure where to get lots of high-quality open source data.

	## Training Details
	### Training Data

	\| Split \| Samples \| Source File \| Notes \|
	\|-------------\|---------\|----------------------\|-------\|
	\| Training \| 2,800 \| TRAINING.jsonl \| Used for model training \|
	\| Validation \| 600 \| TRAINING.jsonl \| Used for early stopping / hyperparameter tuning \|
	\| Test (internal) \| 600 \| TRAINING.jsonl \| Used for in‑distribution evaluation \|
	\| Test (held‑out) \| 400 \| TEST.jsonl \| Fully independent test set (reported results) \|

	Total unique prompts: 4,400

	Class distribution (training):
	Level 0: 762 (27.2%) • Level 1: 674 (24.1%) • Level 2: 795 (28.4%) • Level 3: 569 (20.3%)

	## Training Procedure
	- Hardware: NVIDIA T4 (16 GB VRAM, Google Colab)
	- Framework: PyTorch 2.11 + Hugging Face Transformers
	- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
	- Scheduler: Linear warmup (10% of steps) → linear decay
	- Loss: Weighted Cross‑Entropy (classification) + MSE (regression)
	- Batch size: 16 (effective 32 with gradient accumulation)
	- Epochs: 7 (early stopping patience = 3)
	- Training time: ~18 minutes
	- Class balancing: sqrt‑scaled class weights + weighted random sampler

	## Evaluation Results

	Reported on 600 held‑out samples from TRAINING.jsonl (internal test).
	\|Metric\|Value\|
	\|----\|----\|
	\|Exact Match Accuracy\|64.5%\|
	\|Adjacent (±1) Accuracy\|93.0%\|
	\|Macro F1\|0.663\|
	\|Weighted F1\|0.653\|

	Per‑Class Performance (internal test, 600 samples)
	\|Level\|Precision\|Recall\|F1\|Support\|
	\|----\|----\|----\|----\|----\|
	\|L0 (Trivial)\|0.658\|0.626\|0.642\|163\|
	\|L1 (Simple)\|0.457\|0.628\|0.529\|145\|
	\|L2 (Moderate)\|0.683\|0.571\|0.622\|170\|
	\|L3 (Complex)\|0.933\|0.795\|0.858\|122\|

	Confusion Matrix (internal test, 600 samples)

	\| \|Pred L0\|Pred L1\|Pred L2\|Pred L3\|
	\|----\|----\|----\|----\|----\|
	\|True L0\|102\|46\|13\|2\|
	\|True L1\|35\|91\|18\|1\|
	\|True L2\|15\|54\|97\|4\|
	\|True L3\|3\|8\|14\|97\|

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch
	import torch.nn as nn

	class PromptComplexityRouter(nn.Module):
	def __init__(self, backbone="microsoft/deberta-v3-base", num_labels=4):
	super().__init__()
	self.backbone = AutoModel.from_pretrained(backbone)
	hidden_size = self.backbone.config.hidden_size
	self.classifier = nn.Sequential(
	nn.Dropout(0.1),
	nn.Linear(hidden_size, 256),
	nn.GELU(),
	nn.Dropout(0.1),
	nn.Linear(256, num_labels),
	)

	def forward(self, input_ids, attention_mask):
	outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
	cls_output = outputs.last_hidden_state[:, 0, :]
	return self.classifier(cls_output)

	# Load
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	tokenizer = AutoTokenizer.from_pretrained("RowRed/ComplexityRouter")
	model = PromptComplexityRouter()
	model.load_state_dict(
	torch.load("pytorch_model.bin", map_location=device),
	strict=False
	)
	model.to(device)
	model.eval()

	# Predict
	prompts = ["What is 2+2?", "Explain quantum entanglement in detail."]
	encoded = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(device)
	with torch.no_grad():
	logits = model(encoded["input_ids"], encoded["attention_mask"])
	probs = torch.softmax(logits, dim=-1)
	predictions = torch.argmax(probs, dim=-1)

	for prompt, level in zip(prompts, predictions):
	print(f"Level {level.item()}: {prompt}")
	```

	## Citation
	If you use this model, please cite:

	```bibtex
	@software{ComplexityRouter,
	author = {RowRed},
	title = {ComplexityRouter},
	year = {2026},
	url = {https://huggingface.co/RowRed/ComplexityRouter}
	}
	```

	Additionally, acknowledge the base dataset and labeling model:

	```bibtex
	@dataset{oasst2,
	author = {OpenAssistant Contributors},
	title = {Open Assistant Conversations Dataset Release 2},
	year = {2023},
	url = {https://huggingface.co/datasets/OpenAssistant/oasst2}
	}

	@software{qwen3.5-4b,
	author = {Qwen Team},
	title = {Qwen3.5-4B},
	year = {2026},
	url = {https://huggingface.co/Qwen/Qwen3.5-4B}
	}
	```

	## License
	This model is released under Apache‑2.0.
	The backbone (microsoft/deberta-v3-base) is MIT‑licensed.
	The training dataset is derived from OASST2 (Apache‑2.0) and Qwen3.5‑4B outputs (Apache‑2.0).