Instructions to use NotaMG/eqaq-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NotaMG/eqaq-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NotaMG/eqaq-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NotaMG/eqaq-v2") model = AutoModelForCausalLM.from_pretrained("NotaMG/eqaq-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NotaMG/eqaq-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NotaMG/eqaq-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/NotaMG/eqaq-v2
- SGLang
How to use NotaMG/eqaq-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NotaMG/eqaq-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NotaMG/eqaq-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use NotaMG/eqaq-v2 with Docker Model Runner:
docker model run hf.co/NotaMG/eqaq-v2
EQAQ v2
EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with SGLang, plus the EAGLE3 draft models used in the local speculative decoding experiments.
Repository layout:
.
|-- config.json
|-- model-00001-of-00001.safetensors
|-- model.safetensors.index.json
|-- tokenizer.json
|-- tokenizer_config.json
|-- vocab.json
|-- merges.txt
|-- chat_template.jinja
`-- drafts/
|-- q028-fast-sglangcompat/
`-- q004-chatthink-sglangcompat/
The root model is the target model. The draft directories are EAGLE3 draft models for SGLang speculative decoding and are not standalone target models.
Expected Performance
These numbers are local measurements from the EQC competition protocol harness, not an official leaderboard score. The official submission uploaded successfully, but the evaluation job failed before scoring because the service could not provision the requested ML compute capacity.
Recommended route setup for the measured run:
- Target model: repository root AWQ model
- Latency, MMLU-Pro, IFEval draft:
drafts/q028-fast-sglangcompat - GPQA/thinking draft:
drafts/q004-chatthink-sglangcompat - SGLang speculative decoding: EAGLE3,
speculative-num-steps=10,speculative-eagle-topk=2,speculative-num-draft-tokens=20
Local latency
Measured with the EQC latency request shape: /v1/completions, logical batch
size 1, 5 warmup runs, 50 measurement runs per category.
The speedup below is computed against a target-only run measured on the same local machine, not against the fixed baseline constants embedded in the EQC protocol harness.
| Category | Prompt / new tokens | Target-only median | EQAQ v2 median | Local speedup |
|---|---|---|---|---|
| short | 64 / 128 | 852.58 ms | 228.87 ms | 3.73x |
| medium | 2048 / 256 | 1771.02 ms | 475.62 ms | 3.72x |
| long | 8192 / 256 | 2179.81 ms | 847.43 ms | 2.57x |
Average local speedup was 3.10x using the average of category medians
(1601.14 ms / 517.31 ms). The older 9.41x figure comes from dividing by
the EQC harness fixed baseline constants (2582/5441/6576 ms) and should not
be interpreted as a speedup over a baseline measured on this machine.
A submission-aligned smoke run with a more conservative single-image setup measured about 4.39x against the same fixed protocol constants over 3 runs per category; it is included only as a packaging/protocol smoke result, not as the local target-only speedup.
Baseline caveat: the target-only no-spec SGLang server crashed with the default
piecewise CUDA graph path (NoneType mrope_positions), so the local
target-only baseline was measured with --disable-piecewise-cuda-graph while
keeping the same target model, endpoint, prompt/token protocol, CUDA graph
batch sizes, and core SGLang serving options.
Observed speculative accept rate in the active local SGLang run was low, roughly 6% over recent decode batches, so the latency gain should be understood as the combined effect of SGLang serving settings, CUDA graph, and speculative decoding rather than high draft acceptance alone.
Local quality
Measured in the same local full protocol run:
| Benchmark | Metric | Score | Gate |
|---|---|---|---|
| MMLU-Pro | exact_match, custom-extract | 0.6525 | 0.621 |
| IFEval | inst_level_strict_acc | 0.8106 | 0.814 |
| GPQA-Diamond | exact_match, flexible-extract | 0.4293 | 0.630 |
The local run passed the latency gate and MMLU-Pro, but did not pass the full quality gate because IFEval was slightly below threshold and GPQA-Diamond was substantially below threshold. Treat this package as a speed-oriented EQC artifact, not a confirmed quality-passing competition submission.
Expected SGLang usage shape:
python -m sglang.launch_server \
--model-path <local-snapshot-of-this-repo> \
--tokenizer-path <local-snapshot-of-this-repo> \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path <local-snapshot-of-this-repo>/drafts/q028-fast-sglangcompat \
--speculative-draft-model-quantization unquant \
--speculative-num-steps 10 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 20
Local source artifacts:
- Target:
/home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat - q028 draft:
/home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat - q004 draft:
/home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat
- Downloads last month
- 113