Instructions to use NotaMG/eqaq-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NotaMG/eqaq-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NotaMG/eqaq-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NotaMG/eqaq-v2")
model = AutoModelForCausalLM.from_pretrained("NotaMG/eqaq-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use NotaMG/eqaq-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NotaMG/eqaq-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NotaMG/eqaq-v2

SGLang

How to use NotaMG/eqaq-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NotaMG/eqaq-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NotaMG/eqaq-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NotaMG/eqaq-v2 with Docker Model Runner:
```
docker model run hf.co/NotaMG/eqaq-v2
```

EQAQ v2

EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with SGLang, plus the EAGLE3 draft models used in the local speculative decoding experiments.

Repository layout:

.
|-- config.json
|-- model-00001-of-00001.safetensors
|-- model.safetensors.index.json
|-- tokenizer.json
|-- tokenizer_config.json
|-- vocab.json
|-- merges.txt
|-- chat_template.jinja
`-- drafts/
    |-- q028-fast-sglangcompat/
    `-- q004-chatthink-sglangcompat/

The root model is the target model. The draft directories are EAGLE3 draft models for SGLang speculative decoding and are not standalone target models.

Expected Performance

These numbers are local measurements from the EQC competition protocol harness, not an official leaderboard score. The official submission uploaded successfully, but the evaluation job failed before scoring because the service could not provision the requested ML compute capacity.

Recommended route setup for the measured run:

Target model: repository root AWQ model
Latency, MMLU-Pro, IFEval draft: drafts/q028-fast-sglangcompat
GPQA/thinking draft: drafts/q004-chatthink-sglangcompat
SGLang speculative decoding: EAGLE3, speculative-num-steps=10, speculative-eagle-topk=2, speculative-num-draft-tokens=20

Local latency

Measured with the EQC latency request shape: /v1/completions, logical batch size 1, 5 warmup runs, 50 measurement runs per category.

The speedup below is computed against a target-only run measured on the same local machine, not against the fixed baseline constants embedded in the EQC protocol harness.

Category	Prompt / new tokens	Target-only median	EQAQ v2 median	Local speedup
short	64 / 128	852.58 ms	228.87 ms	3.73x
medium	2048 / 256	1771.02 ms	475.62 ms	3.72x
long	8192 / 256	2179.81 ms	847.43 ms	2.57x

Average local speedup was 3.10x using the average of category medians (1601.14 ms / 517.31 ms). The older 9.41x figure comes from dividing by the EQC harness fixed baseline constants (2582/5441/6576 ms) and should not be interpreted as a speedup over a baseline measured on this machine.

A submission-aligned smoke run with a more conservative single-image setup measured about 4.39x against the same fixed protocol constants over 3 runs per category; it is included only as a packaging/protocol smoke result, not as the local target-only speedup.

Baseline caveat: the target-only no-spec SGLang server crashed with the default piecewise CUDA graph path (NoneType mrope_positions), so the local target-only baseline was measured with --disable-piecewise-cuda-graph while keeping the same target model, endpoint, prompt/token protocol, CUDA graph batch sizes, and core SGLang serving options.

Observed speculative accept rate in the active local SGLang run was low, roughly 6% over recent decode batches, so the latency gain should be understood as the combined effect of SGLang serving settings, CUDA graph, and speculative decoding rather than high draft acceptance alone.

Local quality

Measured in the same local full protocol run:

Benchmark	Metric	Score	Gate
MMLU-Pro	exact_match, custom-extract	0.6525	0.621
IFEval	inst_level_strict_acc	0.8106	0.814
GPQA-Diamond	exact_match, flexible-extract	0.4293	0.630

The local run passed the latency gate and MMLU-Pro, but did not pass the full quality gate because IFEval was slightly below threshold and GPQA-Diamond was substantially below threshold. Treat this package as a speed-oriented EQC artifact, not a confirmed quality-passing competition submission.

Expected SGLang usage shape:

python -m sglang.launch_server \
  --model-path <local-snapshot-of-this-repo> \
  --tokenizer-path <local-snapshot-of-this-repo> \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path <local-snapshot-of-this-repo>/drafts/q028-fast-sglangcompat \
  --speculative-draft-model-quantization unquant \
  --speculative-num-steps 10 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 20

Local source artifacts:

Target: /home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat
q028 draft: /home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat
q004 draft: /home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat

Downloads last month: 113

Safetensors

Model size

4B params

Tensor type

I64

I32

BF16