TG Benchmarks on OnePlus 13

There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized

TODO:

benchmark qnn llama2 locally
benchmark T-MAC groupsize 128 if needed
test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
overclocking ram with magisk module
potentially check standards in quantization: mlc before regression, executorch, qnn

Model Benchmarks

Quantization	Benchmark 1 (200)	Benchmark 2 (50)
Q4_0 (Pure)	12.76	13.22
Q4_0 (Normal)	12.54	13.03

Test Command:

-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2

Quantization	Benchmark 1 (200)	Benchmark 2 (50)
Q4_0 (Pure)	11.54	11.91

Test Configuration	Tokens	Result
Benchmark 1	200	4.46
Benchmark 2	50	4.45

llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86

Model	Size	Params	Backend	Threads	Test	t/s (tokens/sec)
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	pp512	67.33 ± 0.10
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	tg128	22.72 ± 0.04
qwen2 ?B INT_N Q4_K	1.70 GiB	3.40 B	CPU	4	pp512	59.66 ± 0.10
qwen2 ?B INT_N Q4_K	1.70 GiB	3.40 B	CPU	4	tg128	26.43 ± 0.14

INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.

AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html

The Kquant-series isn't optimized for efficiency, it is meant for quality
Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights.

There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes)
The next option is benchmarking llama 8B, it has a larger size.
In running services, we can observe 4.9GB used during inference. (including 4096 cache)
The result is 13.6 t/s for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency.
The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's website.
llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B.