TG Benchmarks on OnePlus 13

There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized

TODO:

  • benchmark qnn llama2 locally
  • benchmark T-MAC groupsize 128 if needed
  • test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
  • overclocking ram with magisk module
  • potentially check standards in quantization: mlc before regression, executorch, qnn

Model Benchmarks

Llama 2

Quantization Benchmark 1 (200) Benchmark 2 (50)
Q4_0 (Pure) 12.76 13.22
Q4_0 (Normal) 12.54 13.03

Test Command:

-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2

Llama 3

Quantization Benchmark 1 (200) Benchmark 2 (50)
Q4_0 (Pure) 11.54 11.91

Reka-Flash 21B Benchmarks Q4_0 (Normal)

Test Configuration Tokens Result
Benchmark 1 200 4.46
Benchmark 2 50 4.45

Intermediate Sizes

Model Architecture Intermediate Size
Llama2 7B 11,008
Llama3 3B 8,192
Llama3 8B 14,336
Qwen 7B 2.5 18,944
Qwen 2.5B/14B 13,824
QWQ 27,648
Reka-Flash 21B 19,648
Mistral 2503 32,768
Codestral 22B 16,384

llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86

Model Size Params Backend Threads Test t/s (tokens/sec)
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 pp512 67.33 ± 0.10
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 tg128 22.72 ± 0.04
qwen2 ?B INT_N Q4_K 1.70 GiB 3.40 B CPU 4 pp512 59.66 ± 0.10
qwen2 ?B INT_N Q4_K 1.70 GiB 3.40 B CPU 4 tg128 26.43 ± 0.14

INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.

AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html

  • The Kquant-series isn't optimized for efficiency, it is meant for quality
  • Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights.

Converted llama2 7B and ran the 8B

  • There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes)
  • The next option is benchmarking llama 8B, it has a larger size.
  • In running services, we can observe 4.9GB used during inference. (including 4096 cache)
  • The result is 13.6 t/s for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency.
  • The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's website.
  • llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B.
Downloads last month
26
GGUF
Model size
8.03B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support