LocalScore - Local LLM Benchmark
LocalScore is an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results. We created LocalScore to provide a simple, portable way to evaluate computer performance across various LLMs while making it easy to share and browse hardware performance data.
We believe strongly in the power of local AI systems, especially as smaller models become more powerful. In addition we expect computer hardware to become more powerful and cheaper to run these models. We hope this will create an opportunity for accessible and private AI systems, and that LocalScore will help you navigate this.
Check out the website: https://localscore.ai
This repo contains the 'official models' for LocalScore, which will get you and your GPU on the leaderboard if you choose to submit your results.
Tiny | Small | Medium | |
---|---|---|---|
# Params | 1B | 8B | 14B |
Model Family | LLama 3.2 | LLama 3.1 | Qwen 2.5 |
Quantization | Q4_K_M | Q4_K_M | Q4_K_M |
Approx VRAM Required | 2GB | 6GB | 10GB |
To run LocalScore you can download any of the models from this repo
Linux
wget https://huggingface.co/Mozilla/LocalScore/resolve/main/localscore-tiny-1b
chmod +x localscore-tiny-1b
./localscore-tiny-1b
Windows
- Download localscore-tiny-1b
- Change the filename to
localscore-tiny-1b.exe
- Open cmd.exe and run
localscore-tiny-1b.exe
What is a LocalScore?
A LocalScore is a measure of three key performance metrics that matter for local LLM performance:
- Prompt Processing Speed: How quickly your system processes input text (tokens per second)
- Generation Speed: How fast your system generates new text (tokens per second)
- Time to First Token: The latency before the first response appears (milliseconds)
These metrics are combined into a single LocalScore which gives you a straightforward way to compare different hardware configurations. A score of 1,000 is excellent, 250 is passable, and below 100 will likely be a poor user experience in some regard.
Under the hood, LocalScore leverages Llamafile to ensure portability across different systems, making benchmarking accessible regardless of your setup.
The Tests
The tests were designed to provide a realistic picture of how models will perform in everyday use. Instead of testing raw prompt processing and generation speeds, we wanted to emulate the kinds of tasks that users will actually be doing with these models. Below are a list of the tests we run and some of the use cases they are meant to emulate.
Test Name | Prompt Tokens | Generated Tokens | Sample Use Cases |
---|---|---|---|
pp1024+tg16 | 1024 | 16 | Classification, sentiment analysis, keyword extraction. |
pp4096+tg256 | 4096 | 256 | Long document Q&A, RAG, short summary of extensive text. |
pp2048+tg256 | 2048 | 256 | Article summarization, contextual paragraph generation. |
pp2048+tg768 | 2048 | 768 | Drafting detailed replies, multi-paragraph generation, content sections. |
pp1024+tg1024 | 1024 | 1024 | Balanced Q&A, content drafting, code generation based on long sample. |
pp1280+tg3072 | 1280 | 3072 | Complex reasoning, chain-of-thought, long-form creative writing, code generation. |
pp384+tg1152 | 384 | 1152 | Prompt expansion, explanation generation, creative writing, code generation. |
pp64+tg1024 | 64 | 1024 | Short prompt creative generation (poetry/story), Q&A, code generation. |
pp16+tg1536 | 16 | 1536 | Creative text writing/storytelling, Q&A, code generation. |
For more check out:
- Website: https://localscore.ai
- Demo video: https://youtu.be/De6pA1bQsHU
- Blog post: https://localscore.ai/blog
- CLI Github: https://github.com/Mozilla-Ocho/llamafile/tree/main/localscore
- Website Github: https://github.com/cjpais/localscore