LocalScore - Local LLM Benchmark

LocalScore is an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results. We created LocalScore to provide a simple, portable way to evaluate computer performance across various LLMs while making it easy to share and browse hardware performance data.

We believe strongly in the power of local AI systems, especially as smaller models become more powerful. In addition we expect computer hardware to become more powerful and cheaper to run these models. We hope this will create an opportunity for accessible and private AI systems, and that LocalScore will help you navigate this.

Check out the website: https://localscore.ai

This repo contains the 'official models' for LocalScore, which will get you and your GPU on the leaderboard if you choose to submit your results.

	Tiny	Small	Medium
# Params	1B	8B	14B
Model Family	LLama 3.2	LLama 3.1	Qwen 2.5
Quantization	Q4_K_M	Q4_K_M	Q4_K_M
Approx VRAM Required	2GB	6GB	10GB

To run LocalScore you can download any of the models from this repo

Linux

wget https://huggingface.co/Mozilla/LocalScore/resolve/main/localscore-tiny-1b
chmod +x localscore-tiny-1b
./localscore-tiny-1b

Windows

Download localscore-tiny-1b
Change the filename to localscore-tiny-1b.exe
Open cmd.exe and run localscore-tiny-1b.exe

What is a LocalScore?

A LocalScore is a measure of three key performance metrics that matter for local LLM performance:

Prompt Processing Speed: How quickly your system processes input text (tokens per second)
Generation Speed: How fast your system generates new text (tokens per second)
Time to First Token: The latency before the first response appears (milliseconds)

These metrics are combined into a single LocalScore which gives you a straightforward way to compare different hardware configurations. A score of 1,000 is excellent, 250 is passable, and below 100 will likely be a poor user experience in some regard.

Under the hood, LocalScore leverages Llamafile to ensure portability across different systems, making benchmarking accessible regardless of your setup.

The Tests

The tests were designed to provide a realistic picture of how models will perform in everyday use. Instead of testing raw prompt processing and generation speeds, we wanted to emulate the kinds of tasks that users will actually be doing with these models. Below are a list of the tests we run and some of the use cases they are meant to emulate.

Test Name	Prompt Tokens	Generated Tokens	Sample Use Cases
pp1024+tg16	1024	16	Classification, sentiment analysis, keyword extraction.
pp4096+tg256	4096	256	Long document Q&A, RAG, short summary of extensive text.
pp2048+tg256	2048	256	Article summarization, contextual paragraph generation.
pp2048+tg768	2048	768	Drafting detailed replies, multi-paragraph generation, content sections.
pp1024+tg1024	1024	1024	Balanced Q&A, content drafting, code generation based on long sample.
pp1280+tg3072	1280	3072	Complex reasoning, chain-of-thought, long-form creative writing, code generation.
pp384+tg1152	384	1152	Prompt expansion, explanation generation, creative writing, code generation.
pp64+tg1024	64	1024	Short prompt creative generation (poetry/story), Q&A, code generation.
pp16+tg1536	16	1536	Creative text writing/storytelling, Q&A, code generation.

For more check out:

Website: https://localscore.ai
Demo video: https://youtu.be/De6pA1bQsHU
Blog post: https://localscore.ai/blog
CLI Github: https://github.com/Mozilla-Ocho/llamafile/tree/main/localscore
Website Github: https://github.com/cjpais/localscore

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support