LocalScore - Local LLM Benchmark

LocalScore is an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results. We created LocalScore to provide a simple, portable way to evaluate computer performance across various LLMs while making it easy to share and browse hardware performance data.

We believe strongly in the power of local AI systems, especially as smaller models become more powerful. In addition we expect computer hardware to become more powerful and cheaper to run these models. We hope this will create an opportunity for accessible and private AI systems, and that LocalScore will help you navigate this.

Check out the website: https://localscore.ai

This repo contains the 'official models' for LocalScore, which will get you and your GPU on the leaderboard if you choose to submit your results.

Tiny Small Medium
# Params 1B 8B 14B
Model Family LLama 3.2 LLama 3.1 Qwen 2.5
Quantization Q4_K_M Q4_K_M Q4_K_M
Approx VRAM Required 2GB 6GB 10GB

To run LocalScore you can download any of the models from this repo

Linux

wget https://huggingface.co/Mozilla/LocalScore/resolve/main/localscore-tiny-1b
chmod +x localscore-tiny-1b
./localscore-tiny-1b

Windows

  1. Download localscore-tiny-1b
  2. Change the filename to localscore-tiny-1b.exe
  3. Open cmd.exe and run localscore-tiny-1b.exe

What is a LocalScore?

A LocalScore is a measure of three key performance metrics that matter for local LLM performance:

  1. Prompt Processing Speed: How quickly your system processes input text (tokens per second)
  2. Generation Speed: How fast your system generates new text (tokens per second)
  3. Time to First Token: The latency before the first response appears (milliseconds)

These metrics are combined into a single LocalScore which gives you a straightforward way to compare different hardware configurations. A score of 1,000 is excellent, 250 is passable, and below 100 will likely be a poor user experience in some regard.

Under the hood, LocalScore leverages Llamafile to ensure portability across different systems, making benchmarking accessible regardless of your setup.

The Tests

The tests were designed to provide a realistic picture of how models will perform in everyday use. Instead of testing raw prompt processing and generation speeds, we wanted to emulate the kinds of tasks that users will actually be doing with these models. Below are a list of the tests we run and some of the use cases they are meant to emulate.

Test Name Prompt Tokens Generated Tokens Sample Use Cases
pp1024+tg16 1024 16 Classification, sentiment analysis, keyword extraction.
pp4096+tg256 4096 256 Long document Q&A, RAG, short summary of extensive text.
pp2048+tg256 2048 256 Article summarization, contextual paragraph generation.
pp2048+tg768 2048 768 Drafting detailed replies, multi-paragraph generation, content sections.
pp1024+tg1024 1024 1024 Balanced Q&A, content drafting, code generation based on long sample.
pp1280+tg3072 1280 3072 Complex reasoning, chain-of-thought, long-form creative writing, code generation.
pp384+tg1152 384 1152 Prompt expansion, explanation generation, creative writing, code generation.
pp64+tg1024 64 1024 Short prompt creative generation (poetry/story), Q&A, code generation.
pp16+tg1536 16 1536 Creative text writing/storytelling, Q&A, code generation.

For more check out:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support