Small Language Models Enthusiasts and GPU Poor oss enjoyers lets connect. Just created an organization which main target is to have fun with smaller models tuneable on consumer range GPUs, feel free to join and lets have some fun, much love ;3
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐ฅ
How? By combining step-wise reward models with tree search algorithms :)
We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"
We're open sourcing the full recipe and sharing a detailed blog post.
In our blog post we cover:
๐ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.
๐ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.
๐งญ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! ๐คฏ - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! ๐ - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!