Supra-1.5 instruct • Experimental Chat Tune

SupraLabs has finetuned SupraLabs/Supra-1.5-50M-Base-exp for chat, and we are excited to share the model with the community!

Note: This is an experimental model, do not use it in production!

About Model

SupraLabs/Supra-1.5-50M-instruct-exp is an experimental 50-million-parameter instruction-tuned language model developed by SupraLabs. Designed to explore the capabilities of ultra-small language models, it focuses on efficient instruction following, lightweight deployment, and fast inference on resource-constrained hardware. Despite its compact size, the model aims to provide useful conversational and text-generation abilities while serving as a research platform for studying scaling, optimization, and small-model performance.

Model Benchmarks

Our evaluation results demonstrate that Supra-1.5-50M-instruct-exp achieves superior performance relative to standard baselines within the 50-million-parameter class.

Starting with an accuracy leaderboard, we are pleased to share the improvements over Supra-1

The model consistently excels at BLiMP benchmarks with a constant 67.4 score!

When evaluated against core language modeling metrics, the architecture demonstrates strong capabilities across the following benchmarks:

Furthermore, the model scales effectively to factual language benchmarks, delivering accurate, knowledge-driven responses across various text-generation tasks.

The SupraLabs team conducted a series of comparative evaluations isolating raw versus normalized benchmark performance to determine whether text normalization inadvertently degrades model utility. The resulting data yielded several compelling insights. Notably, tasks relying heavily on scientific domain knowledge and factual data showed optimal performance under raw inference conditions. Conversely, benchmarks evaluating mathematics and complex logical reasoning saw a measurable performance lift when utilizing normalized inference. This suggests that while raw text preserves the precise formatting needed for factual retrieval, normalization helps the model isolate patterns essential for structured logic.

Ultimately, these findings reveal that the optimal evaluation framework for ultra-small language models is highly sensitive to text formatting, highlighting a fascinating behavioral trade-off between semantic retrieval and structural logic. This dynamic underscores the complexity of optimizing 50M-class architectures for diverse tasks. Building upon these macro-level performance insights, we will now transition to a more granular analysis of the model's foundational linguistic capabilities. The following section provides a comprehensive deep dive into our evaluation using the BLiMP (Benchmark of Linguistic Minimal Pairs) suite to isolate specific grammatical and syntactic trends.

While the model demonstrates a strong macro-level average across the BLiMP (Benchmark of Linguistic Minimal Pairs) suite, a granular breakdown reveals a clear divergence in performance across different syntactic phenomena. Analyzing the categories where the model underperformed provides critical insights into the linguistic limitations of ultra-small, 50M-parameter architectures.

The chart below details our full evaluation, highlighting the specific grammatical structures that pose the greatest challenge for the model:

While these lower-scoring categories highlight clear boundaries in the model's architectural capacity, they only represent part of the linguistic profile. Moving past these edge cases, the evaluation uncovers a substantial middle ground where the model achieves highly competitive, stable results.

In this mid-tier band, the architecture demonstrates a solid baseline understanding of standard English structural rules, capturing fundamental grammar mechanics effectively without hitting the parameter limits observed in the more complex probes.

The chart below details the specific BLiMP benchmarks where the model achieved these consistent, average-range scores:

While the mid-tier benchmarks show a stable foundation, the true capabilities of the architecture become clear when looking at its highest-scoring categories. In these specific domains, the model does not just perform well—it displays an exceptional, highly optimized mastery of intricate grammatical rules that typically challenge models twice its size.

The data reveals outstanding accuracy in tracking structural dependencies, processing complex clausal configurations, and identifying subtle syntactic errors. These peaks in performance show that with the right instruction tuning, a 50M-parameter model can achieve near-flawless precision in core areas of language mechanics.

The chart below highlights the top-performing BLiMP probes where the model achieved its highest accuracy scores:

The peak performance observed across these advanced syntactic probes underscores the exceptional capabilities of this model within its parameter class. Achieving this level of precision at just 50 million parameters demonstrates that compact architectures, when properly optimized and instruction-tuned, can punch well above their weight class—delivering highly efficient linguistic processing without the massive computational overhead of larger models.

At SupraLabs, we are deeply committed to advancing the open-source AI ecosystem. We believe that democratization is the key to true AI innovation, and we are proud to contribute highly accessible, lightweight research models that lower the barrier to entry for developers and researchers worldwide.

To support this commitment, the weights for Supra-1.5-50M-instruct-exp are fully open to the community. We enthusiastically encourage developers, hobbyists, and researchers to experiment with, build upon, and fine-tune this model for their own custom use cases. We cannot wait to see what the community builds!