|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
- en |
|
--- |
|
|
|
<style> |
|
.custom-table { |
|
table-layout: fixed; |
|
width: 100%; |
|
border-collapse: collapse; |
|
margin-top: 2em; |
|
} |
|
.custom-table td { |
|
width: 50%; |
|
vertical-align: top; |
|
padding: 10px; |
|
box-shadow: 0px 0px 0px 0px rgba(0, 0, 0, 0.15); |
|
} |
|
.custom-image-container { |
|
position: relative; |
|
width: 100%; |
|
margin-bottom: 0em; |
|
overflow: hidden; |
|
border-radius: 10px; |
|
transition: transform .7s; |
|
/* Smooth transition for the container */ |
|
} |
|
.left-column:hover { |
|
transform: scale(2) translate(+25%, 0%); |
|
z-index: 9999; |
|
/* Scale the container on hover */ |
|
} |
|
.right-column:hover { |
|
transform: scale(2) translate(-25%, 0%); |
|
z-index: 9999; |
|
/* Scale the container on hover */ |
|
} |
|
.custom-image { |
|
width: 100%; |
|
height: auto; |
|
object-fit: cover; |
|
border-radius: 10px; |
|
transition: transform .7s; |
|
margin-bottom: 0em; |
|
} |
|
</style> |
|
|
|
|
|
# Mamba-1.4B |
|
|
|
The original Mamba model trained on over 1T tokens, mostly in English and Russian. |
|
|
|
This release contains only the pre-trained part of the model. It doesn’t include any instructions following tuning. Feel free to try it out and share your results. |
|
|
|
Note that this is a ~1.3B model, which is why its results can be worse than those of models with 7B parameters. However, model is competitive among models of the same size. |
|
|
|
If you have any questions, feel free to open an issue. |
|
|
|
## Model description |
|
|
|
Model has the same architecture and config parameters as the original [Mamba-1.4B](https://huggingface.co/state-spaces/mamba-1.4b-hf) model. The only difference is the vocabulary size, which is 50,280 in the vanilla configuration and 32,768 in model. As a result, model has fewer parameters (1.34B). |
|
|
|
This model was trained with the [original implementation](https://github.com/state-spaces/mamba) with the FSDP strategy. |
|
|
|
Training details: |
|
- Effective batch size was 1024 and the sequence length was 2048, resulting in 2M tokens per batch. |
|
- Training was conducted for 500,000 steps, resulting in more than 1T tokens. |
|
- Learning rate scheduler was set up as follows: |
|
- Warmup for the first 2500 steps from 0 to 2e-4. |
|
- Graceful decrease to 1.8e-5 until step 497,500. |
|
- Cooldown to 0 for the last 2500 steps. |
|
- We use BF16 for training, but keep the gradient and buffer in FP32 for stability. |
|
|
|
## How to use |
|
|
|
You need to install transformers version 4.39.0 or higher. We also recommend you to install optimized kernels: both `causal_conv_1d` and `mamba-ssm`. |
|
|
|
```shell |
|
pip install transformers>=4.39.0 |
|
pip install causal-conv1d>=1.2.0 |
|
pip install mamba-ssm |
|
``` |
|
|
|
After that, you can use the classic [`generate`](https://huggingface.co/docs/transformers/en/main_classes/text_generation) API. Refer to the [documentation](https://huggingface.co/state-spaces/mamba-1.4b-hf) of the original model for more details. |
|
|
|
```python |
|
from transformers import MambaForCausalLM, AutoTokenizer |
|
|
|
model = MambaForCausalLM.from_pretrained("SpirinEgor/mamba-1.4b") |
|
tokenizer = AutoTokenizer.from_pretrained("SpirinEgor/mamba-1.4b") |
|
|
|
s = "Я очень люблю лимончелло" |
|
input_ids = tokenizer(s, return_tensors="pt")["input_ids"] |
|
|
|
output_ids = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_p=0.95, top_k=50, repetition_penalty=1.1) |
|
print(tokenizer.decode(output_ids[0])) |
|
# <s> Я очень люблю лимончелло. Просто без ума от этого ликёра, но когда его много я себя не контролирую и начинаю пить всё что можно.</s> |
|
``` |
|
|
|
## Dataset |
|
|
|
The training dataset contains data mainly in English and Russian, as well as code and multilingual content. We use a combination of open-source datasets, e.g., parts of SlimPajama, Wikipedia, Reddit, etc. |
|
|
|
| Language | Part | |
|
|:------------|:-------| |
|
| Russian | 53.5% | |
|
| English | 36.8% | |
|
| Source Code | 4.2% | |
|
| Other | 5.5% | |
|
|
|
## Evaluation |
|
|
|
For evaluation, we use the same set of tasks as in the original paper. |
|
|
|
Some useful notes and details: |
|
- As proposed in the paper, all tasks are zero-shot, unlike in the popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). Therefore, it is impossible to compare these models based on just the numbers from the leaderboards. |
|
- Only some tasks were used for the Russian language. These were translated and edited analogues. |
|
- For evaluation, up to 3B parameters models were used. Bigger models show significantly better results for both languages. |
|
|
|
If you want to reproduce the results or check any other model, you can use the [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) framework. |
|
|
|
We ran it with the following parameters: |
|
|
|
```shell |
|
--tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --num_fewshot 0 --batch_size 4 |
|
``` |
|
|
|
_Hover over the small plots to enlarge them._ |
|
|
|
### Russian |
|
|
|
<img class="custom-image" src="images/ru_average.png" alt="ru_average"> |
|
|
|
<table class="custom-table"> |
|
<tr> |
|
<td><div class="custom-image-container left-column"> |
|
<img class="custom-image" src="images/ru_hellaswag.png" alt="ru_hellaswag"> |
|
</td></div> |
|
<td><div class="custom-image-container right-column"> |
|
<img class="custom-image" src="images/ru_winogrande.png" alt="ru_winogrande"> |
|
</td></div> |
|
</tr> |
|
<tr> |
|
<td><div class="custom-image-container left-column"> |
|
<img class="custom-image" src="images/ru_arc-e.png" alt="ru_arc-e"> |
|
</div></td> |
|
<td><div class="custom-image-container right-column"> |
|
<img class="custom-image" src="images/ru_arc-c.png" alt="ru_arc-c"> |
|
</div></td> |
|
</tr> |
|
</table> |
|
|
|
### English |
|
|
|
<img class="custom-image" src="images/en_average.png" alt="en_average"> |
|
|
|
<table class="custom-table"> |
|
<tr> |
|
<td><div class="custom-image-container left-column"> |
|
<img class="custom-image" src="images/en_lambada.png" alt="en_lambada"> |
|
</div></td> |
|
<td><div class="custom-image-container right-column"> |
|
<img class="custom-image" src="images/en_hellaswag.png" alt="en_hellaswag"> |
|
</div></td> |
|
</tr> |
|
<tr> |
|
<td><div class="custom-image-container left-column"> |
|
<img class="custom-image" src="images/en_piqa.png" alt="en_piqa"> |
|
</div></td> |
|
<td><div class="custom-image-container right-column"> |
|
<img class="custom-image" src="images/en_winogrande.png" alt="en_winogrande"> |
|
</div></td> |
|
</tr> |
|
<tr> |
|
<td><div class="custom-image-container left-column"> |
|
<img class="custom-image" src="images/en_arc-e.png" alt="en_arc-e"> |
|
</div></td> |
|
<td><div class="custom-image-container right-column"> |
|
<img class="custom-image" src="images/en_arc-c.png" alt="en_arc-c"> |
|
</div></td> |
|
</tr> |
|
</table> |
|
|
|
As expected, model performs worse on tasks in the English language, and shows better results with Russian, even outperforming some popular models. |
|
|
|
## Citation |
|
|
|
``` |
|
@article{mamba, |
|
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces}, |
|
author={Gu, Albert and Dao, Tri}, |
|
journal={arXiv preprint arXiv:2312.00752}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
``` |
|
@misc{spirin2024mamba_ru, |
|
title={mamba-1.4b-ru}, |
|
author={Spirin, Egor}, |
|
url={https://huggingface.co/SpirinEgor/mamba-1.4b}, |
|
publisher={Hugging Face} |
|
year={2024}, |
|
} |
|
``` |