File size: 7,502 Bytes

fd16824

---
license: apache-2.0
language:
- ru
- en
---

<style>
  .custom-table {
    table-layout: fixed;
    width: 100%;
    border-collapse: collapse;
    margin-top: 2em;
  }
  .custom-table td {
    width: 50%;
    vertical-align: top;
    padding: 10px;
    box-shadow: 0px 0px 0px 0px rgba(0, 0, 0, 0.15);
  }
  .custom-image-container {
    position: relative;
    width: 100%;
    margin-bottom: 0em;
    overflow: hidden;
    border-radius: 10px;
    transition: transform .7s;
    /* Smooth transition for the container */
  }
  .left-column:hover {
    transform: scale(2) translate(+25%, 0%);
    z-index: 9999;
    /* Scale the container on hover */
  }
  .right-column:hover {
    transform: scale(2) translate(-25%, 0%);
    z-index: 9999;
    /* Scale the container on hover */
  }
  .custom-image {
    width: 100%;
    height: auto;
    object-fit: cover;
    border-radius: 10px;
    transition: transform .7s;
    margin-bottom: 0em;
  }
</style>


# Mamba-1.4B

The original Mamba model trained on over 1T tokens, mostly in English and Russian.

This release contains only the pre-trained part of the model. It doesn’t include any instructions following tuning. Feel free to try it out and share your results.

Note that this is a ~1.3B model, which is why its results can be worse than those of models with 7B parameters. However, model is competitive among models of the same size.

If you have any questions, feel free to open an issue.

## Model description

Model has the same architecture and config parameters as the original [Mamba-1.4B](https://huggingface.co/state-spaces/mamba-1.4b-hf) model. The only difference is the vocabulary size, which is 50,280 in the vanilla configuration and 32,768 in model. As a result, model has fewer parameters (1.34B).

This model was trained with the [original implementation](https://github.com/state-spaces/mamba) with the FSDP strategy.

Training details:
- Effective batch size was 1024 and the sequence length was 2048, resulting in 2M tokens per batch.
- Training was conducted for 500,000 steps, resulting in more than 1T tokens.
- Learning rate scheduler was set up as follows:
  - Warmup for the first 2500 steps from 0 to 2e-4.
  - Graceful decrease to 1.8e-5 until step 497,500.
  - Cooldown to 0 for the last 2500 steps.
- We use BF16 for training, but keep the gradient and buffer in FP32 for stability.

## How to use

You need to install transformers version 4.39.0 or higher. We also recommend you to install optimized kernels: both `causal_conv_1d` and `mamba-ssm`.

```shell
pip install transformers>=4.39.0
pip install causal-conv1d>=1.2.0
pip install mamba-ssm
```

After that, you can use the classic [`generate`](https://huggingface.co/docs/transformers/en/main_classes/text_generation) API. Refer to the [documentation](https://huggingface.co/state-spaces/mamba-1.4b-hf) of the original model for more details.

```python
from transformers import MambaForCausalLM, AutoTokenizer

model = MambaForCausalLM.from_pretrained("SpirinEgor/mamba-1.4b")
tokenizer = AutoTokenizer.from_pretrained("SpirinEgor/mamba-1.4b")

s = "Я очень люблю лимончелло"
input_ids = tokenizer(s, return_tensors="pt")["input_ids"]

output_ids = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_p=0.95, top_k=50, repetition_penalty=1.1)
print(tokenizer.decode(output_ids[0]))
# <s> Я очень люблю лимончелло. Просто без ума от этого ликёра, но когда его много я себя не контролирую и начинаю пить всё что можно.</s>
```

## Dataset

The training dataset contains data mainly in English and Russian, as well as code and multilingual content. We use a combination of open-source datasets, e.g., parts of SlimPajama, Wikipedia, Reddit, etc.

| Language    | Part   |
|:------------|:-------|
| Russian     | 53.5%  |
| English     | 36.8%  |
| Source Code | 4.2%   |
| Other       | 5.5%   |

## Evaluation

For evaluation, we use the same set of tasks as in the original paper.

Some useful notes and details:
- As proposed in the paper, all tasks are zero-shot, unlike in the popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). Therefore, it is impossible to compare these models based on just the numbers from the leaderboards.
- Only some tasks were used for the Russian language. These were translated and edited analogues.
- For evaluation, up to 3B parameters models were used. Bigger models show significantly better results for both languages.

If you want to reproduce the results or check any other model, you can use the [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) framework.

We ran it with the following parameters:

```shell
--tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --num_fewshot 0 --batch_size 4
```

_Hover over the small plots to enlarge them._

### Russian

<img class="custom-image" src="images/ru_average.png" alt="ru_average">

<table class="custom-table">
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/ru_hellaswag.png" alt="ru_hellaswag">
        </td></div>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/ru_winogrande.png" alt="ru_winogrande">
        </td></div>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/ru_arc-e.png" alt="ru_arc-e">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/ru_arc-c.png" alt="ru_arc-c">
        </div></td>
    </tr>
</table>

### English

<img class="custom-image" src="images/en_average.png" alt="en_average">

<table class="custom-table">
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_lambada.png" alt="en_lambada">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_hellaswag.png" alt="en_hellaswag">
        </div></td>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_piqa.png" alt="en_piqa">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_winogrande.png" alt="en_winogrande">
        </div></td>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_arc-e.png" alt="en_arc-e">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_arc-c.png" alt="en_arc-c">
        </div></td>
    </tr>
</table>

As expected, model performs worse on tasks in the English language, and shows better results with Russian, even outperforming some popular models.

## Citation

```
@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
```

```
@misc{spirin2024mamba_ru,
	title={mamba-1.4b-ru},
	author={Spirin, Egor},
	url={https://huggingface.co/SpirinEgor/mamba-1.4b},
	publisher={Hugging Face}
	year={2024},
}
```