File size: 7,502 Bytes
fd16824
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: apache-2.0
language:
- ru
- en
---

<style>
  .custom-table {
    table-layout: fixed;
    width: 100%;
    border-collapse: collapse;
    margin-top: 2em;
  }
  .custom-table td {
    width: 50%;
    vertical-align: top;
    padding: 10px;
    box-shadow: 0px 0px 0px 0px rgba(0, 0, 0, 0.15);
  }
  .custom-image-container {
    position: relative;
    width: 100%;
    margin-bottom: 0em;
    overflow: hidden;
    border-radius: 10px;
    transition: transform .7s;
    /* Smooth transition for the container */
  }
  .left-column:hover {
    transform: scale(2) translate(+25%, 0%);
    z-index: 9999;
    /* Scale the container on hover */
  }
  .right-column:hover {
    transform: scale(2) translate(-25%, 0%);
    z-index: 9999;
    /* Scale the container on hover */
  }
  .custom-image {
    width: 100%;
    height: auto;
    object-fit: cover;
    border-radius: 10px;
    transition: transform .7s;
    margin-bottom: 0em;
  }
</style>


# Mamba-1.4B

The original Mamba model trained on over 1T tokens, mostly in English and Russian.

This release contains only the pre-trained part of the model. It doesn’t include any instructions following tuning. Feel free to try it out and share your results.

Note that this is a ~1.3B model, which is why its results can be worse than those of models with 7B parameters. However, model is competitive among models of the same size.

If you have any questions, feel free to open an issue.

## Model description

Model has the same architecture and config parameters as the original [Mamba-1.4B](https://huggingface.co/state-spaces/mamba-1.4b-hf) model. The only difference is the vocabulary size, which is 50,280 in the vanilla configuration and 32,768 in model. As a result, model has fewer parameters (1.34B).

This model was trained with the [original implementation](https://github.com/state-spaces/mamba) with the FSDP strategy.

Training details:
- Effective batch size was 1024 and the sequence length was 2048, resulting in 2M tokens per batch.
- Training was conducted for 500,000 steps, resulting in more than 1T tokens.
- Learning rate scheduler was set up as follows:
  - Warmup for the first 2500 steps from 0 to 2e-4.
  - Graceful decrease to 1.8e-5 until step 497,500.
  - Cooldown to 0 for the last 2500 steps.
- We use BF16 for training, but keep the gradient and buffer in FP32 for stability.

## How to use

You need to install transformers version 4.39.0 or higher. We also recommend you to install optimized kernels: both `causal_conv_1d` and `mamba-ssm`.

```shell
pip install transformers>=4.39.0
pip install causal-conv1d>=1.2.0
pip install mamba-ssm
```

After that, you can use the classic [`generate`](https://huggingface.co/docs/transformers/en/main_classes/text_generation) API. Refer to the [documentation](https://huggingface.co/state-spaces/mamba-1.4b-hf) of the original model for more details.

```python
from transformers import MambaForCausalLM, AutoTokenizer

model = MambaForCausalLM.from_pretrained("SpirinEgor/mamba-1.4b")
tokenizer = AutoTokenizer.from_pretrained("SpirinEgor/mamba-1.4b")

s = "Я очень люблю лимончелло"
input_ids = tokenizer(s, return_tensors="pt")["input_ids"]

output_ids = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_p=0.95, top_k=50, repetition_penalty=1.1)
print(tokenizer.decode(output_ids[0]))
# <s> Я очень люблю лимончелло. Просто без ума от этого ликёра, но когда его много я себя не контролирую и начинаю пить всё что можно.</s>
```

## Dataset

The training dataset contains data mainly in English and Russian, as well as code and multilingual content. We use a combination of open-source datasets, e.g., parts of SlimPajama, Wikipedia, Reddit, etc.

| Language    | Part   |
|:------------|:-------|
| Russian     | 53.5%  |
| English     | 36.8%  |
| Source Code | 4.2%   |
| Other       | 5.5%   |

## Evaluation

For evaluation, we use the same set of tasks as in the original paper.

Some useful notes and details:
- As proposed in the paper, all tasks are zero-shot, unlike in the popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). Therefore, it is impossible to compare these models based on just the numbers from the leaderboards.
- Only some tasks were used for the Russian language. These were translated and edited analogues.
- For evaluation, up to 3B parameters models were used. Bigger models show significantly better results for both languages.

If you want to reproduce the results or check any other model, you can use the [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) framework.

We ran it with the following parameters:

```shell
--tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --num_fewshot 0 --batch_size 4
```

_Hover over the small plots to enlarge them._

### Russian

<img class="custom-image" src="images/ru_average.png" alt="ru_average">

<table class="custom-table">
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/ru_hellaswag.png" alt="ru_hellaswag">
        </td></div>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/ru_winogrande.png" alt="ru_winogrande">
        </td></div>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/ru_arc-e.png" alt="ru_arc-e">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/ru_arc-c.png" alt="ru_arc-c">
        </div></td>
    </tr>
</table>

### English

<img class="custom-image" src="images/en_average.png" alt="en_average">

<table class="custom-table">
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_lambada.png" alt="en_lambada">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_hellaswag.png" alt="en_hellaswag">
        </div></td>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_piqa.png" alt="en_piqa">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_winogrande.png" alt="en_winogrande">
        </div></td>
    </tr>
    <tr>
        <td><div class="custom-image-container left-column">
            <img class="custom-image" src="images/en_arc-e.png" alt="en_arc-e">
        </div></td>
        <td><div class="custom-image-container right-column">
            <img class="custom-image" src="images/en_arc-c.png" alt="en_arc-c">
        </div></td>
    </tr>
</table>

As expected, model performs worse on tasks in the English language, and shows better results with Russian, even outperforming some popular models.

## Citation

```
@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
```

```
@misc{spirin2024mamba_ru,
	title={mamba-1.4b-ru},
	author={Spirin, Egor},
	url={https://huggingface.co/SpirinEgor/mamba-1.4b},
	publisher={Hugging Face}
	year={2024},
}
```