File size: 8,735 Bytes

---
license: other
license_name: seallms
license_link: https://huggingface.co/SeaLLMs/SeaLLM-13B-Chat/blob/main/LICENSE
language:
- en
- zh
- vi
- id
- th
- ms
- km
- lo
- my
- tl
tags:
- multilingual
- sea
---

<p align="center">
  <img src="sealmmm.png" width="200" />
</p>

> SeaLLM will be able to "see"!

# *SeaLMMM-7B* - Large Multilingual Multimodal Models for Southeast Asia


<p align="center">
<a href="https://damo-nlp-sg.github.io/SeaLLMs/" target="_blank" rel="noopener">Website</a>
&nbsp;&nbsp;
<a href="https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1" target="_blank" rel="noopener"> 🤗 Tech Memo</a>
&nbsp;&nbsp;
<a href="https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B" target="_blank" rel="noopener"> 🤗 DEMO</a>
&nbsp;&nbsp;
<a href="https://github.com/DAMO-NLP-SG/SeaLLMs" target="_blank" rel="noopener">Github</a>
&nbsp;&nbsp;
<a href="https://arxiv.org/pdf/2312.00738.pdf" target="_blank" rel="noopener">Technical Report</a>
</p>

<!-- 🔥<span style="color: #ff3860">[HOT]</span> SeaLLMs project now has a dedicated website - [damo-nlp-sg.github.io/SeaLLMs](https://damo-nlp-sg.github.io/SeaLLMs/) -->


We introduce and [showcase](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B) the first iteration of [SeaLMMM](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) -- A unified multilingual and multimodal that excel in both text-only and vision tasks in multiple languages spoken in Southeast Asia.

### SeaLMMM-7B abilities
* SeaLMMM-7B is one of the strongest 7B vision-language models at **text-only tasks**, with performance similar to [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2). It is a text-first-vision-second model.
* SeaLMMM-7B **is** able to handle most SEA languages, making it more multilingual than En-only LLava, Bilingual (En+Zh) Qwen-VL or Yi-VL. 
* Unlike LLava or specialized VLMs, which demand only one image at the begining, SeaLMMM-7B can seamlessly handle text-only conversations at the begining and visual instructions in the middle of the conversations and support topic and language switching.
* SeaLMMM-7B can carry multi-image generation or in-context visual learning, in which case, [Better llava next](https://github.com/huggingface/transformers/pull/29850) should be applied to enable such feature.


### Release and DEMO

- DEMO: [SeaLLMs/SeaLLM-7b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B).
- Model weights:
  - [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1).
- Explore SeaLLMs:
  - [SeaLLMs/SeaLLM-7B-v2.5](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2.5).
  - [SeaLLMs/SeaLLM-7B-v2](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2).
  - [SeaLLMs/SeaLLM-7B-v1](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v1).


<blockquote style="color:red">
<p><strong style="color: red">Terms of Use and License</strong>: 
By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our <a href="https://huggingface.co/SeaLLMs/SeaLLM-Chat-13b/edit/main/LICENSE" target="_blank" rel="noopener">SeaLLMs Terms Of Use</a>.
</blockquote>

> **Disclaimer**:
> We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation.
> Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
> In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.

> The logo was generated by DALL-E 3.


## Overview

SeaLMMM-7B-v0.1 is a multimodal extension of [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2). 
It adopts the [Llava-1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) (Llava-NEXT) architecture. 
It is trained by jointly train SeaLLM's multilingual text-only datasets along with Llava-1.5 English-only vision data, as well as in-house synthetically generated multilingual multimodal vision data and open-source data, such as [ThaiIDCardSynt](https://huggingface.co/datasets/matichon/ThaiIDCardSynt).


### English Vision QA Tasks

| Multimodal Models | VQA2 | GQA | Vizwiz | SQA-IMG | TextQA
| --- | --- | --- | --- | --- | --- | 
| Qwen-VL-Chat | 78.20 | 57.50 | 38.90 | 68.20 | 61.50
| Llava-1.5-7b | 78.50 | 62.00 | 50.00 | 66.80 | 58.20
| Llava-1.5-13b | 80.00 | 63.30 | 53.60 | 71.60 | 61.30
| [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) | 80.14 | 61.58 | 58.00 | 71.79 | 63.47

### Multilingual Text-only World Knowledge

We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) (M3e) for En, Zh, Vi, Id, Th.

On text-only benchmarks, [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) is generally on-par with [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2) - its base LLM model. This demonstrates that our multimodal training regime does not vastly degrade text-only performance.

| Model | Langs | En<br>MMLU | En<br>M3e | Zh<br>M3e | Vi<br>M3e | Id<br>M3e | Th<br>M3e
|-----| -----  | --- |  -- | ----- | ---- | --- | --- |
| GPT-3.5         | Multi | 68.90 | 75.46 | 60.20 | 58.64 | 49.27 | 37.41
| Vistral-7B-chat | Mono  | 56.86 | 67.00 | 44.56 | 54.33 | 36.49 | 25.27
| Qwen1.5-7B-chat | Multi | 61.00 | 52.07 | 81.96 | 43.38 | 24.29 | 20.25
| SailorLM        | Multi | 52.72 | 59.76 | 67.74 | 50.14 | 39.53 | 37.73
| [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2) | Multi | 61.89 | 70.91 | 55.43 | 51.15 | 42.25 | 35.52
| [SeaLLM-7B-v2.5](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2.5)  | Multi | 64.05 | 76.87 | 62.54 | 63.11 | 48.64 | 46.86
| ---
| [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) | Multi | 60.31 | 70.43 | 52.78 | 50.47 | 42.37 | 33.53



## Multilingual Multimodal Showcases

[SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) has better vision understanding and solving abilities in languages beyond English and Chinese, especially SEA languages, such as Vietnamese and Indonesian.

![two_cat.png](two_cat.png)

Image: find "x" in Vietnamese. Left: Llava-1.6-34B. Right: SeaLMMM-7B-v0.1. 
<div class="row" style="display: flex; clear: both;">
    <img src="llava_1.6_34b_find_x_vi.png" alt="Forest" style="float: left; width: 39%">
    <img src="find_x_vi.png" alt="Snow" style="float: left; width: 59%">
</div>


### Limitations
* Despite being multilingual, SeaLMMM-7B-v0.1 multi-modal capabilities still work best in English, while we're working to improve it in other languages.
* For OCR, it can only read English.
* SeaLMMM-7B-v0.1 sometimes still think it cannot process image in multi-turn setting, due to existing text-only SFT, future versions fill fix this.
* Multi-modal multi-turn capabilities are still limited.


### Usage

#### Instruction format

**Unlike others, image token is `<|image|>`**

```python
prompt = """<|im_start|>system
You are a helpful assistant.</s>
<|im_start|>user
<|image|>
What is in the image?</s>
<|im_start|>assistant
There is 2 cats in the image.</s>"""

# <|im_start|> is not a special token.
# Transformers chat_template should be consistent with vLLM format below.

# ! ENSURE 1 and only 1 bos `<s>` at the beginning of sequence
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))

"""
```

## Acknowledgement to Our Linguists

We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.

## Citation

If you find our project useful, we hope you would kindly star our repo and cite our work as follows: Corresponding Author: [l.bing@alibaba-inc.com](mailto:l.bing@alibaba-inc.com)

**Author list and order will change!**

* `*` and `^` are equal contributions.

```
@article{damonlpsg2023seallm,
  author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*, Weiwen Xu, Hou Pong Chan,
            Zhiqiang Hu, Chenhui Shen^, Yew Ken Chia^, Xingxuan Li, Jianyu Wang,
            Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
            Chaoqun Liu, Hang Zhang, Lidong Bing},
  title = {SeaLLMs - Large Language Models for Southeast Asia},
  year = 2023,
  Eprint = {arXiv:2312.00738},
}
```