LLaSM: Large Language and Speech Model

Abstract

Multi-modal large language models have garnered significant interest recently. Though, most of the works are focusing on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which human interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for human to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following data set LLaSM-Audio-Instruction.

Our paper makes the following contributions:

We build a speech-language multi-modal assistant that can understand and follow the speech-language instructions, which provides a more convenient and natural way for humans to interact with artificial intelligence.

We construct and release LLaSM-Audio-Instructions, a large scale Chinese and English speech-text cross-modal instruction following dataset.

We release the code in https://github.com/LinkSoul-AI/LLaSM.

We release the models in LLaSM-Chinese-Llama-2-7B and LLaSM-Baichuan-7B.

Tips

Demo 试用教程

文本框输入文字，点击最右侧发送按钮即可发送消息，开始聊天。
点击语音按钮，开始录音，再次点击，结束录音。点击发送按钮，即可发送语音消息。
语音未发送之前可在音频预览区检查，聊天框中的历史语音消息同样支持回放。
点击重置按钮可清空历史对话信息。
注：本 demo 仅作为 LLaSM 的模型能力展示，对多轮对话中话题切换支持不足。切换聊天话题时，建议清空历史以获得更好的体验。

BibTeX

          
@misc{shu2023llasm,
      title={LLaSM: Large Language and Speech Model}, 
      author={Yu Shu and Siwei Dong and Guangyao Chen and Wenhao Huang and Ruihua Zhang and Daochen Shi and Qiqi Xiang and Yemin Shi},
      year={2023},
      eprint={2308.15930},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the open-source projects for giving us access to their models, including Chinese-Llama-2-7B and Whisper and Baichuan-7B.