LLaSM: Large Language and Speech Model

Yu Shu2, Siwei Dong2, Guangyao Chen1,3, Wenhao Huang4, Rita Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi1*
1LinkSoul.AI, 2Beijing Academy of Artificial Intelligence, China, 3Peking University, China 401.ai
*Corresponding author: ymshi@linksoul.ai

Abstract

Multi-modal large language models have garnered significant interest recently. Though, most of the works are focusing on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which human interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for human to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following data set LLaSM-Audio-Instruction.

Our paper makes the following contributions:

  • We build a speech-language multi-modal assistant that can understand and follow the speech-language instructions, which provides a more convenient and natural way for humans to interact with artificial intelligence.
  • We construct and release LLaSM-Audio-Instructions, a large scale Chinese and English speech-text cross-modal instruction following dataset.
  • We release the code in https://github.com/LinkSoul-AI/LLaSM.
  • We release the models in LLaSM-Chinese-Llama-2-7B and LLaSM-Baichuan-7B.
  • Demo

    loading

    Loading...

    Audio preview

    Tips

    Demo 试用教程

    • 文本框输入文字,点击最右侧发送按钮即可发送消息,开始聊天。
    • 点击语音按钮,开始录音,再次点击,结束录音。点击发送按钮,即可发送语音消息。
    • 语音未发送之前可在音频预览区检查,聊天框中的历史语音消息同样支持回放。
    • 点击重置按钮可清空历史对话信息。
    • 注:本 demo 仅作为 LLaSM 的模型能力展示,对多轮对话中话题切换支持不足。切换聊天话题时,建议清空历史以获得更好的体验。

    BibTeX

              
    @misc{shu2023llasm,
          title={LLaSM: Large Language and Speech Model}, 
          author={Yu Shu and Siwei Dong and Guangyao Chen and Wenhao Huang and Ruihua Zhang and Daochen Shi and Qiqi Xiang and Yemin Shi},
          year={2023},
          eprint={2308.15930},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
              
            

    Acknowledgement

    This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the open-source projects for giving us access to their models, including Chinese-Llama-2-7B and Whisper and Baichuan-7B.