Multi-modal large language models have garnered significant interest recently. Though, most of the works are focusing on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which human interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for human to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following data set LLaSM-Audio-Instruction.
Our paper makes the following contributions:
@misc{shu2023llasm,
title={LLaSM: Large Language and Speech Model},
author={Yu Shu and Siwei Dong and Guangyao Chen and Wenhao Huang and Ruihua Zhang and Daochen Shi and Qiqi Xiang and Yemin Shi},
year={2023},
eprint={2308.15930},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the open-source projects for giving us access to their models, including Chinese-Llama-2-7B and Whisper and Baichuan-7B.