LLaSM / index.html
TabShu's picture
Update index.html
4b163cd
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="LLaSM: Large Language and Speech Model">
<meta name="keywords" content="speech-language, multi-modal, LLM, LLaSM">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>LLaSM: Large Language and Speech Model</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<!-- for LLaSM demo -->
<link rel="stylesheet" href="./static/css/bootsrap.min.css">
<link rel="stylesheet" href="./static/css/styles.css">
<script src="./static/js/recorder.mp3.min.js"></script>
<script src="./static/js/waveview.js"></script>
<!-- / for LLaSM demo -->
</head>
<body>
<nav class="navbar" role="navigation" aria-label="main navigation">
<div class="navbar-brand">
<a role="button" class="navbar-burger" aria-label="menu" aria-expanded="false">
<span aria-hidden="true"></span>
<span aria-hidden="true"></span>
<span aria-hidden="true"></span>
</a>
</div>
<div class="navbar-menu">
<div class="navbar-start" style="flex-grow: 1; justify-content: center;">
<a class="navbar-item" href="https://keunhong.com">
<span class="icon">
<i class="fas fa-home"></i>
</span>
</a>
<div class="navbar-item has-dropdown is-hoverable">
<a class="navbar-link">
More Research
</a>
<div class="navbar-dropdown">
<a class="navbar-item" href="https://huggingface.co/spaces/LinkSoul/Chinese-LLaVA" target="_blank">
Chinese-LLaVA
</a>
<a class="navbar-item" href="https://huggingface.co/LinkSoul/Chinese-Llama-2-7b" target="_blank">
Chinese-Llama-2-7B
</a>
</div>
</div>
</div>
</div>
</nav>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">LLaSM: Large Language and Speech Model</h1>
<div class="is-size-5 publication-authors">
<span class="author-block" style="color:#008AD7;font-weight:normal;">
Yu Shu<sup>2</sup>,</span>
<span class="author-block" style="color:#008AD7;font-weight:normal;">
Siwei Dong<sup>2</sup>,</span>
<span class="author-block" style="color:#ed2f09;font-weight:normal;">
Guangyao Chen<sup>1,3</sup>,
</span>
<span class="author-block" style="color:#cc00d7;font-weight:normal;">
Wenhao Huang<sup>4</sup>,
</span>
<span class="author-block" style="color:#19e706;font-weight:normal;">
Rita Zhang,
<!-- Rita Zhang<sup>5</sup>, -->
</span>
<span class="author-block" style="color:#19e706;font-weight:normal;">
Daochen Shi,
<!-- Daochen Shi<sup>5</sup>, -->
</span>
<span class="author-block" style="color:#19e706;font-weight:normal;">
Qiqi Xiang,
<!-- Qiqi Xiang<sup>5</sup>, -->
</span>
<span class="author-block" style="color:#f68946;font-weight:normal;">
Yemin Shi<sup>1*</sup>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block" style="color:#f68946;font-weight:normal;"><sup>1</sup>LinkSoul.AI,</span>
<span class="author-block" style="color:#008AD7;font-weight:normal;"><sup>2</sup>Beijing Academy of Artificial Intelligence, China,</span>
<span class="author-block" style="color:#ed2f09;font-weight:normal;"><sup>3</sup>Peking University, China</span>
<span class="author-block" style="color:#cc00d7;font-weight:normal;"><sup>4</sup>01.ai</span>
</div>
<div>
<span class="author-block"><sup>*</sup>Corresponding author: ymshi@linksoul.ai</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<!-- <span class="link-block">
<a href="" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span> -->
<span class="link-block">
<a href="https://arxiv.org/abs/2308.15930" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Model Link. -->
<span class="link-block">
<a href="https://huggingface.co/LinkSoul/LLaSM-Cllama2" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-atom"></i>
</span>
<span>Model</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/LinkSoul-AI/LLaSM" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Dataset Link. -->
<span class="link-block">
<a href="https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="far fa-images"></i>
</span>
<span>Data</span>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Multi-modal large language models have garnered significant interest recently. Though,
most of the works are focusing on vision-language multi-modal models providing
strong capabilities in following vision-and-language instructions. However, we claim
that speech is also an important modality through which human interact with the
world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal
speech-and-language instructions. In this work, we propose <b>L</b>arge <b>L</b>anguage
<b>a</b>nd <b>S</b>peech <b>M</b>odel (<b>LLaSM</b>). <b>LLaSM</b> is an end-to-end trained large multi-modal
speech-language model with cross-modal conversational abilities, capable of following
speech-and-language instructions. Our early experiments show that <b>LLaSM</b> demonstrates
a more convenient and natural way for human to interact with artificial intelligence.
Specifically, we also release a large Speech Instruction Following data set <b>LLaSM-Audio-Instruction</b>.
</p>
<p>
Our paper makes the following contributions:
</p>
<ui>
<li>
We build a speech-language multi-modal assistant that can understand and follow the speech-language instructions, which provides a more convenient and natural way for humans to interact with artificial intelligence.
</li>
<li>
We construct and release <a href="https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions" target="_blank">LLaSM-Audio-Instructions</a>, a large scale Chinese and English speech-text cross-modal instruction following dataset.
</li>
<li>
We release the code in <a href="https://github.com/LinkSoul-AI/LLaSM" target="_blank">https://github.com/LinkSoul-AI/LLaSM</a>.
</li>
<li>
We release the models in <a href="https://huggingface.co/LinkSoul/LLaSM-Cllama2" target="_blank">LLaSM-Chinese-Llama-2-7B</a> and <a href="https://huggingface.co/LinkSoul/LLaSM-Baichuan" target="_blank">LLaSM-Baichuan-7B</a>.
</li>
</ui>
</div>
</div>
</div>
<!--/ Abstract. -->
</div>
</section>
<section class="hero is-light is-small">
<div class="hero-body">
<h2 class="title is-3" style="text-align: center;">Demo</h2>
<!-- LLaSM Demo -->
<div id="llasaLoading" style="position: absolute; width: 100%; z-index: 1; display: flex; justify-content: center; align-items: center;">
<div style="text-align: center;">
<img src="./images/duck.gif" alt="loading" />
<h3>Loading...</h3>
</div>
</div>
<div class="container" id="llasa" style="opacity: 0;">
<div class="row mt-5 justify-content-center">
<div class="col-md-12 mt-3">
<div id="chat-window" class="card p-2">
<div class="container my-3">
<!-- <div id="info"></div> -->
<div id="results" class="results">
</div>
<fieldset id="temp_audio" style="text-align: center; height: 100px; border: 1.4px solid #ddd;">
<legend style="float: initial;
text-align: initial;
width: initial;
margin-left: 10px;
font-size: initial;">Audio preview</legend>
<div id="waveform" style="text-align: center; height: 50px; width: 100%;"></div>
<audio id="audioPlayer" style="height: 50px; width: 100%; display: none; padding: 0 20px 0 20px;" controls src=""></audio>
</fieldset>
</div>
</div>
<div id="user-input" class="mt-2">
<div class="input-group">
<textarea type="text" id="user-text" style="height: 60px; padding: 10px 150px 5px 10px;" placeholder="Type in your message or press record button to speak..."></textarea>
<div id="input-audio" class="input-group-append p-2">
<button id="delete_button" class="mb-2 p-2">
<img id="delete_img" class="mb-2" src="images/error.png" alt="Del">
</button>
<button id="start_button" class="mb-2 p-2">
<img id="start_img" class="mb-2" src="images/microphone.png" alt="Record">
</button>
<button id="send_button" class="mb-2 p-2">
<img id="send_text_img" class="mb-2" src="images/paper-plane.png" alt="Start">
</button>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- / LLaSM Demo -->
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Demo Tips. -->
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Tips</h2>
<div class="content has-text-justified">
<h4>
Demo 试用教程
</h4>
<ul>
<li>
文本框输入文字,点击最右侧发送按钮即可发送消息,开始聊天。
</li>
<li>
点击语音按钮,开始录音,再次点击,结束录音。点击发送按钮,即可发送语音消息。
</li>
<li>
语音未发送之前可在音频预览区检查,聊天框中的历史语音消息同样支持回放。
</li>
<li>
点击重置按钮可清空历史对话信息。
</li>
<li>
注:本 demo 仅作为 LLaSM 的模型能力展示,对多轮对话中话题切换支持不足。切换聊天话题时,建议清空历史以获得更好的体验。
</li>
</ul>
</div>
</div>
</div>
<!--/ Demo Tips. -->
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">BibTeX</h2>
<pre>
<code>
@misc{shu2023llasm,
title={LLaSM: Large Language and Speech Model},
author={Yu Shu and Siwei Dong and Guangyao Chen and Wenhao Huang and Ruihua Zhang and Daochen Shi and Qiqi Xiang and Yemin Shi},
year={2023},
eprint={2308.15930},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
</code>
</pre>
</div>
</div>
</div>
</section>
<section class="section" id="Acknowledgement">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title">Acknowledgement</h2>
<p>
This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
Commons Attribution-ShareAlike 4.0 International License</a>. We thank the open-source projects for giving us access to their models, including <a href="https://huggingface.co/LinkSoul/Chinese-Llama-2-7b" target="_blank">Chinese-Llama-2-7B</a> and <a href="https://huggingface.co/openai/whisper-large-v2" target="_blank">Whisper</a> and <a href="https://huggingface.co/baichuan-inc/Baichuan-7B" target="_blank">Baichuan-7B</a>.
</p>
</div>
</div>
</div>
</section>
<!-- for LLaSM demo -->
<script src="./static/js/index_demo.js"></script>
<!-- / for LLaSM demo -->
</body>
</html>