Update README.md (#1)
Browse files- Update README.md (b0911100145ef4499e79bc5f92e5842d881c7082)
Co-authored-by: mrh <ryanmiao@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,20 +1,179 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
-
# Step-Audio
|
5 |
|
6 |
-
Step-Audio
|
7 |
|
8 |
-
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
-
|
12 |
|
13 |
-
本仓库是Step-Audio中的多模态大型语言模型(LLM)部分。它是一个 1300 亿参数的多模态大型语言模型 (LLM),它负责理解和生成人类语音。 该模型经过专门设计,能够无缝集成语音识别、语义理解、对话管理、语音克隆和语音生成等功能。
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
|
18 |
-
|
19 |
|
20 |
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
4 |
|
5 |
+
# 1. Step-Audio-Chat
|
6 |
|
7 |
+
This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation.
|
8 |
|
9 |
+
## 2. Evaluation
|
10 |
+
### 2.1 LLM judge metrics(GPT-4o) on [**StepEval-Audio-360**](https://huggingface.co/datasets/stepfun-ai/StepEval-Audio-360)
|
11 |
+
<table>
|
12 |
+
<caption>Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360.</caption>
|
13 |
+
<thead>
|
14 |
+
<tr>
|
15 |
+
<th>Model</th>
|
16 |
+
<th style="text-align:center">Factuality (% ↑)</th>
|
17 |
+
<th style="text-align:center">Relevance (% ↑)</th>
|
18 |
+
<th style="text-align:center">Chat Score ↑</th>
|
19 |
+
</tr>
|
20 |
+
</thead>
|
21 |
+
<tbody>
|
22 |
+
<tr>
|
23 |
+
<td>GLM4-Voice</td>
|
24 |
+
<td style="text-align:center">54.7</td>
|
25 |
+
<td style="text-align:center">66.4</td>
|
26 |
+
<td style="text-align:center">3.49</td>
|
27 |
+
</tr>
|
28 |
+
<tr>
|
29 |
+
<td>Qwen2-Audio</td>
|
30 |
+
<td style="text-align:center">22.6</td>
|
31 |
+
<td style="text-align:center">26.3</td>
|
32 |
+
<td style="text-align:center">2.27</td>
|
33 |
+
</tr>
|
34 |
+
<tr>
|
35 |
+
<td>Moshi<sup>*</sup></td>
|
36 |
+
<td style="text-align:center">1.0</td>
|
37 |
+
<td style="text-align:center">0</td>
|
38 |
+
<td style="text-align:center">1.49</td>
|
39 |
+
</tr>
|
40 |
+
<tr>
|
41 |
+
<td><strong>Step-Audio-Chat</strong></td>
|
42 |
+
<td style="text-align:center"><strong>66.4</strong></td>
|
43 |
+
<td style="text-align:center"><strong>75.2</strong></td>
|
44 |
+
<td style="text-align:center"><strong>4.11</strong></td>
|
45 |
+
</tr>
|
46 |
+
</tbody>
|
47 |
+
</table>
|
48 |
|
49 |
+
*Note: Moshi are marked with "\*" and should be considered for reference only.
|
50 |
|
|
|
51 |
|
52 |
+
### 2.2 Public Test Set
|
53 |
+
|
54 |
+
<table>
|
55 |
+
<thead>
|
56 |
+
<tr>
|
57 |
+
<th>Model</th>
|
58 |
+
<th style="text-align:center">Llama Question</th>
|
59 |
+
<th style="text-align:center">Web Questions</th>
|
60 |
+
<th style="text-align:center">TriviaQA*</th>
|
61 |
+
<th style="text-align:center">ComplexBench</th>
|
62 |
+
<th style="text-align:center">HSK-6</th>
|
63 |
+
</tr>
|
64 |
+
</thead>
|
65 |
+
<tbody>
|
66 |
+
<tr>
|
67 |
+
<td>GLM4-Voice</td>
|
68 |
+
<td style="text-align:center">64.7</td>
|
69 |
+
<td style="text-align:center">32.2</td>
|
70 |
+
<td style="text-align:center">39.1</td>
|
71 |
+
<td style="text-align:center">66.0</td>
|
72 |
+
<td style="text-align:center">74.0</td>
|
73 |
+
</tr>
|
74 |
+
<tr>
|
75 |
+
<td>Moshi</td>
|
76 |
+
<td style="text-align:center">62.3</td>
|
77 |
+
<td style="text-align:center">26.6</td>
|
78 |
+
<td style="text-align:center">22.8</td>
|
79 |
+
<td style="text-align:center">-</td>
|
80 |
+
<td style="text-align:center">-</td>
|
81 |
+
</tr>
|
82 |
+
<tr>
|
83 |
+
<td>Freeze-Omni</td>
|
84 |
+
<td style="text-align:center">72.0</td>
|
85 |
+
<td style="text-align:center">44.7</td>
|
86 |
+
<td style="text-align:center">53.9</td>
|
87 |
+
<td style="text-align:center">-</td>
|
88 |
+
<td style="text-align:center">-</td>
|
89 |
+
</tr>
|
90 |
+
<tr>
|
91 |
+
<td>LUCY</td>
|
92 |
+
<td style="text-align:center">59.7</td>
|
93 |
+
<td style="text-align:center">29.3</td>
|
94 |
+
<td style="text-align:center">27.0</td>
|
95 |
+
<td style="text-align:center">-</td>
|
96 |
+
<td style="text-align:center">-</td>
|
97 |
+
</tr>
|
98 |
+
<tr>
|
99 |
+
<td>MinMo</td>
|
100 |
+
<td style="text-align:center">78.9</td>
|
101 |
+
<td style="text-align:center">55.0</td>
|
102 |
+
<td style="text-align:center">48.3</td>
|
103 |
+
<td style="text-align:center">-</td>
|
104 |
+
<td style="text-align:center">-</td>
|
105 |
+
</tr>
|
106 |
+
<tr>
|
107 |
+
<td>Qwen2-Audio</td>
|
108 |
+
<td style="text-align:center">52.0</td>
|
109 |
+
<td style="text-align:center">27.0</td>
|
110 |
+
<td style="text-align:center">37.3</td>
|
111 |
+
<td style="text-align:center">54.0</td>
|
112 |
+
<td style="text-align:center">-</td>
|
113 |
+
</tr>
|
114 |
+
<tr>
|
115 |
+
<td><strong>Step-Audio-Chat</strong></td>
|
116 |
+
<td style="text-align:center"><strong><i>81.0</i></strong></td>
|
117 |
+
<td style="text-align:center"><strong>75.1</strong></td>
|
118 |
+
<td style="text-align:center"><strong>58.0</strong></td>
|
119 |
+
<td style="text-align:center"><strong>74.0</strong></td>
|
120 |
+
<td style="text-align:center"><strong>86.0</strong></td>
|
121 |
+
</tr>
|
122 |
+
</tbody>
|
123 |
+
</table>
|
124 |
+
|
125 |
+
*Note: Results marked with "\*" on TriviaQA dataset are considered for reference only.*
|
126 |
+
|
127 |
+
*TriviaQA dataset marked with "\*" indicates results are for reference only.*
|
128 |
+
|
129 |
+
### 2.3 Audio instruction following
|
130 |
+
<table>
|
131 |
+
<thead>
|
132 |
+
<tr>
|
133 |
+
<th rowspan="2">Category</th>
|
134 |
+
<th colspan="2" style="text-align:center">Instruction Following</th>
|
135 |
+
<th colspan="2" style="text-align:center">Audio Quality</th>
|
136 |
+
</tr>
|
137 |
+
<tr>
|
138 |
+
<th style="text-align:center">GLM-4-Voice</th>
|
139 |
+
<th style="text-align:center">Step-Audio</th>
|
140 |
+
<th style="text-align:center">GLM-4-Voice</th>
|
141 |
+
<th style="text-align:center">Step-Audio</th>
|
142 |
+
</tr>
|
143 |
+
</thead>
|
144 |
+
<tbody>
|
145 |
+
<tr>
|
146 |
+
<td>Languages</td>
|
147 |
+
<td style="text-align:center">1.9</td>
|
148 |
+
<td style="text-align:center">3.8</td>
|
149 |
+
<td style="text-align:center">2.9</td>
|
150 |
+
<td style="text-align:center">3.3</td>
|
151 |
+
</tr>
|
152 |
+
<tr>
|
153 |
+
<td>Role-playing</td>
|
154 |
+
<td style="text-align:center">3.8</td>
|
155 |
+
<td style="text-align:center">4.2</td>
|
156 |
+
<td style="text-align:center">3.2</td>
|
157 |
+
<td style="text-align:center">3.6</td>
|
158 |
+
</tr>
|
159 |
+
<tr>
|
160 |
+
<td>Singing / RAP</td>
|
161 |
+
<td style="text-align:center">2.1</td>
|
162 |
+
<td style="text-align:center">2.4</td>
|
163 |
+
<td style="text-align:center">2.4</td>
|
164 |
+
<td style="text-align:center">4</td>
|
165 |
+
</tr>
|
166 |
+
<tr>
|
167 |
+
<td>Voice Control</td>
|
168 |
+
<td style="text-align:center">3.6</td>
|
169 |
+
<td style="text-align:center">4.4</td>
|
170 |
+
<td style="text-align:center">3.3</td>
|
171 |
+
<td style="text-align:center">4.1</td>
|
172 |
+
</tr>
|
173 |
+
</tbody>
|
174 |
+
</table>
|
175 |
|
176 |
|
177 |
+
## 3. More information
|
178 |
|
179 |
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
|