jzx-ai-lab
/

flow_mirror

Text2Text Generation

Transformers

Safetensors

parler_tts

Inference Endpoints

Model card Files Files and versions Community

jzx-ai-lab commited on 13 days ago

Commit

1af93f2

•

1 Parent(s): f08ff09

Update README.md

Browse files

Files changed (1) hide show

README.md +115 -120

README.md CHANGED Viewed

@@ -1,31 +1,34 @@
----
-license: apache-2.0
-language:
-- zh
----
 [Update]
-8.29 创建仓库，发布README & Roadmap(https://github.com/jingzhunxue/flow_mirror)
-8.31 发布Demo Site(https://voice-playground.91jzx.cn)
-9.2  发布Inference Code
-9.12 心流知镜-s-v0.2-checkpoint-20240828
-## 动机
-尽管文本在互联网上是主要的语言形态，但许多场景如教学授课和医生问诊仍主要采用直接语音交流。此外，低龄儿童或不具备读写能力的人通过听说能力能够进行广泛的交流和表达，显示出纯语音交流具备足够的智能沟通能力。语音（Textless）交流天然包含丰富的表达信息，这在教育培训等场景中，相比纯粹的ASR文字转换，具有更高的信息价值。
-同时，本项目也受到 OpenAI 发布的 GPT-4o 和其展示的教育场景中的演示视频展现的能力的启发。
-## 团队
-浙江精准学是由阿里巴巴投资，专注于提供教育相关软硬件产品（AI辅学机）的公司。精准学 AI 团队致力于通过 AI 技术实现接近甚至超越人类教育体验的主动式学习，并力求降低技术成本，使之人人可负担。
-## 背景
-直接的语音端到端模型最早据我们所知，源自 Meta 的 Speechbot 系列的 GLSM。以下相关工作文献为我们的研究提供了宝贵的参考和实验经验：
-- SpiritLM: Nguyen et al. (2024) 探索了口语和书面语言模型的交错。[详细信息][1]
-- GLSM: Lakhotia et al. (2021) 从原始音频中生成口语语言模型。[详细信息][2]
-- AudioLM: Borsos et al. (2023) 提出了一种音频生成的语言建模方法。[详细信息][3]
-- SpeechGPT: Zhang et al. (2023) 强化了大型语言模型的内在跨模态对话能力。[详细信息][4]
-- SpeechFlow: Liu et al. (2024) 介绍了一种配合流匹配的语音生成预训练方法。[详细信息][5]
 [1]: https://arxiv.org/abs/2402.05755 "SpiRit-LM: Interleaved Spoken and Written Language Model"
 [2]: https://arxiv.org/abs/2102.01192 "Generative Spoken Language Modeling from Raw Audio"
@@ -34,127 +37,119 @@ language:
 [5]: https://arxiv.org/abs/2310.16338 "Generative Pre-training for Speech with Flow Matching"
-## 方法
-总体来说，我们把语音端到端模型的预训练看作一个对于语音蕴含的semantic和acoustic信息的representation的学习过程。从文���LLM初始化额外带来了统一的学习到Text & Audio Representation的可能性，也大幅度减少了工程量。所以我们按照如下两步设计整体训练思路。
-针对中文特别是支持教育场景语汇的自监督预训练语音编码器的缺乏，我们基于Meta HuBERT论文的方法开发了一个侧重语义信息的自监督语音编码器，并借鉴RVQVAE的方法，使用大量中文语音数据从头训练了侧重声学信息的音频编解码器（9层码本）。
-![自监督音频Codec建模示意图](assets/flow_mirror_s_v02_ssl_codec.png)
-基于这些自监督预训练的编解码器，我们使用 qwen2 系列 LLM 模型作为初始化参数，整体结构设计如图我们采用的非对称的结构。输入以 Semantic Unit 为主，输出为 Acoustic Unit 和文本同时输出。
-![整体结构图](assets/flow_mirror_s_v02_architecture.png)
-FlowMirror-s v0.1和v0.2使用了2万小时和5万小时的语音数据进行端到端的预训练，并支持ASR、TTS、语音续写、语音对话等任务。这些实验结果初步验证了语音端到端模型的可行性，并且显示出网络设计的可扩展性，预示着模型在后续版本中能够获得更强的能力。
-## 评估
-定性音频的例子可以参考如下对话
 ```text
 example_1 = "人在没有目标的时候才应该有压力"
 example_2 = "这个阶段需要学习什么知识？"
 example_3 = "怎么把事情做对要花时间去培养"
 example_4 = "这里的药材长势不错"
 ```
-### 对话语音例子
-**对话例子1:** "人在没有目标的时候才应该有压力"
-[输入](assets/question_example_1_MP3.mp3)
-[输出](assets/answer_example_1_MP3.mp3)
-**对话例子2:** "这里的药材长势不错"
-[输入](assets/question_example_4_MP3.mp3)
-[输出](assets/answer_example_4_MP3.mp3)
-### Demo Site
-相应的 Demo 实际体验部署在 https://voice-playground.91jzx.cn ，限于资源有限，同时支持并发小于10。实际部署的checkpoint是心流知镜-s v0.2-240822-checkpoint，后续会更新到v0.2和v0.3的最新的版本。
-### 多任务评估
-在这里ASR子任务被看作是对于语音中蕴含的learnable semantic info在预训练阶段对此representation学习效果的一个评估。当前的checkpoint，在预训练的第一阶段观察到ASR子任务大约相当于Whisper-small的水平。所选取的评估数据，公开领域网上语音数据是未训练的数据，Wenet数据全部未参与端到端训练过程。从这两部分数据随机采样1024条进行评估。
-| 数据集来源          | 数量    |  中文CER/WER     |
-|-------------------|---------|---------|
-| 公开领域随机采样 - test   | 1024（采样） | 12.55%  |
-| WenetSpeech - test| 1024（采样） | 24.23%  |
-因为此次发布的checkpoint是早期epoch，随着后续的训练数据量和时间的增加，在不增加神经网络参数量的前提下语音语义和文本之间的对齐方面可以看到会有很大的提升。
-【TODO】
-AudioBench的评估数据待添加
-PS: 亟待构建中文版的AudioBench以便更好的综合评估
-## 限制与缺点
-* 在3个数据阶段的训练中，我们没有使用常规的文本LLM预训练数据，预见到与原始qwen2模型相比，在MMLU评估上可能会有能力下降。后续版本将尝试减少这种能力下降。
-* 当前版本仅对说话人音色进行了控制，其他语音信息如情感、韵律、语速、停顿、非言语声音、音高等未进行针对性调优。
-* 对话有时会答非所问，或者回答错误的话题（例如语音特有的同音字造成的误解）。当前阶段限于1.5B 的参数量，以及预训练语音数据的特殊分布（不是均匀分布在各种对话话题）以及数据预处理的瓶颈的原因，我们认为随着数据量的增加以及针对性数据的加入，会大幅度改善这一问题。
-* 当前版本还不支持多轮对话。
-* 推理速度还有非常大的改善空间。目前在L20显卡上的TTFB是670ms左右，预计在针对TensorRT适配，以及一些其他流行技术的应用后，即使不考虑量化仍然有整体吞吐量的十几倍的加速空间存在。
-## 许可证
-由于在v0.1-v0.3的自监督Encoder中使用了WenetSpeech的数据集，我们发布的自监督预训练语音Encoder和端到端checkpoint权重文件仅限于学术使用。代码部分则遵循Apache 2.0协议。
-为了促进中文及亚洲地区语言的语音模型探索，我们将整理采集的公域数据，排除Wenet数据后训练一个新的版本，开放可以更加自由使用的自监督编码器和编解码器。
-## 路径规划
-预计本项目的工作规划如下
-### 2024-8
-**心流知镜-s v0.1 & 0.2 (5亿-15亿参数)**
-- [x] 中文版自监督audio codec
-- [x] 心流知镜-s v0.1 & v0.2 (5亿-15亿 参数)
-- [x] 基于 webrtc 的体验网站
-- [x] 语音 & 文字 双输出
 ⠀
-### 2024-9
-**心流知镜-s v0.2**
-- [x] 开源checkpoint和推理代码
-- [ ] 推理加速版本
-- [ ] 支持端侧部署
-- [ ] 开放自监督speech encoder和Audio codec权重和代码供学术使用
 ⠀
-### 2024-10
-**心流知镜-s v0.3**
-- [ ] 中小学科目教学增强
-- [ ] 支持对话Speaker语音选择
-- [ ] 语音Expressive表达（情绪、音量、高音、语速等）
-- [ ] 中文为主的AudioBench评估数据集的构建
 ⠀
-### 2024-11
-**心流知镜-s v0.3-多语言版本**
-- [ ] 支持东亚地区及全球主流语言
-- [ ] 支持多语种交互对话
 ⠀
-### 2024-12
-**心流知镜-s v0.4**
-- [ ] 支持高品质的教育教学场景全双工对话
-- [ ] 更大参数量的模型尺寸
 ⠀
-### 2025-1
-**心流知镜-s v0.5**
-- [ ] 对于中国各地方言及口音的支持
 ⠀
-### 2025-3
-**心流知镜-s1**
-- [ ] 发布更大参数量的模型尺寸
-- [ ] 对于视觉能力的扩展
-## 招聘
-以下方向均在招聘，也有课题组Leader角色，有兴趣的欢迎联络
-- 语音 ASR/TTS/对话SLLM
-- 角色扮演 LLM 模型
-- 多模态模型推理加速
-- 视觉理解，文档智能
-- 通用框架的人物视频生成
-## 同行社群
-钉钉群：90720015617
-<img src="assets/dingding_qrcode.png" alt="钉钉技术群二维码" width="200"/>
----
-license: apache-2.0
----

+([简体中文](./README_zh.md)|English)
+[![huggingface](https://img.shields.io/badge/huggingface-ckpt-yellow)](https://huggingface.co/jzx-ai-lab/flow_mirror)
+[![modelscope](https://img.shields.io/badge/modelscope-ckpt-purple)](https://www.modelscope.cn/models/jzx-ai-lab/Flow_mirror)
+[![github](https://img.shields.io/badge/Github-code-black)](https://github.com/jingzhunxue/flow_mirror)
+[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 [Update]
+8.29: Created repository, published README & Roadmap
+8.31: Released Demo Site (https://voice-playground.91jzx.cn)
+9.02: Released Inference Code
+9.12: Released FlowMirror-s-v0.2-checkpoint-20240828
+## Motivation
+While text remains the dominant form of language on the internet, many scenarios, such as teaching and medical consultations, still rely on direct verbal communication. Moreover, young children and individuals without literacy skills can engage in extensive communication and expression through listening and speaking, demonstrating that pure voice-based communication can provide sufficient intelligence for interaction. Spoken (textless) communication inherently contains rich expressive information, making it more valuable than purely ASR-converted text in scenarios like education and training.
+Additionally, this project draws inspiration from the capabilities demonstrated by OpenAI's GPT-4 and its educational use cases showcased in demo videos.
+## Team
+Zhejiang Jingzhunxue is a company funded by Alibaba, focusing on providing education-related hardware and software products (AI-assisted learning devices). The AI team at Jingzhunxue is dedicated to achieving proactive learning experiences comparable to or surpassing human education using AI technologies, while striving to reduce technical costs to make these solutions affordable for everyone.
+## Background
+To the best of our knowledge, the earliest end-to-end voice models originated from Meta’s Speechbot GLSM series. Several relevant research papers have provided valuable references and experimental experiences for our work:
+- SpiritLM: Nguyen et al. (2024) explored the interleaving of spoken and written language models.[More Info][1]
+- GLSM: Lakhotia et al. (2021) Lakhotia et al. (2021) developed a generative spoken language model from raw audio.[More Info][2]
+- AudioLM: Borsos et al. (2023) proposed a language modeling approach to audio generation.[More Info][3]
+- SpeechGPT: Zhang et al. (2023) enhanced the cross-modal conversational capabilities of large language models.[More Info][4]
+- SpeechFlow:Liu et al. (2024) introduced a speech generation pretraining method using flow matching. [More Info][5]
 [1]: https://arxiv.org/abs/2402.05755 "SpiRit-LM: Interleaved Spoken and Written Language Model"
 [2]: https://arxiv.org/abs/2102.01192 "Generative Spoken Language Modeling from Raw Audio"
 [5]: https://arxiv.org/abs/2310.16338 "Generative Pre-training for Speech with Flow Matching"
+## Methodology
+Overall, we view the pre-training of end-to-end voice models as a process of learning representations that capture both semantic and acoustic information inherent in speech. Initializing with a text-based LLM brings the possibility of learning unified Text & Audio Representations and significantly reduces engineering complexity. Thus, we designed the overall training process in two stages as outlined below.
+Due to the lack of self-supervised pre-trained speech encoders supporting Chinese, particularly for educational vocabulary, we developed a self-supervised speech encoder focusing on semantic information, based on the Meta HuBERT paper. Drawing inspiration from RVQVAE, we trained an audio codec focusing on acoustic information (9 layers of codebooks) from scratch using extensive Chinese speech data.
+![Self-supervised Audio Codec Modeling](assets/flow_mirror_s_v02_ssl_codec.png)
+Based on these self-supervised pre-trained codecs, we used the qwen2 series LLM models as initialization parameters. As shown in the figure, we adopted an asymmetric structure, where input is primarily a Semantic Unit, and output includes both Acoustic Units and text.
+![Overall Architecture](assets/flow_mirror_s_v02_architecture.png)
+FlowMirror-s v0.1 and v0.2 were pre-trained with 20,000 hours and 50,000 hours of speech data, respectively, and support tasks such as ASR, TTS, speech continuation, and voice dialogue. These experimental results preliminarily verify the feasibility of end-to-end voice models and demonstrate the scalability of the network design, suggesting that the model will achieve even stronger capabilities in future versions.
+## Evaluation
+Qualitative audio examples can be referenced through the following dialogues:
 ```text
 example_1 = "人在没有目标的时候才应该有压力"
 example_2 = "这个阶段需要学习什么知识？"
 example_3 = "怎么把事情做对要花时间去培养"
 example_4 = "这里的药材长势不错"
 ```
+### Dialogue Voice Examples
+**Example 1:** "People should only feel pressure when they lack a goal."
+[Input](assets/question_example_1_MP3.mp3)
+[Output](assets/answer_example_1_MP3.mp3)
+**Example 2:** "The growth of the herbs here looks promising."
+[Input](assets/question_example_4_MP3.mp3)
+[Output](assets/answer_example_4_MP3.mp3)
+### Demo Site
+The demo is deployed at https://voice-playground.91jzx.cn, with support for up to 10 concurrent users due to limited resources. The checkpoint currently deployed is 心流知镜-s v0.2-240822-checkpoint. Future versions will update to the latest v0.2 and v0.3 checkpoints.
+### Multi-task Evaluation
+In this project, the ASR sub-task is considered an evaluation of how well learnable semantic information in the speech is captured during pre-training. The current checkpoint achieves ASR performance approximately equivalent to Whisper-small during the first stage of pre-training. The evaluation data consists of publicly available online speech data, which was not used during training, and Wenet data, which did not participate in end-to-end training. A random sample of 1,024 sentences from both datasets was evaluated.
+| Dataset Source            | Quantity  | Chinese CER/WER   |
+|--------------------------|-----------|-------------------|
+| Public Dataset - Test     | 1,024     | 12.55%            |
+| WenetSpeech - Test        | 1,024     | 24.23%            |
+Since this checkpoint is from an early epoch, it is expected that with increased training data and time, the alignment between speech semantics and text will significantly improve, even without increasing the model size.
+**[TODO]**
+Evaluation data from AudioBench will be added.
+Note: There is an urgent need to construct a Chinese version of AudioBench for more comprehensive evaluations.
+## Limitations and Drawbacks
+* During the three-stage training process, we did not use conventional text LLM pre-training data. Compared to the original qwen2 model, this may lead to decreased performance in MMLU evaluations. Future versions will aim to mitigate this.
+* The current version only controls the speaker's voice timbre. Other speech characteristics such as emotion, prosody, speaking rate, pauses, non-verbal sounds, and pitch have not been fine-tuned.
+* Sometimes, the dialogue responses may be irrelevant or address the wrong topic (e.g., misinterpretations caused by homophones in speech). At this stage, due to the limited parameter size (1.5B) and the special distribution of pre-training speech data (not evenly distributed across conversation topics), as well as bottlenecks in data preprocessing, we anticipate significant improvements in this area with increased and more targeted data.
+* Multi-turn conversations are not yet supported in the current version.
+* There is substantial room for improving inference speed. The current TTFB on an L20 GPU is around 670ms. We expect that with TensorRT optimization and the application of other popular techniques, overall throughput can be improved by an order of magnitude, even without quantization.
+## License
+Since WenetSpeech data was used in the self-supervised encoder for v0.1-v0.3, the self-supervised pre-trained speech encoder and end-to-end checkpoint weight files are limited to academic use. The code is licensed under Apache 2.0.
+To further promote the exploration of speech models for Chinese and Asian languages, we plan to release a new version trained on publicly collected data (excluding Wenet), providing a self-supervised encoder and decoder that is more freely usable.
+## Roadmap
+The project is planned as follows:
+### August 2024
+**心流知镜-s v0.1 & 0.2 (500M-1.5B parameters)**
+- [x] Chinese self-supervised audio codec
+- [x] 心流知镜-s v0.1 & v0.2 (500M-1.5B parameters)
+- [x] Experience website based on WebRTC
+- [x] Dual output: Speech & Text
 ⠀
+### September 2024
+**心流知镜-s v0.2**
+- [x] Open-source [checkpoint](https://huggingface.co/jzx-ai-lab/flow_mirror) and inference code
+- [ ] Accelerated inference version
+- [ ] Support for on-device deployment
+- [ ] Release self-supervised speech encoder and audio codec weights for academic use
 ⠀
+### October 2024
+**心流知镜-s v0.3**
+- [ ] Enhanced for primary and secondary school subject teaching
+- [ ] Support for speaker voice selection in dialogues
+- [ ] Expressive speech (emotion, volume, pitch, speech rate, etc.)
+- [ ] Construction of a Chinese-focused AudioBench evaluation dataset
 ⠀
+### November 2024
+**心流知镜-s v0.3 - Multilingual Version**
+- [ ] Support for major languages in East Asia and globally
+- [ ] Support for multilingual interactive dialogues
 ⠀
+### December 2024
+**心流知镜-s v0.4**
+- [ ] Support for high-quality, fully duplex dialogues in educational scenarios
+- [ ] Larger model sizes
 ⠀
+### January 2025
+**心流知镜-s v0.5**
+- [ ] Support for various Chinese dialects and accents
 ⠀
+### March 2025
+**心流知镜-s1**
+- [ ] Release of larger model sizes
+- [ ] Expansion to visual capabilities
+## Recruitment
+We are hiring for the following areas, including group leader roles. Interested candidates are welcome to apply:
+- Speech ASR/TTS/Dialog SLLM
+- Role-playing LLM model
+- Multimodal model inference acceleration
+- Visual understanding and document intelligence
+- General framework for character video generation
+## Community
+DingTalk Group: 90720015617
+<img src="assets/dingding_qrcode.png" alt="DingTalk Technical Group QR Code" width="200"/>