lysanderism commited on
Commit
38d9e4e
Β·
verified Β·
1 Parent(s): e45cb1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -75
README.md CHANGED
@@ -1,76 +1,75 @@
1
- ---
2
- license: apache-2.0
3
- task_categories:
4
- - audio-classification
5
- - automatic-speech-recognition
6
- - question-answering
7
- tags:
8
- - Audio
9
- - Large Audio Language Models
10
- language:
11
- - en
12
- metrics:
13
- - F1
14
- - IOU
15
- - accuracy
16
- ---
17
-
18
- ## πŸš€πŸš€ TimeAudio: Bridging Temporal Gaps in Large Audio-Language
19
-
20
- <div style='display:flex; gap: 0.25rem; '>
21
- <a href='https://arxiv.org/pdf/.pdf'><img src='https://img.shields.io/badge/paper-PDF-green'></a>
22
- <a href='https://huggingface.co/lysanderism/TimeAudio'><img src='https://img.shields.io/badge/huggingface-checkpoint-yellow'></a>
23
- </div>
24
-
25
- Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data.
26
-
27
- To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to achieve end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance.
28
-
29
-
30
- ## Method
31
-
32
- TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
33
-
34
- <div align=center><img src="img/overview.png" height="100%" width="92%"/></div>
35
-
36
- ## Compare
37
-
38
- Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
39
-
40
- <div align=center><img src="img/case.png" height="100%" width="70%"/></div>
41
-
42
- ## How to inference in CLI
43
-
44
- You need to use the following dependencies:
45
-
46
- 1. Our environment: The python version is 3.10.16, and other required packages can be installed with the following command: ```pip install -r requirements.txt```.
47
- 2. Download [whisper large v2](https://huggingface.co/openai/whisper-large-v2/tree/main) to ```whisper_path```.
48
- 3. Download [Fine-tuned BEATs_iter3+ (AS2M) (cpt2)](https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) to `beats_path`.
49
- 4. Download [vicuna 7B v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main) to ```vicuna_path```.
50
- 5. Download [timeaudio](https://huggingface.co/lysanderism/TimeAudio/tree/main) to ```ckpt_path```.
51
- 6. Running with ```python3 cli_inference.py --ckpt_path xxx --whisper_path xxx --beats_path xxx --vicuna_path xxx``` to start cli inference. Please make sure your GPU has more than 40G of memory. If your GPU does not have enough memory (e.g. only 24G), you can quantize the model using the `--low_resource` parameter to reduce the memory usage.
52
-
53
- ## Launch a QA
54
-
55
- 1. Same as **How to inference in CLI: 1-5**.
56
- 2. Running with ```python3 web_demo.py --ckpt_path xxx --whisper_path xxx --beats_path xxx --vicuna_path xxx``` in your machine. You can add `--low_resource` parameter if the GPU memory is not enough, and reduce the LoRA scaling factor to maintain the model's emergent abilities.
57
-
58
-
59
- ## Citation
60
- If you find TimeAudio great and useful, please cite our paper:
61
- ```
62
- @article{,
63
- title={TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models},
64
- author={Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang },
65
- journal={arXiv preprint arXiv:},
66
- year={2025},
67
- url={https://arxiv.org/abs/}
68
- }
69
- ```
70
-
71
- ## πŸ™ Acknowledgments
72
-
73
- We gratefully acknowledge the creators of:
74
-
75
- - SALMONN(Tang et al.)
76
  - Qwen2-Audio(Chu et al.)
 
1
+ ---
2
+ license: apache-2.0
3
+ task_categories:
4
+ - audio-classification
5
+ - automatic-speech-recognition
6
+ - question-answering
7
+ tags:
8
+ - Audio
9
+ - Large Audio Language Models
10
+ language:
11
+ - en
12
+ metrics:
13
+ - F1
14
+ - IOU
15
+ - accuracy
16
+ ---
17
+
18
+ ## πŸš€πŸš€ TimeAudio: Bridging Temporal Gaps in Large Audio-Language
19
+
20
+ <div style='display:flex; gap: 0.25rem; '>
21
+ <a href='https://arxiv.org/pdf/.pdf'><img src='https://img.shields.io/badge/paper-PDF-green'></a>
22
+ <a href='https://huggingface.co/lysanderism/TimeAudio'><img src='https://img.shields.io/badge/huggingface-checkpoint-yellow'></a>
23
+ </div>
24
+
25
+ Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data.
26
+
27
+ To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to achieve end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance.
28
+
29
+
30
+ ## Method
31
+
32
+ TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
33
+
34
+ <div align=center><img src="img/overview.png" height="100%" width="92%"/></div>
35
+
36
+ ## Compare
37
+
38
+ Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
39
+
40
+ <div align=center><img src="img/case.png" height="100%" width="70%"/></div>
41
+
42
+ ## How to inference in CLI
43
+
44
+ You need to use the following dependencies:
45
+
46
+ 1. Our environment: The python version is 3.10.16, and other required packages can be installed with the following command: ```pip install -r requirements.txt```.
47
+ 2. Download [whisper large v2](https://huggingface.co/openai/whisper-large-v2/tree/main) to ```whisper_path```.
48
+ 3. Download [Fine-tuned BEATs_iter3+ (AS2M) (cpt2)](https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) to `beats_path`.
49
+ 4. Download [vicuna 7B v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main) to ```vicuna_path```.
50
+ 5. Download [timeaudio](https://huggingface.co/lysanderism/TimeAudio/tree/main) to ```ckpt_path```.
51
+ 6. Running with ```python3 cli_inference.py --ckpt_path xxx --whisper_path xxx --beats_path xxx --vicuna_path xxx``` to start cli inference. Please make sure your GPU has more than 40G of memory. If your GPU does not have enough memory (e.g. only 24G), you can quantize the model using the `--low_resource` parameter to reduce the memory usage.
52
+
53
+ ## Launch a Demo
54
+
55
+ Same as **How to inference in CLI: 1-5**.
56
+
57
+
58
+ ## Citation
59
+ If you find TimeAudio great and useful, please cite our paper:
60
+ ```
61
+ @article{,
62
+ title={TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models},
63
+ author={Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, Xiangdong Wang },
64
+ journal={arXiv preprint arXiv:},
65
+ year={2025},
66
+ url={https://arxiv.org/abs/}
67
+ }
68
+ ```
69
+
70
+ ## πŸ™ Acknowledgments
71
+
72
+ We gratefully acknowledge the creators of:
73
+
74
+ - SALMONN(Tang et al.)
 
75
  - Qwen2-Audio(Chu et al.)