jimbozhang commited on
Commit
b499c3b
·
verified ·
1 Parent(s): 2f0baa1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - th
7
+ - id
8
+ - vi
9
+ pipeline_tag: audio-text-to-text
10
+ tags:
11
+ - multimodal
12
+ - audio-language-model
13
+ - audio
14
+ base_model:
15
+ - mispeech/dasheng-0.6B
16
+ - Qwen/Qwen2.5-Omni-7B
17
+ base_model_relation: finetune
18
+ ---
19
+ # MiDashengLM-7B-0804 (4bit, bitsandbytes)
20
+
21
+ The bnb-4bit weights for [mispeech/midashenglm-7b-0804-fp32](https://huggingface.co/mispeech/midashenglm-7b-0804-fp32).
22
+
23
+ **Note**: This is a basic 4-bit quantization using bitsandbytes.
24
+ For better performance and accuracy, we recommend using our [GPTQ-quantized version](https://huggingface.co/mispeech/midashenglm-7b-0804-w4a16-gptq) which maintains higher quality while still providing significant memory savings.
25
+
26
+ ## Usage
27
+
28
+ ### Load Model
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
32
+
33
+ model_id = "mispeech/midashenglm-7b-0804-4bit-bnb" # "mispeech/midashenglm-7b-0804-w4a16-gptq" is more recommended
34
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
36
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
37
+ ```
38
+
39
+ ### Construct Prompt
40
+
41
+ ```python
42
+ user_prompt = "Caption the audio." # You may try any other prompt
43
+
44
+ messages = [
45
+ {
46
+ "role": "system",
47
+ "content": [
48
+ {"type": "text", "text": "You are a helpful language and speech assistant."}
49
+ ],
50
+ },
51
+ {
52
+ "role": "user",
53
+ "content": [
54
+ {"type": "text", "text": user_prompt},
55
+ {
56
+ "type": "audio",
57
+ "path": "/path/to/example.wav",
58
+ # or "url": "https://example.com/example.wav"
59
+ # or "audio": np.random.randn(16000)
60
+ },
61
+ ],
62
+ },
63
+ ]
64
+ ```
65
+
66
+ ### Generate Output
67
+
68
+ ```python
69
+ import torch
70
+
71
+ with torch.no_grad():
72
+ model_inputs = processor.apply_chat_template(
73
+ messages,
74
+ tokenize=True,
75
+ add_generation_prompt=True,
76
+ add_special_tokens=True,
77
+ return_dict=True,
78
+ ).to(device=model.device, dtype=model.dtype)
79
+ generation = model.generate(**model_inputs)
80
+ output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
81
+ ```
82
+
83
+ ## Citation
84
+
85
+ MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
86
+
87
+ If you find MiDashengLM useful in your research, please consider citing our work:
88
+
89
+ ```bibtex
90
+ @techreport{midashenglm7b,
91
+ title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
92
+ author = {{Horizon Team, MiLM Plus}},
93
+ institution= {Xiaomi Inc.},
94
+ year = {2025},
95
+ note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
96
+ url = {https://arxiv.org/abs/2508.03983},
97
+ eprint = {2508.03983},
98
+ }
99
+ ```