mispeech
/

midashenglm-7b-0804-4bit-bnb

Audio-Text-to-Text

audio-language-model

4-bit precision

Model card Files Files and versions

jimbozhang commited on Oct 20

Commit

b499c3b

·

verified ·

1 Parent(s): 2f0baa1

Create README.md

Files changed (1) hide show

README.md +99 -0

README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+- th
+- id
+- vi
+pipeline_tag: audio-text-to-text
+tags:
+- multimodal
+- audio-language-model
+- audio
+base_model:
+- mispeech/dasheng-0.6B
+- Qwen/Qwen2.5-Omni-7B
+base_model_relation: finetune
+---
+# MiDashengLM-7B-0804 (4bit, bitsandbytes)
+The bnb-4bit weights for [mispeech/midashenglm-7b-0804-fp32](https://huggingface.co/mispeech/midashenglm-7b-0804-fp32).
+**Note**: This is a basic 4-bit quantization using bitsandbytes.
+For better performance and accuracy, we recommend using our [GPTQ-quantized version](https://huggingface.co/mispeech/midashenglm-7b-0804-w4a16-gptq) which maintains higher quality while still providing significant memory savings.
+## Usage
+### Load Model
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+model_id = "mispeech/midashenglm-7b-0804-4bit-bnb"  # "mispeech/midashenglm-7b-0804-w4a16-gptq" is more recommended
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+```
+### Construct Prompt
+```python
+user_prompt = "Caption the audio."  # You may try any other prompt
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {"type": "text", "text": "You are a helpful language and speech assistant."}
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": user_prompt},
+            {
+                "type": "audio",
+                "path": "/path/to/example.wav",
+                # or "url": "https://example.com/example.wav"
+                # or "audio": np.random.randn(16000)
+            },
+        ],
+    },
+]
+```
+### Generate Output
+```python
+import torch
+with torch.no_grad():
+    model_inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        add_special_tokens=True,
+        return_dict=True,
+    ).to(device=model.device, dtype=model.dtype)
+    generation = model.generate(**model_inputs)
+    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
+```
+## Citation
+MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
+If you find MiDashengLM useful in your research, please consider citing our work:
+```bibtex
+@techreport{midashenglm7b,
+  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
+  author     = {{Horizon Team, MiLM Plus}},
+  institution= {Xiaomi Inc.},
+  year       = {2025},
+  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
+  url        = {https://arxiv.org/abs/2508.03983},
+  eprint     = {2508.03983},
+}
+```