SaraAlthubaiti commited on
Commit
1954827
Β·
verified Β·
1 Parent(s): 308fd89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -11,6 +11,30 @@ metrics:
11
  - wer
12
  - accuracy
13
  base_model:
14
- - meta-llama/Llama-3.2-1B
15
  pipeline_tag: audio-text-to-text
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - wer
12
  - accuracy
13
  base_model:
14
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
15
  pipeline_tag: audio-text-to-text
16
+ ---
17
+ # TinyOctopus: Bilingual Audio Language Model πŸ™πŸ”Š
18
+
19
+ ## πŸ“’ Overview
20
+ **TinyOctopus** is a **Bilingual Audio Language Model (Audio-LLM)** designed to process and generate text from audio inputs. The model leverages **Distil-Whisper (distil-large-v3)** for audio encoding, a **cross-attention projection layer** for alignment, and **DeepSeek 1.5B** for text generation. TinyOctopus is optimized for tasks such as:
21
+
22
+ - **Bilingual Automatic Speech Recognition (ASR)** πŸ—£οΈ
23
+ - **Speech Translation** 🌍
24
+ - **Gender Identification** πŸ‘¨πŸ§‘
25
+ - **Dialect Identification**
26
+
27
+ TinyOctopus maintaining the architectural principles of the following structure:
28
+
29
+ ## πŸ— Model Architecture
30
+ ### **TinyOctopus integrates:**
31
+ 1. **Distil-Whisper (distil-large-v3)** for encoding audio inputs.
32
+ 2. **Cross-Attention Projection Layer** (trainable) to align audio features with textual representations.
33
+ 3. **DeepSeek 1.5B** as the core language model for text generation.
34
+
35
+ ## πŸ“‚ Dataset
36
+ The model has been trained on multiple datasets to optimize its performance across different tasks:
37
+
38
+ - **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.
39
+
40
+ - **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.