Spaces:

ASLP-lab
/

SongFormer

Running on Zero

App Files Files Community

ASLP-lab commited on 24 days ago

Commit

70d8fcf

1 Parent(s): ef1d5fa

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +186 -0
app.py +636 -0
requirements.txt +86 -0
src/SongFormer/ckpts/md5sum.txt +4 -0
src/SongFormer/configs/SongFormer.yaml +186 -0
src/SongFormer/dataset/DatasetAdaper.py +33 -0
src/SongFormer/dataset/GeminiOnlyLabelAdapter.py +332 -0
src/SongFormer/dataset/HookTheoryAdapter.py +448 -0
src/SongFormer/dataset/custom_types.py +14 -0
src/SongFormer/dataset/label2id.py +163 -0
src/SongFormer/dataset/msa_info_utils.py +47 -0
src/SongFormer/eval.sh +22 -0
src/SongFormer/evaluation/eval_infer_results.py +198 -0
src/SongFormer/infer.sh +21 -0
src/SongFormer/infer/infer.py +439 -0
src/SongFormer/models/SongFormer.py +521 -0
src/SongFormer/postprocessing/calc_acc.py +82 -0
src/SongFormer/postprocessing/calc_iou.py +89 -0
src/SongFormer/postprocessing/functional.py +71 -0
src/SongFormer/postprocessing/helpers.py +101 -0
src/SongFormer/train/accelerate_config/single_gpu.yaml +17 -0
src/SongFormer/utils/average_checkpoints.py +152 -0
src/SongFormer/utils/convert_res2msa_txt.py +79 -0
src/SongFormer/utils/fetch_pretrained.py +40 -0
src/third_party/MuQ/.gitattributes +2 -0
src/third_party/MuQ/.gitignore +46 -0
src/third_party/MuQ/.gitmodules +3 -0
src/third_party/MuQ/LICENSE +21 -0
src/third_party/MuQ/LICENSE_weights +399 -0
src/third_party/MuQ/README.md +129 -0
src/third_party/MuQ/images/muq-logo.jpeg +0 -0
src/third_party/MuQ/images/radar.jpg +3 -0
src/third_party/MuQ/images/tab-marble.jpg +3 -0
src/third_party/MuQ/images/tab-mulan.png +3 -0
src/third_party/MuQ/images/tagging.jpg +3 -0
src/third_party/MuQ/requirements.txt +11 -0
src/third_party/MuQ/setup.py +34 -0
src/third_party/MuQ/src/muq/__init__.py +2 -0
src/third_party/MuQ/src/muq/muq/__init__.py +1 -0
src/third_party/MuQ/src/muq/muq/models/__init__.py +0 -0
src/third_party/MuQ/src/muq/muq/models/muq_model.py +366 -0
src/third_party/MuQ/src/muq/muq/modules/__init__.py +2 -0
src/third_party/MuQ/src/muq/muq/modules/conv.py +77 -0
src/third_party/MuQ/src/muq/muq/modules/features.py +37 -0
src/third_party/MuQ/src/muq/muq/modules/flash_conformer.py +2114 -0
src/third_party/MuQ/src/muq/muq/modules/random_quantizer.py +68 -0
src/third_party/MuQ/src/muq/muq/modules/rvq.py +314 -0
src/third_party/MuQ/src/muq/muq/muq.py +90 -0
src/third_party/MuQ/src/muq/muq_mulan/__init__.py +1 -0
src/third_party/MuQ/src/muq/muq_mulan/models/__init__.py +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,186 @@

+---
+title: SongFormer
+emoji: 🎵
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+python_version: "3.10"
+app_file: app.py
+tags:
+  - music-structure-annotation
+  - transformer
+short_description: State-of-the-art music analysis with multi-scale datasets
+fullWidth: true
+---
+<p align="center">
+  <img src="figs/logo.png" width="50%" />
+</p>
+# SONGFORMER: SCALING MUSIC STRUCTURE ANALYSIS WITH HETEROGENEOUS SUPERVISION
+![Python](https://img.shields.io/badge/Python-3.10-brightgreen)
+![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightblue)
+[![arXiv](https://img.shields.io/badge/arXiv-com.svg?logo=arXiv)]()
+[![GitHub](https://img.shields.io/badge/GitHub-SongFormer-black)](https://github.com/ASLP-lab/SongFormer)
+[![HuggingFace Space](https://img.shields.io/badge/HuggingFace-space-yellow)](https://huggingface.co/spaces/ASLP-lab/SongFormer)
+[![HuggingFace Model](https://img.shields.io/badge/HuggingFace-model-blue)](https://huggingface.co/ASLP-lab/SongFormer)
+[![Dataset SongFormDB](https://img.shields.io/badge/HF%20Dataset-SongFormDB-green)](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
+[![Dataset SongFormBench](https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange)](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
+[![Discord](https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white)](https://discord.gg/rwcqh7Em)
+[![lab](https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey)](http://www.npu-aslp.org/)
+Chunbo Hao<sup>&ast;</sup>, Ruibin Yuan<sup>&ast;</sup>, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie<sup>&dagger;</sup>
+----
+SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.
+![](figs/songformer.png)
+## News and Updates
+## 📋 To-Do List
+- [x] Complete and push inference code to GitHub
+- [x] Upload model checkpoint(s) to Hugging Face Hub
+- [ ] Upload the paper to arXiv
+- [x] Fix readme
+- [ ] Deploy an out-of-the-box inference version on Hugging Face (via Inference API or Spaces)
+- [ ] Publish the package to PyPI for easy installation via `pip`
+- [ ] Open-source evaluation code
+- [ ] Open-source training code
+## Installation
+### Setting up Python Environment
+```bash
+git clone https://github.com/ASLP-lab/SongFormer.git
+# Get MuQ and MusicFM source code
+git submodule update --init --recursive
+conda create -n songformer python=3.10 -y
+conda activate songformer
+```
+For users in mainland China, you may need to set up pip mirror source:
+```bash
+pip config set global.index-url https://pypi.mirrors.ustc.edu.cn/simple
+```
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+We tested this on Ubuntu 22.04.1 LTS and it works normally. If you cannot install, you may need to remove version constraints in `requirements.txt`
+### Download Pre-trained Models
+```bash
+cd src/SongFormer
+# For users in mainland China, you can modify according to the py file instructions to use hf-mirror.com for downloading
+python utils/fetch_pretrained.py
+```
+After downloading, you can verify the md5sum values in `src/SongFormer/ckpts/MusicFM/md5sum.txt` match the downloaded files:
+```bash
+md5sum ckpts/MusicFM/msd_stats.json
+md5sum ckpts/MusicFM/pretrained_msd.pt
+md5sum ckpts/SongFormer.safetensors
+# md5sum ckpts/SongFormer.pt
+```
+## Inference
+## Inference
+### 1. One-Click Inference with HuggingFace Space (coming soon)
+Available at: [https://huggingface.co/spaces/ASLP-lab/SongFormer](https://huggingface.co/spaces/ASLP-lab/SongFormer)
+### 2. Gradio App
+First, cd to the project root directory and activate the environment:
+```bash
+conda activate songformer
+```
+You can modify the server port and listening address in the last line of `app.py` according to your preference.
+> If you're using an HTTP proxy, please ensure you include:
+>
+> ```bash
+> export no_proxy="localhost, 127.0.0.1, ::1"
+> export NO_PROXY="localhost, 127.0.0.1, ::1"
+> ```
+>
+> Otherwise, Gradio may incorrectly assume the service hasn't started, causing startup to exit directly.
+When first running `app.py`, it will connect to Hugging Face to download MuQ-related weights. We recommend creating an empty folder in an appropriate location and using `export HF_HOME=XXX` to point to this folder, so cache will be stored there for easy cleanup and transfer.
+And for users in mainland China, you may need `export HF_ENDPOINT=https://hf-mirror.com`. For details, refer to https://hf-mirror.com/
+```bash
+python app.py
+```
+### 3. Python Code
+You can refer to the file `src/SongFormer/infer/infer.py`. The corresponding execution script is located at `src/SongFormer/infer.sh`. This is a ready-to-use, single-machine, multi-process annotation script.
+Below are some configurable parameters from the `src/SongFormer/infer.sh` script. You can set `CUDA_VISIBLE_DEVICES` to specify which GPUs to use:
+```bash
+-i              # Input SCP folder path, each line containing the absolute path to one audio file
+-o              # Output directory for annotation results
+--model         # Annotation model; the default is 'SongFormer', change if using a fine-tuned model
+--checkpoint    # Path to the model checkpoint file
+--config_pat    # Path to the configuration file
+-gn             # Total number of GPUs to use — should match the number specified in CUDA_VISIBLE_DEVICES
+-tn             # Number of processes to run per GPU
+```
+You can control which GPUs are used by setting the `CUDA_VISIBLE_DEVICES` environment variable.
+### 4. CLI Inference
+Coming soon
+### 4. Pitfall
+- You may need to modify line 121 in `src/third_party/musicfm/model/musicfm_25hz.py` to:
+`S = torch.load(model_path, weights_only=False)["state_dict"]`
+## Training
+## Citation
+If our work and codebase is useful for you, please cite as:
+````
+comming soon
+````
+## License
+Our code is released under CC-BY-4.0 License.
+## Contact Us
+<p align="center">
+    <a href="http://www.nwpu-aslp.org/">
+        <img src="figs/aslp.png" width="400"/>
+    </a>
+</p>

app.py ADDED Viewed

	@@ -0,0 +1,636 @@

+# import os
+# import sys
+# os.chdir(os.path.join("src", "SongFormer"))
+# sys.path.append(os.path.join("..", "third_party"))
+# sys.path.append(".")
+import os
+import sys
+# 获取当前文件的绝对路径和脚本名称
+current_file = os.path.abspath(__file__)
+current_dir = os.path.dirname(current_file)
+script_name = os.path.basename(__file__)
+print(f"[INFO] 正在运行脚本：{script_name}")
+print(f"[INFO] 当前文件所在目录为：{current_dir}")
+# 设置工作目录为 `src/SongFormer`（如果该路径存在）
+songformer_path = os.path.join(current_dir, "src", "SongFormer")
+if os.path.exists(songformer_path):
+    os.chdir(songformer_path)
+    print(f"[INFO] 工作目录已修改为：{songformer_path}")
+else:
+    print(f"[WARNING] 目标工作目录不存在：{songformer_path}")
+# 获取当前工作目录，即运行 os.chdir 后的路径
+working_dir = os.getcwd()
+print(f"[INFO] 当前工作目录为：{working_dir}")
+# 添加第三方库路径到 sys.path（third_party）
+third_party_path = os.path.join(current_dir, "third_party")
+if os.path.exists(third_party_path):
+    sys.path.insert(0, third_party_path)
+    print(f"[INFO] 已添加第三方库路径到 sys.path：{third_party_path}")
+else:
+    print(f"[WARNING] third_party 路径不存在：{third_party_path}")
+# 添加当前工作目录到 sys.path（通常是 src/SongFormer）
+sys.path.insert(0, working_dir)
+print(f"[INFO] 已添加当前工作目录到 sys.path：{working_dir}")
+# 尝试添加多个可能用于 musicfm 导入的路径
+musicfm_paths = [
+    os.path.join(current_dir, "src"),
+    os.path.join(current_dir, "third_party"),
+    os.path.join(current_dir, "src", "SongFormer"),
+]
+for path in musicfm_paths:
+    if os.path.exists(path):
+        sys.path.insert(0, path)
+        print(f"[INFO] 已添加路径到 sys.path：{path}")
+    else:
+        print(f"[DEBUG] 路径不存在，跳过添加：{path}")
+# 可选：打印 sys.path 的当前状态
+print("\n[DEBUG] 当前 sys.path 设置如下：")
+for idx, p in enumerate(sys.path):
+    print(f"  {idx}: {p}")
+# monkey patch to fix issues in msaf
+import scipy
+import numpy as np
+scipy.inf = np.inf
+import gradio as gr
+import torch
+import librosa
+import json
+import math
+import importlib
+import matplotlib.pyplot as plt
+import matplotlib.ticker as ticker
+from pathlib import Path
+from argparse import Namespace
+from omegaconf import OmegaConf
+from ema_pytorch import EMA
+from muq import MuQ
+from musicfm.model.musicfm_25hz import MusicFM25Hz
+from postprocessing.functional import postprocess_functional_structure
+from dataset.label2id import DATASET_ID_ALLOWED_LABEL_IDS, DATASET_LABEL_TO_DATASET_ID
+from utils.fetch_pretrained import download_all
+# Constants
+MUSICFM_HOME_PATH = os.path.join("ckpts", "MusicFM")
+BEFORE_DOWNSAMPLING_FRAME_RATES = 25
+AFTER_DOWNSAMPLING_FRAME_RATES = 8.333
+DATASET_LABEL = "SongForm-HX-8Class"
+DATASET_IDS = [5]
+TIME_DUR = 420
+INPUT_SAMPLING_RATE = 24000
+# Global model variables
+muq_model = None
+musicfm_model = None
+msa_model = None
+device = None
+def load_checkpoint(checkpoint_path, device=None):
+    """Load checkpoint from path"""
+    if device is None:
+        device = "cpu"
+    if checkpoint_path.endswith(".pt"):
+        checkpoint = torch.load(checkpoint_path, map_location=device)
+    elif checkpoint_path.endswith(".safetensors"):
+        from safetensors.torch import load_file
+        checkpoint = {"model_ema": load_file(checkpoint_path, device=device)}
+    else:
+        raise ValueError("Unsupported checkpoint format. Use .pt or .safetensors")
+    return checkpoint
+def initialize_models(model_name: str, checkpoint: str, config_path: str):
+    """Initialize all models"""
+    global muq_model, musicfm_model, msa_model, device
+    # Set device
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # Load MuQ
+    muq_model = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
+    muq_model = muq_model.to(device).eval()
+    # Load MusicFM
+    musicfm_model = MusicFM25Hz(
+        is_flash=False,
+        stat_path=os.path.join(MUSICFM_HOME_PATH, "msd_stats.json"),
+        model_path=os.path.join(MUSICFM_HOME_PATH, "pretrained_msd.pt"),
+    )
+    musicfm_model = musicfm_model.to(device).eval()
+    # Load MSA model
+    module = importlib.import_module("models." + str(model_name))
+    Model = getattr(module, "Model")
+    hp = OmegaConf.load(os.path.join("configs", config_path))
+    msa_model = Model(hp)
+    ckpt = load_checkpoint(checkpoint_path=os.path.join("ckpts", checkpoint))
+    if ckpt.get("model_ema", None) is not None:
+        model_ema = EMA(msa_model, include_online_model=False)
+        model_ema.load_state_dict(ckpt["model_ema"])
+        msa_model.load_state_dict(model_ema.ema_model.state_dict())
+    else:
+        msa_model.load_state_dict(ckpt["model"])
+    msa_model.to(device).eval()
+    return hp
+def process_audio(audio_path, win_size=420, hop_size=420, num_classes=128):
+    """Process audio file and return structure analysis results"""
+    global muq_model, musicfm_model, msa_model, device
+    if muq_model is None:
+        hp = initialize_models()
+    else:
+        hp = OmegaConf.load(os.path.join("configs", "SongFormer.yaml"))
+    # Load audio
+    wav, sr = librosa.load(audio_path, sr=INPUT_SAMPLING_RATE)
+    audio = torch.tensor(wav).to(device)
+    # Prepare output
+    total_len = (
+        (audio.shape[0] // INPUT_SAMPLING_RATE) // TIME_DUR * TIME_DUR
+    ) + TIME_DUR
+    total_frames = math.ceil(total_len * AFTER_DOWNSAMPLING_FRAME_RATES)
+    logits = {
+        "function_logits": np.zeros([total_frames, num_classes]),
+        "boundary_logits": np.zeros([total_frames]),
+    }
+    logits_num = {
+        "function_logits": np.zeros([total_frames, num_classes]),
+        "boundary_logits": np.zeros([total_frames]),
+    }
+    # Prepare label masks
+    dataset_id2label_mask = {}
+    for key, allowed_ids in DATASET_ID_ALLOWED_LABEL_IDS.items():
+        dataset_id2label_mask[key] = np.ones(num_classes, dtype=bool)
+        dataset_id2label_mask[key][allowed_ids] = False
+    lens = 0
+    i = 0
+    with torch.no_grad():
+        while True:
+            start_idx = i * INPUT_SAMPLING_RATE
+            end_idx = min((i + win_size) * INPUT_SAMPLING_RATE, audio.shape[-1])
+            if start_idx >= audio.shape[-1]:
+                break
+            if end_idx - start_idx <= 1024:
+                continue
+            audio_seg = audio[start_idx:end_idx]
+            # Get embeddings
+            muq_output = muq_model(audio_seg.unsqueeze(0), output_hidden_states=True)
+            muq_embd_420s = muq_output["hidden_states"][10]
+            del muq_output
+            torch.cuda.empty_cache()
+            _, musicfm_hidden_states = musicfm_model.get_predictions(
+                audio_seg.unsqueeze(0)
+            )
+            musicfm_embd_420s = musicfm_hidden_states[10]
+            del musicfm_hidden_states
+            torch.cuda.empty_cache()
+            # Process 30-second segments
+            wraped_muq_embd_30s = []
+            wraped_musicfm_embd_30s = []
+            for idx_30s in range(i, i + hop_size, 30):
+                start_idx_30s = idx_30s * INPUT_SAMPLING_RATE
+                end_idx_30s = min(
+                    (idx_30s + 30) * INPUT_SAMPLING_RATE,
+                    audio.shape[-1],
+                    (i + hop_size) * INPUT_SAMPLING_RATE,
+                )
+                if start_idx_30s >= audio.shape[-1]:
+                    break
+                if end_idx_30s - start_idx_30s <= 1024:
+                    continue
+                wraped_muq_embd_30s.append(
+                    muq_model(
+                        audio[start_idx_30s:end_idx_30s].unsqueeze(0),
+                        output_hidden_states=True,
+                    )["hidden_states"][10]
+                )
+                torch.cuda.empty_cache()
+                wraped_musicfm_embd_30s.append(
+                    musicfm_model.get_predictions(
+                        audio[start_idx_30s:end_idx_30s].unsqueeze(0)
+                    )[1][10]
+                )
+                torch.cuda.empty_cache()
+            if wraped_muq_embd_30s:
+                wraped_muq_embd_30s = torch.concatenate(wraped_muq_embd_30s, dim=1)
+                wraped_musicfm_embd_30s = torch.concatenate(
+                    wraped_musicfm_embd_30s, dim=1
+                )
+                all_embds = [
+                    wraped_musicfm_embd_30s,
+                    wraped_muq_embd_30s,
+                    musicfm_embd_420s,
+                    muq_embd_420s,
+                ]
+                # Align embedding lengths
+                if len(all_embds) > 1:
+                    embd_lens = [x.shape[1] for x in all_embds]
+                    min_embd_len = min(embd_lens)
+                    for idx in range(len(all_embds)):
+                        all_embds[idx] = all_embds[idx][:, :min_embd_len, :]
+                embd = torch.concatenate(all_embds, axis=-1)
+                # Inference
+                dataset_ids = torch.Tensor(DATASET_IDS).to(device, dtype=torch.long)
+                msa_info, chunk_logits = msa_model.infer(
+                    input_embeddings=embd,
+                    dataset_ids=dataset_ids,
+                    label_id_masks=torch.Tensor(
+                        dataset_id2label_mask[
+                            DATASET_LABEL_TO_DATASET_ID[DATASET_LABEL]
+                        ]
+                    )
+                    .to(device, dtype=bool)
+                    .unsqueeze(0)
+                    .unsqueeze(0),
+                    with_logits=True,
+                )
+                # Accumulate logits
+                start_frame = int(i * AFTER_DOWNSAMPLING_FRAME_RATES)
+                end_frame = start_frame + min(
+                    math.ceil(hop_size * AFTER_DOWNSAMPLING_FRAME_RATES),
+                    chunk_logits["boundary_logits"][0].shape[0],
+                )
+                logits["function_logits"][start_frame:end_frame, :] += (
+                    chunk_logits["function_logits"][0].detach().cpu().numpy()
+                )
+                logits["boundary_logits"][start_frame:end_frame] = (
+                    chunk_logits["boundary_logits"][0].detach().cpu().numpy()
+                )
+                logits_num["function_logits"][start_frame:end_frame, :] += 1
+                logits_num["boundary_logits"][start_frame:end_frame] += 1
+                lens += end_frame - start_frame
+            i += hop_size
+    # Average logits
+    logits["function_logits"] /= np.maximum(logits_num["function_logits"], 1)
+    logits["boundary_logits"] /= np.maximum(logits_num["boundary_logits"], 1)
+    logits["function_logits"] = torch.from_numpy(
+        logits["function_logits"][:lens]
+    ).unsqueeze(0)
+    logits["boundary_logits"] = torch.from_numpy(
+        logits["boundary_logits"][:lens]
+    ).unsqueeze(0)
+    # Post-process
+    msa_infer_output = postprocess_functional_structure(logits, hp)
+    return logits, msa_infer_output
+def format_as_segments(msa_output):
+    """Format as list of segments"""
+    segments = []
+    for idx in range(len(msa_output) - 1):
+        segments.append(
+            {
+                "start": str(round(msa_output[idx][0], 2)),
+                "end": str(round(msa_output[idx + 1][0], 2)),
+                "label": msa_output[idx][1],
+            }
+        )
+    return segments
+def format_as_msa(msa_output):
+    """Format as MSA format"""
+    lines = []
+    for time, label in msa_output:
+        lines.append(f"{time:.2f} {label}")
+    return "\n".join(lines)
+def format_as_json(segments):
+    """Format as JSON"""
+    return json.dumps(segments, indent=2, ensure_ascii=False)
+def create_visualization(
+    logits, msa_output, label_num=8, frame_rates=AFTER_DOWNSAMPLING_FRAME_RATES
+):
+    """Create visualization plot"""
+    # Assume ID_TO_LABEL mapping exists
+    try:
+        from dataset.label2id import ID_TO_LABEL
+    except:
+        ID_TO_LABEL = {i: f"Class_{i}" for i in range(128)}
+    function_vals = logits["function_logits"].squeeze().cpu().numpy()
+    boundary_vals = logits["boundary_logits"].squeeze().cpu().numpy()
+    top_classes = np.argsort(function_vals.mean(axis=0))[-label_num:]
+    T = function_vals.shape[0]
+    time_axis = np.arange(T) / frame_rates
+    fig, ax = plt.subplots(2, 1, figsize=(15, 8), sharex=True)
+    # Plot function logits
+    for cls in top_classes:
+        ax[1].plot(
+            time_axis,
+            function_vals[:, cls],
+            label=f"{ID_TO_LABEL.get(cls, f'Class_{cls}')}",
+        )
+    ax[1].set_title("Top 8 Function Logits by Mean Activation")
+    ax[1].set_xlabel("Time (seconds)")
+    ax[1].set_ylabel("Logit")
+    ax[1].xaxis.set_major_locator(ticker.MultipleLocator(20))
+    ax[1].xaxis.set_minor_locator(ticker.MultipleLocator(5))
+    ax[1].xaxis.set_major_formatter(ticker.FormatStrFormatter("%.1f"))
+    ax[1].legend()
+    ax[1].grid(True)
+    # Plot boundary logits
+    ax[0].plot(time_axis, boundary_vals, label="Boundary Logit", color="orange")
+    ax[0].set_title("Boundary Logits")
+    ax[0].set_ylabel("Logit")
+    ax[0].legend()
+    ax[0].grid(True)
+    # Add vertical lines for markers
+    for t_sec, label in msa_output:
+        for a in ax:
+            a.axvline(x=t_sec, color="red", linestyle="--", linewidth=0.8, alpha=0.7)
+        if label != "end":
+            ax[1].text(
+                t_sec + 0.3,
+                ax[1].get_ylim()[1] * 0.85,
+                label,
+                rotation=90,
+                fontsize=8,
+                color="red",
+            )
+    plt.suptitle("Music Structure Analysis - Logits Overview", fontsize=16)
+    plt.tight_layout()
+    return fig
+def rule_post_processing(msa_list):
+    if len(msa_list) <= 2:
+        return msa_list
+    result = msa_list.copy()
+    while len(result) > 2:
+        first_duration = result[1][0] - result[0][0]
+        if first_duration < 1.0 and len(result) > 2:
+            result[0] = (result[0][0], result[1][1])
+            result = [result[0]] + result[2:]
+        else:
+            break
+    while len(result) > 2:
+        last_label_duration = result[-1][0] - result[-2][0]
+        if last_label_duration < 1.0:
+            result = result[:-2] + [result[-1]]
+        else:
+            break
+    while len(result) > 2:
+        if result[0][1] == result[1][1] and result[1][0] <= 10.0:
+            result = [(result[0][0], result[0][1])] + result[2:]
+        else:
+            break
+    while len(result) > 2:
+        last_duration = result[-1][0] - result[-2][0]
+        if result[-2][1] == result[-3][1] and last_duration <= 10.0:
+            result = result[:-2] + [result[-1]]
+        else:
+            break
+    return result
+def process_and_analyze(audio_file):
+    """Main processing function"""
+    def format_time(t: float) -> str:
+        minutes = int(t // 60)
+        seconds = t % 60
+        return f"{minutes:02d}:{seconds:06.3f}"  # 这个格式是正确的
+    if audio_file is None:
+        return None, "", "", None
+    try:
+        # Process audio
+        logits, msa_output = process_audio(audio_file)
+        # Apply rule-based post-processing, if not needed, use in cli infer
+        msa_output = rule_post_processing(msa_output)
+        # Format outputs
+        segments = format_as_segments(msa_output)
+        msa_format = format_as_msa(msa_output)
+        json_format = format_as_json(segments)
+        # Create table data
+        table_data = [
+            [
+                f"{float(seg['start']):.2f} ({format_time(float(seg['start']))})",
+                f"{float(seg['end']):.2f} ({format_time(float(seg['end']))})",
+                seg["label"],
+            ]
+            for seg in segments
+        ]
+        # Create visualization
+        fig = create_visualization(logits, msa_output)
+        return table_data, json_format, msa_format, fig
+    except Exception as e:
+        import traceback
+        error_msg = f"Error: {str(e)}\n{traceback.format_exc()}"
+        print(error_msg)  # 在命令行输出完整错误
+        return None, "", error_msg, None
+# Create Gradio interface
+with gr.Blocks(
+    title="Music Structure Analysis",
+    css="""
+    .logo-container {
+        text-align: center;
+        margin-bottom: 20px;
+    }
+    .links-container {
+        display: flex;
+        justify-content: center;
+        column-gap: 10px;
+        margin-bottom: 10px;
+    }
+    .model-title {
+        text-align: center;
+        font-size: 24px;
+        font-weight: bold;
+        margin-bottom: 30px;
+    }
+    """,
+) as demo:
+    # Top Logo
+    gr.HTML("""
+        <div style="display: flex; justify-content: center; align-items: center;">
+            <img src="https://raw.githubusercontent.com/ASLP-lab/SongFormer/refs/heads/main/figs/logo.png" style="max-width: 300px; height: auto;" />
+        </div>
+    """)
+    # Model title
+    gr.HTML("""
+        <div class="model-title">
+            SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
+        </div>
+    """)
+    # Links
+    gr.HTML("""
+        <div class="links-container">
+            <img src="https://img.shields.io/badge/Python-3.10-brightgreen" alt="Python">
+            <img src="https://img.shields.io/badge/License-CC%20BY%204.0-lightblue" alt="License">
+            <a href="https://arxiv.org/abs/">
+            <img src="https://img.shields.io/badge/arXiv-com.svg?logo=arXiv" alt="arXiv">
+            </a>
+            <a href="https://github.com/ASLP-lab/SongFormer">
+            <img src="https://img.shields.io/badge/GitHub-SongFormer-black" alt="GitHub">
+            </a>
+            <a href="https://huggingface.co/spaces/ASLP-lab/SongFormer">
+            <img src="https://img.shields.io/badge/HuggingFace-space-yellow" alt="HuggingFace Space">
+            </a>
+            <a href="https://huggingface.co/ASLP-lab/SongFormer">
+            <img src="https://img.shields.io/badge/HuggingFace-model-blue" alt="HuggingFace Model">
+            </a>
+            <a href="https://huggingface.co/datasets/ASLP-lab/SongFormDB">
+            <img src="https://img.shields.io/badge/HF%20Dataset-SongFormDB-green" alt="Dataset SongFormDB">
+            </a>
+            <a href="https://huggingface.co/datasets/ASLP-lab/SongFormBench">
+            <img src="https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange" alt="Dataset SongFormBench">
+            </a>
+            <a href="https://discord.gg/rwcqh7Em">
+            <img src="https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white" alt="Discord">
+            </a>
+            <a href="http://www.npu-aslp.org/">
+            <img src="https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey" alt="ASLP">
+            </a>
+        </div>
+    """)
+    # Main input area
+    with gr.Row():
+        with gr.Column(scale=3):
+            audio_input = gr.Audio(
+                label="Upload Audio File", type="filepath", elem_id="audio-input"
+            )
+        with gr.Column(scale=1):
+            gr.Markdown("### 📌 Examples")
+            gr.Examples(
+                examples=[
+                    # Add your example audio file paths
+                    # ["example1.mp3"],
+                    # ["example2.mp3"],
+                ],
+                inputs=[audio_input],
+                label="Click to load example",
+            )
+    # Analyze button
+    with gr.Row():
+        analyze_btn = gr.Button(
+            "🚀 Analyze Music Structure", variant="primary", scale=1
+        )
+    # Results display area
+    with gr.Row():
+        with gr.Column(scale=13):
+            segments_table = gr.Dataframe(
+                headers=["Start / s (m:s.ms)", "End / s (m:s.ms)", "Label"],
+                label="Detected Music Segments",
+                interactive=False,
+                elem_id="result-table",
+            )
+        with gr.Column(scale=8):
+            with gr.Row():
+                with gr.Accordion("📄 JSON Output", open=False):
+                    json_output = gr.Textbox(
+                        label="JSON Format",
+                        lines=15,
+                        max_lines=20,
+                        interactive=False,
+                        show_copy_button=True,
+                    )
+            with gr.Row():
+                with gr.Accordion("📋 MSA Text Output", open=False):
+                    msa_output = gr.Textbox(
+                        label="MSA Format",
+                        lines=15,
+                        max_lines=20,
+                        interactive=False,
+                        show_copy_button=True,
+                    )
+    # Visualization plot
+    with gr.Row():
+        plot_output = gr.Plot(label="Activation Curves Visualization")
+    gr.HTML("""
+        <div style="display: flex; justify-content: center; align-items: center;">
+            <img src="https://raw.githubusercontent.com/ASLP-lab/SongFormer/refs/heads/main/figs/aslp.png" style="max-width: 300px; height: auto;" />
+        </div>
+    """)
+    # Set event handlers
+    analyze_btn.click(
+        fn=process_and_analyze,
+        inputs=[audio_input],
+        outputs=[segments_table, json_output, msa_output, plot_output],
+    )
+if __name__ == "__main__":
+    # Download pretrained models if not exist
+    download_all(use_mirror=False)
+    # Initialize models
+    print("Initializing models...")
+    initialize_models(
+        model_name="SongFormer",
+        checkpoint="SongFormer.safetensors",
+        config_path="SongFormer.yaml",
+    )
+    print("Models loaded successfully!")
+    # Launch interface
+    demo.launch(server_name="127.0.0.1", server_port=7891, debug=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,86 @@

+# Core Deep Learning Framework
+torch==2.4.0
+torchaudio==2.4.0
+lightning==2.5.1.post0
+# ML/DL Libraries
+transformers==4.51.1
+accelerate==1.5.2
+datasets==3.6.0
+tokenizers==0.21.1
+huggingface-hub==0.30.1
+safetensors==0.5.3
+# Scientific Computing
+numpy==1.25.0
+scipy==1.15.2
+scikit-learn==1.6.1
+pandas==2.2.3
+# Audio Processing
+librosa==0.11.0
+audioread==3.0.1
+soundfile==0.13.1
+pesq==0.0.4
+auraloss==0.4.0
+nnAudio==0.3.3
+julius==0.2.7
+soxr==0.5.0.post1
+mir_eval==0.8.2
+jams==0.3.4
+msaf==0.1.80
+# Visualization & Monitoring
+matplotlib==3.10.1
+seaborn==0.13.2
+tensorboard==2.19.0
+wandb==0.19.8
+gpustat==1.1.1
+# Configuration & CLI
+hydra-core==1.3.2
+omegaconf==2.3.0
+fire==0.7.1
+click==8.1.8
+# Deep Learning Utils
+einops==0.8.1
+einx==0.3.0
+x-transformers==2.4.14
+x-clip==0.14.4
+ema-pytorch==0.7.7
+schedulefree==1.4.1
+torchmetrics==1.7.1
+# Data Processing
+h5py==3.13.0
+pyarrow==19.0.1
+pillow==11.1.0
+# Text Processing
+ftfy==6.3.1
+regex==2024.11.6
+pypinyin==0.54.0
+textgrid==1.6.1
+pylrc==0.1.2
+# Model Management
+modelscope==1.27.1
+# Utilities
+tqdm==4.67.1
+loguru==0.7.3
+joblib==1.4.2
+easydict==1.13
+addict==2.4.0
+beartype==0.21.0
+# Others
+triton==3.0.0
+muq==0.1.0
+vmo==0.30.5
+# others
+gradio
+einops
+beartype

src/SongFormer/ckpts/md5sum.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+df930aceac8209818556c4a656a0714c    MusicFM/pretrained_msd.pt
+75ab2e47b093e07378f7f703bdb82c14    MusicFM/msd_stats.json
+5a24800e12ab357744f8b47e523ba3e6    SongFormer.safetensors
+2c66c0bb91364e318e90dbc2d9a79ee2    _SongFormer.pt

src/SongFormer/configs/SongFormer.yaml ADDED Viewed

	@@ -0,0 +1,186 @@

+# ============================
+# Model Configuration
+# ============================
+input_dim_raw: 4096         # Downsampled Fused SSL Representation Dimension
+input_dim: 2048             # Input Dimension after Linear Layer
+# Downsampling Module
+down_sample_conv_kernel_size: 3
+down_sample_conv_stride: 3
+down_sample_conv_dropout: 0.1
+down_sample_conv_padding: 0
+# Transformer Module
+transformer_encoder_input_dim: 1024
+transformer_input_dim: 512
+num_transformer_layers: 4
+transformer_nhead: 8
+transformer_dropout: 0.1
+# task-specific heads
+boundary_head_hidden_dims: [128, 64, 8]
+function_head_hidden_dims: []
+num_classes: 128
+num_dataset_classes: 64
+# scheduler
+warmup_steps: 300
+total_steps: 12010
+warmup_max_lr: 0.0001
+# frame rates after downsampling
+output_logits_frame_rates: 8.333
+# it means output_logits_frame_rates = input_embd_frame_rates // downsample_rates, because the padding is 0.
+downsample_rates: 3
+# frame rates after downsampling, used by model and post process
+frame_rates: 8.333
+# ema config
+ema_kwargs:
+    {update_after_step: 200}
+# ============================
+# Loss Functions configuration
+# ============================
+# Focal loss
+label_focal_loss_weight: 0.2
+label_focal_loss_alpha: 0.25
+label_focal_loss_gamma: 2.0
+# Boundary TV loss
+boundary_tvloss_weight: 0.05
+boundary_tv_loss_beta: 0.6
+boundary_tv_loss_lambda: 0.4
+boundary_tv_loss_boundary_threshold: 0.01
+boundary_tv_loss_reduction_weight: 0.1
+loss_weight_section: 0.2
+loss_weight_function: 0.8
+# ============================
+# Training config
+# ============================
+# Number of neighbors used to augment boundaries in the dataset.
+# Example: 1/25*3 * 10s = 1.2s (both sides total 4.2s)
+num_neighbors: 10
+learn_label: true
+learn_segment: true
+accumulation_steps: 2
+slice_dur: 420
+early_stopping_step: 3
+local_maxima_filter_size: 3
+# ============================
+# Dataset config
+# ============================
+train_dataset:
+    _target_: dataset.SongFormerDataset.Dataset
+    dataset_abstracts:
+        [
+            {
+                "internal_tmp_id": "SongForm-HX-8Class",
+                "dataset_type": "SongForm-HX-8Class",
+                "input_embedding_dir": "your_data_dir/30s_420s/harmonix/musicfm_hop420/layer_10 your_data_dir/30s_420s/harmonix/muq_hop420/layer_10 your_data_dir/420s/harmonix/musicfm_hop420/layer_10 your_data_dir/420s/harmonix/muq_hop420/layer_10",
+                "label_path": "your_data_dir/labels/harmonixset_8class_rule_revision.jsonl",
+                "split_ids_path": "your_data_dir/separated_ids/harmonixset_separated_ids_with_val_set/train.txt",
+                "multiplier": 4,
+            },
+            {
+                "internal_tmp_id": "SongForm-Private",
+                "dataset_type": "SongForm-Private",
+                "input_embedding_dir": "your_data_dir/30s_420s/Internal_data/musicfm_hop420/layer_10 your_data_dir/30s_420s/Internal_data/muq_hop420/layer_10 your_data_dir/420s/Internal_data/musicfm_hop420/layer_10 your_data_dir/420s/Internal_data/muq_hop420/layer_10",
+                "label_path": "your_data_dir/labels/0006_single_layer_transformer_musicfm_muq_along_time_00_5k_v1.jsonl",
+                "split_ids_path": "your_data_dir/separated_ids/internal_data_sofa_clean/train.txt",
+                "multiplier": 1,
+            },
+            {
+                adapter: HookTheoryAdapter,
+                internal_tmp_id: "SongForm-Hook",
+                structure_jsonl_paths: [
+                    "your_data_dir/HookTheoryStructure.train.jsonl"
+                ],
+                dataset_type: "SongForm-Hook",
+                input_embedding_dir: "your_data_dir/30s_420s/HookTheory/musicfm_hop420/layer_10 your_data_dir/30s_420s/HookTheory/muq_hop420/layer_10 your_data_dir/420s/HookTheory/musicfm_hop420/layer_10 your_data_dir/420s/HookTheory/muq_hop420/layer_10",
+                split_ids_path: "your_data_dir/separated_ids/hooktheory_separated_ids/train.txt",
+                multiplier: 1,
+            },
+        ]
+    hparams:
+        output_logits_frame_rates: ${output_logits_frame_rates}
+        downsample_rates: ${downsample_rates}
+        num_neighbors: ${num_neighbors}
+        input_dim: ${input_dim_raw}
+        slice_dur: ${slice_dur}
+        num_classes: ${num_classes}
+        frame_rates: ${frame_rates}
+eval_dataset:
+    _target_: dataset.SongFormerDataset.Dataset
+    dataset_abstracts:
+        [
+            {
+                "internal_tmp_id": "SongForm-HX-8Classs_val",
+                "dataset_type": "SongForm-HX-8Class",
+                "input_embedding_dir": "your_data_dir/30s_420s/harmonix/musicfm_hop420/layer_10 your_data_dir/30s_420s/harmonix/muq_hop420/layer_10 your_data_dir/420s/harmonix/musicfm_hop420/layer_10 your_data_dir/420s/harmonix/muq_hop420/layer_10",
+                "label_path": "your_data_dir/processed_data/labels/harmonixset_8class_rule_revision.jsonl",
+                "split_ids_path": "your_data_dir/separated_ids/harmonixset_separated_ids_with_val_set/val.txt",
+                "multiplier": 1,
+            },
+        ]
+    hparams:
+        output_logits_frame_rates: ${output_logits_frame_rates}
+        downsample_rates: ${downsample_rates}
+        num_neighbors: ${num_neighbors}
+        input_dim: ${input_dim_raw}
+        slice_dur: ${slice_dur}
+        num_classes: ${num_classes}
+        frame_rates: ${frame_rates}
+# ============================
+# DataLoader configuration
+# ============================
+train_dataloader:
+    num_workers: 4
+    batch_size: 4
+    pin_memory: True
+    prefetch_factor: 4
+    drop_last: True
+    persistent_workers: True
+    shuffle: true
+eval_dataloader:
+    num_workers: 0
+    batch_size: 1
+    shuffle: false
+# ============================
+# Optimizer configuration
+# ============================
+optimizer:
+    lr: ${warmup_max_lr}
+    betas: [0.8, 0.999]
+    eps: 1e-08
+    weight_decay: 3e-7
+# ============================
+# Training Run configuration
+# ============================
+args:
+    run_name: SongFormer
+    model_name: SongFormer
+    save_interval: 800
+    eval_interval: 800
+    checkpoint_dir: output/SongFormer
+    max_epochs: 1000
+    max_steps: 12010
+    tags: null

src/SongFormer/dataset/DatasetAdaper.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from abc import ABC, abstractmethod
+class DatasetAdapter(ABC):
+    """
+    Abstract base class for dataset adapters.
+    """
+    @abstractmethod
+    def __init__(self, *args, **kwargs):
+        """
+        Initialize the dataset adapter with necessary parameters.
+        """
+        raise NotImplementedError("Subclasses must implement the __init__ method.")
+    @abstractmethod
+    def get_ids(self):
+        """
+        Get the IDs of the dataset.
+        This method should be implemented by subclasses.
+        Returns:
+            A list or set of IDs representing the dataset. In format: ID + start_time
+            must cosider the split of dataset, e.g. train, val, test.
+        """
+        raise NotImplementedError("Subclasses must implement this method.")
+    @abstractmethod
+    def get_item_json(self, *args, **kwargs):
+        """
+        Get the item JSON representation from the dataset.
+        """
+        raise NotImplementedError("Subclasses must implement this method.")

src/SongFormer/dataset/GeminiOnlyLabelAdapter.py ADDED Viewed

	@@ -0,0 +1,332 @@

+# 1. It was found that the annotations generated by Gemini are discontinuous between segments
+# (possibly differing by more than 1.7 seconds, accounting for approximately 1/4 to 1/3 of the cases).
+# 2. Gemini's labels can compete with our SOTA model, but Gemini's boundary metrics are very poor.
+# With a tolerance of 3 seconds, they are similar to the metrics of our best model.
+import pdb
+import random
+import os
+from collections import defaultdict
+from pathlib import Path
+import json
+from venv import logger
+import numpy as np
+import math
+from .label2id import (
+    DATASET_ID_ALLOWED_LABEL_IDS,
+    DATASET_LABEL_TO_DATASET_ID,
+    ID_TO_LABEL,
+    LABEL_TO_ID,
+)
+from argparse import Namespace
+from scipy.ndimage import gaussian_filter1d
+from .DatasetAdaper import DatasetAdapter
+from omegaconf import ListConfig
+import copy
+# Adapter for datasets labeled only by Gemini
+class GeminiOnlyLabelAdapter(DatasetAdapter):
+    def __init__(self, **kwargs):
+        (
+            label_paths,
+            hparams,
+            internal_tmp_id,
+            dataset_type,
+            input_embedding_dir,
+            split_ids_path,
+        ) = (
+            kwargs["label_paths"],
+            kwargs["hparams"],
+            kwargs["internal_tmp_id"],
+            kwargs["dataset_type"],
+            kwargs["input_embedding_dir"],
+            kwargs["split_ids_path"],
+        )
+        self.frame_rates = hparams.frame_rates
+        self.hparams = hparams
+        self.label_to_id = LABEL_TO_ID
+        self.dataset_id_to_dataset_id = DATASET_LABEL_TO_DATASET_ID
+        self.id_to_label = ID_TO_LABEL
+        self.internal_tmp_id = internal_tmp_id
+        self.dataset_type = dataset_type
+        self.EPS = 1e-6
+        self.dataset_id2label_mask = {}
+        for key, allowed_ids in DATASET_ID_ALLOWED_LABEL_IDS.items():
+            self.dataset_id2label_mask[key] = np.ones(
+                self.hparams.num_classes, dtype=bool
+            )
+            self.dataset_id2label_mask[key][allowed_ids] = False
+        self.id2segments = {}
+        data = self.load_jsonl(label_paths)
+        self.input_embedding_dir = input_embedding_dir
+        all_input_embedding_dirs = input_embedding_dir.split()
+        valid_data_ids = self.get_ids_from_dir(all_input_embedding_dirs[0])
+        for x in all_input_embedding_dirs:
+            valid_data_ids = valid_data_ids.intersection(self.get_ids_from_dir(x))
+        split_ids = []
+        with open(split_ids_path) as f:
+            for line in f:
+                if not line.strip():
+                    continue
+                split_ids.append(line.strip())
+        split_ids = set(split_ids)
+        valid_data_ids = [
+            x for x in valid_data_ids if "_".join(x.split("_")[:-1]) in split_ids
+        ]
+        valid_data_ids = [
+            (internal_tmp_id, dataset_type, x, "HookTheoryAdapter")
+            for x in valid_data_ids
+        ]
+        self.valid_data_ids = valid_data_ids
+        rng = random.Random(42)
+        rng.shuffle(self.valid_data_ids)
+        for item in data:
+            self.id2segments[item["data_id"]] = item["msa_info"]
+    def get_ids_from_dir(self, dir_path: str):
+        ids = os.listdir(dir_path)
+        ids = [Path(x).stem for x in ids if x.endswith(".npy")]
+        return set(ids)
+    def time2frame(self, this_time):
+        return int(this_time * self.frame_rates)
+    def load_jsonl(self, paths):
+        data = []
+        for path in paths:
+            with open(path, "r", encoding="utf-8") as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    obj = json.loads(line)
+                    data.append(obj)
+        return data
+    def get_ids(self):
+        return list(self.valid_data_ids)
+    def widen_temporal_events(self, events, num_neighbors):
+        def theoretical_gaussian_max(sigma):
+            return 1 / (np.sqrt(2 * np.pi) * sigma)
+        widen_events = events
+        sigma = num_neighbors / 3.0
+        smoothed = gaussian_filter1d(widen_events.astype(float), sigma=sigma)
+        smoothed /= theoretical_gaussian_max(sigma)
+        smoothed = np.clip(smoothed, 0, 1)
+        return smoothed
+    def get_item_json(self, utt, start_time, end_time):
+        embd_list = []
+        embd_dirs = self.input_embedding_dir.split()
+        for embd_dir in embd_dirs:
+            if not Path(embd_dir).exists():
+                raise FileNotFoundError(
+                    f"Embedding directory {embd_dir} does not exist"
+                )
+            tmp = np.load(Path(embd_dir) / f"{utt}.npy").squeeze(axis=0)
+            embd_list.append(tmp)
+        # Check that max and min lengths of all representations differ by at most 2
+        if len(embd_list) > 1:
+            embd_shapes = [x.shape for x in embd_list]
+            max_shape = max(embd_shapes, key=lambda x: x[0])
+            min_shape = min(embd_shapes, key=lambda x: x[0])
+            if abs(max_shape[0] - min_shape[0]) > 2:
+                raise ValueError(
+                    f"Embedding shapes differ too much: {max_shape} vs {min_shape}"
+                )
+        for idx in range(len(embd_list)):
+            embd_list[idx] = embd_list[idx][: min_shape[0], :]
+        input_embedding = np.concatenate(embd_list, axis=-1)
+        return_json = self._get_item_json_without_embedding(
+            "_".join(utt.split("_")[:-1]), start_time, end_time
+        )
+        if return_json is None:
+            logger.warning(
+                f"Skip {utt} because no valid segments found in {start_time} to {end_time}."
+            )
+            return None
+        else:
+            return_json["input_embedding"] = input_embedding
+            return return_json
+    def get_local_times_labels(self, utt):
+        assert utt in self.id2segments, f"utt {utt} not found in id2segments"
+        time_datas = [x[0] for x in self.id2segments[utt]]
+        time_datas = list(map(float, time_datas))
+        label_datas = [
+            -1 if x[1] == "end" else self.label_to_id[x[1]]
+            for x in self.id2segments[utt]
+        ]
+        return np.array(time_datas), label_datas
+    def _get_item_json_without_embedding(self, utt, start_time, end_time):
+        SLICE_DUR = int(math.ceil(end_time - start_time))
+        local_times, local_labels = self.get_local_times_labels(utt)
+        local_times, local_labels = (
+            copy.deepcopy(local_times),
+            copy.deepcopy(local_labels),
+        )
+        assert np.all(local_times[:-1] < local_times[1:]), (
+            f"time must be sorted, but {utt} is {local_times}"
+        )
+        local_times = local_times - start_time
+        time_L = max(0.0, float(local_times.min()))
+        time_R = min(float(SLICE_DUR), float(local_times.max()))
+        # Note whether boundary labels are reachable
+        keep_boundarys = (time_L + self.EPS < local_times) & (
+            local_times < time_R - self.EPS
+        )
+        # If no valid boundaries, return None
+        if keep_boundarys.sum() <= 0:
+            return None
+        mask = np.ones([int(SLICE_DUR * self.frame_rates)], dtype=bool)
+        mask[self.time2frame(time_L) : self.time2frame(time_R)] = False
+        true_boundary = np.zeros([int(SLICE_DUR * self.frame_rates)], dtype=float)
+        for idx in np.flatnonzero(keep_boundarys):
+            true_boundary[self.time2frame(local_times[idx])] = 1
+        true_function = np.zeros(
+            [int(SLICE_DUR * self.frame_rates), self.hparams.num_classes],
+            dtype=float,
+        )
+        true_function_list = []
+        msa_info = []
+        last_pos = self.time2frame(time_L)
+        for idx in np.flatnonzero(keep_boundarys):
+            true_function[
+                last_pos : self.time2frame(local_times[idx]),
+                int(local_labels[idx - 1]),
+            ] = 1
+            true_function_list.append(
+                [int(x) for x in local_labels[idx - 1]]
+                if isinstance(local_labels[idx - 1], list)
+                else int(local_labels[idx - 1])
+            )
+            msa_info.append(
+                (
+                    float(max(local_times[idx - 1], time_L)),
+                    [str(self.id_to_label[int(x)]) for x in local_labels[idx - 1]]
+                    if isinstance(local_labels[idx - 1], list)
+                    else str(self.id_to_label[int(local_labels[idx - 1])]),
+                )
+            )
+            last_pos = self.time2frame(local_times[idx])
+        # Check last label correctness
+        true_function[
+            last_pos : self.time2frame(time_R),
+            local_labels[int(np.flatnonzero(keep_boundarys)[-1])],
+        ] = 1
+        true_function_list.append(
+            [int(x) for x in local_labels[int(np.flatnonzero(keep_boundarys)[-1])]]
+            if isinstance(local_labels[int(np.flatnonzero(keep_boundarys)[-1])], list)
+            else int(local_labels[int(np.flatnonzero(keep_boundarys)[-1])])
+        )
+        msa_info.append(
+            (
+                float(local_times[int(np.flatnonzero(keep_boundarys)[-1])]),
+                [
+                    str(self.id_to_label[int(x)])
+                    for x in local_labels[int(np.flatnonzero(keep_boundarys)[-1])]
+                ]
+                if isinstance(
+                    local_labels[int(np.flatnonzero(keep_boundarys)[-1])], list
+                )
+                else str(
+                    self.id_to_label[
+                        int(local_labels[int(np.flatnonzero(keep_boundarys)[-1])])
+                    ]
+                ),
+            )
+        )
+        # Append final label at end; decide if it's necessary
+        msa_info.append((float(time_R), "end"))
+        # Add boundary_mask & function_mask
+        frame_len = int(SLICE_DUR * self.frame_rates)
+        # During loss computation, boundaries are fully masked
+        boundary_mask = np.ones([frame_len], dtype=bool)
+        function_mask = np.zeros([frame_len], dtype=bool)
+        # Set masks according to msa_info
+        for i in range(len(msa_info) - 1):
+            seg_start, seg_label = msa_info[i]
+            seg_end, _ = msa_info[i + 1]
+            start_frame = self.time2frame(seg_start)
+            end_frame = self.time2frame(seg_end)
+            # Handle case where label may be string or list
+            is_no_label = (
+                seg_label == "NO_LABEL"
+                if isinstance(seg_label, str)
+                else "NO_LABEL" in seg_label
+            )
+            if is_no_label:
+                # function_mask set True
+                function_mask[start_frame:end_frame] = True
+        # ------~~------------
+        # During loss computation, boundaries are fully masked
+        boundary_mask = np.ones([frame_len], dtype=bool)
+        function_mask = np.zeros([frame_len], dtype=bool)
+        # Set masks according to msa_info
+        for i in range(len(msa_info) - 1):
+            seg_start, seg_label = msa_info[i]
+            seg_end, _ = msa_info[i + 1]
+            start_frame = self.time2frame(seg_start)
+            end_frame = self.time2frame(seg_end)
+            # Handle case where label may be string or list
+            is_no_label = (
+                seg_label == "NO_LABEL"
+                if isinstance(seg_label, str)
+                else "NO_LABEL" in seg_label
+            )
+            if is_no_label:
+                # function_mask set True
+                function_mask[start_frame:end_frame] = True
+        # return all things except for input_embedding
+        return {
+            "data_id": self.internal_tmp_id + "_" + f"{utt}_{start_time}",
+            "mask": mask,
+            "true_boundary": true_boundary,
+            "widen_true_boundary": self.widen_temporal_events(
+                true_boundary, num_neighbors=self.hparams.num_neighbors
+            ),
+            "true_function": true_function,
+            "true_function_list": true_function_list,
+            "msa_info": msa_info,
+            "dataset_id": self.dataset_id_to_dataset_id[self.dataset_type],
+            "label_id_mask": self.dataset_id2label_mask[
+                self.dataset_id_to_dataset_id[self.dataset_type]
+            ],
+            "boundary_mask": boundary_mask,  # Only effective during loss calculation
+            "function_mask": function_mask,  # Only effective during loss calculation
+        }

src/SongFormer/dataset/HookTheoryAdapter.py ADDED Viewed

	@@ -0,0 +1,448 @@

+import random
+import os
+from collections import defaultdict
+from pathlib import Path
+import json
+import numpy as np
+import math
+from .label2id import (
+    DATASET_ID_ALLOWED_LABEL_IDS,
+    DATASET_LABEL_TO_DATASET_ID,
+    ID_TO_LABEL,
+    LABEL_TO_ID,
+)
+from argparse import Namespace
+from scipy.ndimage import gaussian_filter1d
+from .DatasetAdaper import DatasetAdapter
+from omegaconf import ListConfig
+class HookTheoryAdapter(DatasetAdapter):
+    def __init__(self, **kwargs):
+        (
+            structure_jsonl_paths,
+            hparams,
+            internal_tmp_id,
+            dataset_type,
+            input_embedding_dir,
+            split_ids_path,
+        ) = (
+            kwargs["structure_jsonl_paths"],
+            kwargs["hparams"],
+            kwargs["internal_tmp_id"],
+            kwargs["dataset_type"],
+            kwargs.get("input_embedding_dir", None),
+            kwargs.get("split_ids_path", None),
+        )
+        # basic attrs
+        self.frame_rates = hparams.frame_rates
+        self.hparams = hparams
+        self.label_to_id = LABEL_TO_ID
+        self.dataset_id_to_dataset_id = DATASET_LABEL_TO_DATASET_ID
+        self.id_to_label = ID_TO_LABEL
+        self.internal_tmp_id = internal_tmp_id
+        self.dataset_type = dataset_type
+        self.EPS = 1e-6
+        # build dataset-specific label mask
+        self.dataset_id2label_mask = {}
+        for key, allowed_ids in DATASET_ID_ALLOWED_LABEL_IDS.items():
+            self.dataset_id2label_mask[key] = np.ones(
+                self.hparams.num_classes, dtype=bool
+            )
+            self.dataset_id2label_mask[key][allowed_ids] = False
+        assert isinstance(structure_jsonl_paths, (ListConfig, tuple, list))
+        # load segments per audio id
+        self.id2segments = defaultdict(list)
+        data = self.load_jsonl(structure_jsonl_paths)
+        # input embedding dirs (space-separated)
+        self.input_embedding_dir = input_embedding_dir
+        all_input_embedding_dirs = input_embedding_dir.split()
+        # get valid ids that exist in all embedding dirs
+        valid_data_ids = self.get_ids_from_dir(all_input_embedding_dirs[0])
+        for x in all_input_embedding_dirs:
+            valid_data_ids = valid_data_ids.intersection(self.get_ids_from_dir(x))
+        # read split ids
+        split_ids = []
+        with open(split_ids_path) as f:
+            for line in f:
+                if not line.strip():
+                    continue
+                split_ids.append(line.strip())
+        split_ids = set(split_ids)
+        # filter valid ids by split
+        valid_data_ids = [
+            x for x in valid_data_ids if "_".join(x.split("_")[:-1]) in split_ids
+        ]
+        valid_data_ids = [
+            (internal_tmp_id, dataset_type, x, "HookTheoryAdapter")
+            for x in valid_data_ids
+        ]
+        self.valid_data_ids = valid_data_ids
+        rng = random.Random(42)
+        rng.shuffle(self.valid_data_ids)
+        for item in data:
+            self.id2segments[Path(item["ori_audio_path"]).stem].append(item)
+        # logger.info(f"load {len(self.id2segments)} songs from {structure_jsonl_paths}")
+    def get_ids_from_dir(self, dir_path: str):
+        ids = os.listdir(dir_path)
+        ids = [Path(x).stem for x in ids if x.endswith(".npy")]
+        return set(ids)
+    def time2frame(self, this_time):
+        # convert time (s) to frame index
+        return int(this_time * self.frame_rates)
+    def load_jsonl(self, paths):
+        # load list of jsonl files
+        data = []
+        for path in paths:
+            with open(path, "r", encoding="utf-8") as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    obj = json.loads(line)
+                    data.append(obj)
+        return data
+    def split_and_label(self, query_start, query_end, segments):
+        """
+        segments: List of dicts, each with keys: "segment_start", "segment_end", 'label'
+        """
+        # Step 1: collect all boundary points (only within query interval)
+        points = set([query_start, query_end])
+        for seg in segments:
+            if query_start <= seg["segment_start"] <= query_end:
+                points.add(seg["segment_start"])
+            if query_start <= seg["segment_end"] <= query_end:
+                points.add(seg["segment_end"])
+        sorted_points = sorted(points)
+        result = []
+        # Step 2: for each small interval, check which segments cover it
+        for i in range(len(sorted_points) - 1):
+            part_start = sorted_points[i]
+            part_end = sorted_points[i + 1]
+            labels = []
+            for seg in segments:
+                if (
+                    seg["segment_start"] <= part_start
+                    and seg["segment_end"] >= part_end
+                ):
+                    labels.extend(seg["label"])
+            if not labels:
+                labels = ["NO_LABEL"]
+            result.append(
+                {"segment_start": part_start, "segment_end": part_end, "labels": labels}
+            )
+        # deduplicate labels per interval
+        for idx in range(len(result)):
+            result[idx]["labels"] = list(set(result[idx]["labels"]))
+        return result
+    def merge_small_intervals(self, parts, min_duration=2.0):
+        """
+        parts: list of dicts with "segment_start", "segment_end", 'labels'
+        Merge intervals shorter than min_duration into neighbor intervals.
+        """
+        new_parts = []
+        i = 0
+        while i < len(parts):
+            part = parts[i]
+            duration = part["segment_end"] - part["segment_start"]
+            if duration < min_duration:
+                # decide where to merge
+                if len(new_parts) > 0 and (i + 1) < len(parts):
+                    # randomly choose previous or next
+                    if random.choice([True, False]):
+                        prev = new_parts[-1]
+                        prev["segment_end"] = part["segment_end"]
+                    else:
+                        next_part = parts[i + 1]
+                        next_part["segment_start"] = part["segment_start"]
+                    # skip adding this part
+                elif len(new_parts) > 0:
+                    # only previous exists - merge into previous
+                    prev = new_parts[-1]
+                    prev["segment_end"] = part["segment_end"]
+                elif (i + 1) < len(parts):
+                    # only next exists - merge into next
+                    next_part = parts[i + 1]
+                    next_part["segment_start"] = part["segment_start"]
+                # else: nothing to merge, drop
+                i += 1
+            else:
+                new_parts.append(part)
+                i += 1
+        return new_parts
+    def rounding_time(self, segments, num_decimals=3):
+        # round segment boundaries to given decimals
+        for idx in range(len(segments)):
+            segments[idx]["segment_start"] = round(
+                segments[idx]["segment_start"], num_decimals
+            )
+            segments[idx]["segment_end"] = round(
+                segments[idx]["segment_end"], num_decimals
+            )
+        return segments
+    def get_ids(self):
+        return list(self.valid_data_ids)
+    def convert_label(self, label: str):
+        # map various labels to canonical labels
+        mapping = {
+            "chorus": "chorus",
+            "intro": "intro",
+            "bridge": "bridge",
+            "verse": "verse",
+            "pre-chorus": "pre-chorus",
+            "solo": "inst",
+            "instrumental": "inst",
+            "outro": "outro",
+            "NO_LABEL": "NO_LABEL",
+        }
+        assert label in mapping, f"Unknown label: {label}"
+        return mapping[label]
+    def parts_to_label_and_times(self, parts, use_random_tag=True):
+        """
+        parts: list of dicts with 'segment_start', 'segment_end', 'labels'
+        if use_random_tag: label will be random from valid labels
+        else: label will be all valid labels (labels list)
+        return:
+            local_times: np.array of right boundary time points (excluding query_end)
+            local_labels: list of label indices corresponding to self.label_to_id
+        """
+        local_times = []
+        local_labels = []
+        for part in parts:
+            local_times.append(part["segment_start"])
+            label = random.choice(part["labels"]) if use_random_tag else part["labels"]
+            local_labels.append(self.label_to_id[self.convert_label(label)])
+        return np.array(local_times), local_labels
+    def get_parts(self, utt, query_start, query_end):
+        key = "_".join(utt.split("_")[:-1])
+        assert key in self.id2segments
+        segments = self.id2segments[key]
+        segments = self.rounding_time(segments)
+        parts = self.split_and_label(query_start, query_end, segments)
+        # Apply merging twice to remove very short intervals
+        new_parts = self.merge_small_intervals(parts, min_duration=2.0)
+        new_parts = self.merge_small_intervals(new_parts, min_duration=2.0)
+        return new_parts
+    def widen_temporal_events(self, events, num_neighbors):
+        # smooth binary events with a normalized gaussian
+        def theoretical_gaussian_max(sigma):
+            return 1 / (np.sqrt(2 * np.pi) * sigma)
+        widen_events = events
+        sigma = num_neighbors / 3.0
+        smoothed = gaussian_filter1d(widen_events.astype(float), sigma=sigma)
+        smoothed /= theoretical_gaussian_max(sigma)
+        smoothed = np.clip(smoothed, 0, 1)
+        return smoothed
+    def get_item_json(self, utt, start_time, end_time):
+        # load embeddings from all embedding dirs
+        embd_list = []
+        embd_dirs = self.input_embedding_dir.split()
+        for embd_dir in embd_dirs:
+            if not Path(embd_dir).exists():
+                raise FileNotFoundError(
+                    f"Embedding directory {embd_dir} does not exist"
+                )
+            tmp = np.load(Path(embd_dir) / f"{utt}.npy").squeeze(axis=0)
+            embd_list.append(tmp)
+        # Check that max/min length difference across embeddings <= 2
+        if len(embd_list) > 1:
+            embd_shapes = [x.shape for x in embd_list]
+            max_shape = max(embd_shapes, key=lambda x: x[0])
+            min_shape = min(embd_shapes, key=lambda x: x[0])
+            if abs(max_shape[0] - min_shape[0]) > 2:
+                raise ValueError(
+                    f"Embedding shapes differ too much: {max_shape} vs {min_shape}"
+                )
+            for idx in range(len(embd_list)):
+                embd_list[idx] = embd_list[idx][: min_shape[0], :]
+        input_embedding = np.concatenate(embd_list, axis=-1)
+        return_json = self.get_item_json_without_embedding(utt, start_time, end_time)
+        if return_json is None:
+            return None
+        else:
+            return_json["input_embedding"] = input_embedding
+            return return_json
+    def get_item_json_without_embedding(self, utt, start_time, end_time):
+        SLICE_DUR = int(math.ceil(end_time - start_time))
+        local_times, local_labels = self.parts_to_label_and_times(
+            self.get_parts(utt, start_time, end_time)
+        )
+        assert np.all(local_times[:-1] < local_times[1:]), (
+            f"time must be sorted, but {utt} is {local_times}"
+        )
+        # normalize local times relative to slice start
+        local_times = local_times - start_time
+        time_L = 0.0
+        # here time_R is full slice duration because NO_LABEL may appear
+        time_R = float(SLICE_DUR)
+        # determine which boundaries are within (time_L, time_R)
+        keep_boundarys = (time_L + self.EPS < local_times) & (
+            local_times < time_R - self.EPS
+        )
+        # if no valid boundary, return None
+        if keep_boundarys.sum() <= 0:
+            return None
+        mask = np.ones([int(SLICE_DUR * self.frame_rates)], dtype=bool)
+        mask[self.time2frame(time_L) : self.time2frame(time_R)] = False
+        true_boundary = np.zeros([int(SLICE_DUR * self.frame_rates)], dtype=float)
+        for idx in np.flatnonzero(keep_boundarys):
+            true_boundary[self.time2frame(local_times[idx])] = 1
+        true_function = np.zeros(
+            [int(SLICE_DUR * self.frame_rates), self.hparams.num_classes],
+            dtype=float,
+        )
+        true_function_list = []
+        msa_info = []
+        last_pos = self.time2frame(time_L)
+        for idx in np.flatnonzero(keep_boundarys):
+            # local_labels[idx] might be int or list(int)
+            true_function[
+                last_pos : self.time2frame(local_times[idx]),
+                local_labels[idx - 1],
+            ] = 1
+            true_function_list.append(
+                [int(x) for x in local_labels[idx - 1]]
+                if isinstance(local_labels[idx - 1], list)
+                else int(local_labels[idx - 1])
+            )
+            msa_info.append(
+                (
+                    float(max(local_times[idx - 1], time_L)),
+                    [str(self.id_to_label[int(x)]) for x in local_labels[idx - 1]]
+                    if isinstance(local_labels[idx - 1], list)
+                    else str(self.id_to_label[int(local_labels[idx - 1])]),
+                )
+            )
+            last_pos = self.time2frame(local_times[idx])
+        # check last label correctness
+        true_function[
+            last_pos : self.time2frame(time_R),
+            local_labels[int(np.flatnonzero(keep_boundarys)[-1])],
+        ] = 1
+        true_function_list.append(
+            [int(x) for x in local_labels[int(np.flatnonzero(keep_boundarys)[-1])]]
+            if isinstance(local_labels[int(np.flatnonzero(keep_boundarys)[-1])], list)
+            else int(local_labels[int(np.flatnonzero(keep_boundarys)[-1])])
+        )
+        msa_info.append(
+            (
+                float(local_times[int(np.flatnonzero(keep_boundarys)[-1])]),
+                [
+                    str(self.id_to_label[int(x)])
+                    for x in local_labels[int(np.flatnonzero(keep_boundarys)[-1])]
+                ]
+                if isinstance(
+                    local_labels[int(np.flatnonzero(keep_boundarys)[-1])], list
+                )
+                else str(
+                    self.id_to_label[
+                        int(local_labels[int(np.flatnonzero(keep_boundarys)[-1])])
+                    ]
+                ),
+            )
+        )
+        # append final "end" marker
+        msa_info.append((float(time_R), "end"))
+        # -------------------------
+        # boundary_mask & function_mask
+        # -------------------------
+        frame_len = int(SLICE_DUR * self.frame_rates)
+        boundary_mask = np.zeros([frame_len], dtype=bool)
+        function_mask = np.zeros([frame_len], dtype=bool)
+        # set masks according to msa_info
+        for i in range(len(msa_info) - 1):
+            seg_start, seg_label = msa_info[i]
+            seg_end, _ = msa_info[i + 1]
+            start_frame = self.time2frame(seg_start)
+            end_frame = self.time2frame(seg_end)
+            # handle label being string or list
+            is_no_label = (
+                seg_label == "NO_LABEL"
+                if isinstance(seg_label, str)
+                else "NO_LABEL" in seg_label
+            )
+            if is_no_label:
+                # set function_mask True for NO_LABEL regions
+                function_mask[start_frame:end_frame] = True
+                # set boundary_mask True for regions >4s away from ends
+                left_offset = self.time2frame(seg_start + 4)
+                right_offset = self.time2frame(seg_end - 4)
+                if i == 0:
+                    if right_offset > 0:
+                        boundary_mask[0 : min(right_offset, frame_len)] = True
+                elif i == len(msa_info) - 2:
+                    if left_offset < frame_len:
+                        boundary_mask[left_offset:frame_len] = True
+                elif right_offset > left_offset:
+                    boundary_mask[left_offset:right_offset] = True
+        # -------------------------
+        # return all things except input_embedding
+        # -------------------------
+        return {
+            "data_id": self.internal_tmp_id + "_" + f"{utt}_{start_time}",
+            "mask": mask,
+            "true_boundary": true_boundary,
+            "widen_true_boundary": self.widen_temporal_events(
+                true_boundary, num_neighbors=self.hparams.num_neighbors
+            ),
+            "true_function": true_function,
+            "true_function_list": true_function_list,
+            "msa_info": msa_info,
+            "dataset_id": self.dataset_id_to_dataset_id[self.dataset_type],
+            "label_id_mask": self.dataset_id2label_mask[
+                self.dataset_id_to_dataset_id[self.dataset_type]
+            ],
+            "boundary_mask": boundary_mask,  # only effective during loss computation
+            "function_mask": function_mask,  # only effective during loss computation
+        }

src/SongFormer/dataset/custom_types.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""
+MsaInfo
+	A list of (timestamp, label) tuples used to represent music structure
+	analysis (MSA). The first element of the tuple is a float timestamp
+	(in seconds) and the second is a string label
+Example
+-------
+	>>> msa: MsaInfo = [(0.0, "intro"), (12.5, "verse"), (34.0, "chorus")]
+"""
+from typing import List, Tuple
+MsaInfo = List[Tuple[float, str]]

src/SongFormer/dataset/label2id.py ADDED Viewed

	@@ -0,0 +1,163 @@

+LABEL_TO_ID = {
+    "intro": 0,
+    "verse": 1,
+    "chorus": 2,
+    "bridge": 3,
+    "inst": 4,
+    "outro": 5,
+    "silence": 6,
+    "intchorus": 7,
+    "prechorus": 8,
+    "gtrbreak": 9,
+    "solo": 10,
+    "quietchorus": 11,
+    "bre": 12,
+    "break": 13,
+    "introverse": 14,
+    "mainriff": 15,
+    "chorushalf": 16,
+    "instintro": 17,
+    "gtr": 18,
+    "vocaloutro": 19,
+    "verse_slow": 20,
+    "fadein": 21,
+    "saxobeat": 22,
+    "transition": 23,
+    "verse1a": 24,
+    "build": 25,
+    "pre-chorus": 26,
+    "outroa": 27,
+    "bigoutro": 28,
+    "fast": 29,
+    "instrumentalverse": 30,
+    "section": 31,
+    "choruspart": 32,
+    "instbridge": 33,
+    "guitar": 34,
+    "instrumental": 35,
+    "breakdown": 36,
+    "rhythmlessintro": 37,
+    "intropt": 38,
+    "interlude": 39,
+    "postchorus": 40,
+    "postverse": 41,
+    "opening": 42,
+    "altchorus": 43,
+    "stutter": 44,
+    "oddriff": 45,
+    "synth": 46,
+    "preverse": 47,
+    "quiet": 48,
+    "raps": 49,
+    "verseinst": 50,
+    "instchorus": 51,
+    "chorus_instrumental": 52,
+    "slowverse": 53,
+    "slow": 54,
+    "worstthingever": 55,
+    "transition2a": 56,
+    "miniverse": 57,
+    "refrain": 58,
+    "introchorus": 59,
+    "drumroll": 60,
+    "guitarsolo": 61,
+    "versepart": 62,
+    "chorusinst": 63,
+    "ending": 64,
+    "no-vocal-intro": 65,
+    "no-vocal-interlude": 66,
+    "no-vocal-outro": 67,
+    "NO_LABEL": 68,  # Only referring to cases without labels, this portion of labels will be ignored during the loss calculation process.
+}
+ID_TO_LABEL = {v: k for k, v in LABEL_TO_ID.items()}
+# Reserve 64 embedding positions for dataset identifiers in the model.
+DATASET_LABEL_TO_DATASET_ID = {
+    "SongForm-HX-7Class": 0,  # Categories after rule mapping for HarmonixSet
+    "SongForm-HX-Widen": 1,  # Original HarmonixSet
+    "SongForm-Private-Raw": 2,
+    "SongForm-Private": 3,
+    "SongForm-HX-Gemini-Relabeled": 4,  # Rule-mapped HarmonixSet corrected by Gemini
+    "SongForm-HX-8Class": 5,  # Rule-mapped (pre-chorus retained)
+    "SongForm-Hook": 6,
+    "SongForm-Gem": 7,
+    "SongForm-Gem-Only-Label": 8,  # Use only segments with labels in SongForm-Gem
+}
+DATASET_ID_TO_DATASET_LABEL = {v: k for k, v in DATASET_LABEL_TO_DATASET_ID.items()}
+DATASET_ID_ALLOWED_LABEL_IDS = {
+    0: [0, 1, 2, 3, 4, 5, 6],
+    1: [
+        0,
+        1,
+        2,
+        3,
+        4,
+        5,
+        6,
+        7,
+        8,
+        9,
+        10,
+        11,
+        12,
+        13,
+        14,
+        15,
+        16,
+        17,
+        18,
+        19,
+        20,
+        21,
+        22,
+        23,
+        24,
+        25,
+        27,
+        28,
+        29,
+        30,
+        31,
+        32,
+        33,
+        34,
+        35,
+        36,
+        37,
+        38,
+        40,
+        41,
+        42,
+        43,
+        44,
+        45,
+        46,
+        47,
+        48,
+        49,
+        50,
+        51,
+        52,
+        53,
+        54,
+        55,
+        56,
+        57,
+        58,
+        59,
+        60,
+        61,
+        62,
+        63,
+    ],
+    2: [0, 1, 2, 3, 26, 39, 64, 65, 66, 67],
+    3: [0, 1, 2, 3, 4, 5, 6, 26, 39, 64, 65, 66, 67],
+    4: [0, 1, 2, 3, 4, 5, 6, 26],
+    5: [0, 1, 2, 3, 4, 5, 6, 26],
+    6: [0, 1, 2, 3, 4, 5, 6, 26],
+    7: [0, 1, 2, 3, 4, 5, 6, 26],
+    8: [0, 1, 2, 3, 4, 5, 6, 26],
+}

src/SongFormer/dataset/msa_info_utils.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from dataset.custom_types import MsaInfo
+from dataset.label2id import LABEL_TO_ID
+def load_msa_info(msa_info_path):
+    msa_info: MsaInfo = []
+    with open(msa_info_path) as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            time_, label = line.split()
+            time_ = float(time_)
+            label = str(label)
+            assert label in LABEL_TO_ID or label == "end", f"{label} not in LABEL_TO_ID"
+            msa_info.append((time_, label))
+    assert msa_info[-1][1] == "end", f"last {msa_info[-1][1]} != end"
+    return msa_info
+def load_msa_infos(msa_str):
+    msa_info: MsaInfo = []
+    for line in msa_str:
+        line = line.strip()
+        if not line:
+            continue
+        time_, label = line.split()
+        time_ = float(time_)
+        label = str(label)
+        assert label in LABEL_TO_ID or label == "end", f"{label} not in LABEL_TO_ID"
+        msa_info.append((time_, label))
+    assert msa_info[-1][1] == "end", f"last {msa_info[-1][1]} != end"
+    return msa_info
+def dump_msa_info(msa_info_path, msa_info: MsaInfo):
+    with open(msa_info_path, "w") as f:
+        for time_, label in msa_info:
+            f.write(f"{time_} {label}\n")
+def dump_msa_infos(msa_info: MsaInfo):
+    mas_strs = []
+    for time_, label in msa_info:
+        mas_strs.append(f"{round(time_, 2)} {label}")
+    return "\n".join(mas_strs)

src/SongFormer/eval.sh ADDED Viewed

	@@ -0,0 +1,22 @@

+export CUDA_VISIBLE_DEVICES=-1
+export PYTHONPATH=${PWD}:$PYTHONPATH
+export HYDRA_FULL_ERROR=1
+export OMP_NUM_THREADS=1
+export MPI_NUM_THREADS=1
+export NCCL_P2P_DISABLE=1
+export NCCL_IB_DISABLE=1
+EST_DIR=
+ANN_DIR=
+OUTPUT_DIR=
+echo "$EST_DIR --> $OUTPUT_DIR"
+mkdir -p "$OUTPUT_DIR"
+python evaluation/eval_infer_results.py \
+    --ann_dir $ANN_DIR \
+    --est_dir $EST_DIR \
+    --output_dir $OUTPUT_DIR \
+    --prechorus2what verse
+    # --armerge_continuous_segments

src/SongFormer/evaluation/eval_infer_results.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import argparse
+import os
+from collections import defaultdict
+from pathlib import Path
+import mir_eval
+import numpy as np
+import pandas as pd
+from dataset.custom_types import MsaInfo
+from dataset.label2id import LABEL_TO_ID
+from dataset.msa_info_utils import load_msa_info
+from msaf.eval import compute_results
+from postprocessing.calc_acc import cal_acc
+from postprocessing.calc_iou import cal_iou
+from tqdm import tqdm
+from loguru import logger
+LEGAL_LABELS = {
+    "end",
+    "intro",
+    "verse",
+    "chorus",
+    "bridge",
+    "inst",
+    "outro",
+    "silence",
+    "pre-chorus",
+}
+def to_inters_labels(msa_info: MsaInfo):
+    label_ids = np.array([LABEL_TO_ID[x[1]] for x in msa_info[:-1]])
+    times = [x[0] for x in msa_info]
+    start_times = np.column_stack([np.array(times[:-1]), np.array(times[1:])])
+    return start_times, label_ids
+def merge_continuous_segments(segments):
+    """
+    Merge continuous segments with the same label.
+    Parameters:
+    segments: List of tuples [(start_time, label), ...], where the last element is (end_time, 'end')
+    Returns:
+    Merged segment list in the same format [(start_time, label), ...], with the last element being (end_time, 'end')
+    """
+    if not segments or len(segments) < 2:
+        return segments
+    merged = []
+    current_start = segments[0][0]
+    current_label = segments[0][1]
+    for i in range(1, len(segments)):
+        time, label = segments[i]
+        if label == "end":
+            if current_label != "end":
+                merged.append((current_start, current_label))
+            merged.append((time, "end"))
+            break
+        if label != current_label:
+            merged.append((current_start, current_label))
+            current_start = time
+            current_label = label
+    return merged
+def main():
+    argparser = argparse.ArgumentParser()
+    argparser.add_argument("--ann_dir", type=str, required=True)
+    argparser.add_argument("--est_dir", type=str, required=True)
+    argparser.add_argument("--output_dir", type=str, default="./eval_infer_results")
+    argparser.add_argument("--prechorus2what", type=str, default=None)
+    argparser.add_argument("--armerge_continuous_segments", action="store_true")
+    args = argparser.parse_args()
+    ann_dir = args.ann_dir
+    est_dir = args.est_dir
+    output_dir = args.output_dir
+    if args.armerge_continuous_segments:
+        logger.info("Merging continuous segments")
+    os.makedirs(output_dir, exist_ok=True)
+    ann_id_lists = [x for x in os.listdir(ann_dir) if x.endswith(".txt")]
+    est_id_lists = [x for x in os.listdir(est_dir) if x.endswith(".txt")]
+    common_id_lists = set(ann_id_lists) & set(est_id_lists)
+    common_id_lists = list(common_id_lists)
+    logger.info(f"Common number of files: {len(common_id_lists)}")
+    resultes = []
+    ious = {}
+    for id in tqdm(common_id_lists):
+        try:
+            logger.info(f"Processing {id}")
+            ann_msa = load_msa_info(os.path.join(ann_dir, id))
+            est_msa = load_msa_info(os.path.join(est_dir, id))
+            if args.prechorus2what == "verse":
+                ann_msa = [
+                    (t, "verse") if l == "pre-chorus" else (t, l) for t, l in ann_msa
+                ]
+                est_msa = [
+                    (t, "verse") if l == "pre-chorus" else (t, l) for t, l in est_msa
+                ]
+            elif args.prechorus2what == "chorus":
+                ann_msa = [
+                    (t, "chorus") if l == "pre-chorus" else (t, l) for t, l in ann_msa
+                ]
+                est_msa = [
+                    (t, "chorus") if l == "pre-chorus" else (t, l) for t, l in est_msa
+                ]
+            elif args.prechorus2what is not None:
+                raise ValueError(f"Unknown prechorus2what: {args.prechorus2what}")
+            if args.armerge_continuous_segments:
+                ann_msa = merge_continuous_segments(ann_msa)
+                est_msa = merge_continuous_segments(est_msa)
+            ann_inter, ann_labels = to_inters_labels(ann_msa)
+            est_inter, est_labels = to_inters_labels(est_msa)
+            result = compute_results(
+                ann_inter,
+                est_inter,
+                ann_labels,
+                est_labels,
+                bins=11,
+                est_file="test.txt",
+                weight=0.58,
+            )
+            acc = cal_acc(ann_msa, est_msa, post_digit=3)
+            ious[id] = cal_iou(ann_msa, est_msa)
+            result["HitRate_1P"], result["HitRate_1R"], result["HitRate_1F"] = (
+                mir_eval.segment.detection(ann_inter, est_inter, window=1, trim=False)
+            )
+            result.update({"id": Path(id).stem})
+            result.update({"acc": acc})
+            for v in ious[id]:
+                result.update({f"iou-{v['label']}": v["iou"]})
+            del result["track_id"]
+            del result["ds_name"]
+            resultes.append(result)
+        except Exception as e:
+            logger.error(f"Error processing {id}: {e}")
+            continue
+    df = pd.DataFrame(resultes)
+    df.to_csv(f"{output_dir}/eval_infer.csv", index=False)
+    intsec_dur_total = defaultdict(float)
+    uni_dur_total = defaultdict(float)
+    for tid, value in ious.items():
+        for item in value:
+            label = item["label"]
+            intsec_dur_total[label] += item.get("intsec_dur", 0)
+            uni_dur_total[label] += item.get("uni_dur", 0)
+    total_intsec = sum(intsec_dur_total.values())
+    total_uni = sum(uni_dur_total.values())
+    overall_iou = total_intsec / total_uni if total_uni > 0 else 0.0
+    class_ious = {}
+    for label in intsec_dur_total:
+        intsec = intsec_dur_total[label]
+        uni = uni_dur_total[label]
+        class_ious[label] = intsec / uni if uni > 0 else 0.0
+    summary = pd.DataFrame(
+        [
+            {
+                "num_samples": len(df),
+                "HR.5F": df["HitRate_0.5F"].mean(),
+                "HR3F": df["HitRate_3F"].mean(),
+                "HR1F": df["HitRate_1F"].mean(),
+                "PWF": df["PWF"].mean(),
+                "Sf": df["Sf"].mean(),
+                "acc": df["acc"].mean(),
+                "iou": overall_iou,
+                **{f"iou_{k}": v for k, v in class_ious.items()},
+            }
+        ]
+    )
+    with open(f"{output_dir}/eval_infer_summary.md", "w") as f:
+        print(summary.to_markdown(), file=f)
+    summary.to_csv(f"{output_dir}/eval_infer_summary.csv", index=False)
+    logger.info(f"Results saved to {output_dir}")
+if __name__ == "__main__":
+    main()

src/SongFormer/infer.sh ADDED Viewed

	@@ -0,0 +1,21 @@

+export CUDA_VISIBLE_DEVICES=
+echo "use gpu ${CUDA_VISIBLE_DEVICES}"
+export PYTHONPATH=../third_party:$PYTHONPATH
+export OMP_NUM_THREADS=1
+export MPI_NUM_THREADS=1
+export NCCL_P2P_DISABLE=1
+export NCCL_IB_DISABLE=1
+python infer/infer.py \
+-i XXX.scp \
+-o XXX_dir \
+--model SongFormer \
+--checkpoint SongFormer.safetensors \
+--config_path SongFormer.yaml \
+-gn 1 \
+-tn 1
+# --debug
+# --no_rule_post_processing

src/SongFormer/infer/infer.py ADDED Viewed

	@@ -0,0 +1,439 @@

+import argparse
+import importlib
+import json
+import math
+import multiprocessing as mp
+import os
+import time
+from argparse import Namespace
+from pathlib import Path
+# monkey patch to fix issues in msaf
+import scipy
+import numpy as np
+scipy.inf = np.inf
+import librosa
+import torch
+from ema_pytorch import EMA
+from loguru import logger
+from muq import MuQ
+from musicfm.model.musicfm_25hz import MusicFM25Hz
+from omegaconf import OmegaConf
+from tqdm import tqdm
+mp.set_start_method("spawn", force=True)
+MUSICFM_HOME_PATH = os.path.join("ckpts", "MusicFM")
+BEFORE_DOWNSAMPLING_FRAME_RATES = 25
+AFTER_DOWNSAMPLING_FRAME_RATES = 8.333
+DATASET_LABEL = "SongForm-HX-8Class"
+DATASET_IDS = [5]
+TIME_DUR = 420
+INPUT_SAMPLING_RATE = 24000
+from dataset.label2id import DATASET_ID_ALLOWED_LABEL_IDS, DATASET_LABEL_TO_DATASET_ID
+from postprocessing.functional import postprocess_functional_structure
+def get_processed_ids(output_path):
+    """Get already processed IDs from output directory"""
+    ids = os.listdir(output_path)
+    ret = []
+    for x in ids:
+        if x.endswith(".json"):
+            ret.append(x.replace(".json", ""))
+    return set(ret)
+def get_processing_ids(input_path, processed_ids_set):
+    """Get IDs to be processed from input directory"""
+    ret = []
+    with open(input_path) as f:
+        for line in f:
+            if line.strip() and Path(line.strip()).stem not in processed_ids_set:
+                ret.append(line.strip())
+    return ret
+def load_checkpoint(checkpoint_path, device=None):
+    """Load checkpoint from path"""
+    if device is None:
+        device = "cpu"
+    if checkpoint_path.endswith(".pt"):
+        checkpoint = torch.load(checkpoint_path, map_location=device)
+    elif checkpoint_path.endswith(".safetensors"):
+        from safetensors.torch import load_file
+        checkpoint = {"model_ema": load_file(checkpoint_path, device=device)}
+    else:
+        raise ValueError("Unsupported checkpoint format. Use .pt or .safetensors")
+    return checkpoint
+def rule_post_processing(msa_list):
+    if len(msa_list) <= 2:
+        return msa_list
+    result = msa_list.copy()
+    while len(result) > 2:
+        first_duration = result[1][0] - result[0][0]
+        if first_duration < 1.0 and len(result) > 2:
+            result[0] = (result[0][0], result[1][1])
+            result = [result[0]] + result[2:]
+        else:
+            break
+    while len(result) > 2:
+        last_label_duration = result[-1][0] - result[-2][0]
+        if last_label_duration < 1.0:
+            result = result[:-2] + [result[-1]]
+        else:
+            break
+    while len(result) > 2:
+        if result[0][1] == result[1][1] and result[1][0] <= 10.0:
+            result = [(result[0][0], result[0][1])] + result[2:]
+        else:
+            break
+    while len(result) > 2:
+        last_duration = result[-1][0] - result[-2][0]
+        if result[-2][1] == result[-3][1] and last_duration <= 10.0:
+            result = result[:-2] + [result[-1]]
+        else:
+            break
+    return result
+def inference(rank, queue_input: mp.Queue, queue_output: mp.Queue, args):
+    """Run inference on the input audio"""
+    device = f"cuda:{rank}"
+    # MuQ model loading (this will automatically fetch the checkpoint from huggingface)
+    muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
+    muq = muq.to(device).eval()
+    # MusicFM model loading
+    musicfm = MusicFM25Hz(
+        is_flash=False,
+        stat_path=os.path.join(MUSICFM_HOME_PATH, "msd_stats.json"),
+        model_path=os.path.join(MUSICFM_HOME_PATH, "pretrained_msd.pt"),
+    )
+    musicfm = musicfm.to(device)
+    musicfm.eval()
+    # Custom model loading based on the config
+    module = importlib.import_module("models." + str(args.model))
+    Model = getattr(module, "Model")
+    hp = OmegaConf.load(os.path.join("configs", args.config_path))
+    model = Model(hp)
+    ckpt = load_checkpoint(checkpoint_path=os.path.join("ckpts", args.checkpoint))
+    if ckpt.get("model_ema", None) is not None:
+        logger.info("Loading EMA model parameters")
+        model_ema = EMA(model, include_online_model=False)
+        model_ema.load_state_dict(ckpt["model_ema"])
+        model.load_state_dict(model_ema.ema_model.state_dict())
+    else:
+        logger.info("No EMA model parameters found, using original model")
+        model.load_state_dict(ckpt["model"])
+    model.to(device)
+    model.eval()
+    num_classes = args.num_classes
+    dataset_id2label_mask = {}
+    for key, allowed_ids in DATASET_ID_ALLOWED_LABEL_IDS.items():
+        dataset_id2label_mask[key] = np.ones(args.num_classes, dtype=bool)
+        dataset_id2label_mask[key][allowed_ids] = False
+    with torch.no_grad():
+        while True:
+            item = queue_input.get()
+            if not item:
+                queue_output.put(None)
+                break
+            try:
+                # Loading the audio file
+                wav, sr = librosa.load(item, sr=INPUT_SAMPLING_RATE)
+                audio = torch.tensor(wav).to(device)
+                win_size = args.win_size
+                hop_size = args.hop_size
+                total_len = (
+                    (audio.shape[0] // INPUT_SAMPLING_RATE) // TIME_DUR
+                ) * TIME_DUR + TIME_DUR
+                total_frames = math.ceil(total_len * AFTER_DOWNSAMPLING_FRAME_RATES)
+                logits = {
+                    "function_logits": np.zeros([total_frames, num_classes]),
+                    "boundary_logits": np.zeros([total_frames]),
+                }
+                logits_num = {
+                    "function_logits": np.zeros([total_frames, num_classes]),
+                    "boundary_logits": np.zeros([total_frames]),
+                }
+                lens = 0
+                i = 0
+                while True:
+                    start_idx = i * INPUT_SAMPLING_RATE
+                    end_idx = min((i + win_size) * INPUT_SAMPLING_RATE, audio.shape[-1])
+                    if start_idx >= audio.shape[-1]:
+                        break
+                    if end_idx - start_idx <= 1024:
+                        continue
+                    audio_seg = audio[start_idx:end_idx]
+                    # MuQ embedding
+                    muq_output = muq(audio_seg.unsqueeze(0), output_hidden_states=True)
+                    muq_embd_420s = muq_output["hidden_states"][10]
+                    del muq_output
+                    torch.cuda.empty_cache()
+                    # MusicFM embedding
+                    _, musicfm_hidden_states = musicfm.get_predictions(
+                        audio_seg.unsqueeze(0)
+                    )
+                    musicfm_embd_420s = musicfm_hidden_states[10]
+                    del musicfm_hidden_states
+                    torch.cuda.empty_cache()
+                    wraped_muq_embd_30s = []
+                    wraped_musicfm_embd_30s = []
+                    for idx_30s in range(i, i + hop_size, 30):
+                        start_idx_30s = idx_30s * INPUT_SAMPLING_RATE
+                        end_idx_30s = min(
+                            (idx_30s + 30) * INPUT_SAMPLING_RATE,
+                            audio.shape[-1],
+                            (i + hop_size) * INPUT_SAMPLING_RATE,
+                        )
+                        if start_idx_30s >= audio.shape[-1]:
+                            break
+                        if end_idx_30s - start_idx_30s <= 1024:
+                            continue
+                        wraped_muq_embd_30s.append(
+                            muq(
+                                audio[start_idx_30s:end_idx_30s].unsqueeze(0),
+                                output_hidden_states=True,
+                            )["hidden_states"][10]
+                        )
+                        torch.cuda.empty_cache()
+                        wraped_musicfm_embd_30s.append(
+                            musicfm.get_predictions(
+                                audio[start_idx_30s:end_idx_30s].unsqueeze(0)
+                            )[1][10]
+                        )
+                        torch.cuda.empty_cache()
+                    wraped_muq_embd_30s = torch.concatenate(wraped_muq_embd_30s, dim=1)
+                    wraped_musicfm_embd_30s = torch.concatenate(
+                        wraped_musicfm_embd_30s, dim=1
+                    )
+                    all_embds = [
+                        wraped_musicfm_embd_30s,
+                        wraped_muq_embd_30s,
+                        musicfm_embd_420s,
+                        muq_embd_420s,
+                    ]
+                    if len(all_embds) > 1:
+                        embd_lens = [x.shape[1] for x in all_embds]
+                        max_embd_len = max(embd_lens)
+                        min_embd_len = min(embd_lens)
+                        if abs(max_embd_len - min_embd_len) > 4:
+                            raise ValueError(
+                                f"Embedding shapes differ too much: {max_embd_len} vs {min_embd_len}"
+                            )
+                        for idx in range(len(all_embds)):
+                            all_embds[idx] = all_embds[idx][:, :min_embd_len, :]
+                    embd = torch.concatenate(all_embds, axis=-1)
+                    dataset_label = DATASET_LABEL
+                    dataset_ids = torch.Tensor(DATASET_IDS).to(device, dtype=torch.long)
+                    msa_info, chunk_logits = model.infer(
+                        input_embeddings=embd,
+                        dataset_ids=dataset_ids,
+                        label_id_masks=torch.Tensor(
+                            dataset_id2label_mask[
+                                DATASET_LABEL_TO_DATASET_ID[dataset_label]
+                            ]
+                        )
+                        .to(device, dtype=bool)
+                        .unsqueeze(0)
+                        .unsqueeze(0),
+                        with_logits=True,
+                    )
+                    start_frame = int(i * AFTER_DOWNSAMPLING_FRAME_RATES)
+                    end_frame = start_frame + min(
+                        math.ceil(hop_size * AFTER_DOWNSAMPLING_FRAME_RATES),
+                        chunk_logits["boundary_logits"][0].shape[0],
+                    )
+                    logits["function_logits"][start_frame:end_frame, :] += (
+                        chunk_logits["function_logits"][0].detach().cpu().numpy()
+                    )
+                    logits["boundary_logits"][start_frame:end_frame] = (
+                        chunk_logits["boundary_logits"][0].detach().cpu().numpy()
+                    )
+                    logits_num["function_logits"][start_frame:end_frame, :] += 1
+                    logits_num["boundary_logits"][start_frame:end_frame] += 1
+                    lens += end_frame - start_frame
+                    i += hop_size
+                logits["function_logits"] /= logits_num["function_logits"]
+                logits["boundary_logits"] /= logits_num["boundary_logits"]
+                logits["function_logits"] = torch.from_numpy(
+                    logits["function_logits"][:lens]
+                ).unsqueeze(0)
+                logits["boundary_logits"] = torch.from_numpy(
+                    logits["boundary_logits"][:lens]
+                ).unsqueeze(0)
+                msa_infer_output = postprocess_functional_structure(logits, hp)
+                assert msa_infer_output[-1][-1] == "end"
+                if not args.no_rule_post_processing:
+                    msa_infer_output = rule_post_processing(msa_infer_output)
+                msa_json = []
+                for idx in range(len(msa_infer_output) - 1):
+                    msa_json.append(
+                        {
+                            "label": msa_infer_output[idx][1],
+                            "start": msa_infer_output[idx][0],
+                            "end": msa_infer_output[idx + 1][0],
+                        }
+                    )
+                json.dump(
+                    msa_json,
+                    open(os.path.join(args.output_dir, f"{Path(item).stem}.json"), "w"),
+                    indent=4,
+                    ensure_ascii=False,
+                )
+                queue_output.put(None)
+            except Exception as e:
+                queue_output.put(None)
+                logger.error(f"process {rank} error\n{item}\n{e}")
+def deal_with_output(output_path, queue_output, length):
+    """Handle output data from the queue"""
+    pbar = tqdm(range(length), desc="getting inference output")
+    for _ in pbar:
+        data = queue_output.get()
+        if not data:
+            continue
+def main(args):
+    input_path = args.input_path
+    output_path = args.output_path
+    gpu_num = args.gpu_num
+    num_thread_per_gpu = args.num_thread_per_gpu
+    debug = args.debug
+    os.makedirs(output_path, exist_ok=True)
+    processed_ids = get_processed_ids(output_path=output_path)
+    processing_ids = get_processing_ids(input_path, processed_ids)
+    num_threads = num_thread_per_gpu * gpu_num
+    queue_input: mp.Queue = mp.Queue()
+    queue_output: mp.Queue = mp.Queue()
+    init_args = Namespace(
+        output_dir=output_path,
+        win_size=420,
+        hop_size=420,
+        num_classes=128,
+        model=args.model,
+        checkpoint=args.checkpoint,
+        config_path=args.config_path,
+        no_rule_post_processing=args.no_rule_post_processing,
+    )
+    processes = []
+    if debug:
+        queue_input.put(processing_ids[0])
+        queue_input.put(None)
+        inference(0, queue_input, queue_output, init_args)
+        print("debug exit")
+        exit(0)
+    for thread_num in range(num_threads):
+        rank = thread_num % gpu_num
+        print(f"num_threads: {thread_num} on GPU {rank}")
+        time.sleep(0.2)
+        p = mp.Process(
+            target=inference,
+            args=(rank, queue_input, queue_output, init_args),
+            daemon=True,
+        )
+        p.start()
+        processes.append(p)
+    for wav_id in tqdm(processing_ids, desc="add data to queue"):
+        queue_input.put(wav_id)
+    for _ in range(num_threads):
+        queue_input.put(None)
+    deal_with_output(output_path, queue_output, len(processing_ids))
+    for p in processes:
+        p.join()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input_path", "-i", type=str, required=True, help="Input file path"
+    )
+    parser.add_argument(
+        "--output_path", "-o", type=str, required=True, help="Output file path"
+    )
+    parser.add_argument(
+        "--gpu_num", "-gn", type=int, default=1, help="Number of GPUs, default is 1"
+    )
+    parser.add_argument(
+        "--num_thread_per_gpu",
+        "-tn",
+        type=int,
+        default=1,
+        help="Number of threads per GPU, default is 1",
+    )
+    parser.add_argument("--model", type=str, help="Model to use")
+    parser.add_argument("--checkpoint", type=str, help="Checkpoint path")
+    parser.add_argument("--config_path", type=str, help="Configuration file path")
+    parser.add_argument(
+        "--no_rule_post_processing",
+        action="store_true",
+        help="Disable rule-based post-processing",
+    )
+    parser.add_argument("--debug", action="store_true", help="Enable debug mode")
+    args = parser.parse_args()
+    main(args=args)

src/SongFormer/models/SongFormer.py ADDED Viewed

	@@ -0,0 +1,521 @@

+import torch
+import torch.nn as nn
+import numpy as np
+import torch.nn.functional as F
+from dataset.custom_types import MsaInfo
+from msaf.eval import compute_results
+from postprocessing.functional import postprocess_functional_structure
+from x_transformers import Encoder
+import bisect
+class Head(nn.Module):
+    def __init__(self, input_dim, output_dim, hidden_dims=None, activation="silu"):
+        super().__init__()
+        hidden_dims = hidden_dims or []
+        act_layers = {"relu": nn.ReLU, "silu": nn.SiLU, "gelu": nn.GELU}
+        act_layer = act_layers.get(activation.lower())
+        if not act_layer:
+            raise ValueError(f"Unsupported activation: {activation}")
+        dims = [input_dim] + hidden_dims + [output_dim]
+        layers = []
+        for i in range(len(dims) - 1):
+            layers.append(nn.Linear(dims[i], dims[i + 1]))
+            if i < len(dims) - 2:
+                layers.append(act_layer())
+        self.net = nn.Sequential(*layers)
+    def reset_parameters(self, confidence):
+        bias_value = -torch.log(torch.tensor((1 - confidence) / confidence))
+        self.net[-1].bias.data.fill_(bias_value.item())
+    def forward(self, x):
+        batch, T, C = x.shape
+        x = x.reshape(-1, C)
+        x = self.net(x)
+        return x.reshape(batch, T, -1)
+class WrapedTransformerEncoder(nn.Module):
+    def __init__(
+        self, input_dim, transformer_input_dim, num_layers=1, nhead=8, dropout=0.1
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.transformer_input_dim = transformer_input_dim
+        if input_dim != transformer_input_dim:
+            self.input_proj = nn.Sequential(
+                nn.Linear(input_dim, transformer_input_dim),
+                nn.LayerNorm(transformer_input_dim),
+                nn.GELU(),
+                nn.Dropout(dropout * 0.5),
+                nn.Linear(transformer_input_dim, transformer_input_dim),
+            )
+        else:
+            self.input_proj = nn.Identity()
+        self.transformer = Encoder(
+            dim=transformer_input_dim,
+            depth=num_layers,
+            heads=nhead,
+            layer_dropout=dropout,
+            attn_dropout=dropout,
+            ff_dropout=dropout,
+            attn_flash=True,
+            rotary_pos_emb=True,
+        )
+    def forward(self, x, src_key_padding_mask=None):
+        """
+        The input src_key_padding_mask is a B x T boolean mask, where True indicates masked positions.
+        However, in x-transformers, False indicates masked positions.
+        Therefore, it needs to be converted so that False represents masked positions.
+        """
+        x = self.input_proj(x)
+        mask = (
+            ~torch.tensor(src_key_padding_mask, dtype=torch.bool, device=x.device)
+            if src_key_padding_mask is not None
+            else None
+        )
+        return self.transformer(x, mask=mask)
+def prefix_dict(d, prefix: str):
+    if prefix:
+        return d
+    return {prefix + key: value for key, value in d.items()}
+class TimeDownsample(nn.Module):
+    def __init__(
+        self, dim_in, dim_out=None, kernel_size=5, stride=5, padding=0, dropout=0.1
+    ):
+        super().__init__()
+        self.dim_out = dim_out or dim_in
+        assert self.dim_out % 2 == 0
+        self.depthwise_conv = nn.Conv1d(
+            in_channels=dim_in,
+            out_channels=dim_in,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            groups=dim_in,
+            bias=False,
+        )
+        self.pointwise_conv = nn.Conv1d(
+            in_channels=dim_in,
+            out_channels=self.dim_out,
+            kernel_size=1,
+            bias=False,
+        )
+        self.pool = nn.AvgPool1d(kernel_size, stride, padding=padding)
+        self.norm1 = nn.LayerNorm(self.dim_out)
+        self.act1 = nn.GELU()
+        self.dropout1 = nn.Dropout(dropout)
+        if dim_in != self.dim_out:
+            self.residual_conv = nn.Conv1d(
+                dim_in, self.dim_out, kernel_size=1, bias=False
+            )
+        else:
+            self.residual_conv = None
+    def forward(self, x):
+        residual = x  # [B, T, D_in]
+        # Convolutional module
+        x_c = x.transpose(1, 2)  # [B, D_in, T]
+        x_c = self.depthwise_conv(x_c)  # [B, D_in, T_down]
+        x_c = self.pointwise_conv(x_c)  # [B, D_out, T_down]
+        # Residual module
+        res = self.pool(residual.transpose(1, 2))  # [B, D_in, T]
+        if self.residual_conv:
+            res = self.residual_conv(res)  # [B, D_out, T_down]
+        x_c = x_c + res  # [B, D_out, T_down]
+        x_c = x_c.transpose(1, 2)  # [B, T_down, D_out]
+        x_c = self.norm1(x_c)
+        x_c = self.act1(x_c)
+        x_c = self.dropout1(x_c)
+        return x_c
+class AddFuse(nn.Module):
+    def __init__(self):
+        super(AddFuse, self).__init__()
+    def forward(self, x, cond):
+        return x + cond
+class TVLoss1D(nn.Module):
+    def __init__(
+        self, beta=1.0, lambda_tv=0.4, boundary_threshold=0.01, reduction_weight=0.1
+    ):
+        """
+        Args:
+            beta: Exponential parameter for TV loss (recommended 0.5~1.0)
+            lambda_tv: Overall weight for TV loss
+            boundary_threshold: Label threshold to determine if a region is a "boundary area" (e.g., 0.01)
+            reduction_weight: Scaling factor for TV penalty within boundary regions (e.g., 0.1, meaning only 10% penalty)
+        """
+        super().__init__()
+        self.beta = beta
+        self.lambda_tv = lambda_tv
+        self.boundary_threshold = boundary_threshold
+        self.reduction_weight = reduction_weight
+    def forward(self, pred, target=None):
+        """
+        Args:
+            pred: (B, T) or (B, T, 1), float boundary scores output by the model
+            target: (B, T) or (B, T, 1), ground truth labels (optional, used for spatial weighting if provided)
+        Returns:
+            scalar: weighted TV loss
+        """
+        if pred.dim() == 3:
+            pred = pred.squeeze(-1)
+        if target is not None and target.dim() == 3:
+            target = target.squeeze(-1)
+        diff = pred[:, 1:] - pred[:, :-1]
+        tv_base = torch.pow(torch.abs(diff) + 1e-8, self.beta)
+        if target is None:
+            return self.lambda_tv * tv_base.mean()
+        left_in_boundary = target[:, :-1] > self.boundary_threshold
+        right_in_boundary = target[:, 1:] > self.boundary_threshold
+        near_boundary = left_in_boundary | right_in_boundary
+        weight_mask = torch.where(
+            near_boundary,
+            self.reduction_weight * torch.ones_like(tv_base),
+            torch.ones_like(tv_base),
+        )
+        tv_weighted = (tv_base * weight_mask).mean()
+        return self.lambda_tv * tv_weighted
+class SoftmaxFocalLoss(nn.Module):
+    """
+    Softmax Focal Loss for single-label multi-class classification.
+    Suitable for mutually exclusive classes.
+    """
+    def __init__(self, alpha: float = 0.25, gamma: float = 2.0):
+        super().__init__()
+        self.alpha = alpha
+        self.gamma = gamma
+    def forward(self, pred: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            pred: [B, T, C], raw logits
+            targets: [B, T, C] (soft) or [B, T] (hard, dtype=long)
+        Returns:
+            loss: scalar or [B, T] depending on reduction
+        """
+        log_probs = F.log_softmax(pred, dim=-1)
+        probs = torch.exp(log_probs)
+        if targets.dtype == torch.long:
+            targets_onehot = F.one_hot(targets, num_classes=pred.size(-1)).float()
+        else:
+            targets_onehot = targets
+        p_t = (probs * targets_onehot).sum(dim=-1)
+        p_t = p_t.clamp(min=1e-8, max=1.0 - 1e-8)
+        if self.alpha > 0:
+            alpha_t = self.alpha * targets_onehot + (1 - self.alpha) * (
+                1 - targets_onehot
+            )
+            alpha_weight = (alpha_t * targets_onehot).sum(dim=-1)
+        else:
+            alpha_weight = 1.0
+        focal_weight = (1 - p_t) ** self.gamma
+        ce_loss = -log_probs * targets_onehot
+        ce_loss = ce_loss.sum(dim=-1)
+        loss = alpha_weight * focal_weight * ce_loss
+        return loss
+class Model(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.input_norm = nn.LayerNorm(config.input_dim)
+        self.mixed_win_downsample = nn.Linear(config.input_dim_raw, config.input_dim)
+        self.dataset_class_prefix = nn.Embedding(
+            num_embeddings=config.num_dataset_classes,
+            embedding_dim=config.transformer_encoder_input_dim,
+        )
+        self.down_sample_conv = TimeDownsample(
+            dim_in=config.input_dim,
+            dim_out=config.transformer_encoder_input_dim,
+            kernel_size=config.down_sample_conv_kernel_size,
+            stride=config.down_sample_conv_stride,
+            dropout=config.down_sample_conv_dropout,
+            padding=config.down_sample_conv_padding,
+        )
+        self.AddFuse = AddFuse()
+        self.transformer = WrapedTransformerEncoder(
+            input_dim=config.transformer_encoder_input_dim,
+            transformer_input_dim=config.transformer_input_dim,
+            num_layers=config.num_transformer_layers,
+            nhead=config.transformer_nhead,
+            dropout=config.transformer_dropout,
+        )
+        self.boundary_TVLoss1D = TVLoss1D(
+            beta=config.boundary_tv_loss_beta,
+            lambda_tv=config.boundary_tv_loss_lambda,
+            boundary_threshold=config.boundary_tv_loss_boundary_threshold,
+            reduction_weight=config.boundary_tv_loss_reduction_weight,
+        )
+        self.label_focal_loss = SoftmaxFocalLoss(
+            alpha=config.label_focal_loss_alpha, gamma=config.label_focal_loss_gamma
+        )
+        self.boundary_head = Head(config.transformer_input_dim, 1)
+        self.function_head = Head(config.transformer_input_dim, config.num_classes)
+    def cal_metrics(self, gt_info: MsaInfo, msa_info: MsaInfo):
+        assert gt_info[-1][1] == "end" and msa_info[-1][1] == "end", (
+            "gt_info and msa_info should end with 'end'"
+        )
+        gt_info_labels = [label for time_, label in gt_info][:-1]
+        gt_info_inters = [time_ for time_, label in gt_info]
+        gt_info_inters = np.column_stack(
+            [np.array(gt_info_inters[:-1]), np.array(gt_info_inters[1:])]
+        )
+        msa_info_labels = [label for time_, label in msa_info][:-1]
+        msa_info_inters = [time_ for time_, label in msa_info]
+        msa_info_inters = np.column_stack(
+            [np.array(msa_info_inters[:-1]), np.array(msa_info_inters[1:])]
+        )
+        result = compute_results(
+            ann_inter=gt_info_inters,
+            est_inter=msa_info_inters,
+            ann_labels=gt_info_labels,
+            est_labels=msa_info_labels,
+            bins=11,
+            est_file="test.txt",
+            weight=0.58,
+        )
+        return result
+    def cal_acc(
+        self, ann_info: MsaInfo | str, est_info: MsaInfo | str, post_digit: int = 3
+    ):
+        ann_info_time = [
+            int(round(time_, post_digit) * (10**post_digit))
+            for time_, label in ann_info
+        ]
+        est_info_time = [
+            int(round(time_, post_digit) * (10**post_digit))
+            for time_, label in est_info
+        ]
+        common_start_time = max(ann_info_time[0], est_info_time[0])
+        common_end_time = min(ann_info_time[-1], est_info_time[-1])
+        time_points = {common_start_time, common_end_time}
+        time_points.update(
+            {
+                time_
+                for time_ in ann_info_time
+                if common_start_time <= time_ <= common_end_time
+            }
+        )
+        time_points.update(
+            {
+                time_
+                for time_ in est_info_time
+                if common_start_time <= time_ <= common_end_time
+            }
+        )
+        time_points = sorted(time_points)
+        total_duration, total_score = 0, 0
+        for idx in range(len(time_points) - 1):
+            duration = time_points[idx + 1] - time_points[idx]
+            ann_label = ann_info[
+                bisect.bisect_right(ann_info_time, time_points[idx]) - 1
+            ][1]
+            est_label = est_info[
+                bisect.bisect_right(est_info_time, time_points[idx]) - 1
+            ][1]
+            total_duration += duration
+            if ann_label == est_label:
+                total_score += duration
+        return total_score / total_duration
+    def infer_with_metrics(self, batch, prefix: str = None):
+        with torch.no_grad():
+            logits = self.forward_func(batch)
+            losses = self.compute_losses(logits, batch, prefix=None)
+            expanded_mask = batch["label_id_masks"].expand(
+                -1, logits["function_logits"].size(1), -1
+            )
+            logits["function_logits"] = logits["function_logits"].masked_fill(
+                expanded_mask, -float("inf")
+            )
+            msa_info = postprocess_functional_structure(
+                logits=logits, config=self.config
+            )
+            gt_info = batch["msa_infos"][0]
+            results = self.cal_metrics(gt_info=gt_info, msa_info=msa_info)
+        ret_results = {
+            "loss": losses["loss"].item(),
+            "HitRate_3P": results["HitRate_3P"],
+            "HitRate_3R": results["HitRate_3R"],
+            "HitRate_3F": results["HitRate_3F"],
+            "HitRate_0.5P": results["HitRate_0.5P"],
+            "HitRate_0.5R": results["HitRate_0.5R"],
+            "HitRate_0.5F": results["HitRate_0.5F"],
+            "PWF": results["PWF"],
+            "PWP": results["PWP"],
+            "PWR": results["PWR"],
+            "Sf": results["Sf"],
+            "So": results["So"],
+            "Su": results["Su"],
+            "acc": self.cal_acc(ann_info=gt_info, est_info=msa_info),
+        }
+        if prefix:
+            ret_results = prefix_dict(ret_results, prefix)
+        return ret_results
+    def infer(
+        self,
+        input_embeddings,
+        dataset_ids,
+        label_id_masks,
+        prefix: str = None,
+        with_logits=False,
+    ):
+        with torch.no_grad():
+            input_embeddings = self.mixed_win_downsample(input_embeddings)
+            input_embeddings = self.input_norm(input_embeddings)
+            logits = self.down_sample_conv(input_embeddings)
+            dataset_prefix = self.dataset_class_prefix(dataset_ids)
+            dataset_prefix_expand = dataset_prefix.unsqueeze(1).expand(
+                logits.size(0), 1, -1
+            )
+            logits = self.AddFuse(x=logits, cond=dataset_prefix_expand)
+            logits = self.transformer(x=logits, src_key_padding_mask=None)
+            function_logits = self.function_head(logits)
+            boundary_logits = self.boundary_head(logits).squeeze(-1)
+            logits = {
+                "function_logits": function_logits,
+                "boundary_logits": boundary_logits,
+            }
+            expanded_mask = label_id_masks.expand(
+                -1, logits["function_logits"].size(1), -1
+            )
+            logits["function_logits"] = logits["function_logits"].masked_fill(
+                expanded_mask, -float("inf")
+            )
+            msa_info = postprocess_functional_structure(
+                logits=logits, config=self.config
+            )
+        return (msa_info, logits) if with_logits else msa_info
+    def compute_losses(self, outputs, batch, prefix: str = None):
+        loss = 0.0
+        losses = {}
+        loss_section = F.binary_cross_entropy_with_logits(
+            outputs["boundary_logits"],
+            batch["widen_true_boundaries"],
+            reduction="none",
+        )
+        loss_section += self.config.boundary_tvloss_weight * self.boundary_TVLoss1D(
+            pred=outputs["boundary_logits"],
+            target=batch["widen_true_boundaries"],
+        )
+        loss_function = F.cross_entropy(
+            outputs["function_logits"].transpose(1, 2),
+            batch["true_functions"].transpose(1, 2),
+            reduction="none",
+        )
+        # input is [B, T, C]
+        ttt = self.config.label_focal_loss_weight * self.label_focal_loss(
+            pred=outputs["function_logits"], targets=batch["true_functions"]
+        )
+        loss_function += ttt
+        float_masks = (~batch["masks"]).float()
+        boundary_mask = batch.get("boundary_mask", None)
+        function_mask = batch.get("function_mask", None)
+        if boundary_mask is not None:
+            boundary_mask = (~boundary_mask).float()
+        else:
+            boundary_mask = 1
+        if function_mask is not None:
+            function_mask = (~function_mask).float()
+        else:
+            function_mask = 1
+        loss_section = torch.mean(boundary_mask * float_masks * loss_section)
+        loss_function = torch.mean(function_mask * float_masks * loss_function)
+        loss_section *= self.config.loss_weight_section
+        loss_function *= self.config.loss_weight_function
+        if self.config.learn_label:
+            loss += loss_function
+        if self.config.learn_segment:
+            loss += loss_section
+        losses.update(
+            loss=loss,
+            loss_section=loss_section,
+            loss_function=loss_function,
+        )
+        if prefix:
+            losses = prefix_dict(losses, prefix)
+        return losses
+    def forward_func(self, batch):
+        input_embeddings = batch["input_embeddings"]
+        input_embeddings = self.mixed_win_downsample(input_embeddings)
+        input_embeddings = self.input_norm(input_embeddings)
+        logits = self.down_sample_conv(input_embeddings)
+        dataset_prefix = self.dataset_class_prefix(batch["dataset_ids"])
+        logits = self.AddFuse(x=logits, cond=dataset_prefix.unsqueeze(1))
+        src_key_padding_mask = batch["masks"]
+        logits = self.transformer(x=logits, src_key_padding_mask=src_key_padding_mask)
+        function_logits = self.function_head(logits)
+        boundary_logits = self.boundary_head(logits).squeeze(-1)
+        logits = {
+            "function_logits": function_logits,
+            "boundary_logits": boundary_logits,
+        }
+        return logits
+    def forward(self, batch):
+        logits = self.forward_func(batch)
+        losses = self.compute_losses(logits, batch, prefix=None)
+        return logits, losses["loss"], losses

src/SongFormer/postprocessing/calc_acc.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import os
+import bisect
+from dataset.msa_info_utils import (
+    load_msa_info,
+)
+from dataset.custom_types import MsaInfo
+import glob
+import pdb
+import pandas as pd
+def cal_acc(ann_info: MsaInfo | str, est_info: MsaInfo | str, post_digit: int = 3):
+    if type(ann_info) is str:
+        assert os.path.exists(ann_info), f"{ann_info} not exists"
+        ann_info = load_msa_info(ann_info)
+    if type(ann_info) is str:
+        assert os.path.exists(est_info), f"{est_info} not exists"
+        est_info = load_msa_info(est_info)
+    ann_info_time = [
+        int(round(time_, post_digit) * (10**post_digit)) for time_, label in ann_info
+    ]
+    est_info_time = [
+        int(round(time_, post_digit) * (10**post_digit)) for time_, label in est_info
+    ]
+    common_start_time = max(ann_info_time[0], est_info_time[0])
+    common_end_time = min(ann_info_time[-1], est_info_time[-1])
+    time_points = set()
+    time_points.add(common_start_time)
+    time_points.add(common_end_time)
+    for time_ in ann_info_time:
+        if time_ >= common_start_time and time_ <= common_end_time:
+            time_points.add(time_)
+    for time_ in est_info_time:
+        if time_ >= common_start_time and time_ <= common_end_time:
+            time_points.add(time_)
+    time_points = sorted(list(time_points))
+    total_duration = 0
+    total_score = 0
+    for idx in range(len(time_points) - 1):
+        duration = time_points[idx + 1] - time_points[idx]
+        ann_label = ann_info[bisect.bisect_right(ann_info_time, time_points[idx]) - 1][
+            1
+        ]
+        est_label = est_info[bisect.bisect_right(est_info_time, time_points[idx]) - 1][
+            1
+        ]
+        total_duration += duration
+        if ann_label == est_label:
+            total_score += duration
+    return total_score / total_duration
+if __name__ == "__main__":
+    ext_paths = glob.glob("")
+    results = []
+    for ext_path in ext_paths:
+        try:
+            ann_path = os.path.join(
+                "",
+                os.path.basename(ext_path).split(".")[0] + ".txt",
+            )
+            results.append(
+                {
+                    "data_id": os.path.basename(ext_path).split(".")[0],
+                    "acc": cal_acc(
+                        ann_info=ann_path,
+                        est_info=ext_path,
+                    ),
+                }
+            )
+        except Exception as e:
+            print(e)
+            continue
+    df = pd.DataFrame(results)
+    print(df["acc"].mean())

src/SongFormer/postprocessing/calc_iou.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import os
+from dataset.custom_types import MsaInfo
+from dataset.label2id import LABEL_TO_ID
+from pprint import pprint
+def load_msa_info(msa_info_path):
+    msa_info: MsaInfo = []
+    with open(msa_info_path) as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            time_, label = line.split()
+            time_ = float(time_)
+            label = str(label)
+            assert label in LABEL_TO_ID or label == "end", f"{label} not in LABEL_TO_ID"
+            msa_info.append((time_, label))
+    assert msa_info[-1][1] == "end", f"last {msa_info[-1][1]} != end"
+    return msa_info
+def msa_info_to_segments(msa_info):
+    # skip the last "end"
+    segments = []
+    for i in range(len(msa_info) - 1):
+        start = msa_info[i][0]
+        end = msa_info[i + 1][0]
+        label = msa_info[i][1]
+        segments.append((start, end, label))
+    return segments
+def compute_iou_for_label(segments_a, segments_b, label):
+    # segments_a, segments_b: [(start, end, label)]
+    # only process the current label
+    intervals_a = [(s, e) for s, e, l in segments_a if l == label]
+    intervals_b = [(s, e) for s, e, l in segments_b if l == label]
+    # sum up all intersections between a and b
+    intersection = 0.0
+    for sa, ea in intervals_a:
+        for sb, eb in intervals_b:
+            left = max(sa, sb)
+            right = min(ea, eb)
+            if left < right:
+                intersection += right - left
+    # union = total length of both sets - overlapping intersection
+    length_a = sum([e - s for s, e in intervals_a])
+    length_b = sum([e - s for s, e in intervals_b])
+    union = length_a + length_b - intersection
+    if union == 0:
+        return 0.0
+    return intersection / union, intersection, union
+def compute_mean_iou(segments_a, segments_b, labels):
+    ious = []
+    for label in labels:
+        iou, intsec_dur, uni_dur = compute_iou_for_label(segments_a, segments_b, label)
+        ious.append(
+            {"label": label, "iou": iou, "intsec_dur": intsec_dur, "uni_dur": uni_dur}
+        )
+    return ious
+def cal_iou(ann_info, est_info):
+    if type(ann_info) is str:
+        assert os.path.exists(ann_info), f"{ann_info} not exists"
+        ann_info = load_msa_info(ann_info)
+    if type(est_info) is str:
+        assert os.path.exists(est_info), f"{est_info} not exists"
+        est_info = load_msa_info(est_info)
+    segments_ann = msa_info_to_segments(ann_info)
+    segments_est = msa_info_to_segments(est_info)
+    occurred_labels = list(
+        set([l for s, e, l in segments_ann]) | set(l for s, e, l in segments_est)
+    )
+    mean_iou = compute_mean_iou(segments_ann, segments_est, occurred_labels)
+    return mean_iou
+if __name__ == "__main__":
+    ann_info = ""
+    est_info = ""
+    pprint(cal_iou(ann_info, est_info))

src/SongFormer/postprocessing/functional.py ADDED Viewed

	@@ -0,0 +1,71 @@

+# This file contains code adapted from the following sources:
+# [MIT license] https://github.com/mir-aidj/all-in-one/blob/main/src/allin1/postprocessing/functional.py
+import numpy as np
+import torch
+from .helpers import (
+    local_maxima,
+    peak_picking,
+    # event_frames_to_time,
+)
+from dataset.label2id import LABEL_TO_ID, ID_TO_LABEL
+from dataset.custom_types import MsaInfo
+def event_frames_to_time(frame_rates, boundary: np.array):
+    boundary = np.array(boundary)
+    boundary_times = boundary / frame_rates
+    return boundary_times
+def postprocess_functional_structure(
+    logits,
+    config,
+):
+    # pdb.set_trace()
+    boundary_logits = logits["boundary_logits"]
+    function_logits = logits["function_logits"]
+    assert boundary_logits.shape[0] == 1 and function_logits.shape[0] == 1, (
+        "Only batch size 1 is supported"
+    )
+    raw_prob_sections = torch.sigmoid(boundary_logits[0])
+    raw_prob_functions = torch.softmax(function_logits[0].transpose(0, 1), dim=0)
+    # filter_size=4 * cfg.min_hops_per_beat + 1
+    prob_sections, _ = local_maxima(
+        raw_prob_sections, filter_size=config.local_maxima_filter_size
+    )
+    prob_sections = prob_sections.cpu().numpy()
+    prob_functions = raw_prob_functions.cpu().numpy()
+    boundary_candidates = peak_picking(
+        boundary_activation=prob_sections,
+        window_past=int(12 * config.frame_rates),  # 原来是fps
+        window_future=int(12 * config.frame_rates),
+    )
+    boundary = boundary_candidates > 0.0
+    duration = len(prob_sections) / config.frame_rates
+    pred_boundary_times = event_frames_to_time(
+        frame_rates=config.frame_rates, boundary=np.flatnonzero(boundary)
+    )
+    if pred_boundary_times[0] != 0:
+        pred_boundary_times = np.insert(pred_boundary_times, 0, 0)
+    if pred_boundary_times[-1] != duration:
+        pred_boundary_times = np.append(pred_boundary_times, duration)
+    pred_boundaries = np.stack([pred_boundary_times[:-1], pred_boundary_times[1:]]).T
+    pred_boundary_indices = np.flatnonzero(boundary)
+    pred_boundary_indices = pred_boundary_indices[pred_boundary_indices > 0]
+    prob_segment_function = np.split(prob_functions, pred_boundary_indices, axis=1)
+    pred_labels = [p.mean(axis=1).argmax().item() for p in prob_segment_function]
+    segments: MsaInfo = []
+    for (start, end), label in zip(pred_boundaries, pred_labels):
+        segment = (float(start), str(ID_TO_LABEL[label]))
+        segments.append(segment)
+    segments.append((float(pred_boundary_times[-1]), "end"))
+    return segments

src/SongFormer/postprocessing/helpers.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# This file contains code adapted from the following sources:
+# [MIT license] https://github.com/mir-aidj/all-in-one/blob/main/src/allin1/postprocessing/helpers.py
+import numpy as np
+import torch.nn.functional as F
+import torch
+import librosa
+from typing import Union
+from scipy.signal import argrelextrema
+from scipy.interpolate import interp1d
+from numpy.lib.stride_tricks import sliding_window_view
+from numpy.typing import NDArray
+def local_maxima(tensor, filter_size=41):
+    assert len(tensor.shape) in (1, 2), "Input tensor should have 1 or 2 dimensions"
+    assert filter_size % 2 == 1, "Filter size should be an odd number"
+    original_shape = tensor.shape
+    if len(original_shape) == 1:
+        tensor = tensor.unsqueeze(0)
+    # Pad the input array with the minimum value
+    padding = filter_size // 2
+    padded_arr = F.pad(tensor, (padding, padding), mode="constant", value=-torch.inf)
+    # Create a rolling window view of the padded array
+    rolling_view = padded_arr.unfold(1, filter_size, 1)
+    # Find the indices of the local maxima
+    center = filter_size // 2
+    local_maxima_mask = torch.eq(
+        rolling_view[:, :, center], torch.max(rolling_view, dim=-1).values
+    )
+    local_maxima_indices = local_maxima_mask.nonzero()
+    # Initialize a new PyTorch tensor with zeros and the same shape as the input tensor
+    output_arr = torch.zeros_like(tensor)
+    # Set the local maxima values in the output tensor
+    output_arr[local_maxima_mask] = tensor[local_maxima_mask]
+    output_arr = output_arr.reshape(original_shape)
+    return output_arr, local_maxima_indices
+def local_maxima_numpy(arr, order=20):
+    is_batch = len(arr.shape) == 2
+    if is_batch:
+        return np.stack([local_maxima_numpy(x, order) for x in arr])
+    # Define a comparison function for argrelextrema to find local maxima
+    compare_func = np.greater
+    # Find the indices of the local maxima
+    local_maxima_indices = argrelextrema(arr, compare_func, order=order)
+    # Initialize a new numpy array with zeros and the same shape as the input array
+    output_arr = np.zeros_like(arr)
+    # Set the local maxima values in the output array
+    output_arr[local_maxima_indices] = arr[local_maxima_indices]
+    return output_arr
+def peak_picking(boundary_activation, window_past=12, window_future=6):
+    # Find local maxima using a sliding window
+    window_size = window_past + window_future
+    assert window_size % 2 == 0, "window_past + window_future must be even"
+    window_size += 1
+    # Pad boundary_activation
+    boundary_activation_padded = np.pad(
+        boundary_activation, (window_past, window_future), mode="constant"
+    )
+    max_filter = sliding_window_view(boundary_activation_padded, window_size)
+    local_maxima = (boundary_activation == np.max(max_filter, axis=-1)) & (
+        boundary_activation > 0
+    )
+    # Compute strength values by subtracting the mean of the past and future windows
+    past_window_filter = sliding_window_view(
+        boundary_activation_padded[: -(window_future + 1)], window_past
+    )
+    future_window_filter = sliding_window_view(
+        boundary_activation_padded[window_past + 1 :], window_future
+    )
+    past_mean = np.mean(past_window_filter, axis=-1)
+    future_mean = np.mean(future_window_filter, axis=-1)
+    strength_values = boundary_activation - ((past_mean + future_mean) / 2)
+    # Get boundary candidates and their corresponding strength values
+    boundary_candidates = np.flatnonzero(local_maxima)
+    strength_values = strength_values[boundary_candidates]
+    strength_activations = np.zeros_like(boundary_activation)
+    strength_activations[boundary_candidates] = strength_values
+    return strength_activations

src/SongFormer/train/accelerate_config/single_gpu.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: 'NO'
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

src/SongFormer/utils/average_checkpoints.py ADDED Viewed

	@@ -0,0 +1,152 @@

+import torch
+import copy
+from typing import List, Dict, Any
+def average_checkpoints(checkpoint_paths: List[str], output_path: str = None):
+    """
+    Average the model and model_ema weights from multiple checkpoints
+    Parameters:
+    checkpoint_paths: List of checkpoint file paths
+    output_path: Output path; if None, return the averaged checkpoint dictionary
+    Returns:
+    Averaged checkpoint dictionary
+    """
+    if not checkpoint_paths:
+        raise ValueError("At least one checkpoint path is required")
+    # Load the first checkpoint as the base
+    print(f"Loading base checkpoint: {checkpoint_paths[0]}")
+    avg_checkpoint = torch.load(checkpoint_paths[0], map_location="cpu")
+    if len(checkpoint_paths) == 1:
+        if output_path:
+            torch.save(avg_checkpoint, output_path)
+        return avg_checkpoint
+    # Initialize accumulators
+    avg_model_state = copy.deepcopy(avg_checkpoint["model"])
+    avg_model_ema_state = None
+    if "model_ema" in avg_checkpoint:
+        avg_model_ema_state = copy.deepcopy(avg_checkpoint["model_ema"])
+    # Accumulate the weights from the other checkpoints
+    for i, ckpt_path in enumerate(checkpoint_paths[1:], 1):
+        print(f"Processing checkpoint {i + 1}/{len(checkpoint_paths)}: {ckpt_path}")
+        ckpt = torch.load(ckpt_path, map_location="cpu")
+        # Accumulate model weights
+        for key in avg_model_state.keys():
+            if key in ckpt["model"]:
+                avg_model_state[key] += ckpt["model"][key]
+        # Accumulate model_ema weights (if available)
+        if avg_model_ema_state is not None and "model_ema" in ckpt:
+            for key in avg_model_ema_state.keys():
+                if key in ckpt["model_ema"]:
+                    avg_model_ema_state[key] += ckpt["model_ema"][key]
+    # Compute the average
+    num_checkpoints = len(checkpoint_paths)
+    print(f"Averaging over {num_checkpoints} checkpoints...")
+    for key in avg_model_state.keys():
+        avg_model_state[key] = avg_model_state[key] / num_checkpoints
+    if avg_model_ema_state is not None:
+        for key in avg_model_ema_state.keys():
+            avg_model_ema_state[key] = avg_model_ema_state[key] / num_checkpoints
+    # Update the checkpoint dictionary
+    avg_checkpoint["model"] = avg_model_state
+    if avg_model_ema_state is not None:
+        avg_checkpoint["model_ema"] = avg_model_ema_state
+    # Save (if an output path is specified)
+    if output_path:
+        print(f"Saving averaged checkpoint to: {output_path}")
+        torch.save(avg_checkpoint, output_path)
+    return avg_checkpoint
+def average_checkpoints_memory_efficient(
+    checkpoint_paths: List[str], output_path: str = None
+):
+    """
+    Memory efficient version: Load and process checkpoints one by one, suitable for large models
+    """
+    if not checkpoint_paths:
+        raise ValueError("At least one checkpoint path is required")
+    print(f"Loading base checkpoint: {checkpoint_paths[0]}")
+    avg_checkpoint = torch.load(checkpoint_paths[0], map_location="cpu")
+    if len(checkpoint_paths) == 1:
+        if output_path:
+            torch.save(avg_checkpoint, output_path)
+        return avg_checkpoint
+    # Convert to float32 for better precision
+    for key in avg_checkpoint["model"].keys():
+        avg_checkpoint["model"][key] = avg_checkpoint["model"][key].float()
+    if "model_ema" in avg_checkpoint:
+        for key in avg_checkpoint["model_ema"].keys():
+            avg_checkpoint["model_ema"][key] = avg_checkpoint["model_ema"][key].float()
+    # Load and accumulate checkpoints one by one
+    for i, ckpt_path in enumerate(checkpoint_paths[1:], 1):
+        print(f"Processing checkpoint {i + 1}/{len(checkpoint_paths)}: {ckpt_path}")
+        ckpt = torch.load(ckpt_path, map_location="cpu")
+        # Accumulate model weights
+        for key in avg_checkpoint["model"].keys():
+            if key in ckpt["model"]:
+                avg_checkpoint["model"][key] += ckpt["model"][key].float()
+        # Accumulate model_ema weights
+        if "model_ema" in avg_checkpoint and "model_ema" in ckpt:
+            for key in avg_checkpoint["model_ema"].keys():
+                if key in ckpt["model_ema"]:
+                    avg_checkpoint["model_ema"][key] += ckpt["model_ema"][key].float()
+        # Free memory
+        del ckpt
+        torch.cuda.empty_cache()
+    # Compute the average
+    num_checkpoints = len(checkpoint_paths)
+    print(f"Averaging over {num_checkpoints} checkpoints...")
+    for key in avg_checkpoint["model"].keys():
+        avg_checkpoint["model"][key] /= num_checkpoints
+    if "model_ema" in avg_checkpoint:
+        for key in avg_checkpoint["model_ema"].keys():
+            avg_checkpoint["model_ema"][key] /= num_checkpoints
+    if output_path:
+        print(f"Saving averaged checkpoint to: {output_path}")
+        torch.save(avg_checkpoint, output_path)
+    return avg_checkpoint
+# Example usage
+if __name__ == "__main__":
+    # Method 1: Simple usage
+    checkpoint_paths = []
+    # Average and save
+    average_checkpoints(checkpoint_paths, "")
+    # Method 2: Get the averaged checkpoint and further process it
+    # avg_ckpt = average_checkpoints(checkpoint_paths)
+    # print("Averaged checkpoint keys:", avg_ckpt.keys())
+    # Method 3: Use memory-efficient version (suitable for large models)
+    # average_checkpoints_memory_efficient(checkpoint_paths, 'averaged_checkpoint_efficient.pt')

src/SongFormer/utils/convert_res2msa_txt.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import json
+import os
+from pathlib import Path
+import fire
+def convert_json_to_format(json_data):
+    """Convert JSON data to the specified format"""
+    result = []
+    # Process the start time and label for each segment
+    for segment in json_data:
+        start_time = segment["start"]
+        label = segment["label"]
+        result.append(f"{start_time:.6f} {label}")
+    # Add the last end time
+    if json_data:
+        last_end_time = json_data[-1]["end"]
+        result.append(f"{last_end_time:.6f} end")
+    return "\n".join(result)
+def process_json_files(input_folder, output_folder):
+    """Process all JSON files in the input folder"""
+    # Create the output folder if it doesn't exist
+    Path(output_folder).mkdir(parents=True, exist_ok=True)
+    # Get all JSON files
+    json_files = [f for f in os.listdir(input_folder) if f.endswith(".json")]
+    if not json_files:
+        print(f"No JSON files found in {input_folder}")
+        return
+    print(f"Found {len(json_files)} JSON files")
+    # Process each JSON file
+    for json_file in json_files:
+        input_path = os.path.join(input_folder, json_file)
+        try:
+            # Read the JSON file
+            with open(input_path, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            # Convert the format
+            converted_data = convert_json_to_format(data)
+            # Generate the output filename (replace .json with .txt)
+            output_filename = json_file.replace(".json", ".txt")
+            output_path = os.path.join(output_folder, output_filename)
+            # Write to the output file
+            with open(output_path, "w", encoding="utf-8") as f:
+                f.write(converted_data)
+            print(f"✓ Processed: {json_file} -> {output_filename}")
+        except Exception as e:
+            print(f"✗ Error processing {json_file}: {str(e)}")
+def main(input_folder: str, output_folder: str):
+    print(f"Input folder: {input_folder}")
+    print(f"Output folder: {output_folder}")
+    print("-" * 50)
+    # Process the files
+    process_json_files(input_folder, output_folder)
+    print("-" * 50)
+    print("Processing complete!")
+if __name__ == "__main__":
+    fire.Fire(main)

src/SongFormer/utils/fetch_pretrained.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import os
+import requests
+from tqdm import tqdm
+def download(url, path):
+    if os.path.exists(path):
+        print(f"File already exists, skipping download: {path}")
+        return
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    response = requests.get(url, stream=True)
+    total_size = int(response.headers.get("content-length", 0))
+    with (
+        open(path, "wb") as f,
+        tqdm(
+            desc=path,
+            total=total_size,
+            unit="iB",
+            unit_scale=True,
+            unit_divisor=1024,
+        ) as bar,
+    ):
+        for data in response.iter_content(chunk_size=1024):
+            size = f.write(data)
+            bar.update(size)
+# 根据 https://github.com/minzwon/musicfm 下载预训练模型
+download(
+    "https://huggingface.co/minzwon/MusicFM/resolve/main/msd_stats.json",
+    os.path.join("ckpts", "MusicFM", "msd_stats.json"),
+)
+download(
+    "https://huggingface.co/minzwon/MusicFM/resolve/main/pretrained_msd.pt",
+    os.path.join("ckpts", "MusicFM", "pretrained_msd.pt"),
+)
+# for Mainland China
+# download('https://hf-mirror.com/minzwon/MusicFM/resolve/main/msd_stats.json', os.path.join("ckpts", "MusicFM", "msd_stats.json"))
+# download('https://hf-mirror.com/minzwon/MusicFM/resolve/main/pretrained_msd.pt', os.path.join("ckpts", "MusicFM", "pretrained_msd.pt"))

src/third_party/MuQ/.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Auto detect text files and perform LF normalization
2	+ * text=auto

src/third_party/MuQ/.gitignore ADDED Viewed

	@@ -0,0 +1,46 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg*/
+*pyc
+# Distribution / packaging
+.Python
+env/
+build/
+dist/
+*.log
+# pyenv
+.python-version
+# dotenv
+.env
+# virtualenv
+.venv/
+venv/
+ENV/
+# VSCode settings
+.vscode
+# IDEA files
+.idea
+# OSX dir files
+.DS_Store
+# Sublime Text settings
+*.sublime-workspace
+*.sublime-project
+# custom
+open/
+src/recipes/pretrain/dataset/music4all/*.json
+src/recipes/contrastive_learning/datasets/mtg-jamendo/*.json
+runs/
+output/
+logs
+outputs/

src/third_party/MuQ/.gitmodules ADDED Viewed

	@@ -0,0 +1,3 @@

+[submodule "src/recipes/pretrain/fairseq"]
+	path = src/recipes/pretrain/fairseq
+	url = https://github.com/facebookresearch/fairseq

src/third_party/MuQ/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) Tencent.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

src/third_party/MuQ/LICENSE_weights ADDED Viewed

	@@ -0,0 +1,399 @@

+Attribution-NonCommercial 4.0 International
+=======================================================================
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+Using Creative Commons Public Licenses
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+	wiki.creativecommons.org/Considerations_for_licensors
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More_considerations
+     for the public:
+	wiki.creativecommons.org/Considerations_for_licensees
+=======================================================================
+Creative Commons Attribution-NonCommercial 4.0 International Public
+License
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial 4.0 International Public License ("Public
+License"). To the extent this Public License may be interpreted as a
+contract, You are granted the Licensed Rights in consideration of Your
+acceptance of these terms and conditions, and the Licensor grants You
+such rights in consideration of benefits the Licensor receives from
+making the Licensed Material available under these terms and
+conditions.
+Section 1 -- Definitions.
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+  i. NonCommercial means not primarily intended for or directed towards
+     commercial advantage or monetary compensation. For purposes of
+     this Public License, the exchange of the Licensed Material for
+     other material subject to Copyright and Similar Rights by digital
+     file-sharing or similar means is NonCommercial provided there is
+     no payment of monetary compensation in connection with the
+     exchange.
+  j. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+  k. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+  l. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+Section 2 -- Scope.
+  a. License grant.
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+            a. reproduce and Share the Licensed Material, in whole or
+               in part, for NonCommercial purposes only; and
+            b. produce, reproduce, and Share Adapted Material for
+               NonCommercial purposes only.
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+       5. Downstream recipients.
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+  b. Other rights.
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties, including when
+          the Licensed Material is used other than for NonCommercial
+          purposes.
+Section 3 -- License Conditions.
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+  a. Attribution.
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+                ii. a copyright notice;
+               iii. a notice that refers to this Public License;
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+Section 4 -- Sui Generis Database Rights.
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database for NonCommercial purposes
+     only;
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+Section 6 -- Term and Termination.
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+       2. upon express reinstatement by the Licensor.
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+Section 7 -- Other Terms and Conditions.
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+Section 8 -- Interpretation.
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+=======================================================================
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+Creative Commons may be contacted at creativecommons.org.

src/third_party/MuQ/README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# <img src="images/muq-logo.jpeg" alt="" height="24px"> MuQ & MuQ-MuLan
+<div>
+  <a href='#'><img alt="Static Badge" src="https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white"></a>
+  <a href='https://arxiv.org/abs/2501.01108'><img alt="Static Badge" src="https://img.shields.io/badge/arXiv-2501.01108-%23b31b1b?logo=arxiv&link=https%3A%2F%2Farxiv.org%2F"></a>
+  <a href='https://huggingface.co/OpenMuQ'><img alt="Static Badge" src="https://img.shields.io/badge/huggingface-OpenMuQ-%23FFD21E?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2FOpenMuQ"></a>
+  <a href='https://pytorch.org/'><img alt="Static Badge" src="https://img.shields.io/badge/framework-PyTorch-%23EE4C2C?logo=pytorch"></a>
+  <a href='https://pypi.org/project/muq'><img alt="Static Badge" src="https://img.shields.io/badge/pip%20install-muq-green?logo=PyPI&logoColor=white&link=https%3A%2F%2Fpypi.org%2Fproject%2Fmuq"></a>
+</div>
+This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
+ with Mel Residual Vector **Q**uantization"*.
+In this repo, the following models are released:
+- **MuQ**: A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
+- **MuQ-MuLan**: A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
+## Overview
+We develop the **MuQ** for music SSL. MuQ applys our proposed Mel-RVQ as quantitative targets and achieves SOTA performance on many music understanding (or MIR) tasks.
+We also construct the **MuQ-MuLan**, a CLIP-like model trained by contrastive learning, which jointly represents music and text into embeddings.
+For more details, please refer to our [paper](https://arxiv.org/abs/2501.01108).
+<div>
+  <img src="images/radar.jpg" width="45%" alt="Evaluation on MARBLE Benchmark">
+  <img src="images/tagging.jpg" width="45%" alt="Evaluation on Zero-shot Music Tagging">
+</div>
+## Usage
+To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
+```bash
+pip3 install muq
+```
+To extract music audio features using **MuQ**, you can refer to the following code:
+```python
+import torch, librosa
+from muq import MuQ
+device = 'cuda'
+wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
+wavs = torch.tensor(wav).unsqueeze(0).to(device)
+# This will automatically fetch the checkpoint from huggingface
+muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
+muq = muq.to(device).eval()
+with torch.no_grad():
+    output = muq(wavs, output_hidden_states=True)
+print('Total number of layers: ', len(output.hidden_states))
+print('Feature shape: ', output.last_hidden_state.shape)
+```
+Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
+```python
+import torch, librosa
+from muq import MuQMuLan
+# This will automatically fetch checkpoints from huggingface
+device = 'cuda'
+mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
+mulan = mulan.to(device).eval()
+# Extract music embeddings
+wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
+wavs = torch.tensor(wav).unsqueeze(0).to(device)
+with torch.no_grad():
+    audio_embeds = mulan(wavs = wavs)
+# Extract text embeddings (texts can be in English or Chinese)
+texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲，节奏欢快"]
+with torch.no_grad():
+    text_embeds = mulan(texts = texts)
+# Calculate dot product similarity
+sim = mulan.calc_similarity(audio_embeds, text_embeds)
+print(sim)
+```
+> Note that both MuQ and MuQ-MuLan strictly require **24 kHz** audio as input.
+> We recommend using **fp32** during MuQ inference to avoid potential NaN issues.
+## Performance
+<img src="images/tab-marble.jpg" width="100%" style="max-width: 800px" alt="Table MARBLE Benchmark">
+<img src="images/tab-mulan.png" width="50%" style="max-width: 400px; margin: 0 25%" alt="Table Mulan Results">
+## Model Checkpoints
+| Model Name | Parameters | Data | HuggingFace🤗 |
+| ----------- | --- | ---  | ----------- |
+| MuQ    | ~300M  | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)       |
+| MuQ-MuLan  | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)       |
+**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper. The training recipes can be found [here](./src/recipes).
+## License
+The code in this repository is released under the MIT license as found in the [LICENSE](LICENSE) file.
+The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) in this repository are released under the CC-BY-NC 4.0 license, as detailed in the [LICENSE_weights](LICENSE_weights) file.
+## Citation
+```
+@article{zhu2025muq,
+      title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
+      author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
+      journal={arXiv preprint arXiv:2501.01108},
+      year={2025}
+}
+```
+## Acknowledgement
+We borrow many codes from the following repositories:
+- [lucidrains/musiclm-pytorch](https://github.com/lucidrains/musiclm-pytorch)
+- [minzwon/musicfm](https://github.com/minzwon/musicfm)
+Also, we are especially grateful to the awesome [MARBLE-Benchmark](https://github.com/a43992899/MARBLE-Benchmark).

src/third_party/MuQ/images/muq-logo.jpeg ADDED Viewed

src/third_party/MuQ/images/radar.jpg ADDED Viewed

Git LFS Details

SHA256: 5c128d768e4888aa0bfbef2d3caa47e819b81840b153c2e4265fd40d921c3685
Pointer size: 130 Bytes
Size of remote file: 43.1 kB

src/third_party/MuQ/images/tab-marble.jpg ADDED Viewed

Git LFS Details

SHA256: d7287c7741b06062fb5cb57b10149c9138cbb56ad3eabef7e3b957ea32db1639
Pointer size: 131 Bytes
Size of remote file: 264 kB

src/third_party/MuQ/images/tab-mulan.png ADDED Viewed

Git LFS Details

SHA256: f473ebd635d2f0c5e4f3fdf7b0a11b9b99ad000215035b4dd412e0c9f7fa3304
Pointer size: 130 Bytes
Size of remote file: 83.6 kB

src/third_party/MuQ/images/tagging.jpg ADDED Viewed

Git LFS Details

SHA256: 57717afef5c8341b64d685410311bc1335752508c0d127b6573a226688eb61b0
Pointer size: 130 Bytes
Size of remote file: 44.4 kB

src/third_party/MuQ/requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+einops
+librosa
+nnAudio
+numpy
+soundfile
+torch
+torchaudio
+tqdm
+transformers
+easydict
+x_clip

src/third_party/MuQ/setup.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from setuptools import setup, find_packages
+setup(
+    name='muq',  # Name of the package
+    version='0.1.0',  # Version of the package
+    packages=find_packages(where='src'),  # Automatically discover packages under the 'src' directory
+    package_dir={'': 'src'},  # Specify the root directory for packages as 'src'
+    include_package_data=True,  # Include additional files, such as static files
+    install_requires=[  # List of dependencies
+        "einops",
+        "librosa",
+        "nnAudio",
+        "numpy",
+        "soundfile",
+        "torch",
+        "torchaudio",
+        "tqdm",
+        "transformers",
+        "easydict",
+        "x_clip",
+    ],
+    author='Haina Zhu',  # Author name
+    author_email='juhayna@qq.com',  # Author email address
+    description='MuQ: A deep learning model for music and text',  # Short description of the package
+    long_description=open('README.md', encoding='utf-8').read(),  # Long description from the README file
+    long_description_content_type='text/markdown',  # Format of the long description (Markdown)
+    url='https://github.com/tencent-ailab/MuQ',  # Project URL
+    classifiers=[
+        'Programming Language :: Python :: 3',  # Python 3 support
+        'License :: OSI Approved :: MIT License',  # License type
+        'Operating System :: OS Independent',  # Supports all operating systems
+    ],
+    python_requires='>=3.8',  # Supported Python version
+)

src/third_party/MuQ/src/muq/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .muq import MuQ, MuQConfig
2	+ from .muq_mulan import MuQMuLan, MuQMuLanConfig

src/third_party/MuQ/src/muq/muq/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .muq import MuQConfig, MuQ

src/third_party/MuQ/src/muq/muq/models/__init__.py ADDED Viewed

File without changes

src/third_party/MuQ/src/muq/muq/models/muq_model.py ADDED Viewed

	@@ -0,0 +1,366 @@

+import json
+import random
+import torch
+from torch import nn
+from einops import rearrange
+import os
+from easydict import EasyDict
+from ..modules.random_quantizer import RandomProjectionQuantizer
+from ..modules.features import MelSTFT
+from ..modules.conv import Conv2dSubsampling
+class MuQModel(nn.Module):
+    def __init__(
+        self,
+        num_codebooks=1,
+        codebook_dim=16,
+        codebook_size=4096,
+        features=["melspec_2048"],
+        hop_length=240,
+        n_mels=128,
+        conv_dim=512,
+        encoder_dim=1024,
+        encoder_depth=12,
+        mask_hop=0.4,
+        mask_prob=0.6,
+        is_flash=False,
+        stat=dict(),
+        w2v2_config=dict(),
+        use_rvq_target=False,
+        use_vq_target=False,
+        use_encodec_target=False,
+        rvq_ckpt_path=None,
+        recon_loss_ratio=None,
+        label_rate=25,
+        rvq_n_codebooks=8,
+        rvq_multi_layer_num=1,
+    ):
+        super().__init__()
+        # global variables
+        self.hop_length = hop_length
+        self.mask_hop = mask_hop
+        self.mask_prob = mask_prob
+        self.num_codebooks = num_codebooks
+        self.codebook_size = codebook_size
+        self.features = features
+        self.recon_loss_ratio = recon_loss_ratio
+        self.n_fold = int(100//label_rate)
+        self.label_rate = label_rate
+        # load feature mean / std stats
+        self.stat = stat
+        # feature extractor
+        self.preprocessor_melspec_2048 = MelSTFT(
+            n_fft=2048, hop_length=hop_length, is_db=True
+        )
+        # random quantizer
+        self.use_rvq_target = use_rvq_target
+        self.use_vq_target = use_vq_target
+        self.use_encodec_target = use_encodec_target
+        seed = 142
+        if self.use_rvq_like_target:
+            if use_rvq_target:
+                from ..modules.rvq import ResidualVectorQuantize
+                inp_dim = 128*self.n_fold
+                self.rvq = ResidualVectorQuantize(
+                    input_dim = inp_dim,
+                    n_codebooks = rvq_n_codebooks,
+                    codebook_size = 1024,
+                    codebook_dim = 16,
+                    quantizer_dropout = 0.0,
+                    use_multi_layer_num = rvq_multi_layer_num,
+                    )
+            elif use_vq_target:
+                from ..modules.rvq import VectorQuantize
+                self.rvq = VectorQuantize(
+                    input_dim = 128*self.n_fold,
+                    codebook_size = 1024,
+                    codebook_dim = 8,
+                    stale_tolerance = 1000,
+                    mfcc_clustering = False
+                )
+            elif use_encodec_target:
+                from encodec import EncodecModel
+                self.rvq = EncodecModel.encodec_model_24khz()
+                self.rvq.set_target_bandwidth(6.0)
+                for param in self.rvq.parameters():
+                    param.requires_grad = False
+            if rvq_ckpt_path is not None and os.path.exists(rvq_ckpt_path):
+                state_dict = torch.load(rvq_ckpt_path, map_location="cpu")
+                self.rvq.load_state_dict(state_dict)
+            else:
+                pass
+                # print(f'Checkpoint for rvq `{rvq_ckpt_path}` not found. Using random initialization.')
+        else:
+            for feature in self.features:
+                for i in range(num_codebooks):
+                    setattr(
+                        self,
+                        f"quantizer_{feature}", # _{i}
+                        RandomProjectionQuantizer(
+                            n_mels * self.n_fold, codebook_dim, codebook_size, seed=seed + i
+                        ),
+                    )
+        # two residual convolution layers + one projection layer
+        strides_factory = {
+            4: [2, 2],
+            2: [2, 1]
+        }
+        self.conv = Conv2dSubsampling(
+            1, conv_dim, encoder_dim, strides=strides_factory.get(self.n_fold), n_bands=n_mels
+        )
+        # Conformer
+        if is_flash:
+            from modules.flash_conformer import (
+                Wav2Vec2ConformerEncoder,
+                Wav2Vec2ConformerConfig,
+            )
+        else:
+            from transformers.models.wav2vec2_conformer.modeling_wav2vec2_conformer import (
+                Wav2Vec2ConformerEncoder,
+                Wav2Vec2ConformerConfig,
+            )
+        config = EasyDict(w2v2_config)
+        config.num_hidden_layers = encoder_depth
+        config.hidden_size = encoder_dim
+        self.conformer = Wav2Vec2ConformerEncoder(config)
+        self.linear = nn.Linear(encoder_dim, codebook_size) # projection layer
+        # reconstruct melspec
+        if self.recon_loss_ratio is not None and self.recon_loss_ratio > 0:
+            self.recon_proj = nn.Linear(encoder_dim, n_mels * self.n_fold)
+            self.recon_loss = nn.MSELoss()
+        # loss function
+        self.loss = nn.CrossEntropyLoss()
+        # cls token (used for sequence classification)
+        random.seed(seed)
+        self.cls_token = nn.Parameter(torch.randn(encoder_dim))
+    @property
+    def use_rvq_like_target(self):
+        return self.use_rvq_target or self.use_vq_target or self.use_encodec_target
+    def masking(self, x, attention_mask=None):
+        """random masking of 400ms with given probability"""
+        mx = x.clone()
+        b, t = mx.shape
+        len_masking_raw = int(24000 * self.mask_hop)
+        len_masking_token = int(24000 / self.hop_length / 2 / 2 * self.mask_hop)
+        # get random mask indices
+        start_indices = torch.rand(b, t // len_masking_raw) < self.mask_prob
+        time_domain_masked_indices = torch.nonzero(
+            start_indices.repeat_interleave(len_masking_raw, dim=1)
+        )
+        token_domain_masked_indices = torch.nonzero(
+            start_indices.repeat_interleave(len_masking_token, dim=1)
+        )
+        # mask with random values
+        masking_noise = (
+            torch.randn(time_domain_masked_indices.shape[0], dtype=x.dtype) * 0.1
+        )  # 0 mean 0.1 std
+        mx[tuple(time_domain_masked_indices.t())] = masking_noise.to(x.device)
+        return mx, token_domain_masked_indices
+    @torch.no_grad()
+    def preprocessing(self, x, features):
+        """extract classic audio features"""
+        # check precision
+        if x.dtype == torch.float16 or x.dtype == torch.bfloat16:
+            precision = 16
+        else:
+            precision = 32
+        out = {}
+        for key in features:
+            layer = getattr(self, "preprocessor_%s" % key)
+            layer.to(x.device)
+            dtype = x.dtype
+            out[key] = layer(x.float())[..., :-1]
+            if precision == 16:
+                out[key] = out[key].half()
+            if out[key].dtype != dtype:
+                out[key].to(dtype=dtype)
+        return out
+    def encoder(self, x, *, attention_mask=None, is_features_only=False):
+        """2-layer conv + w2v-conformer"""
+        x = self.conv(x)
+        mask_indices = None
+        if attention_mask is None:
+            out = self.conformer(x, output_hidden_states=True)
+        else:
+            attention_mask = attention_mask.bool()
+            skip_n = int(attention_mask.size(-1) / x.size(1))
+            attention_mask = attention_mask[:, ::skip_n]
+            attention_mask = attention_mask[:, :x.size(1)]
+            out = self.conformer(x, attention_mask=attention_mask, output_hidden_states=True)
+        hidden_emb = out["hidden_states"]
+        last_emb = out["last_hidden_state"]
+        logits = self.linear(last_emb)
+        interval = self.codebook_size
+        logits = {
+            key: logits[:, :, i * interval : (i + 1) * interval]
+            for i, key in enumerate(self.features)
+        }
+        return logits, hidden_emb, mask_indices
+    @torch.no_grad()
+    def normalize(self, x):
+        """normalize the input audio to have zero mean unit variance"""
+        for key in x.keys():
+            x[key] = (x[key] - self.stat["%s_mean" % key]) / self.stat["%s_std" % key]
+        return x
+    @torch.no_grad()
+    def rearrange(self, x):
+        """rearrange the batch to flatten every 4 steps"""
+        for key in x.keys():
+            if key == "chromagram":
+                x[key] = rearrange(x[key], "b f t -> b t f")
+            else:
+                x[key] = rearrange(x[key], "b f (t s) -> b t (s f)", s=self.n_fold)
+        return x
+    def get_rvq_codes(self, inp, raw_wav):
+        if self.use_rvq_target:
+            quantized_prompt_embeds, codes, _, commitment_loss, codebook_loss, rvq_usage = self.rvq(inp)
+            return codes
+        if self.use_vq_target:
+            quantized_prompt_embeds, commitment_loss, codebook_loss, codes, _ = self.rvq(inp)
+            return codes.unsqueeze(1)
+        if self.use_encodec_target:
+            encoded_frames = self.rvq.encode(raw_wav.unsqueeze(1)) #list, B,[ 8,T ]
+            codes = torch.cat([encoded[0].detach() for encoded in encoded_frames], dim=-1)
+            if self.label_rate == 25:
+                codes = codes[:, :, ::3]
+            return codes
+    @torch.no_grad()
+    def tokenize(self, x, raw_wav):
+        out = {}
+        for key in x.keys():
+            if self.use_rvq_like_target:
+                self.rvq.eval()
+                inp = x[key].permute((0, 2, 1))
+                codes = self.get_rvq_codes(inp, raw_wav)
+                out[key] = torch.cat([codes[:, idx, ...] for idx in range(int(self.codebook_size//1024))], dim=-1)
+            else:
+                layer = getattr(self, "quantizer_%s" % key)
+                out[key] = layer(x[key])
+        return out
+    def get_targets(self, x, label=None):
+        if self.use_encodec_target:
+            raw_x = x.clone()
+        else:
+            raw_x = None
+        x = self.preprocessing(x, features=self.features)
+        x = self.normalize(x)
+        x = self.rearrange(x)
+        melspec = x['melspec_2048']
+        if label is None:
+            # Use labels from Mel-RVQ
+            target_tokens = self.tokenize(x, raw_x)
+        else:
+            # Use labels pre-extracted for iteration training
+            target_tokens = {'melspec_2048': rearrange(label, "b n s -> b (n s)").long()}
+        return target_tokens, melspec
+    def get_predictions(self, x, *, mask=None, attention_mask=None, return_new_mask=False, is_features_only=False):
+        # preprocessing
+        x = self.preprocessing(x, features=["melspec_2048"])
+        x = self.normalize(x)
+        # encoding
+        logits, hidden_emb, new_mask = self.encoder(x["melspec_2048"], attention_mask=attention_mask, is_features_only=is_features_only)
+        if return_new_mask:
+            return logits, hidden_emb, mask if new_mask is None else new_mask
+        else:
+            return logits, hidden_emb
+    def get_latent(self, x, layer_ix=12):
+        _, hidden_states = self.get_predictions(x)
+        emb = hidden_states[layer_ix]
+        return emb
+    def compute_nce(self, x, pos, negs):
+        neg_is_pos = (pos == negs).all(-1)
+        pos = pos.unsqueeze(0)
+        targets = torch.cat([pos, negs], dim=0)
+        logits = torch.cosine_similarity(x.float(), targets.float(), dim=-1).type_as(x)
+        logits /= 0.1
+        if neg_is_pos.any():
+            logits[1:][neg_is_pos] = float("-inf")
+        logits = logits.transpose(0, 1)
+        return logits
+    def get_loss(self, logits, target_tokens, masked_indices):
+        losses = {}
+        accuracies = {}
+        for key in logits.keys():
+            if not self.use_rvq_like_target:
+                masked_logits = logits[key][tuple(masked_indices.t())]
+                masked_tokens = target_tokens[key][tuple(masked_indices.t())]
+            else:
+                Batch, SeqLen, N_Codebook_x_CodebookSize = logits[key].shape
+                Batch, N_Codebook_x_SeqLen = target_tokens[key].shape
+                N_Codebook = int(N_Codebook_x_SeqLen // SeqLen)
+                target_tokens[key] = rearrange(target_tokens[key], "b (n s) -> b s n", n=N_Codebook) # Batch, SeqLen=750, N_Codebook=4
+                masked_logits = logits[key][tuple(masked_indices.t())]
+                masked_tokens = target_tokens[key][tuple(masked_indices.t())]
+                masked_logits = rearrange(masked_logits, "b (n c) -> (b n) c", n=N_Codebook)
+                masked_tokens = rearrange(masked_tokens, "b n -> (b n)", n=N_Codebook)
+            losses[key] = self.loss(masked_logits, masked_tokens)
+            accuracies[key] = (
+                torch.sum(masked_logits.argmax(-1) == masked_tokens)
+                / masked_tokens.numel()
+            )
+        return losses, accuracies
+    def get_recon_loss(self, last_hidden_emb, melspec, masked_indices):
+        pred_melspec = self.recon_proj(last_hidden_emb[tuple(masked_indices.t())])
+        target_melspec = melspec[tuple(masked_indices.t())]
+        recon_loss = self.recon_loss(pred_melspec, target_melspec)
+        return recon_loss
+    def forward(self, x, attention_mask=None, label=None):
+        dtype = x.dtype
+        # get target feature tokens
+        target_tokens, melspec = self.get_targets(x, label=label)
+        # masking
+        x, masked_indices = self.masking(x, attention_mask=attention_mask)
+        # forward
+        logits, hidden_emb, masked_indices = self.get_predictions(x, mask=masked_indices, attention_mask=attention_mask, return_new_mask=True)
+        # get loss
+        losses, accuracies = self.get_loss(logits, target_tokens, masked_indices)
+        if self.recon_loss_ratio:
+            losses["recon_loss"] = self.get_recon_loss(hidden_emb[-1], melspec, masked_indices) * self.recon_loss_ratio
+        return logits, hidden_emb, losses, accuracies

src/third_party/MuQ/src/muq/muq/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+
2	+

src/third_party/MuQ/src/muq/muq/modules/conv.py ADDED Viewed

	@@ -0,0 +1,77 @@

+from torch import nn
+from einops import rearrange
+class Res2dModule(nn.Module):
+    def __init__(self, idim, odim, stride=(2, 2)):
+        super(Res2dModule, self).__init__()
+        self.conv1 = nn.Conv2d(idim, odim, 3, padding=1, stride=stride)
+        self.bn1 = nn.BatchNorm2d(odim)
+        self.conv2 = nn.Conv2d(odim, odim, 3, padding=1)
+        self.bn2 = nn.BatchNorm2d(odim)
+        self.relu = nn.ReLU()
+        # residual
+        self.diff = False
+        if (idim != odim) or (stride[0] > 1):
+            self.conv3 = nn.Conv2d(idim, odim, 3, padding=1, stride=stride)
+            self.bn3 = nn.BatchNorm2d(odim)
+            self.diff = True
+    def forward(self, x):
+        out = self.bn2(self.conv2(self.relu(self.bn1(self.conv1(x)))))
+        if self.diff:
+            x = self.bn3(self.conv3(x))
+        out = x + out
+        out = self.relu(out)
+        return out
+class Conv2dSubsampling(nn.Module):
+    """Convolutional 2D subsampling (to 1/4 length).
+    Args:
+        idim (int): Input dimension.
+        hdim (int): Hidden dimension.
+        odim (int): Output dimension.
+        strides (list): Sizes of strides.
+        n_bands (int): Number of frequency bands.
+    """
+    def __init__(self, idim, hdim, odim, strides=[2, 2], n_bands=64):
+        """Construct an Conv2dSubsampling object."""
+        super(Conv2dSubsampling, self).__init__()
+        self.conv = nn.Sequential(
+            Res2dModule(idim, hdim, (2, strides[0])),
+            Res2dModule(hdim, hdim, (2, strides[1])),
+        )
+        self.linear = nn.Linear(hdim * n_bands // 2 // 2, odim)
+    def forward(self, x):
+        """Subsample x.
+        Args:
+            x (torch.Tensor): Input tensor (#batch, idim, time).
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 4.
+        """
+        if x.dim() == 3:
+            x = x.unsqueeze(1)  # (b, c, f, t)
+        x = self.conv(x)
+        x = rearrange(x, "b c f t -> b t (c f)")
+        x = self.linear(x)
+        return x
+if __name__ == '__main__':
+    import torch
+    conv_dim, encoder_dim = 512, 1024
+    conv = Conv2dSubsampling(
+            1, conv_dim, encoder_dim, strides=[2, 1], n_bands=128
+        )
+    inp = torch.randn((1, 128, 3000))
+    out = conv(inp)
+    print(out.shape)

src/third_party/MuQ/src/muq/muq/modules/features.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import torchaudio
+from torch import nn
+import torch
+class MelSTFT:
+    def __init__(
+        self,
+        sample_rate=24000,
+        n_fft=2048,
+        hop_length=240,
+        n_mels=128,
+        is_db=False,
+    ):
+        super(MelSTFT, self).__init__()
+        # spectrogram
+        self.mel_stft = torchaudio.transforms.MelSpectrogram(
+            sample_rate=sample_rate, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels
+        )
+        # amplitude to decibel
+        self.is_db = is_db
+        if is_db:
+            self.amplitude_to_db = torchaudio.transforms.AmplitudeToDB()
+    def __call__(self, waveform):
+        if self.is_db:
+            return self.amplitude_to_db(self.mel_stft(waveform))
+        else:
+            return self.mel_stft(waveform)
+    def to(self, device):
+        self.mel_stft = self.mel_stft.to(device)
+        if self.is_db:
+            self.amplitude_to_db = self.amplitude_to_db.to(device)
+        return self

src/third_party/MuQ/src/muq/muq/modules/flash_conformer.py ADDED Viewed

	@@ -0,0 +1,2114 @@

+# coding=utf-8
+# Copyright 2022 The Fairseq Authors and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Wav2Vec2-Conformer model."""
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from torch.nn import functional as F
+from transformers.activations import ACT2FN
+from transformers.deepspeed import is_deepspeed_zero3_enabled
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    CausalLMOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    Wav2Vec2BaseModelOutput,
+    XVectorOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.models.wav2vec2_conformer.configuration_wav2vec2_conformer import Wav2Vec2ConformerConfig
+logger = logging.get_logger(__name__)
+_HIDDEN_STATES_START_POSITION = 2
+# General docstring
+_CONFIG_FOR_DOC = "Wav2Vec2ConformerConfig"
+# Base docstring
+_CHECKPOINT_FOR_DOC = "facebook/wav2vec2-conformer-rope-large-960h-ft"
+_EXPECTED_OUTPUT_SHAPE = [1, 292, 1024]
+# CTC docstring
+_CTC_EXPECTED_OUTPUT = "'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL'"
+_CTC_EXPECTED_LOSS = 64.21
+WAV2VEC2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/wav2vec2-conformer-rel-pos-large",
+    # See all Wav2Vec2Conformer models at https://huggingface.co/models?filter=wav2vec2-conformer
+]
+@dataclass
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`Wav2Vec2ConformerForPreTraining`], with potential hidden states and attentions.
+    Args:
+        loss (*optional*, returned when `sample_negative_indices` are passed, `torch.FloatTensor` of shape `(1,)`):
+            Total loss as the sum of the contrastive loss (L_m) and the diversity loss (L_d) as stated in the [official
+            paper](https://arxiv.org/pdf/2006.11477.pdf) . (classification) loss.
+        projected_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.proj_codevector_dim)`):
+            Hidden-states of the model projected to *config.proj_codevector_dim* that can be used to predict the masked
+            projected quantized states.
+        projected_quantized_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.proj_codevector_dim)`):
+            Quantized extracted feature vectors projected to *config.proj_codevector_dim* representing the positive
+            target vectors for contrastive loss.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        contrastive_loss (*optional*, returned when `sample_negative_indices` are passed, `torch.FloatTensor` of shape `(1,)`):
+            The contrastive loss (L_m) as stated in the [official paper](https://arxiv.org/pdf/2006.11477.pdf) .
+        diversity_loss (*optional*, returned when `sample_negative_indices` are passed, `torch.FloatTensor` of shape `(1,)`):
+            The diversity loss (L_d) as stated in the [official paper](https://arxiv.org/pdf/2006.11477.pdf) .
+    """
+    loss: Optional[torch.FloatTensor] = None
+    projected_states: torch.FloatTensor = None
+    projected_quantized_states: torch.FloatTensor = None
+    codevector_perplexity: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    contrastive_loss: Optional[torch.FloatTensor] = None
+    diversity_loss: Optional[torch.FloatTensor] = None
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2._compute_mask_indices
+def _compute_mask_indices(
+    shape: Tuple[int, int],
+    mask_prob: float,
+    mask_length: int,
+    attention_mask: Optional[torch.LongTensor] = None,
+    min_masks: int = 0,
+) -> np.ndarray:
+    """
+    Computes random mask spans for a given shape. Used to implement [SpecAugment: A Simple Data Augmentation Method for
+    ASR](https://arxiv.org/abs/1904.08779). Note that this method is not optimized to run on TPU and should be run on
+    CPU as part of the preprocessing during training.
+    Args:
+        shape: The shape for which to compute masks. This should be of a tuple of size 2 where
+               the first element is the batch size and the second element is the length of the axis to span.
+        mask_prob:  The percentage of the whole axis (between 0 and 1) which will be masked. The number of
+                    independently generated mask spans of length `mask_length` is computed by
+                    `mask_prob*shape[1]/mask_length`. Note that due to overlaps, `mask_prob` is an upper bound and the
+                    actual percentage will be smaller.
+        mask_length: size of the mask
+        min_masks: minimum number of masked spans
+        attention_mask: A (right-padded) attention mask which independently shortens the feature axis of
+                        each batch dimension.
+    """
+    batch_size, sequence_length = shape
+    if mask_length < 1:
+        raise ValueError("`mask_length` has to be bigger than 0.")
+    if mask_length > sequence_length:
+        raise ValueError(
+            f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length}"
+            f" and `sequence_length`: {sequence_length}`"
+        )
+    # epsilon is used for probabilistic rounding
+    epsilon = np.random.rand(1).item()
+    def compute_num_masked_span(input_length):
+        """Given input length, compute how many spans should be masked"""
+        num_masked_span = int(mask_prob * input_length / mask_length + epsilon)
+        num_masked_span = max(num_masked_span, min_masks)
+        # make sure num masked span <= sequence_length
+        if num_masked_span * mask_length > sequence_length:
+            num_masked_span = sequence_length // mask_length
+        # make sure num_masked span is also <= input_length - (mask_length - 1)
+        if input_length - (mask_length - 1) < num_masked_span:
+            num_masked_span = max(input_length - (mask_length - 1), 0)
+        return num_masked_span
+    # compute number of masked spans in batch
+    input_lengths = (
+        attention_mask.sum(-1).detach().tolist()
+        if attention_mask is not None
+        else [sequence_length for _ in range(batch_size)]
+    )
+    # SpecAugment mask to fill
+    spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=bool)
+    spec_aug_mask_idxs = []
+    max_num_masked_span = compute_num_masked_span(sequence_length)
+    if max_num_masked_span == 0:
+        return spec_aug_mask
+    for input_length in input_lengths:
+        # compute num of masked spans for this input
+        num_masked_span = compute_num_masked_span(input_length)
+        # get random indices to mask
+        spec_aug_mask_idx = np.random.choice(
+            np.arange(input_length - (mask_length - 1)), num_masked_span, replace=False
+        )
+        # pick first sampled index that will serve as a dummy index to pad vector
+        # to ensure same dimension for all batches due to probabilistic rounding
+        # Picking first sample just pads those vectors twice.
+        if len(spec_aug_mask_idx) == 0:
+            # this case can only happen if `input_length` is strictly smaller then
+            # `sequence_length` in which case the last token has to be a padding
+            # token which we can use as a dummy mask id
+            dummy_mask_idx = sequence_length - 1
+        else:
+            dummy_mask_idx = spec_aug_mask_idx[0]
+        spec_aug_mask_idx = np.concatenate(
+            [spec_aug_mask_idx, np.ones(max_num_masked_span - num_masked_span, dtype=np.int32) * dummy_mask_idx]
+        )
+        spec_aug_mask_idxs.append(spec_aug_mask_idx)
+    spec_aug_mask_idxs = np.array(spec_aug_mask_idxs)
+    # expand masked indices to masked spans
+    spec_aug_mask_idxs = np.broadcast_to(
+        spec_aug_mask_idxs[:, :, None], (batch_size, max_num_masked_span, mask_length)
+    )
+    spec_aug_mask_idxs = spec_aug_mask_idxs.reshape(batch_size, max_num_masked_span * mask_length)
+    # add offset to the starting indexes so that indexes now create a span
+    offsets = np.arange(mask_length)[None, None, :]
+    offsets = np.broadcast_to(offsets, (batch_size, max_num_masked_span, mask_length)).reshape(
+        batch_size, max_num_masked_span * mask_length
+    )
+    spec_aug_mask_idxs = spec_aug_mask_idxs + offsets
+    # ensure that we cannot have indices larger than sequence_length
+    if spec_aug_mask_idxs.max() > sequence_length - 1:
+        spec_aug_mask_idxs[spec_aug_mask_idxs > sequence_length - 1] = sequence_length - 1
+    # scatter indices to mask
+    np.put_along_axis(spec_aug_mask, spec_aug_mask_idxs, 1, -1)
+    return spec_aug_mask
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2._sample_negative_indices
+def _sample_negative_indices(
+    features_shape: Tuple, num_negatives: int, mask_time_indices: Optional[np.ndarray] = None
+):
+    """
+    Sample `num_negatives` vectors from feature vectors.
+    """
+    batch_size, sequence_length = features_shape
+    # generate indices of the positive vectors themselves, repeat them `num_negatives` times
+    sequence_length_range = np.arange(sequence_length)
+    # get `num_negatives` random vector indices from the same utterance
+    sampled_negative_indices = np.zeros(shape=(batch_size, sequence_length, num_negatives), dtype=np.int32)
+    mask_time_indices = (
+        mask_time_indices.astype(bool) if mask_time_indices is not None else np.ones(features_shape, dtype=bool)
+    )
+    for batch_idx in range(batch_size):
+        high = mask_time_indices[batch_idx].sum() - 1
+        mapped_masked_indices = sequence_length_range[mask_time_indices[batch_idx]]
+        feature_indices = np.broadcast_to(np.arange(high + 1)[:, None], (high + 1, num_negatives))
+        sampled_indices = np.random.randint(0, high, size=(high + 1, num_negatives))
+        # avoid sampling the same positive vector, but keep the distribution uniform
+        sampled_indices[sampled_indices >= feature_indices] += 1
+        # remap to actual indices
+        sampled_negative_indices[batch_idx][mask_time_indices[batch_idx]] = mapped_masked_indices[sampled_indices]
+        # correct for batch size
+        sampled_negative_indices[batch_idx] += batch_idx * sequence_length
+    return sampled_negative_indices
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2NoLayerNormConvLayer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerNoLayerNormConvLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+        self.conv = nn.Conv1d(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias=config.conv_bias,
+        )
+        self.activation = ACT2FN[config.feat_extract_activation]
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2LayerNormConvLayer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerLayerNormConvLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+        self.conv = nn.Conv1d(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias=config.conv_bias,
+        )
+        self.layer_norm = nn.LayerNorm(self.out_conv_dim, elementwise_affine=True)
+        self.activation = ACT2FN[config.feat_extract_activation]
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = hidden_states.transpose(-2, -1)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = hidden_states.transpose(-2, -1)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2GroupNormConvLayer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerGroupNormConvLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+        self.conv = nn.Conv1d(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias=config.conv_bias,
+        )
+        self.activation = ACT2FN[config.feat_extract_activation]
+        self.layer_norm = nn.GroupNorm(num_groups=self.out_conv_dim, num_channels=self.out_conv_dim, affine=True)
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2PositionalConvEmbedding with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerPositionalConvEmbedding(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            config.hidden_size,
+            config.hidden_size,
+            kernel_size=config.num_conv_pos_embeddings,
+            padding=config.num_conv_pos_embeddings // 2,
+            groups=config.num_conv_pos_embedding_groups,
+        )
+        if is_deepspeed_zero3_enabled():
+            import deepspeed
+            with deepspeed.zero.GatheredParameters(self.conv.weight, modifier_rank=0):
+                self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
+            deepspeed.zero.register_external_parameter(self, self.conv.weight_v)
+            deepspeed.zero.register_external_parameter(self, self.conv.weight_g)
+        else:
+            self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
+        self.padding = Wav2Vec2ConformerSamePadLayer(config.num_conv_pos_embeddings)
+        self.activation = ACT2FN[config.feat_extract_activation]
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.transpose(1, 2)
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.padding(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2)
+        return hidden_states
+class Wav2Vec2ConformerRotaryPositionalEmbedding(nn.Module):
+    """Rotary positional embedding
+    Reference : https://blog.eleuther.ai/rotary-embeddings/ Paper: https://arxiv.org/pdf/2104.09864.pdf
+    """
+    def __init__(self, config):
+        super().__init__()
+        dim = config.hidden_size // config.num_attention_heads
+        base = config.rotary_embedding_base
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        self.cached_sequence_length = None
+        self.cached_rotary_positional_embedding = None
+    def forward(self, hidden_states):
+        sequence_length = hidden_states.shape[1]
+        if sequence_length == self.cached_sequence_length and self.cached_rotary_positional_embedding is not None:
+            return self.cached_rotary_positional_embedding
+        self.cached_sequence_length = sequence_length
+        time_stamps = torch.arange(sequence_length).type_as(self.inv_freq)
+        freqs = torch.einsum("i,j->ij", time_stamps, self.inv_freq)
+        embeddings = torch.cat((freqs, freqs), dim=-1)
+        cos_embeddings = embeddings.cos()[:, None, None, :]
+        sin_embeddings = embeddings.sin()[:, None, None, :]
+        self.cached_rotary_positional_embedding = torch.stack([cos_embeddings, sin_embeddings])
+        return self.cached_rotary_positional_embedding
+class Wav2Vec2ConformerRelPositionalEmbedding(nn.Module):
+    """Relative positional encoding module."""
+    def __init__(self, config):
+        super().__init__()
+        self.max_len = config.max_source_positions
+        self.d_model = config.hidden_size
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, self.max_len))
+    def extend_pe(self, x):
+        # Reset the positional encodings
+        if self.pe is not None:
+            # self.pe contains both positive and negative parts
+            # the length of self.pe is 2 * input_len - 1
+            if self.pe.size(1) >= x.size(1) * 2 - 1:
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        # Suppose `i` is the position of query vector and `j` is the
+        # position of key vector. We use positive relative positions when keys
+        # are to the left (i>j) and negative relative positions otherwise (i<j).
+        pe_positive = torch.zeros(x.size(1), self.d_model)
+        pe_negative = torch.zeros(x.size(1), self.d_model)
+        position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32) * -(math.log(10000.0) / self.d_model)
+        )
+        pe_positive[:, 0::2] = torch.sin(position * div_term)
+        pe_positive[:, 1::2] = torch.cos(position * div_term)
+        pe_negative[:, 0::2] = torch.sin(-1 * position * div_term)
+        pe_negative[:, 1::2] = torch.cos(-1 * position * div_term)
+        # Reverse the order of positive indices and concat both positive and
+        # negative indices. This is used to support the shifting trick
+        # as in https://arxiv.org/abs/1901.02860
+        pe_positive = torch.flip(pe_positive, [0]).unsqueeze(0)
+        pe_negative = pe_negative[1:].unsqueeze(0)
+        pe = torch.cat([pe_positive, pe_negative], dim=1)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+    def forward(self, hidden_states: torch.Tensor):
+        self.extend_pe(hidden_states)
+        start_idx = self.pe.size(1) // 2 - hidden_states.size(1) + 1
+        end_idx = self.pe.size(1) // 2 + hidden_states.size(1)
+        relative_position_embeddings = self.pe[:, start_idx:end_idx]
+        return relative_position_embeddings
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2SamePadLayer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerSamePadLayer(nn.Module):
+    def __init__(self, num_conv_pos_embeddings):
+        super().__init__()
+        self.num_pad_remove = 1 if num_conv_pos_embeddings % 2 == 0 else 0
+    def forward(self, hidden_states):
+        if self.num_pad_remove > 0:
+            hidden_states = hidden_states[:, :, : -self.num_pad_remove]
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2FeatureEncoder with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerFeatureEncoder(nn.Module):
+    """Construct the features from raw audio waveform"""
+    def __init__(self, config):
+        super().__init__()
+        if config.feat_extract_norm == "group":
+            conv_layers = [Wav2Vec2ConformerGroupNormConvLayer(config, layer_id=0)] + [
+                Wav2Vec2ConformerNoLayerNormConvLayer(config, layer_id=i + 1)
+                for i in range(config.num_feat_extract_layers - 1)
+            ]
+        elif config.feat_extract_norm == "layer":
+            conv_layers = [
+                Wav2Vec2ConformerLayerNormConvLayer(config, layer_id=i) for i in range(config.num_feat_extract_layers)
+            ]
+        else:
+            raise ValueError(
+                f"`config.feat_extract_norm` is {config.feat_extract_norm}, but has to be one of ['group', 'layer']"
+            )
+        self.conv_layers = nn.ModuleList(conv_layers)
+        self.gradient_checkpointing = False
+        self._requires_grad = True
+    def _freeze_parameters(self):
+        for param in self.parameters():
+            param.requires_grad = False
+        self._requires_grad = False
+    def forward(self, input_values):
+        hidden_states = input_values[:, None]
+        # make sure hidden_states require grad for gradient_checkpointing
+        if self._requires_grad and self.training:
+            hidden_states.requires_grad = True
+        for conv_layer in self.conv_layers:
+            if self._requires_grad and self.gradient_checkpointing and self.training:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+                    return custom_forward
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(conv_layer),
+                    hidden_states,
+                )
+            else:
+                hidden_states = conv_layer(hidden_states)
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2FeatureProjection with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerFeatureProjection(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.layer_norm = nn.LayerNorm(config.conv_dim[-1], eps=config.layer_norm_eps)
+        self.projection = nn.Linear(config.conv_dim[-1], config.hidden_size)
+        self.dropout = nn.Dropout(config.feat_proj_dropout)
+    def forward(self, hidden_states):
+        # non-projected hidden states are needed for quantization
+        norm_hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.projection(norm_hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states, norm_hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2FeedForward with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerFeedForward(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.intermediate_dropout = nn.Dropout(config.activation_dropout)
+        self.intermediate_dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+        self.output_dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.output_dropout = nn.Dropout(config.hidden_dropout)
+    def forward(self, hidden_states):
+        hidden_states = self.intermediate_dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        hidden_states = self.intermediate_dropout(hidden_states)
+        hidden_states = self.output_dense(hidden_states)
+        hidden_states = self.output_dropout(hidden_states)
+        return hidden_states
+class Wav2Vec2ConformerConvolutionModule(nn.Module):
+    """Convolution block used in the conformer block"""
+    def __init__(self, config):
+        super().__init__()
+        if (config.conv_depthwise_kernel_size - 1) % 2 == 1:
+            raise ValueError("`config.conv_depthwise_kernel_size` should be a odd number for 'SAME' padding")
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.pointwise_conv1 = torch.nn.Conv1d(
+            config.hidden_size,
+            2 * config.hidden_size,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=False,
+        )
+        self.glu = torch.nn.GLU(dim=1)
+        self.depthwise_conv = torch.nn.Conv1d(
+            config.hidden_size,
+            config.hidden_size,
+            config.conv_depthwise_kernel_size,
+            stride=1,
+            padding=(config.conv_depthwise_kernel_size - 1) // 2,
+            groups=config.hidden_size,
+            bias=False,
+        )
+        self.batch_norm = torch.nn.BatchNorm1d(config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.pointwise_conv2 = torch.nn.Conv1d(
+            config.hidden_size,
+            config.hidden_size,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=False,
+        )
+        self.dropout = torch.nn.Dropout(config.conformer_conv_dropout)
+    def forward(self, hidden_states):
+        hidden_states = self.layer_norm(hidden_states)
+        # exchange the temporal dimension and the feature dimension
+        hidden_states = hidden_states.transpose(1, 2)
+        # GLU mechanism
+        # => (batch, 2*channel, dim)
+        hidden_states = self.pointwise_conv1(hidden_states)
+        # => (batch, channel, dim)
+        hidden_states = self.glu(hidden_states)
+        # 1D Depthwise Conv
+        hidden_states = self.depthwise_conv(hidden_states)
+        hidden_states = self.batch_norm(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.pointwise_conv2(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2)
+        return hidden_states
+class Wav2Vec2ConformerSelfAttention(nn.Module):
+    """Construct an Wav2Vec2ConformerSelfAttention object.
+    Can be enhanced with rotary or relative position embeddings.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.head_size = config.hidden_size // config.num_attention_heads
+        self.num_heads = config.num_attention_heads
+        self.position_embeddings_type = config.position_embeddings_type
+        self.linear_q = nn.Linear(config.hidden_size, config.hidden_size)
+        self.linear_k = nn.Linear(config.hidden_size, config.hidden_size)
+        self.linear_v = nn.Linear(config.hidden_size, config.hidden_size)
+        self.linear_out = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(p=config.attention_dropout)
+        self.dropout_p = config.attention_dropout
+        self.is_causal = config.is_causal
+        if self.position_embeddings_type == "relative":
+            # linear transformation for positional encoding
+            self.linear_pos = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+            # these two learnable bias are used in matrix c and matrix d
+            # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+            self.pos_bias_u = nn.Parameter(torch.zeros(self.num_heads, self.head_size))
+            self.pos_bias_v = nn.Parameter(torch.zeros(self.num_heads, self.head_size))
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        relative_position_embeddings: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        # self-attention mechanism
+        batch_size, sequence_length, hidden_size = hidden_states.size()
+        # make sure query/key states can be != value states
+        query_key_states = hidden_states
+        value_states = hidden_states
+        if self.position_embeddings_type == "rotary":
+            if relative_position_embeddings is None:
+                raise ValueError(
+                    "`relative_position_embeddings` has to be defined when `self.position_embeddings_type == 'rotary'"
+                )
+            query_key_states = self._apply_rotary_embedding(query_key_states, relative_position_embeddings)
+        # project query_key_states and value_states
+        query = self.linear_q(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
+        key = self.linear_k(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
+        value = self.linear_v(value_states).view(batch_size, -1, self.num_heads, self.head_size)
+        # => (batch, head, time1, d_k)
+        query = query.transpose(1, 2)
+        key = key.transpose(1, 2)
+        value = value.transpose(1, 2)
+        with torch.backends.cuda.sdp_kernel(enable_math=False, enable_flash=True, enable_mem_efficient=False):
+            hidden_states = F.scaled_dot_product_attention(query, key, value, attn_mask=attention_mask, dropout_p=self.dropout_p, is_causal=self.is_causal)
+        probs = None
+        # # apply attention_mask if necessary
+        # if attention_mask is not None:
+        #     scores = scores + attention_mask
+        # # => (batch, head, time1, time2)
+        # probs = torch.softmax(scores, dim=-1)
+        # probs = self.dropout(probs)
+        # # => (batch, head, time1, d_k)
+        # hidden_states = torch.matmul(probs, value)
+        # => (batch, time1, hidden_size)
+        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, self.num_heads * self.head_size)
+        hidden_states = self.linear_out(hidden_states)
+        return hidden_states, probs
+    def _apply_rotary_embedding(self, hidden_states, relative_position_embeddings):
+        batch_size, sequence_length, hidden_size = hidden_states.size()
+        hidden_states = hidden_states.view(batch_size, sequence_length, self.num_heads, self.head_size)
+        cos = relative_position_embeddings[0, :sequence_length, ...]
+        sin = relative_position_embeddings[1, :sequence_length, ...]
+        # rotate hidden_states with rotary embeddings
+        hidden_states = hidden_states.transpose(0, 1)
+        rotated_states_begin = hidden_states[..., : self.head_size // 2]
+        rotated_states_end = hidden_states[..., self.head_size // 2 :]
+        rotated_states = torch.cat((-rotated_states_end, rotated_states_begin), dim=rotated_states_begin.ndim - 1)
+        hidden_states = (hidden_states * cos) + (rotated_states * sin)
+        hidden_states = hidden_states.transpose(0, 1)
+        hidden_states = hidden_states.view(batch_size, sequence_length, self.num_heads * self.head_size)
+        return hidden_states
+    def _apply_relative_embeddings(self, query, key, relative_position_embeddings):
+        # 1. project positional embeddings
+        # => (batch, head, 2*time1-1, d_k)
+        proj_relative_position_embeddings = self.linear_pos(relative_position_embeddings)
+        proj_relative_position_embeddings = proj_relative_position_embeddings.view(
+            relative_position_embeddings.size(0), -1, self.num_heads, self.head_size
+        )
+        proj_relative_position_embeddings = proj_relative_position_embeddings.transpose(1, 2)
+        proj_relative_position_embeddings = proj_relative_position_embeddings.transpose(2, 3)
+        # 2. Add bias to query
+        # => (batch, head, time1, d_k)
+        query = query.transpose(1, 2)
+        q_with_bias_u = (query + self.pos_bias_u).transpose(1, 2)
+        q_with_bias_v = (query + self.pos_bias_v).transpose(1, 2)
+        # 3. attention score: first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # => (batch, head, time1, time2)
+        scores_ac = torch.matmul(q_with_bias_u, key.transpose(-2, -1))
+        # 4. then compute matrix b and matrix d
+        # => (batch, head, time1, 2*time1-1)
+        scores_bd = torch.matmul(q_with_bias_v, proj_relative_position_embeddings)
+        # 5. shift matrix b and matrix d
+        zero_pad = torch.zeros((*scores_bd.size()[:3], 1), device=scores_bd.device, dtype=scores_bd.dtype)
+        scores_bd_padded = torch.cat([zero_pad, scores_bd], dim=-1)
+        scores_bd_padded_shape = scores_bd.size()[:2] + (scores_bd.shape[3] + 1, scores_bd.shape[2])
+        scores_bd_padded = scores_bd_padded.view(*scores_bd_padded_shape)
+        scores_bd = scores_bd_padded[:, :, 1:].view_as(scores_bd)
+        scores_bd = scores_bd[:, :, :, : scores_bd.size(-1) // 2 + 1]
+        # 6. sum matrices
+        # => (batch, head, time1, time2)
+        scores = (scores_ac + scores_bd) / math.sqrt(self.head_size)
+        return scores
+class Wav2Vec2ConformerEncoderLayer(nn.Module):
+    """Conformer block based on https://arxiv.org/abs/2005.08100."""
+    def __init__(self, config):
+        super().__init__()
+        embed_dim = config.hidden_size
+        dropout = config.attention_dropout
+        # Feed-forward 1
+        self.ffn1_layer_norm = nn.LayerNorm(embed_dim)
+        self.ffn1 = Wav2Vec2ConformerFeedForward(config)
+        # Self-Attention
+        self.self_attn_layer_norm = nn.LayerNorm(embed_dim)
+        self.self_attn_dropout = torch.nn.Dropout(dropout)
+        self.self_attn = Wav2Vec2ConformerSelfAttention(config)
+        # Conformer Convolution
+        self.conv_module = Wav2Vec2ConformerConvolutionModule(config)
+        # Feed-forward 2
+        self.ffn2_layer_norm = nn.LayerNorm(embed_dim)
+        self.ffn2 = Wav2Vec2ConformerFeedForward(config)
+        self.final_layer_norm = nn.LayerNorm(embed_dim)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask: Optional[torch.Tensor] = None,
+        relative_position_embeddings: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ):
+        hidden_states = hidden_states
+        # 1. Feed-Forward 1 layer
+        residual = hidden_states
+        hidden_states = self.ffn1_layer_norm(hidden_states)
+        hidden_states = self.ffn1(hidden_states)
+        hidden_states = hidden_states * 0.5 + residual
+        residual = hidden_states
+        # 2. Self-Attention layer
+        hidden_states = self.self_attn_layer_norm(hidden_states)
+        hidden_states, attn_weigts = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            relative_position_embeddings=relative_position_embeddings,
+            output_attentions=output_attentions,
+        )
+        hidden_states = self.self_attn_dropout(hidden_states)
+        hidden_states = hidden_states + residual
+        # 3. Convolutional Layer
+        residual = hidden_states
+        hidden_states = self.conv_module(hidden_states)
+        hidden_states = residual + hidden_states
+        # 4. Feed-Forward 2 Layer
+        residual = hidden_states
+        hidden_states = self.ffn2_layer_norm(hidden_states)
+        hidden_states = self.ffn2(hidden_states)
+        hidden_states = hidden_states * 0.5 + residual
+        hidden_states = self.final_layer_norm(hidden_states)
+        return hidden_states, attn_weigts
+class Wav2Vec2ConformerEncoder(nn.Module):
+    def __init__(self, config, is_causal=False):
+        super().__init__()
+        config.is_causal = is_causal
+        self.config = config
+        if config.position_embeddings_type == "relative":
+            self.embed_positions = Wav2Vec2ConformerRelPositionalEmbedding(config)
+        elif config.position_embeddings_type == "rotary":
+            self.embed_positions = Wav2Vec2ConformerRotaryPositionalEmbedding(config)
+        else:
+            self.embed_positions = None
+        self.pos_conv_embed = Wav2Vec2ConformerPositionalConvEmbedding(config)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.layers = nn.ModuleList([Wav2Vec2ConformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        if attention_mask is not None:
+            # make sure padded tokens output 0
+            hidden_states[~attention_mask] = 0.0
+            # extend attention_mask
+            attention_mask = 1.0 - attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
+            attention_mask = attention_mask * torch.finfo(hidden_states.dtype).min
+            attention_mask = attention_mask.expand(
+                attention_mask.shape[0], 1, attention_mask.shape[-1], attention_mask.shape[-1]
+            )
+        hidden_states = self.dropout(hidden_states)
+        if self.embed_positions is not None:
+            relative_position_embeddings = self.embed_positions(hidden_states)
+        else:
+            relative_position_embeddings = None
+        deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()
+        for i, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
+            dropout_probability = np.random.uniform(0, 1)
+            skip_the_layer = True if self.training and (dropout_probability < self.config.layerdrop) else False
+            if not skip_the_layer or deepspeed_zero3_is_enabled:
+                # under deepspeed zero3 all gpus must run in sync
+                if self.gradient_checkpointing and self.training:
+                    # create gradient checkpointing function
+                    def create_custom_forward(module):
+                        def custom_forward(*inputs):
+                            return module(*inputs, output_attentions)
+                        return custom_forward
+                    layer_outputs = torch.utils.checkpoint.checkpoint(
+                        create_custom_forward(layer),
+                        hidden_states,
+                        attention_mask,
+                        relative_position_embeddings,
+                    )
+                else:
+                    layer_outputs = layer(
+                        hidden_states,
+                        attention_mask=attention_mask,
+                        relative_position_embeddings=relative_position_embeddings,
+                        output_attentions=output_attentions,
+                    )
+                hidden_states = layer_outputs[0]
+            if skip_the_layer:
+                layer_outputs = (None, None)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        hidden_states = self.layer_norm(hidden_states)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2GumbelVectorQuantizer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerGumbelVectorQuantizer(nn.Module):
+    """
+    Vector quantization using gumbel softmax. See `[CATEGORICAL REPARAMETERIZATION WITH
+    GUMBEL-SOFTMAX](https://arxiv.org/pdf/1611.01144.pdf) for more information.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.num_groups = config.num_codevector_groups
+        self.num_vars = config.num_codevectors_per_group
+        if config.codevector_dim % self.num_groups != 0:
+            raise ValueError(
+                f"`config.codevector_dim {config.codevector_dim} must be divisible "
+                f"by `config.num_codevector_groups` {self.num_groups} for concatenation"
+            )
+        # storage for codebook variables (codewords)
+        self.codevectors = nn.Parameter(
+            torch.FloatTensor(1, self.num_groups * self.num_vars, config.codevector_dim // self.num_groups)
+        )
+        self.weight_proj = nn.Linear(config.conv_dim[-1], self.num_groups * self.num_vars)
+        # can be decayed for training
+        self.temperature = 2
+    @staticmethod
+    def _compute_perplexity(probs, mask=None):
+        if mask is not None:
+            mask_extended = mask.flatten()[:, None, None].expand(probs.shape)
+            probs = torch.where(mask_extended, probs, torch.zeros_like(probs))
+            marginal_probs = probs.sum(dim=0) / mask.sum()
+        else:
+            marginal_probs = probs.mean(dim=0)
+        perplexity = torch.exp(-torch.sum(marginal_probs * torch.log(marginal_probs + 1e-7), dim=-1)).sum()
+        return perplexity
+    def forward(self, hidden_states, mask_time_indices=None):
+        batch_size, sequence_length, hidden_size = hidden_states.shape
+        # project to codevector dim
+        hidden_states = self.weight_proj(hidden_states)
+        hidden_states = hidden_states.view(batch_size * sequence_length * self.num_groups, -1)
+        if self.training:
+            # sample code vector probs via gumbel in differentiateable way
+            codevector_probs = nn.functional.gumbel_softmax(
+                hidden_states.float(), tau=self.temperature, hard=True
+            ).type_as(hidden_states)
+            # compute perplexity
+            codevector_soft_dist = torch.softmax(
+                hidden_states.view(batch_size * sequence_length, self.num_groups, -1).float(), dim=-1
+            )
+            perplexity = self._compute_perplexity(codevector_soft_dist, mask_time_indices)
+        else:
+            # take argmax in non-differentiable way
+            # comptute hard codevector distribution (one hot)
+            codevector_idx = hidden_states.argmax(dim=-1)
+            codevector_probs = hidden_states.new_zeros(hidden_states.shape).scatter_(
+                -1, codevector_idx.view(-1, 1), 1.0
+            )
+            codevector_probs = codevector_probs.view(batch_size * sequence_length, self.num_groups, -1)
+            perplexity = self._compute_perplexity(codevector_probs, mask_time_indices)
+        codevector_probs = codevector_probs.view(batch_size * sequence_length, -1)
+        # use probs to retrieve codevectors
+        codevectors_per_group = codevector_probs.unsqueeze(-1) * self.codevectors
+        codevectors = codevectors_per_group.view(batch_size * sequence_length, self.num_groups, self.num_vars, -1)
+        codevectors = codevectors.sum(-2).view(batch_size, sequence_length, -1)
+        return codevectors, perplexity
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2Adapter with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerAdapter(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # feature dim might need to be down-projected
+        if config.output_hidden_size != config.hidden_size:
+            self.proj = nn.Linear(config.hidden_size, config.output_hidden_size)
+            self.proj_layer_norm = nn.LayerNorm(config.output_hidden_size)
+        else:
+            self.proj = self.proj_layer_norm = None
+        self.layers = nn.ModuleList(Wav2Vec2ConformerAdapterLayer(config) for _ in range(config.num_adapter_layers))
+        self.layerdrop = config.layerdrop
+    def forward(self, hidden_states):
+        # down project hidden_states if necessary
+        if self.proj is not None and self.proj_layer_norm is not None:
+            hidden_states = self.proj(hidden_states)
+            hidden_states = self.proj_layer_norm(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2)
+        for layer in self.layers:
+            layerdrop_prob = np.random.random()
+            if not self.training or (layerdrop_prob > self.layerdrop):
+                hidden_states = layer(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2)
+        return hidden_states
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2AdapterLayer with Wav2Vec2->Wav2Vec2Conformer
+class Wav2Vec2ConformerAdapterLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            config.output_hidden_size,
+            2 * config.output_hidden_size,
+            config.adapter_kernel_size,
+            stride=config.adapter_stride,
+            padding=1,
+        )
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = nn.functional.glu(hidden_states, dim=1)
+        return hidden_states
+class Wav2Vec2ConformerPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = Wav2Vec2ConformerConfig
+    base_model_prefix = "wav2vec2_conformer"
+    main_input_name = "input_values"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        # Wav2Vec2ForPreTraining last 2 linear layers need standard Linear init.
+        if isinstance(module, Wav2Vec2ConformerForPreTraining):
+            module.project_hid.reset_parameters()
+            module.project_q.reset_parameters()
+            module.project_hid._is_hf_initialized = True
+            module.project_q._is_hf_initialized = True
+        # gumbel softmax requires special init
+        elif isinstance(module, Wav2Vec2ConformerGumbelVectorQuantizer):
+            module.weight_proj.weight.data.normal_(mean=0.0, std=1)
+            module.weight_proj.bias.data.zero_()
+            nn.init.uniform_(module.codevectors)
+        elif isinstance(module, Wav2Vec2ConformerSelfAttention):
+            if hasattr(module, "pos_bias_u"):
+                nn.init.xavier_uniform_(module.pos_bias_u)
+            if hasattr(module, "pos_bias_v"):
+                nn.init.xavier_uniform_(module.pos_bias_v)
+        elif isinstance(module, Wav2Vec2ConformerPositionalConvEmbedding):
+            nn.init.normal_(
+                module.conv.weight,
+                mean=0,
+                std=2 * math.sqrt(1 / (module.conv.kernel_size[0] * module.conv.in_channels)),
+            )
+            nn.init.constant_(module.conv.bias, 0)
+        elif isinstance(module, Wav2Vec2ConformerFeatureProjection):
+            k = math.sqrt(1 / module.projection.in_features)
+            nn.init.uniform_(module.projection.weight, a=-k, b=k)
+            nn.init.uniform_(module.projection.bias, a=-k, b=k)
+        elif isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, (nn.LayerNorm, nn.GroupNorm)):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, nn.Conv1d):
+            nn.init.kaiming_normal_(module.weight)
+            if module.bias is not None:
+                k = math.sqrt(module.groups / (module.in_channels * module.kernel_size[0]))
+                nn.init.uniform_(module.bias, a=-k, b=k)
+    def _get_feat_extract_output_lengths(
+        self, input_lengths: Union[torch.LongTensor, int], add_adapter: Optional[bool] = None
+    ):
+        """
+        Computes the output length of the convolutional layers
+        """
+        add_adapter = self.config.add_adapter if add_adapter is None else add_adapter
+        def _conv_out_length(input_length, kernel_size, stride):
+            # 1D convolutional layer output length formula taken
+            # from https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
+            return torch.div(input_length - kernel_size, stride, rounding_mode="floor") + 1
+        for kernel_size, stride in zip(self.config.conv_kernel, self.config.conv_stride):
+            input_lengths = _conv_out_length(input_lengths, kernel_size, stride)
+        if add_adapter:
+            for _ in range(self.config.num_adapter_layers):
+                input_lengths = _conv_out_length(input_lengths, 1, self.config.adapter_stride)
+        return input_lengths
+    def _get_feature_vector_attention_mask(
+        self, feature_vector_length: int, attention_mask: torch.LongTensor, add_adapter=None
+    ):
+        # Effectively attention_mask.sum(-1), but not inplace to be able to run
+        # on inference mode.
+        non_padded_lengths = attention_mask.cumsum(dim=-1)[:, -1]
+        output_lengths = self._get_feat_extract_output_lengths(non_padded_lengths, add_adapter=add_adapter)
+        output_lengths = output_lengths.to(torch.long)
+        batch_size = attention_mask.shape[0]
+        attention_mask = torch.zeros(
+            (batch_size, feature_vector_length), dtype=attention_mask.dtype, device=attention_mask.device
+        )
+        # these two operations makes sure that all values before the output lengths idxs are attended to
+        attention_mask[(torch.arange(attention_mask.shape[0], device=attention_mask.device), output_lengths - 1)] = 1
+        attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).bool()
+        return attention_mask
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (Wav2Vec2ConformerEncoder, Wav2Vec2ConformerFeatureEncoder)):
+            module.gradient_checkpointing = value
+WAV2VEC2_CONFORMER_START_DOCSTRING = r"""
+    Wav2Vec2Conformer was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
+    Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael
+    Auli.
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving etc.).
+    This model is a PyTorch [nn.Module](https://pytorch.org/docs/stable/nn.html#nn.Module) sub-class. Use it as a
+    regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
+    Parameters:
+        config ([`Wav2Vec2ConformerConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+WAV2VEC2_CONFORMER_INPUTS_DOCSTRING = r"""
+    Args:
+        input_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
+            Float values of input raw speech waveform. Values can be obtained by loading a `.flac` or `.wav` audio file
+            into an array of type `List[float]` or a `numpy.ndarray`, *e.g.* via the soundfile library (`pip install
+            soundfile`). To prepare the array into `input_values`, the [`AutoProcessor`] should be used for padding and
+            conversion into a tensor of type `torch.FloatTensor`. See [`Wav2Vec2Processor.__call__`] for details.
+        attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing convolution and attention on padding token indices. Mask values selected in `[0,
+            1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+            <Tip warning={true}>
+            `attention_mask` should only be passed if the corresponding processor has `config.return_attention_mask ==
+            True`. For all models whose processor has `config.return_attention_mask == False`, such as
+            [wav2vec2-conformer-rel-pos-large](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large),
+            `attention_mask` should **not** be passed to avoid degraded performance when doing batched inference. For
+            such models `input_values` should simply be padded with 0 and passed without `attention_mask`. Be aware
+            that these models also yield slightly different results depending on whether `input_values` is padded or
+            not.
+            </Tip>
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+@add_start_docstrings(
+    "The bare Wav2Vec2Conformer Model transformer outputting raw hidden-states without any specific head on top.",
+    WAV2VEC2_CONFORMER_START_DOCSTRING,
+)
+class Wav2Vec2ConformerModel(Wav2Vec2ConformerPreTrainedModel):
+    def __init__(self, config: Wav2Vec2ConformerConfig):
+        super().__init__(config)
+        self.config = config
+        self.feature_extractor = Wav2Vec2ConformerFeatureEncoder(config)
+        self.feature_projection = Wav2Vec2ConformerFeatureProjection(config)
+        # model only needs masking vector if mask prob is > 0.0
+        if config.mask_time_prob > 0.0 or config.mask_feature_prob > 0.0:
+            self.masked_spec_embed = nn.Parameter(torch.FloatTensor(config.hidden_size).uniform_())
+        self.encoder = Wav2Vec2ConformerEncoder(config)
+        self.adapter = Wav2Vec2ConformerAdapter(config) if config.add_adapter else None
+        # Initialize weights and apply final processing
+        self.post_init()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2Model.freeze_feature_encoder
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.feature_extractor._freeze_parameters()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2Model._mask_hidden_states
+    def _mask_hidden_states(
+        self,
+        hidden_states: torch.FloatTensor,
+        mask_time_indices: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+    ):
+        """
+        Masks extracted features along time axis and/or along feature axis according to
+        [SpecAugment](https://arxiv.org/abs/1904.08779).
+        """
+        # `config.apply_spec_augment` can set masking to False
+        if not getattr(self.config, "apply_spec_augment", True):
+            return hidden_states
+        # generate indices & apply SpecAugment along time axis
+        batch_size, sequence_length, hidden_size = hidden_states.size()
+        if mask_time_indices is not None:
+            # apply SpecAugment along time axis with given mask_time_indices
+            hidden_states[mask_time_indices] = self.masked_spec_embed.to(hidden_states.dtype)
+        elif self.config.mask_time_prob > 0 and self.training:
+            mask_time_indices = _compute_mask_indices(
+                (batch_size, sequence_length),
+                mask_prob=self.config.mask_time_prob,
+                mask_length=self.config.mask_time_length,
+                attention_mask=attention_mask,
+                min_masks=self.config.mask_time_min_masks,
+            )
+            mask_time_indices = torch.tensor(mask_time_indices, device=hidden_states.device, dtype=torch.bool)
+            hidden_states[mask_time_indices] = self.masked_spec_embed.to(hidden_states.dtype)
+        if self.config.mask_feature_prob > 0 and self.training:
+            # generate indices & apply SpecAugment along feature axis
+            mask_feature_indices = _compute_mask_indices(
+                (batch_size, hidden_size),
+                mask_prob=self.config.mask_feature_prob,
+                mask_length=self.config.mask_feature_length,
+                min_masks=self.config.mask_feature_min_masks,
+            )
+            mask_feature_indices = torch.tensor(mask_feature_indices, device=hidden_states.device, dtype=torch.bool)
+            mask_feature_indices = mask_feature_indices[:, None].expand(-1, sequence_length, -1)
+            hidden_states[mask_feature_indices] = 0
+        return hidden_states
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=Wav2Vec2BaseModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+        modality="audio",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2Model.forward with wav2vec2->wav2vec2_conformer
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        mask_time_indices: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Wav2Vec2BaseModelOutput]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        extract_features = self.feature_extractor(input_values)
+        extract_features = extract_features.transpose(1, 2)
+        if attention_mask is not None:
+            # compute reduced attention_mask corresponding to feature vectors
+            attention_mask = self._get_feature_vector_attention_mask(
+                extract_features.shape[1], attention_mask, add_adapter=False
+            )
+        hidden_states, extract_features = self.feature_projection(extract_features)
+        hidden_states = self._mask_hidden_states(
+            hidden_states, mask_time_indices=mask_time_indices, attention_mask=attention_mask
+        )
+        encoder_outputs = self.encoder(
+            hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = encoder_outputs[0]
+        if self.adapter is not None:
+            hidden_states = self.adapter(hidden_states)
+        if not return_dict:
+            return (hidden_states, extract_features) + encoder_outputs[1:]
+        return Wav2Vec2BaseModelOutput(
+            last_hidden_state=hidden_states,
+            extract_features=extract_features,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+@add_start_docstrings(
+    """Wav2Vec2Conformer Model with a quantizer and `VQ` head on top.""", WAV2VEC2_CONFORMER_START_DOCSTRING
+)
+class Wav2Vec2ConformerForPreTraining(Wav2Vec2ConformerPreTrainedModel):
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTraining.__init__ with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer
+    def __init__(self, config: Wav2Vec2ConformerConfig):
+        super().__init__(config)
+        self.wav2vec2_conformer = Wav2Vec2ConformerModel(config)
+        self.dropout_features = nn.Dropout(config.feat_quantizer_dropout)
+        self.quantizer = Wav2Vec2ConformerGumbelVectorQuantizer(config)
+        self.project_hid = nn.Linear(config.hidden_size, config.proj_codevector_dim)
+        self.project_q = nn.Linear(config.codevector_dim, config.proj_codevector_dim)
+        # Initialize weights and apply final processing
+        self.post_init()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTraining.set_gumbel_temperature
+    def set_gumbel_temperature(self, temperature: int):
+        """
+        Set the Gumbel softmax temperature to a given value. Only necessary for training
+        """
+        self.quantizer.temperature = temperature
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTraining.freeze_feature_encoder with wav2vec2->wav2vec2_conformer
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.wav2vec2_conformer.feature_extractor._freeze_parameters()
+    @staticmethod
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTraining.compute_contrastive_logits
+    def compute_contrastive_logits(
+        target_features: torch.FloatTensor,
+        negative_features: torch.FloatTensor,
+        predicted_features: torch.FloatTensor,
+        temperature: int = 0.1,
+    ):
+        """
+        Compute logits for contrastive loss based using cosine similarity as the distance measure between
+        `[positive_feature, negative_features]` and `[predicted_features]`. Additionally, temperature can be applied.
+        """
+        target_features = torch.cat([target_features, negative_features], dim=0)
+        logits = torch.cosine_similarity(predicted_features.float(), target_features.float(), dim=-1).type_as(
+            target_features
+        )
+        # apply temperature
+        logits = logits / temperature
+        return logits
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=Wav2Vec2ConformerForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTraining.forward with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer,wav2vec2_conformer-base->wav2vec2-conformer-rel-pos-large
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        mask_time_indices: Optional[torch.BoolTensor] = None,
+        sampled_negative_indices: Optional[torch.BoolTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Wav2Vec2ConformerForPreTrainingOutput]:
+        r"""
+        mask_time_indices (`torch.BoolTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices to mask extracted features for contrastive loss. When in training mode, model learns to predict
+            masked extracted features in *config.proj_codevector_dim* space.
+        sampled_negative_indices (`torch.BoolTensor` of shape `(batch_size, sequence_length, num_negatives)`, *optional*):
+            Indices indicating which quantized target vectors are used as negative sampled vectors in contrastive loss.
+            Required input for pre-training.
+        Returns:
+        Example:
+        ```python
+        >>> import torch
+        >>> from transformers import AutoFeatureExtractor, Wav2Vec2ConformerForPreTraining
+        >>> from transformers.models.wav2vec2_conformer.modeling_wav2vec2_conformer import (
+        ...     _compute_mask_indices,
+        ...     _sample_negative_indices,
+        ... )
+        >>> from datasets import load_dataset
+        >>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large")
+        >>> model = Wav2Vec2ConformerForPreTraining.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large")
+        >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        >>> input_values = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1
+        >>> # compute masked indices
+        >>> batch_size, raw_sequence_length = input_values.shape
+        >>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length).item()
+        >>> mask_time_indices = _compute_mask_indices(
+        ...     shape=(batch_size, sequence_length), mask_prob=0.2, mask_length=2
+        ... )
+        >>> sampled_negative_indices = _sample_negative_indices(
+        ...     features_shape=(batch_size, sequence_length),
+        ...     num_negatives=model.config.num_negatives,
+        ...     mask_time_indices=mask_time_indices,
+        ... )
+        >>> mask_time_indices = torch.tensor(data=mask_time_indices, device=input_values.device, dtype=torch.long)
+        >>> sampled_negative_indices = torch.tensor(
+        ...     data=sampled_negative_indices, device=input_values.device, dtype=torch.long
+        ... )
+        >>> with torch.no_grad():
+        ...     outputs = model(input_values, mask_time_indices=mask_time_indices)
+        >>> # compute cosine similarity between predicted (=projected_states) and target (=projected_quantized_states)
+        >>> cosine_sim = torch.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states, dim=-1)
+        >>> # show that cosine similarity is much higher than random
+        >>> cosine_sim[mask_time_indices.to(torch.bool)].mean() > 0.5
+        tensor(True)
+        >>> # for contrastive loss training model should be put into train mode
+        >>> model = model.train()
+        >>> loss = model(
+        ...     input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
+        ... ).loss
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if mask_time_indices is not None:
+            mask_time_indices = mask_time_indices.to(torch.bool)
+        outputs = self.wav2vec2_conformer(
+            input_values,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            mask_time_indices=mask_time_indices,
+            return_dict=return_dict,
+        )
+        # 1. project all transformed features (including masked) to final vq dim
+        transformer_features = self.project_hid(outputs[0])
+        # 2. quantize all (unmasked) extracted features and project to final vq dim
+        extract_features = self.dropout_features(outputs[1])
+        if attention_mask is not None:
+            # compute reduced attention_mask correponding to feature vectors
+            attention_mask = self._get_feature_vector_attention_mask(
+                extract_features.shape[1], attention_mask, add_adapter=False
+            )
+        quantized_features, codevector_perplexity = self.quantizer(
+            extract_features, mask_time_indices=mask_time_indices
+        )
+        quantized_features = self.project_q(quantized_features)
+        loss = contrastive_loss = diversity_loss = None
+        if sampled_negative_indices is not None:
+            batch_size, sequence_length, hidden_size = quantized_features.shape
+            # for training, we sample negatives
+            # 3. sample K negatives (distractors) quantized states for contrastive loss
+            # if attention_mask is passed, make sure that padded feature vectors cannot be sampled
+            # sample negative quantized vectors BTC => (BxT)C
+            negative_quantized_features = quantized_features.view(-1, hidden_size)[
+                sampled_negative_indices.long().view(-1)
+            ]
+            negative_quantized_features = negative_quantized_features.view(
+                batch_size, sequence_length, -1, hidden_size
+            ).permute(2, 0, 1, 3)
+            # 4. compute logits, corresponding to `logs = sim(c_t, [q_t, \sim{q}_t]) / \kappa`
+            # of equation (3) in https://arxiv.org/pdf/2006.11477.pdf
+            logits = self.compute_contrastive_logits(
+                quantized_features[None, :],
+                negative_quantized_features,
+                transformer_features,
+                self.config.contrastive_logits_temperature,
+            )
+            # 5. if a negative vector is identical to the positive (i.e. when codebook utilization is low),
+            # its cosine similarity will be masked
+            neg_is_pos = (quantized_features == negative_quantized_features).all(-1)
+            if neg_is_pos.any():
+                logits[1:][neg_is_pos] = float("-inf")
+            # 6. compute contrastive loss \mathbf{L}_m = cross_entropy(logs) =
+            # -log(exp(sim(c_t, q_t)/\kappa) / \sum_{\sim{q}} exp(sim(c_t, \sim{q})/\kappa))
+            logits = logits.transpose(0, 2).reshape(-1, logits.size(0))
+            target = ((1 - mask_time_indices.long()) * -100).transpose(0, 1).flatten()
+            contrastive_loss = nn.functional.cross_entropy(logits.float(), target, reduction="sum")
+            # 7. compute diversity loss: \mathbf{L}_d
+            num_codevectors = self.config.num_codevectors_per_group * self.config.num_codevector_groups
+            diversity_loss = ((num_codevectors - codevector_perplexity) / num_codevectors) * mask_time_indices.sum()
+            # 8. \mathbf{L} = \mathbf{L}_m + \alpha * \mathbf{L}_d
+            loss = contrastive_loss + self.config.diversity_loss_weight * diversity_loss
+        if not return_dict:
+            if loss is not None:
+                return (loss, transformer_features, quantized_features, codevector_perplexity) + outputs[2:]
+            return (transformer_features, quantized_features, codevector_perplexity) + outputs[2:]
+        return Wav2Vec2ConformerForPreTrainingOutput(
+            loss=loss,
+            projected_states=transformer_features,
+            projected_quantized_states=quantized_features,
+            codevector_perplexity=codevector_perplexity,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            contrastive_loss=contrastive_loss,
+            diversity_loss=diversity_loss,
+        )
+@add_start_docstrings(
+    """Wav2Vec2Conformer Model with a `language modeling` head on top for Connectionist Temporal Classification (CTC).""",
+    WAV2VEC2_CONFORMER_START_DOCSTRING,
+)
+class Wav2Vec2ConformerForCTC(Wav2Vec2ConformerPreTrainedModel):
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.__init__ with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer
+    def __init__(self, config):
+        super().__init__(config)
+        self.wav2vec2_conformer = Wav2Vec2ConformerModel(config)
+        self.dropout = nn.Dropout(config.final_dropout)
+        if config.vocab_size is None:
+            raise ValueError(
+                f"You are trying to instantiate {self.__class__} with a configuration that "
+                "does not define the vocabulary size of the language model head. Please "
+                "instantiate the model as follows: `Wav2Vec2ConformerForCTC.from_pretrained(..., vocab_size=vocab_size)`. "
+                "or define `vocab_size` of your model's configuration."
+            )
+        output_hidden_size = (
+            config.output_hidden_size if hasattr(config, "add_adapter") and config.add_adapter else config.hidden_size
+        )
+        self.lm_head = nn.Linear(output_hidden_size, config.vocab_size)
+        # Initialize weights and apply final processing
+        self.post_init()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.freeze_feature_encoder with wav2vec2->wav2vec2_conformer
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.wav2vec2_conformer.feature_extractor._freeze_parameters()
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=CausalLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_CTC_EXPECTED_OUTPUT,
+        expected_loss=_CTC_EXPECTED_LOSS,
+    )
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.forward with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+    ) -> Union[Tuple, CausalLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, target_length)`, *optional*):
+            Labels for connectionist temporal classification. Note that `target_length` has to be smaller or equal to
+            the sequence length of the output logits. Indices are selected in `[-100, 0, ..., config.vocab_size - 1]`.
+            All labels set to `-100` are ignored (masked), the loss is only computed for labels in `[0, ...,
+            config.vocab_size - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.wav2vec2_conformer(
+            input_values,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.lm_head(hidden_states)
+        loss = None
+        if labels is not None:
+            if labels.max() >= self.config.vocab_size:
+                raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
+            # retrieve loss input_lengths from attention_mask
+            attention_mask = (
+                attention_mask if attention_mask is not None else torch.ones_like(input_values, dtype=torch.long)
+            )
+            input_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(-1)).to(torch.long)
+            # assuming that padded tokens are filled with -100
+            # when not being attended to
+            labels_mask = labels >= 0
+            target_lengths = labels_mask.sum(-1)
+            flattened_targets = labels.masked_select(labels_mask)
+            # ctc_loss doesn't support fp16
+            log_probs = nn.functional.log_softmax(logits, dim=-1, dtype=torch.float32).transpose(0, 1)
+            with torch.backends.cudnn.flags(enabled=False):
+                loss = nn.functional.ctc_loss(
+                    log_probs,
+                    flattened_targets,
+                    input_lengths,
+                    target_lengths,
+                    blank=self.config.pad_token_id,
+                    reduction=self.config.ctc_loss_reduction,
+                    zero_infinity=self.config.ctc_zero_infinity,
+                )
+        if not return_dict:
+            output = (logits,) + outputs[_HIDDEN_STATES_START_POSITION:]
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
+        )
+@add_start_docstrings(
+    """
+    Wav2Vec2Conformer Model with a sequence classification head on top (a linear layer over the pooled output) for
+    tasks like SUPERB Keyword Spotting.
+    """,
+    WAV2VEC2_CONFORMER_START_DOCSTRING,
+)
+class Wav2Vec2ConformerForSequenceClassification(Wav2Vec2ConformerPreTrainedModel):
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForSequenceClassification.__init__ with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer
+    def __init__(self, config):
+        super().__init__(config)
+        if hasattr(config, "add_adapter") and config.add_adapter:
+            raise ValueError(
+                "Sequence classification does not support the use of Wav2Vec2Conformer adapters (config.add_adapter=True)"
+            )
+        self.wav2vec2_conformer = Wav2Vec2ConformerModel(config)
+        num_layers = config.num_hidden_layers + 1  # transformer layers + input embeddings
+        if config.use_weighted_layer_sum:
+            self.layer_weights = nn.Parameter(torch.ones(num_layers) / num_layers)
+        self.projector = nn.Linear(config.hidden_size, config.classifier_proj_size)
+        self.classifier = nn.Linear(config.classifier_proj_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForSequenceClassification.freeze_feature_encoder with wav2vec2->wav2vec2_conformer
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.wav2vec2_conformer.feature_extractor._freeze_parameters()
+    def freeze_base_model(self):
+        """
+        Calling this function will disable the gradient computation for the base model so that its parameters will not
+        be updated during training. Only the classification head will be updated.
+        """
+        for param in self.wav2vec2_conformer.parameters():
+            param.requires_grad = False
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=SequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+        modality="audio",
+    )
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForSequenceClassification.forward with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer,WAV_2_VEC_2->WAV2VEC2_CONFORMER
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+    ) -> Union[Tuple, SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = True if self.config.use_weighted_layer_sum else output_hidden_states
+        outputs = self.wav2vec2_conformer(
+            input_values,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if self.config.use_weighted_layer_sum:
+            hidden_states = outputs[_HIDDEN_STATES_START_POSITION]
+            hidden_states = torch.stack(hidden_states, dim=1)
+            norm_weights = nn.functional.softmax(self.layer_weights, dim=-1)
+            hidden_states = (hidden_states * norm_weights.view(-1, 1, 1)).sum(dim=1)
+        else:
+            hidden_states = outputs[0]
+        hidden_states = self.projector(hidden_states)
+        if attention_mask is None:
+            pooled_output = hidden_states.mean(dim=1)
+        else:
+            padding_mask = self._get_feature_vector_attention_mask(hidden_states.shape[1], attention_mask)
+            hidden_states[~padding_mask] = 0.0
+            pooled_output = hidden_states.sum(dim=1) / padding_mask.sum(dim=1).view(-1, 1)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[_HIDDEN_STATES_START_POSITION:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+@add_start_docstrings(
+    """
+    Wav2Vec2Conformer Model with a frame classification head on top for tasks like Speaker Diarization.
+    """,
+    WAV2VEC2_CONFORMER_START_DOCSTRING,
+)
+class Wav2Vec2ConformerForAudioFrameClassification(Wav2Vec2ConformerPreTrainedModel):
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForAudioFrameClassification.__init__ with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer,WAV_2_VEC_2->WAV2VEC2_CONFORMER
+    def __init__(self, config):
+        super().__init__(config)
+        if hasattr(config, "add_adapter") and config.add_adapter:
+            raise ValueError(
+                "Audio frame classification does not support the use of Wav2Vec2Conformer adapters (config.add_adapter=True)"
+            )
+        self.wav2vec2_conformer = Wav2Vec2ConformerModel(config)
+        num_layers = config.num_hidden_layers + 1  # transformer layers + input embeddings
+        if config.use_weighted_layer_sum:
+            self.layer_weights = nn.Parameter(torch.ones(num_layers) / num_layers)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.num_labels = config.num_labels
+        self.init_weights()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForAudioFrameClassification.freeze_feature_encoder with wav2vec2->wav2vec2_conformer
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.wav2vec2_conformer.feature_extractor._freeze_parameters()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForAudioFrameClassification.freeze_base_model with wav2vec2->wav2vec2_conformer
+    def freeze_base_model(self):
+        """
+        Calling this function will disable the gradient computation for the base model so that its parameters will not
+        be updated during training. Only the classification head will be updated.
+        """
+        for param in self.wav2vec2_conformer.parameters():
+            param.requires_grad = False
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+        modality="audio",
+    )
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForAudioFrameClassification.forward with wav2vec2->wav2vec2_conformer
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, TokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = True if self.config.use_weighted_layer_sum else output_hidden_states
+        outputs = self.wav2vec2_conformer(
+            input_values,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if self.config.use_weighted_layer_sum:
+            hidden_states = outputs[_HIDDEN_STATES_START_POSITION]
+            hidden_states = torch.stack(hidden_states, dim=1)
+            norm_weights = nn.functional.softmax(self.layer_weights, dim=-1)
+            hidden_states = (hidden_states * norm_weights.view(-1, 1, 1)).sum(dim=1)
+        else:
+            hidden_states = outputs[0]
+        logits = self.classifier(hidden_states)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), torch.argmax(labels.view(-1, self.num_labels), axis=1))
+        if not return_dict:
+            output = (logits,) + outputs[_HIDDEN_STATES_START_POSITION:]
+            return output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.AMSoftmaxLoss
+class AMSoftmaxLoss(nn.Module):
+    def __init__(self, input_dim, num_labels, scale=30.0, margin=0.4):
+        super(AMSoftmaxLoss, self).__init__()
+        self.scale = scale
+        self.margin = margin
+        self.num_labels = num_labels
+        self.weight = nn.Parameter(torch.randn(input_dim, num_labels), requires_grad=True)
+        self.loss = nn.CrossEntropyLoss()
+    def forward(self, hidden_states, labels):
+        labels = labels.flatten()
+        weight = nn.functional.normalize(self.weight, dim=0)
+        hidden_states = nn.functional.normalize(hidden_states, dim=1)
+        cos_theta = torch.mm(hidden_states, weight)
+        psi = cos_theta - self.margin
+        onehot = nn.functional.one_hot(labels, self.num_labels)
+        logits = self.scale * torch.where(onehot.bool(), psi, cos_theta)
+        loss = self.loss(logits, labels)
+        return loss
+# Copied from transformers.models.wav2vec2.modeling_wav2vec2.TDNNLayer
+class TDNNLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.tdnn_dim[layer_id - 1] if layer_id > 0 else config.tdnn_dim[layer_id]
+        self.out_conv_dim = config.tdnn_dim[layer_id]
+        self.kernel_size = config.tdnn_kernel[layer_id]
+        self.dilation = config.tdnn_dilation[layer_id]
+        self.kernel = nn.Linear(self.in_conv_dim * self.kernel_size, self.out_conv_dim)
+        self.activation = nn.ReLU()
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.unsqueeze(1)
+        hidden_states = nn.functional.unfold(
+            hidden_states,
+            (self.kernel_size, self.in_conv_dim),
+            stride=(1, self.in_conv_dim),
+            dilation=(self.dilation, 1),
+        )
+        hidden_states = hidden_states.transpose(1, 2)
+        hidden_states = self.kernel(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+@add_start_docstrings(
+    """
+    Wav2Vec2Conformer Model with an XVector feature extraction head on top for tasks like Speaker Verification.
+    """,
+    WAV2VEC2_CONFORMER_START_DOCSTRING,
+)
+class Wav2Vec2ConformerForXVector(Wav2Vec2ConformerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.wav2vec2_conformer = Wav2Vec2ConformerModel(config)
+        num_layers = config.num_hidden_layers + 1  # transformer layers + input embeddings
+        if config.use_weighted_layer_sum:
+            self.layer_weights = nn.Parameter(torch.ones(num_layers) / num_layers)
+        self.projector = nn.Linear(config.hidden_size, config.tdnn_dim[0])
+        tdnn_layers = [TDNNLayer(config, i) for i in range(len(config.tdnn_dim))]
+        self.tdnn = nn.ModuleList(tdnn_layers)
+        self.feature_extractor = nn.Linear(config.tdnn_dim[-1] * 2, config.xvector_output_dim)
+        self.classifier = nn.Linear(config.xvector_output_dim, config.xvector_output_dim)
+        self.objective = AMSoftmaxLoss(config.xvector_output_dim, config.num_labels)
+        self.init_weights()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForXVector.freeze_feature_encoder with wav2vec2->wav2vec2_conformer
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.wav2vec2_conformer.feature_extractor._freeze_parameters()
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForXVector.freeze_base_model with wav2vec2->wav2vec2_conformer
+    def freeze_base_model(self):
+        """
+        Calling this function will disable the gradient computation for the base model so that its parameters will not
+        be updated during training. Only the classification head will be updated.
+        """
+        for param in self.wav2vec2_conformer.parameters():
+            param.requires_grad = False
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForXVector._get_tdnn_output_lengths with wav2vec2->wav2vec2_conformer
+    def _get_tdnn_output_lengths(self, input_lengths: Union[torch.LongTensor, int]):
+        """
+        Computes the output length of the TDNN layers
+        """
+        def _conv_out_length(input_length, kernel_size, stride):
+            # 1D convolutional layer output length formula taken
+            # from https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
+            return (input_length - kernel_size) // stride + 1
+        for kernel_size in self.config.tdnn_kernel:
+            input_lengths = _conv_out_length(input_lengths, kernel_size, 1)
+        return input_lengths
+    @add_start_docstrings_to_model_forward(WAV2VEC2_CONFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=XVectorOutput,
+        config_class=_CONFIG_FOR_DOC,
+        modality="audio",
+    )
+    # Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForXVector.forward with Wav2Vec2->Wav2Vec2Conformer,wav2vec2->wav2vec2_conformer,WAV_2_VEC_2->WAV2VEC2_CONFORMER
+    def forward(
+        self,
+        input_values: Optional[torch.Tensor],
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+    ) -> Union[Tuple, XVectorOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = True if self.config.use_weighted_layer_sum else output_hidden_states
+        outputs = self.wav2vec2_conformer(
+            input_values,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if self.config.use_weighted_layer_sum:
+            hidden_states = outputs[_HIDDEN_STATES_START_POSITION]
+            hidden_states = torch.stack(hidden_states, dim=1)
+            norm_weights = nn.functional.softmax(self.layer_weights, dim=-1)
+            hidden_states = (hidden_states * norm_weights.view(-1, 1, 1)).sum(dim=1)
+        else:
+            hidden_states = outputs[0]
+        hidden_states = self.projector(hidden_states)
+        for tdnn_layer in self.tdnn:
+            hidden_states = tdnn_layer(hidden_states)
+        # Statistic Pooling
+        if attention_mask is None:
+            mean_features = hidden_states.mean(dim=1)
+            std_features = hidden_states.std(dim=1)
+        else:
+            feat_extract_output_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(dim=1))
+            tdnn_output_lengths = self._get_tdnn_output_lengths(feat_extract_output_lengths)
+            mean_features = []
+            std_features = []
+            for i, length in enumerate(tdnn_output_lengths):
+                mean_features.append(hidden_states[i, :length].mean(dim=0))
+                std_features.append(hidden_states[i, :length].std(dim=0))
+            mean_features = torch.stack(mean_features)
+            std_features = torch.stack(std_features)
+        statistic_pooling = torch.cat([mean_features, std_features], dim=-1)
+        output_embeddings = self.feature_extractor(statistic_pooling)
+        logits = self.classifier(output_embeddings)
+        loss = None
+        if labels is not None:
+            loss = self.objective(logits, labels)
+        if not return_dict:
+            output = (logits, output_embeddings) + outputs[_HIDDEN_STATES_START_POSITION:]
+            return ((loss,) + output) if loss is not None else output
+        return XVectorOutput(
+            loss=loss,
+            logits=logits,
+            embeddings=output_embeddings,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

src/third_party/MuQ/src/muq/muq/modules/random_quantizer.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import torch
+from torch import nn, einsum
+from einops import rearrange
+class RandomProjectionQuantizer(nn.Module):
+    """
+    Random projection and codebook lookup module
+    Some code is borrowed from:
+     https://github.com/lucidrains/vector-quantize-pytorch/blob/master/vector_quantize_pytorch/random_projection_quantizer.py
+    But I did normalization using pre-computed global mean & variance instead of using layer norm.
+    """
+    def __init__(
+        self,
+        input_dim,
+        codebook_dim,
+        codebook_size,
+        seed=142,
+    ):
+        super().__init__()
+        # random seed
+        torch.manual_seed(seed)
+        # randomly initialized projection
+        random_projection = torch.empty(input_dim, codebook_dim)
+        nn.init.xavier_normal_(random_projection)
+        self.register_buffer("random_projection", random_projection)
+        # randomly initialized codebook
+        codebook = torch.empty(codebook_size, codebook_dim)
+        nn.init.normal_(codebook)
+        self.register_buffer("codebook", codebook)
+    def codebook_lookup(self, x):
+        # reshape
+        b = x.shape[0]
+        x = rearrange(x, "b n e -> (b n) e")
+        # L2 normalization
+        normalized_x = nn.functional.normalize(x, dim=1, p=2)
+        normalized_codebook = nn.functional.normalize(self.codebook, dim=1, p=2)
+        # compute distances
+        distances = torch.cdist(normalized_codebook, normalized_x)
+        # get nearest
+        nearest_indices = torch.argmin(distances, dim=0)
+        # reshape
+        xq = rearrange(nearest_indices, "(b n) -> b n", b=b)
+        return xq
+    @torch.no_grad()
+    def forward(self, x):
+        # always eval
+        self.eval()
+        # random projection [batch, length, input_dim] -> [batch, length, codebook_dim]
+        x = einsum("b n d, d e -> b n e", x, self.random_projection)
+        # codebook lookup
+        xq = self.codebook_lookup(x)
+        return xq

src/third_party/MuQ/src/muq/muq/modules/rvq.py ADDED Viewed

	@@ -0,0 +1,314 @@

+from typing import Union
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+try:
+    from torch.nn.utils import weight_norm
+except:
+    try:
+        from torch.nn.utils.parametrizations import weight_norm
+    except:
+        from torch.nn.utils.parametrize import weight_norm
+def WNConv1d(*args, **kwargs):
+    return weight_norm(nn.Conv1d(*args, **kwargs))
+class VectorQuantize(nn.Module):
+    """
+    Implementation of VQ similar to Karpathy's repo:
+    https://github.com/karpathy/deep-vector-quantization
+    Additionally uses following tricks from Improved VQGAN
+    (https://arxiv.org/pdf/2110.04627.pdf):
+        1. Factorized codes: Perform nearest neighbor lookup in low-dimensional space
+            for improved codebook usage
+        2. l2-normalized codes: Converts euclidean distance to cosine similarity which
+            improves training stability
+    """
+    def __init__(self, input_dim: int, codebook_size: int, codebook_dim: int, stale_tolerance: int = 1000, mfcc_clustering=False, n_layer=1):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.mfcc_clustering = mfcc_clustering
+        ProjClass = nn.Identity if mfcc_clustering else WNConv1d
+        if n_layer==1:
+            self.in_proj = ProjClass(input_dim, codebook_dim, kernel_size=1)
+            self.out_proj = ProjClass(codebook_dim, input_dim, kernel_size=1)
+        elif n_layer >= 2:
+            ndim_hidden = 128
+            self.in_proj = nn.Sequential(
+                ProjClass(input_dim, ndim_hidden, kernel_size=1),
+                *[nn.Sequential(nn.ReLU(), ProjClass(ndim_hidden, ndim_hidden, kernel_size=1),) for _ in range(n_layer-2)],
+                nn.ReLU(),
+                ProjClass(ndim_hidden, codebook_dim, kernel_size=1)
+            )
+            self.out_proj = nn.Sequential(
+                ProjClass(codebook_dim, ndim_hidden, kernel_size=1),
+                nn.ReLU(),
+                *[nn.Sequential(ProjClass(ndim_hidden, ndim_hidden, kernel_size=1), nn.ReLU()) for _ in range(n_layer-2)],
+                ProjClass(ndim_hidden, input_dim, kernel_size=1),
+            )
+        self.codebook = nn.Embedding(codebook_size, codebook_dim)
+        self.register_buffer("stale_counter", torch.zeros(self.codebook_size,))
+        self.stale_tolerance = stale_tolerance
+    def forward(self, z):
+        """Quantized the input tensor using a fixed codebook and returns
+        the corresponding codebook vectors
+        Parameters
+        ----------
+        z : Tensor[B x D x T]
+        Returns
+        -------
+        Tensor[B x D x T]
+            Quantized continuous representation of input
+        Tensor[1]
+            Commitment loss to train encoder to predict vectors closer to codebook
+            entries
+        Tensor[1]
+            Codebook loss to update the codebook
+        Tensor[B x T]
+            Codebook indices (quantized discrete representation of input)
+        Tensor[B x D x T]
+            Projected latents (continuous representation of input before quantization)
+        """
+        # Factorized codes (ViT-VQGAN) Project input into low-dimensional space
+        z_e = self.in_proj(z)  # z_e : (B x D x T)
+        z_q, indices = self.decode_latents(z_e)
+        commitment_loss = F.mse_loss(z_e, z_q.detach(), reduction="none").mean([1, 2])
+        codebook_loss = F.mse_loss(z_q, z_e.detach(), reduction="none").mean([1, 2])
+        z_q = (
+            z_e + (z_q - z_e).detach()
+        )  # noop in forward pass, straight-through gradient estimator in backward pass
+        z_q = self.out_proj(z_q)
+        return z_q, commitment_loss, codebook_loss, indices, z_e
+    def embed_code(self, embed_id):
+        return F.embedding(embed_id, self.codebook.weight)
+    def decode_code(self, embed_id):
+        return self.embed_code(embed_id).transpose(1, 2)
+    def decode_latents(self, latents):
+        encodings = rearrange(latents, "b d t -> (b t) d")
+        codebook = self.codebook.weight  # codebook: (N x D)
+        # L2 normalize encodings and codebook (ViT-VQGAN)
+        encodings = F.normalize(encodings)
+        codebook = F.normalize(codebook)
+        # Compute euclidean distance with codebook
+        dist = (
+            encodings.pow(2).sum(1, keepdim=True)
+            - 2 * encodings @ codebook.t()
+            + codebook.pow(2).sum(1, keepdim=True).t()
+        )
+        indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
+        z_q = self.decode_code(indices)
+        if(self.training):
+            onehots = torch.nn.functional.one_hot(indices, self.codebook_size).float()  # B, T, codebook_size
+            stale_codes = (onehots.sum(0).sum(0) == 0).float()
+            self.stale_counter = self.stale_counter * stale_codes + stale_codes
+            # random replace codes that haven't been used for a while
+            replace_code = (self.stale_counter == self.stale_tolerance).float() # codebook_size
+            if replace_code.sum(-1) > 0:
+                print("Replace {} codes".format(replace_code.sum(-1)))
+                random_input_idx = torch.randperm(encodings.shape[0])
+                random_input = encodings[random_input_idx].view(encodings.shape)
+                if random_input.shape[0] < self.codebook_size:
+                    random_input = torch.cat([random_input]*(self.codebook_size // random_input.shape[0] + 1), 0)
+                random_input = random_input[:self.codebook_size,:].contiguous()  # codebook_size, dim
+                self.codebook.weight.data = self.codebook.weight.data * (1 - replace_code).unsqueeze(-1) + random_input * replace_code.unsqueeze(-1)
+                self.stale_counter = self.stale_counter * (1 - replace_code)
+        return z_q, indices
+class ResidualVectorQuantize(nn.Module):
+    """
+    Introduced in SoundStream: An end2end neural audio codec
+    https://arxiv.org/abs/2107.03312
+    """
+    def __init__(
+        self,
+        input_dim: int = 512,
+        n_codebooks: int = 9,
+        codebook_size: int = 1024,
+        codebook_dim: Union[int, list] = 8,
+        quantizer_dropout: float = 0.0,
+        stale_tolerance: int = 100,
+        use_multi_layer_num:int = 1,
+    ):
+        super().__init__()
+        if isinstance(codebook_dim, int):
+            codebook_dim = [codebook_dim for _ in range(n_codebooks)]
+        self.n_codebooks = n_codebooks
+        self.codebook_dim = codebook_dim
+        self.codebook_size = codebook_size
+        self.quantizers = nn.ModuleList(
+            [
+                VectorQuantize(input_dim, codebook_size, codebook_dim[i], stale_tolerance=stale_tolerance, n_layer=use_multi_layer_num)
+                for i in range(n_codebooks)
+            ]
+        )
+        self.quantizer_dropout = quantizer_dropout
+    def forward(self, z, n_quantizers: int = None):
+        """Quantized the input tensor using a fixed set of `n` codebooks and returns
+        the corresponding codebook vectors
+        Parameters
+        ----------
+        z : Tensor[B x D x T]
+        n_quantizers : int, optional
+            No. of quantizers to use
+            (n_quantizers < self.n_codebooks ex: for quantizer dropout)
+            Note: if `self.quantizer_dropout` is True, this argument is ignored
+                when in training mode, and a random number of quantizers is used.
+        Returns
+        -------
+        dict
+            A dictionary with the following keys:
+            "z" : Tensor[B x D x T]
+                Quantized continuous representation of input
+            "codes" : Tensor[B x N x T]
+                Codebook indices for each codebook
+                (quantized discrete representation of input)
+            "latents" : Tensor[B x N*D x T]
+                Projected latents (continuous representation of input before quantization)
+            "vq/commitment_loss" : Tensor[1]
+                Commitment loss to train encoder to predict vectors closer to codebook
+                entries
+            "vq/codebook_loss" : Tensor[1]
+                Codebook loss to update the codebook
+        """
+        z_q = 0
+        residual = z
+        commitment_loss = 0
+        codebook_loss = 0
+        codebook_indices = []
+        latents = []
+        if n_quantizers is None:
+            n_quantizers = self.n_codebooks
+        if self.training:
+            n_quantizers = torch.ones((z.shape[0],)) * self.n_codebooks + 1
+            dropout = torch.randint(1, self.n_codebooks + 1, (z.shape[0],))
+            n_dropout = int(z.shape[0] * self.quantizer_dropout)
+            n_quantizers[:n_dropout] = dropout[:n_dropout]
+            n_quantizers = n_quantizers.to(z.device)
+        else:
+            n_quantizers = torch.ones((z.shape[0],)) * n_quantizers + 1
+            n_quantizers = n_quantizers.to(z.device)
+        for i, quantizer in enumerate(self.quantizers):
+            # if self.training is False and i >= n_quantizers:
+            #     break
+            z_q_i, commitment_loss_i, codebook_loss_i, indices_i, z_e_i = quantizer(
+                residual
+            )
+            # Create mask to apply quantizer dropout
+            mask = (
+                torch.full((z.shape[0],), fill_value=i, device=z.device) < n_quantizers
+            )
+            z_q = z_q + z_q_i * mask[:, None, None]
+            residual = residual - z_q_i
+            # Sum losses
+            commitment_loss += (commitment_loss_i * mask).mean()
+            codebook_loss += (codebook_loss_i * mask).mean()
+            codebook_indices.append(indices_i)
+            latents.append(z_e_i)
+        codes = torch.stack(codebook_indices, dim=1)
+        latents = torch.cat(latents, dim=1)
+        encodings = F.one_hot(codes, self.codebook_size).float() # B N T 1024
+        return z_q, codes, latents, commitment_loss, codebook_loss, n_quantizers.clamp(max=self.n_codebooks).long() - 1
+    def get_loss(self, x, quantized_prompt_embeds, commitment_loss, codebook_loss):
+        final_loss = commitment_loss * 0.25 + codebook_loss * 1.0 + (x - quantized_prompt_embeds).abs().mean()
+        return final_loss
+    def from_codes(self, codes: torch.Tensor):
+        """Given the quantized codes, reconstruct the continuous representation
+        Parameters
+        ----------
+        codes : Tensor[B x N x T]
+            Quantized discrete representation of input
+        Returns
+        -------
+        Tensor[B x D x T]
+            Quantized continuous representation of input
+        """
+        z_q = 0.0
+        z_p = []
+        n_codebooks = codes.shape[1]
+        for i in range(n_codebooks):
+            z_p_i = self.quantizers[i].decode_code(codes[:, i, :])
+            z_p.append(z_p_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), codes
+    def from_latents(self, latents: torch.Tensor):
+        """Given the unquantized latents, reconstruct the
+        continuous representation after quantization.
+        Parameters
+        ----------
+        latents : Tensor[B x N x T]
+            Continuous representation of input after projection
+        Returns
+        -------
+        Tensor[B x D x T]
+            Quantized representation of full-projected space
+        Tensor[B x D x T]
+            Quantized representation of latent space
+        """
+        z_q = 0
+        z_p = []
+        codes = []
+        dims = np.cumsum([0] + [q.codebook_dim for q in self.quantizers])
+        n_codebooks = np.where(dims <= latents.shape[1])[0].max(axis=0, keepdims=True)[
+            0
+        ]
+        for i in range(n_codebooks):
+            j, k = dims[i], dims[i + 1]
+            z_p_i, codes_i = self.quantizers[i].decode_latents(latents[:, j:k, :])
+            z_p.append(z_p_i)
+            codes.append(codes_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), torch.stack(codes, dim=1)

src/third_party/MuQ/src/muq/muq/muq.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import torch.nn as nn
+import torch
+from .models.muq_model import MuQModel
+from dataclasses import dataclass, field
+from typing import List, Optional
+from transformers.modeling_outputs import BaseModelOutput
+from huggingface_hub import PyTorchModelHubMixin
+@dataclass
+class MuQConfig:
+    label_rate:int = field(default=25)
+    num_codebooks:int = field(default=1)
+    codebook_dim:int = field(default=16)
+    codebook_size:int = field(default=4096)
+    features:List[str] = field(default_factory=lambda:["melspec_2048"])
+    hop_length:int = field(default=240)
+    n_mels:int = field(default=128)
+    conv_dim:int = field(default=512)
+    encoder_dim:int = field(default=1024)
+    encoder_depth:int = field(default=12)
+    mask_hop:float = field(default=0.4)
+    mask_prob:float = field(default=0.6)
+    is_flash:bool = field(default=False)
+    stat:Optional[dict] = field(default_factory=dict)
+    w2v2_config:Optional[dict] = field(default_factory=dict)
+    use_rvq_target:bool = field(default=False)
+    use_vq_target:bool = field(default=False)
+    use_encodec_target:bool = field(default=False)
+    rvq_ckpt_path: Optional[str] = field(default=None)
+    recon_loss_ratio: Optional[float] = field(default=None)
+    resume_checkpoint: Optional[str] = None
+    rvq_n_codebooks:int = field(default=8)
+    rvq_multi_layer_num:int = field(default=1)
+class MuQ(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, config: MuQConfig):
+        super().__init__()
+        if isinstance(config, dict):
+            config = MuQConfig(**config)
+        self.config = config
+        self.model = MuQModel(
+            num_codebooks=config.num_codebooks,
+            codebook_dim=config.codebook_dim,
+            codebook_size=config.codebook_size,
+            features=config.features,
+            hop_length=config.hop_length,
+            n_mels=config.n_mels,
+            conv_dim=config.conv_dim,
+            encoder_dim=config.encoder_dim,
+            encoder_depth=config.encoder_depth,
+            mask_hop=config.mask_hop,
+            mask_prob=config.mask_prob,
+            is_flash=config.is_flash,
+            stat=config.stat,
+            w2v2_config=config.w2v2_config,
+            use_rvq_target=config.use_rvq_target,
+            use_vq_target=config.use_vq_target,
+            use_encodec_target=config.use_encodec_target,
+            rvq_ckpt_path=config.rvq_ckpt_path,
+            recon_loss_ratio=config.recon_loss_ratio,
+            label_rate=config.label_rate,
+            rvq_n_codebooks=config.rvq_n_codebooks,
+            rvq_multi_layer_num=config.rvq_multi_layer_num,
+        )
+    def forward(self, x, attention_mask:Optional[torch.Tensor]=None, output_hidden_states:bool=True) ->BaseModelOutput:
+        """
+        Forward pass through the MuQ model and extract features.
+        Args:
+            x (torch.Tensor): Input waveform tensor of shape (batch_size, time).
+            attention_mask (torch.Tensor, optional): Mask to avoid performing attention on padding token indices.
+                Default is None.
+            output_hidden_states (bool, optional): Whether to return all hidden states or only the last one.
+                Default is False.
+        Returns:
+            BaseModelOutput: An object containing the last hidden state and optionally all hidden states.
+                - last_hidden_state (torch.Tensor): The last hidden state of the model, i.e. extracted MuQ features, of shape (batch_size, sequence_length, hidden_size).
+                - hidden_states (tuple(torch.Tensor), optional): A tuple containing all hidden states produced by the model,
+                each of shape (batch_size, sequence_length, hidden_size). Only returned if output_hidden_states is True.
+        """
+        _, hidden_states = self.model.get_predictions(x, attention_mask=attention_mask, is_features_only=True)
+        last_hidden_state = hidden_states[-1]
+        if not output_hidden_states:
+            return BaseModelOutput(last_hidden_state=last_hidden_state)
+        return BaseModelOutput(
+            last_hidden_state=last_hidden_state,
+            hidden_states=hidden_states
+        )

src/third_party/MuQ/src/muq/muq_mulan/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .muq_mulan import MuQMuLan, MuQMuLanConfig, MuLanConfig, ModalModelConfig, TextTransformerConfig, AudioTransformerConfig

src/third_party/MuQ/src/muq/muq_mulan/models/__init__.py ADDED Viewed

File without changes