mlx-contentvec / README.md
lexandstuff's picture
Upload README.md with huggingface_hub
6c6cd43 verified
metadata
license: mit
library_name: mlx
tags:
  - mlx
  - audio
  - speech
  - feature-extraction
  - contentvec
  - hubert
  - voice-conversion
  - rvc
datasets:
  - librispeech_asr
language:
  - en
pipeline_tag: feature-extraction

MLX ContentVec / HuBERT Base

MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.

This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for RVC (Retrieval-based Voice Conversion).

Model Details

  • Architecture: HuBERT Base (12 transformer layers)
  • Parameters: ~90M
  • Input: 16kHz mono audio
  • Output: 768-dimensional features (~50 frames/second)
  • Framework: MLX
  • Format: SafeTensors (float32)

Usage

import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel

# Load model
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("contentvec_base.safetensors")
model.eval()

# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)

# Extract features
result = model(source)
features = result["x"]  # Shape: (1, num_frames, 768)

Installation

pip install git+https://github.com/example/mlx-contentvec.git

Download Weights

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download(
    repo_id="lexandstuff/mlx-contentvec",
    filename="contentvec_base.safetensors"
)

Validation

These weights produce numerically identical outputs to the original PyTorch implementation:

Metric Value
Max absolute difference 7.3e-6
Cosine similarity 1.000000

Source Weights

Converted from hubert_base.pt (MD5: b76f784c1958d4e535cd0f6151ca35e4).

Use Cases

  • Voice Conversion: Feature extraction for RVC pipeline
  • Speaker Verification: Content-based audio embeddings
  • Speech Analysis: Semantic feature extraction

Citation

@inproceedings{qian2022contentvec,
  title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
  author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
  booktitle={International Conference on Machine Learning},
  year={2022}
}

@article{hsu2021hubert,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Hsu, Wei-Ning and others},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021}
}

License

MIT