dasheng-base-env-encoder

mispeech/dasheng-base fine-tuned (LoRA, merged) to extract environment/background embeddings from speech. Trained on ChristianYang/Env-TTS-Clean with conversation-level AAMSoftmax. Auto-uploaded from training step 100000.

Usage

import torch
from transformers import AutoModel, AutoFeatureExtractor
from huggingface_hub import hf_hub_download

repo = "ChristianYang/dasheng-base-env-encoder"
backbone = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
fe = AutoFeatureExtractor.from_pretrained(repo, trust_remote_code=True)
head = torch.load(hf_hub_download(repo, "head.pt"), map_location="cpu", weights_only=True)
# pipeline: 16 kHz wav -> fe -> mel -> backbone.encoder tokens [B,T,768]
# -> Conv1d x2 (head["cnn"]) -> masked attentive pooling (head["pool"]) -> 768-d embedding

Head weights (head.pt): two Conv1d(768,768,k=3) layers + attentive pooling (Linear(768,1)). Token count = mel frames // 4.

Downloads last month
-
Safetensors
Model size
85.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChristianYang/dasheng-base-env-encoder

Finetuned
(1)
this model