dasheng-base-env-encoder
mispeech/dasheng-base fine-tuned (LoRA, merged) to extract environment/background
embeddings from speech. Trained on ChristianYang/Env-TTS-Clean with
conversation-level AAMSoftmax. Auto-uploaded from training step 100000.
Usage
import torch
from transformers import AutoModel, AutoFeatureExtractor
from huggingface_hub import hf_hub_download
repo = "ChristianYang/dasheng-base-env-encoder"
backbone = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
fe = AutoFeatureExtractor.from_pretrained(repo, trust_remote_code=True)
head = torch.load(hf_hub_download(repo, "head.pt"), map_location="cpu", weights_only=True)
# pipeline: 16 kHz wav -> fe -> mel -> backbone.encoder tokens [B,T,768]
# -> Conv1d x2 (head["cnn"]) -> masked attentive pooling (head["pool"]) -> 768-d embedding
Head weights (head.pt): two Conv1d(768,768,k=3) layers + attentive pooling
(Linear(768,1)). Token count = mel frames // 4.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ChristianYang/dasheng-base-env-encoder
Base model
mispeech/dasheng-base