🧠 Dixtral β€” BUT-FIT Diarization-Conditioned Voxtral for Target-Speaker ASR

This repository hosts Dixtral, developed by BUT Speech@FIT. Dixtral couples the Voxtral-Mini-3B spoken-language model with the DiCoW diarization-conditioned encoder, giving the LLM target-speaker awareness in multi-talker audio.

This checkpoint is tuned for target-speaker / multi-talker transcription (TS-ASR) of conversational and meeting recordings. For spoken question answering, use Dixtral_QA instead.

πŸ› οΈ Model Usage

from transformers import AutoModel, AutoProcessor

MODEL_NAME = "BUT-FIT/Dixtral"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

➑️ For full inference pipelines (diarization β†’ FDDT masks β†’ generation), see the Dixtral GitHub repository.


πŸ“¦ Model Details


πŸ“¬ Contact

πŸ“§ Email: ipoloka@fit.vut.cz 🏒 Affiliation: BUT Speech@FIT, Brno University of Technology πŸ”— GitHub: BUTSpeechFIT

Downloads last month
12
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for BUT-FIT/Dixtral

Finetuned
(19)
this model

Datasets used to train BUT-FIT/Dixtral