--- language: fr license: mit tags: - bert - language-model - flaubert - french - flaubert-base - uncased - asr - speech - oral - natural language understanding - NLU - spoken language understanding - SLU - understanding --- # FlauBERT-Oral models: Using ASR-Generated Text for Spoken Language Modeling **FlauBERT-Oral** are French BERT models trained on a very large amount of automatically transcribed speech from 350,000 hours of diverse French TV shows. They were trained with the [**FlauBERT software**](https://github.com/getalp/Flaubert) using the same parameters as the [flaubert-base-uncased](https://huggingface.co/flaubert/flaubert_base_uncased) model (12 layers, 12 attention heads, 768 dims, 137M parameters, uncased). ## Available FlauBERT-Oral models - `flaubert-oral-asr` : trained from scratch on ASR data, keeping the BPE tokenizer and vocabulary of flaubert-base-uncased - `flaubert-oral-asr_nb` : trained from scratch on ASR data, BPE tokenizer is also trained on the same corpus - `flaubert-oral-mixed` : trained from scratch on a mixed corpus of ASR and text data, BPE tokenizer is also trained on the same corpus - `flaubert-oral-ft` : fine-tuning of flaubert-base-uncased for a few epochs on ASR data ## Usage for sequence classification ```python flaubert_tokenizer = FlaubertTokenizer.from_pretrained("nherve/flaubert-oral-asr") flaubert_classif = FlaubertForSequenceClassification.from_pretrained("nherve/flaubert-oral-asr", num_labels=14) flaubert_classif.sequence_summary.summary_type = 'mean' # Then, train your model ```