RLAC Audio-transcription Segmenter - Chroniques (Random Forest)
Description
This model uses a Random Forest machine learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It is designed to be a lightweight and efficient alternative for text-based segmentation.
Hugging Face link: eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest
Model Details
The model is a single-file classifier (.pkl) serialized with joblib. It operates by analyzing individual transcription segments and their immediate context.
1. Feature Engineering
The model extracts several key features from the SRT segments:
- Temporal Data: Precise air time and segment duration.
- Linguistic Statistics: Word count, character count, and average word length.
- Punctuation Analysis: Detection of question marks and exclamation points to identify rhetorical styles.
- TF-IDF Vectorization: A statistical measure used to evaluate the importance of words in the transcript relative to radio chronicle vocabulary.
- Jingle Detection: Specific binary markers for identified radio jingles.
2. Contextual Awareness (Sliding Window)
To improve accuracy, the model employs a sliding window technique. It doesn't just look at a segment in isolation; it incorporates features from the surrounding segments (e.g., the 2 segments before and 2 segments after) to capture the flow and continuity of the radio broadcast.
3. Core Classifier
The underlying algorithm is a RandomForestClassifier from the Scikit-Learn library, optimized with 200 estimators to handle the balance between precision and recall for chronicle detection.
Advantages
- CPU-Only: Does not require any GPU or heavy deep learning frameworks.
- High Speed: Processing an hour-long show takes less than a second.
- Efficiency: Extremely low memory footprint.
Usage
The model is integrated into the local predict.py workflow:
from train import RadioChroniqueClassifier
classifier = RadioChroniqueClassifier.load_model("models/radio_chronique_rf.pkl")
Author
Maintained by eglantinefonrose.