RLAC Audio-transcription Segmenter - Chroniques (Random Forest)

Description

This model uses a Random Forest machine learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It is designed to be a lightweight and efficient alternative for text-based segmentation.

Hugging Face link: eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest

Model Details

The model is a single-file classifier (.pkl) serialized with joblib. It operates by analyzing individual transcription segments and their immediate context.

1. Feature Engineering

The model extracts several key features from the SRT segments:

  • Temporal Data: Precise air time and segment duration.
  • Linguistic Statistics: Word count, character count, and average word length.
  • Punctuation Analysis: Detection of question marks and exclamation points to identify rhetorical styles.
  • TF-IDF Vectorization: A statistical measure used to evaluate the importance of words in the transcript relative to radio chronicle vocabulary.
  • Jingle Detection: Specific binary markers for identified radio jingles.

2. Contextual Awareness (Sliding Window)

To improve accuracy, the model employs a sliding window technique. It doesn't just look at a segment in isolation; it incorporates features from the surrounding segments (e.g., the 2 segments before and 2 segments after) to capture the flow and continuity of the radio broadcast.

3. Core Classifier

The underlying algorithm is a RandomForestClassifier from the Scikit-Learn library, optimized with 200 estimators to handle the balance between precision and recall for chronicle detection.

Advantages

  • CPU-Only: Does not require any GPU or heavy deep learning frameworks.
  • High Speed: Processing an hour-long show takes less than a second.
  • Efficiency: Extremely low memory footprint.

Usage

The model is integrated into the local predict.py workflow:

from train import RadioChroniqueClassifier
classifier = RadioChroniqueClassifier.load_model("models/radio_chronique_rf.pkl")

Author

Maintained by eglantinefonrose.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest