Skipr Ad Classifier

Fine-tuned DistilBERT model that detects sponsor/ad segments in YouTube podcast transcript text.

Also available: dkayaaaa/ad-classifier-quantised โ€” INT8 ONNX version (~64 MB) for faster, lighter inference with ONNX Runtime.

Model description

This model classifies a short transcript window as either an ad/sponsor segment or normal podcast content. It was trained as part of the Skipr pipeline for skipping sponsor segments in YouTube podcasts.

  • Architecture: distilbert-base-uncased
  • Task: Binary sequence classification
  • Labels:
    • 0 โ€” not an ad segment
    • 1 โ€” ad/sponsor segment
  • Max sequence length: 512 tokens
  • Training: 3 epochs, fine-tuned from distilbert-base-uncased

Intended use

Use this model to classify transcript windows (typically ~20 caption snippets) as ad vs non-ad content. It is designed for use in the Skipr browser extension and related inference services.

Out of scope:

  • General sentiment or topic classification
  • Non-English text (trained on English podcast transcripts)
  • Full-video classification without segmentation

Training data

Training data

The model was trained on a mix of real and synthetic transcript windows:

  • Base set (~800 samples): weak-labeled YouTube podcast segments
    • Positive: segments matching sponsor keywords/brands
    • Negative: normal podcast content
  • Augmented set (~1,200 samples): synthetic variants generated with Llama 8B via Ollama, preserving the original label. Synthetic data generated through strategies; paraphrase, new scenario, style shift, fragment, vocabulary shift

Original labels are heuristic โ€” the model learns from keyword-labeled examples. Synthetic data increases linguistic diversity but inherits the same label assumptions.

Usage

Transformers (Python)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "dkayaaaa/ad-classifier"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)

text = "this episode is brought to you by our friends at..."
inputs = tokenizer(
    text,
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512,
)

with torch.no_grad():
    logits = model(**inputs).logits
    prediction = logits.argmax(dim=-1).item()

print("ad" if prediction == 1 else "not ad")
Downloads last month
417
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kayaaaa/ad-classifier

Finetuned
(11919)
this model