Instructions to use kayaaaa/ad-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kayaaaa/ad-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="kayaaaa/ad-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("kayaaaa/ad-classifier") model = AutoModelForSequenceClassification.from_pretrained("kayaaaa/ad-classifier") - Notebooks
- Google Colab
- Kaggle
Skipr Ad Classifier
Fine-tuned DistilBERT model that detects sponsor/ad segments in YouTube podcast transcript text.
Also available: dkayaaaa/ad-classifier-quantised โ INT8 ONNX version (~64 MB) for faster, lighter inference with ONNX Runtime.
Model description
This model classifies a short transcript window as either an ad/sponsor segment or normal podcast content. It was trained as part of the Skipr pipeline for skipping sponsor segments in YouTube podcasts.
- Architecture:
distilbert-base-uncased - Task: Binary sequence classification
- Labels:
0โ not an ad segment1โ ad/sponsor segment
- Max sequence length: 512 tokens
- Training: 3 epochs, fine-tuned from
distilbert-base-uncased
Intended use
Use this model to classify transcript windows (typically ~20 caption snippets) as ad vs non-ad content. It is designed for use in the Skipr browser extension and related inference services.
Out of scope:
- General sentiment or topic classification
- Non-English text (trained on English podcast transcripts)
- Full-video classification without segmentation
Training data
Training data
The model was trained on a mix of real and synthetic transcript windows:
- Base set (~800 samples): weak-labeled YouTube podcast segments
- Positive: segments matching sponsor keywords/brands
- Negative: normal podcast content
- Augmented set (~1,200 samples): synthetic variants generated with Llama 8B via Ollama, preserving the original label. Synthetic data generated through strategies; paraphrase, new scenario, style shift, fragment, vocabulary shift
Original labels are heuristic โ the model learns from keyword-labeled examples. Synthetic data increases linguistic diversity but inherits the same label assumptions.
Usage
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "dkayaaaa/ad-classifier"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
text = "this episode is brought to you by our friends at..."
inputs = tokenizer(
text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs).logits
prediction = logits.argmax(dim=-1).item()
print("ad" if prediction == 1 else "not ad")
- Downloads last month
- 417
Model tree for kayaaaa/ad-classifier
Base model
distilbert/distilbert-base-uncased