Moreno La Quatra
Update README.md
da963e1
metadata
language: en
datasets:
  - Spotify Podcasts Dataset
tags:
  - bert
  - classification
  - pytorch
pipeline:
  - text-classification

General Information

This is a bert-base-cased, binary classification model, fine-tuned to classify a given sentence as containing advertising content or not. It leverages previous-sentence context to make more accurate predictions. The model is used in the paper 'Leveraging multimodal content for podcast summarization' published at ACM SAC 2022.

Usage:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained('morenolq/spotify-podcast-advertising-classification')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

desc_sentences = ["Sentence 1", "Sentence 2", "Sentence 3"]
for i, s in enumerate(desc_sentences): 
    if i==0:
        context = "__START__"
    else:
        context = desc_sentences[i-1] 
    out = tokenizer(context, text, padding = "max_length",
                        max_length = 256,
                        truncation=True,
                        return_attention_mask=True,
                        return_tensors = 'pt')
    outputs = model(**out)
    print (f"{s},{outputs}")

The manually annotated data, used for model fine-tuning are available here