|
--- |
|
language: "en" |
|
datasets: |
|
- Spotify Podcasts Dataset |
|
tags: |
|
- bert |
|
- classification |
|
- pytorch |
|
pipeline: |
|
- text-classification |
|
--- |
|
|
|
**General Information** |
|
|
|
This is a `bert-base-cased`, binary classification model, fine-tuned to classify a given sentence as containing advertising content or not. It leverages previous-sentence context to make more accurate predictions. |
|
The model is used in the paper 'Leveraging multimodal content for podcast summarization' published at ACM SAC 2022. |
|
|
|
**Usage:** |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained('morenolq/spotify-podcast-advertising-classification') |
|
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') |
|
|
|
desc_sentences = ["Sentence 1", "Sentence 2", "Sentence 3"] |
|
for i, s in enumerate(desc_sentences): |
|
if i==0: |
|
context = "__START__" |
|
else: |
|
context = desc_sentences[i-1] |
|
out = tokenizer(context, text, padding = "max_length", |
|
max_length = 256, |
|
truncation=True, |
|
return_attention_mask=True, |
|
return_tensors = 'pt') |
|
outputs = model(**out) |
|
print (f"{s},{outputs}") |
|
``` |
|
|
|
The manually annotated data, used for model fine-tuning are available [here](https://github.com/MorenoLaQuatra/MATeR/blob/main/description_sentences_classification.tsv) |