chreh's picture
Update README.md
fd63711 verified
---
library_name: transformers
license: mit
datasets:
- sem_eval_2020_task_11
language:
- en
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Given a sentence, our model predicts whether or not the sentence contains "persuasive" language, or language designed to elicit emotions or change
readers' opinions. The model was tuned on the SemEval 2020 Task 11 dataset. However, we preprocessed the dataset to adapt it from
multilabel technique classification and span-classification to our binary classification task.
There are two revisions:
* BERT - we finetuned `bert-large-cased` on our main branch
* XLM-RoBERTa - we finetuned `xlm-roberta-base` on our `roberta` branch.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Ultraviolet Text
- **Model type:** BERT / RoBERTa
- **Language(s) (NLP):** En
- **License:** MIT
- **Finetuned from model [optional]:** bert-large-cased / xlm-roberta-base
## How to Get Started with the Model
Use the code below to get started with the model.
### Loading from the main branch (BERT)
```py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-large-cased")
model = AutoModelForSequenceClassification.from_pretrained("chreh/persuasive_language_detector")
```
### Loading from the `roberta` branch (XLM RoBERTa)
```py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("chreh/persuasive_language_detector", revision="roberta")
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Training data can be downloaded from [the Semeval website](https://propaganda.qcri.org/semeval2020-task11/).
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The training was done using Huggingface Trainer on both our local machines and Intel Developer Cloud kernels, enabling us to prototype multiple models simultaneously.
#### Preprocessing [optional]
All sentences containing spans of persuasive language techniques were labeled as persuasive language examples, while all others
were labeled as examples of non-persuasive language.
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
The test data is from the test data of `sem_eval_2020_task_11`, which can be downloaded from [the original website](https://propaganda.qcri.org/semeval2020-task11/).
The test data contains 38.25% persuasive examples and non-persuasive examples 61.75%. Metrics can be found in the following section
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
Metrics are reported in the format (main_branch), (roberta branch)
* Accuracy - 0.7165140725669719, 0.7326693227091633
* Recall - 0.6875584658559402, 0.6822916666666666
* Precision - 0.5941794664510913, 0.6415279138099902
* F1 - 0.6374674761491761, 0.6612821807168097
Overall, the `roberta` branch performs better, and with faster inference times. Thus, we recommend users download from the `roberta` revision.