|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- sem_eval_2020_task_11 |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
Given a sentence, our model predicts whether or not the sentence contains "persuasive" language, or language designed to elicit emotions or change |
|
readers' opinions. The model was tuned on the SemEval 2020 Task 11 dataset. However, we preprocessed the dataset to adapt it from |
|
multilabel technique classification and span-classification to our binary classification task. |
|
|
|
There are two revisions: |
|
* BERT - we finetuned `bert-large-cased` on our main branch |
|
* XLM-RoBERTa - we finetuned `xlm-roberta-base` on our `roberta` branch. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** Ultraviolet Text |
|
- **Model type:** BERT / RoBERTa |
|
- **Language(s) (NLP):** En |
|
- **License:** MIT |
|
- **Finetuned from model [optional]:** bert-large-cased / xlm-roberta-base |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
### Loading from the main branch (BERT) |
|
```py |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bert-large-cased") |
|
model = AutoModelForSequenceClassification.from_pretrained("chreh/persuasive_language_detector") |
|
``` |
|
|
|
### Loading from the `roberta` branch (XLM RoBERTa) |
|
```py |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") |
|
model = AutoModelForSequenceClassification.from_pretrained("chreh/persuasive_language_detector", revision="roberta") |
|
``` |
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
Training data can be downloaded from [the Semeval website](https://propaganda.qcri.org/semeval2020-task11/). |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
The training was done using Huggingface Trainer on both our local machines and Intel Developer Cloud kernels, enabling us to prototype multiple models simultaneously. |
|
|
|
#### Preprocessing [optional] |
|
All sentences containing spans of persuasive language techniques were labeled as persuasive language examples, while all others |
|
were labeled as examples of non-persuasive language. |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
The test data is from the test data of `sem_eval_2020_task_11`, which can be downloaded from [the original website](https://propaganda.qcri.org/semeval2020-task11/). |
|
The test data contains 38.25% persuasive examples and non-persuasive examples 61.75%. Metrics can be found in the following section |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
Metrics are reported in the format (main_branch), (roberta branch) |
|
* Accuracy - 0.7165140725669719, 0.7326693227091633 |
|
* Recall - 0.6875584658559402, 0.6822916666666666 |
|
* Precision - 0.5941794664510913, 0.6415279138099902 |
|
* F1 - 0.6374674761491761, 0.6612821807168097 |
|
|
|
Overall, the `roberta` branch performs better, and with faster inference times. Thus, we recommend users download from the `roberta` revision. |
|
|
|
|