|
--- |
|
language: it |
|
thumbnail: https://neuraly.ai/static/assets/images/huggingface/thumbnail.png |
|
tags: |
|
- sentiment |
|
- Italian |
|
license: mit |
|
widget: |
|
- text: Huggingface è un team fantastico! |
|
--- |
|
|
|
# 🤗 + neuraly - Italian BERT Sentiment model |
|
|
|
## Model description |
|
|
|
This model performs sentiment analysis on Italian sentences. It was trained starting from an instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased), and fine-tuned on an Italian dataset of tweets, reaching 82% of accuracy on the latter one. |
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
|
|
```python |
|
import torch |
|
from torch import nn |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment") |
|
# Load the model, use .cuda() to load it on the GPU |
|
model = AutoModelForSequenceClassification.from_pretrained("neuraly/bert-base-italian-cased-sentiment") |
|
|
|
sentence = 'Huggingface è un team fantastico!' |
|
input_ids = tokenizer.encode(sentence, add_special_tokens=True) |
|
|
|
# Create tensor, use .cuda() to transfer the tensor to GPU |
|
tensor = torch.tensor(input_ids).long() |
|
# Fake batch dimension |
|
tensor = tensor.unsqueeze(0) |
|
|
|
# Call the model and get the logits |
|
logits, = model(tensor) |
|
|
|
# Remove the fake batch dimension |
|
logits = logits.squeeze(0) |
|
|
|
# The model was trained with a Log Likelyhood + Softmax combined loss, hence to extract probabilities we need a softmax on top of the logits tensor |
|
proba = nn.functional.softmax(logits, dim=0) |
|
|
|
# Unpack the tensor to obtain negative, neutral and positive probabilities |
|
negative, neutral, positive = proba |
|
``` |
|
|
|
#### Limitations and bias |
|
|
|
A possible drawback (or bias) of this model is related to the fact that it was trained on a tweet dataset, with all the limitations that come with it. The domain is strongly related to football players and teams, but it works surprisingly well even on other topics. |
|
|
|
## Training data |
|
|
|
We trained the model by combining the two tweet datasets taken from [Sentipolc EVALITA 2016](http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html). Overall the dataset consists of 45K pre-processed tweets. |
|
|
|
The model weights come from a pre-trained instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased). A huge "thank you" goes to that team, brilliant work! |
|
|
|
## Training procedure |
|
|
|
#### Preprocessing |
|
|
|
We tried to save as much information as possible, since BERT captures extremely well the semantic of complex text sequences. Overall we removed only **@mentions**, **urls** and **emails** from every tweet and kept pretty much everything else. |
|
|
|
#### Hardware |
|
|
|
- **GPU**: Nvidia GTX1080ti |
|
- **CPU**: AMD Ryzen7 3700x 8c/16t |
|
- **RAM**: 64GB DDR4 |
|
|
|
#### Hyperparameters |
|
|
|
- Optimizer: **AdamW** with learning rate of **2e-5**, epsilon of **1e-8** |
|
- Max epochs: **5** |
|
- Batch size: **32** |
|
- Early Stopping: **enabled** with patience = 1 |
|
|
|
Early stopping was triggered after 3 epochs. |
|
|
|
## Eval results |
|
|
|
The model achieves an overall accuracy on the test set equal to 82% |
|
The test set is a 20% split of the whole dataset. |
|
|
|
## About us |
|
[Neuraly](https://neuraly.ai) is a young and dynamic startup committed to designing AI-driven solutions and services through the most advanced Machine Learning and Data Science technologies. You can find out more about who we are and what we do on our [website](https://neuraly.ai). |
|
|
|
## Acknowledgments |
|
|
|
Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team, |
|
it is possible to download the model from their S3 storage and live test it from their inference API 🤗. |
|
|