# 🤗 + neuraly - Italian BERT Sentiment model

## Model description

This model performs sentiment analysis on Italian sentences. It was trained starting from an instance of bert-base-italian-cased, and fine-tuned on an Italian dataset of tweets, reaching 82% of accuracy on the latter one.

## Intended uses & limitations

#### How to use

import torch
from torch import nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment")
# Load the model, use .cuda() to load it on the GPU
model = AutoModelForSequenceClassification.from_pretrained("neuraly/bert-base-italian-cased-sentiment")

sentence = 'Huggingface è un team fantastico!'

# Create tensor, use .cuda() to transfer the tensor to GPU
tensor = torch.tensor(input_ids).long()
# Fake batch dimension
tensor = tensor.unsqueeze(0)

# Call the model and get the logits
logits, = model(tensor)

# Remove the fake batch dimension
logits = logits.squeeze(0)

# The model was trained with a Log Likelyhood + Softmax combined loss, hence to extract probabilities we need a softmax on top of the logits tensor
proba = nn.functional.softmax(logits, dim=0)

# Unpack the tensor to obtain negative, neutral and positive probabilities
negative, neutral, positive = proba

#### Limitations and bias

A possible drawback (or bias) of this model is related to the fact that it was trained on a tweet dataset, with all the limitations that come with it. The domain is strongly related to football players and teams, but it works surprisingly well even on other topics.

## Training data

We trained the model by combining the two tweet datasets taken from Sentipolc EVALITA 2016. Overall the dataset consists of 45K pre-processed tweets.

The model weights come from a pre-trained instance of bert-base-italian-cased. A huge "thank you" goes to that team, brilliant work!

## Training procedure

#### Preprocessing

We tried to save as much information as possible, since BERT captures extremely well the semantic of complex text sequences. Overall we removed only @mentions, urls and emails from every tweet and kept pretty much everything else.

#### Hardware

• GPU: Nvidia GTX1080ti
• CPU: AMD Ryzen7 3700x 8c/16t
• RAM: 64GB DDR4

#### Hyperparameters

• Optimizer: AdamW with learning rate of 2e-5, epsilon of 1e-8
• Max epochs: 5
• Batch size: 32
• Early Stopping: enabled with patience = 1

Early stopping was triggered after 3 epochs.

## Eval results

The model achieves an overall accuracy on the test set equal to 82% The test set is a 20% split of the whole dataset.