Edit model card

Model and approach πŸ€—

As I am limited by my personal computer, the training was done on the distilbert-base-multilingual-cased model. This model is 60% faster than the classic BERT model and preserves 95% of the original model's accuracy.

The dataset provided contains book titles, authors, reviews, and a score for each book. These columns were concatenated to form large context blocks and were used as the input text. The labels, (0, 1, and -1) were normalized to 0, 1, and 2, and finally to NEUTRAL, POSITIVE, and NEGATIVE to help with legibility of the predictions.

As this exercise is simply to show my capacities to train a model, the model has been trained using 3000 training entries and 300 test entries for 2 epochs.

Notes on the three classes and the model's bias πŸ“

The distribution of these classes is not equal in the ensemble of this dataset. Although it is shuffled, positive reviews are the most present, and therefore most-often predicted category. In addition, the decision to keep the review score in the text block did have an impact on the biases of the model. The model can make a prediction based on score alone, a number between 1 and 5.

Positive reviews: 2081

Negative reviews: 224

Neutral reviews: 695

Downloads last month
9

Dataset used to train maclean-connor96/feedier-french-books