metadata

license: apache-2.0
tags:
  - setfit
  - sentence-transformers
  - text-classification
pipeline_tag: text-classification
library_name: sentence-transformers
metrics:
  - accuracy
  - f1
  - precision
  - recall
language:
  - en
  - fr
  - ko
  - zh
  - ja
  - pt
  - ru
datasets:
  - imdb
model-index:
  - name: germla/satoken
    results:
      - task:
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: imdb
          name: imdb
          split: test
        metrics:
          - type: accuracy
            value: 73.976
            name: Accuracy
          - type: f1
            value: 73.1667079105832
            name: F1
          - type: precision
            value: 75.51506895964584
            name: Precision
          - type: recall
            value: 70.96
            name: Recall
      - task:
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: sepidmnorozy/Russian_sentiment
          name: sepidmnorozy/Russian_sentiment
          split: train
        metrics:
          - type: accuracy
            value: 75.66371681415929
            name: Accuracy
          - type: f1
            value: 83.6421871425303
            name: F1
          - type: precision
            value: 75.25730753396459
            name: Precision
          - type: recall
            value: 94.129763130793
            name: Recall

Satoken

This is a SetFit model trained on multilingual datasets (mentioned below) for Sentiment classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

It is utilized by Germla for it's feedback analysis tool. (specifically the Sentiment analysis feature)

For other models (specific language-basis) check here

Usage

To use this model for inference, first install the SetFit library:

python -m pip install setfit

You can then run inference as follows:

from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

Training Details

Training Data

Training Procedure

We made sure to have a balanced dataset. The model was trained on only 35% (50% for chinese) of the train split of all datasets.

Preprocessing

Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
Removal of stopwords using nltk

Speeds, Sizes, Times

The training procedure took 6hours on the NVIDIA T4 GPU.

Evaluation

Testing Data, Factors & Metrics

IMDB test split

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: 6
Cloud Provider: Amazon Web Services
Compute Region: ap-south-1 (Mumbai)
Carbon Emitted: 0.39 kg co2 eq.