SetFit with BAAI/bge-base-en-v1.5

This is a SetFit model that can be used for Text Classification. This SetFit model uses BAAI/bge-base-en-v1.5 as the Sentence Transformer embedding model. A SetFitHead instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

Model Details

The purpose of this model is to predict multiple labels simultaneously from a given input data. Specifically, the model will predict 3 labels - GHGLabel, NetzeroLabel, NonGHGLabel- that are relevant to a particular task or application

GHGLabel: GHG targets refer to contributions framed as targeted
outcomes in GHG terms
NetzeroLabel: Identifies if it contains Netzero Target or not.
NonGHGLabel: Target not in terms of GHG, like energy efficiency, expansion of Solar Energy production etc.

Model Description

Model Type: SetFit
Sentence Transformer body: BAAI/bge-base-en-v1.5
Classification head: a SetFitHead instance
Maximum Sequence Length: 512 tokens
Number of Classes: 3 classes

Model Sources

Repository: SetFit on GitHub
Paper: Efficient Few-Shot Learning Without Prompts
Blogpost: SetFit: Efficient Few-Shot Learning Without Prompts

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("GIZ/SUBTARGET_multilabel_bge")
# Run inference
preds = model("This document enfolds Iceland’s first communication on its long-term strategy (LTS), to be updated when further analysis and policy documents are published on the matter. Iceland is committed to reducing its overall greenhouse gas emissions and reaching climate neutrality no later than 2040 and become fossil fuel free in 2050, which should set Iceland on a path to net negative emissions.")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Word count	19	78.5467	173

Training Dataset: 728

Class Positive Count of Class

GHGLabel 440

NetzeroLabel 120

NonGHGLabel 259
Validation Dataset: 80

Class Positive Count of Class

GHGLabel 49

NetzeroLabel 11

NonGHGLabel 30

Class	Positive Count of Class
GHGLabel	440
NetzeroLabel	120
NonGHGLabel	259

Class	Positive Count of Class
GHGLabel	49
NetzeroLabel	11
NonGHGLabel	30

Training Hyperparameters

batch_size: (8, 2)
num_epochs: (1, 0)
max_steps: -1
sampling_strategy: undersampling
body_learning_rate: (6.86e-06, 1e-05)
head_learning_rate: 0.01
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.01
seed: 42
eval_max_steps: -1
load_best_model_at_end: False

Embedding Training Results

Epoch	Step	Training Loss	Validation Loss
0.0000	1	0.2227	-
0.1519	5000	0.015	0.0831
0.3038	10000	0.0146	0.0924
0.4557	15000	0.0197	0.0827
0.6076	20000	0.0031	0.0883
0.7595	25000	0.0439	0.0865
0.9114	30000	0.0029	0.0914

label	precision	recall	f1-score	support
GHG	0.884	0.938	0.910	49.0
Netzero	0.846	1.000	0.916	11.0
NonGHG	0.903	0.933	0.918	30.0

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Carbon Emitted: 0.268 kg of CO2
Hours Used: 2.03 hours

Training Hardware

On Cloud: No
GPU Model: 1 x Tesla V100-SXM2-16GB
CPU Model: Intel(R) Xeon(R) CPU @ 2.20GHz
RAM Size: 12.67 GB

Framework Versions

Python: 3.10.12
SetFit: 1.0.3
Sentence Transformers: 2.3.1
Transformers: 4.35.2
PyTorch: 2.1.0+cu121
Datasets: 2.17.0
Tokenizers: 0.15.2

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}

GIZ
/

SUBTARGET_multilabel_bge