mtyrrell
/

CPU_Transport_GHG_Classifier

Text Classification

Generated from Trainer

Model card Files Files and versions Metrics Training metrics Community

mtyrrell commited on Jul 23, 2023

Commit

5a7f0b1

·

1 Parent(s): bf62dbf

Update README.md

Files changed (1) hide show

README.md +5 -2

README.md CHANGED Viewed

@@ -65,8 +65,11 @@ The pre-processing operations used to produce the final training dataset were as
 4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
 5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
 6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
-7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
 ## Training procedure

 4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
 5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
 6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
+7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```. This is done to increase the number of training samples available for the GHG class from 42 to 84. The end result is a more equal sample per class breakdown of:
+> - GHG: 84
+> - NOT-GHG: 191
+> - NEGATIVE: 190
+8. To address the remaining class imbalance, inverse frequency class weights are computed and passed to a custom single label trainer function which is used during hyperparameter tuning and final model training.
 ## Training procedure