mtyrrell commited on
Commit
5a7f0b1
1 Parent(s): bf62dbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -65,8 +65,11 @@ The pre-processing operations used to produce the final training dataset were as
65
  4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
66
  5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
67
  6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
68
- 7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
69
-
 
 
 
70
 
71
  ## Training procedure
72
 
 
65
  4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
66
  5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
67
  6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
68
+ 7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```. This is done to increase the number of training samples available for the GHG class from 42 to 84. The end result is a more equal sample per class breakdown of:
69
+ > - GHG: 84
70
+ > - NOT-GHG: 191
71
+ > - NEGATIVE: 190
72
+ 8. To address the remaining class imbalance, inverse frequency class weights are computed and passed to a custom single label trainer function which is used during hyperparameter tuning and final model training.
73
 
74
  ## Training procedure
75