pszemraj commited on
Commit
c9d4dff
·
verified ·
1 Parent(s): 2dbf21a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -11
README.md CHANGED
@@ -733,17 +733,27 @@ It achieves the following results on the evaluation set:
733
 
734
  Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes)
735
 
736
- | Task | Eval Loss | Combined Score | Accuracy | F1 Score | Matthews Correlation | Pearson Correlation |
737
- |------|-----------|----------------|----------|----------|----------------------|---------------------|
738
- | RTE | 0.7066 | - | 66.06% | - | - | - |
739
- | SST-2| 0.2464 | - | 90.6% | - | - | - |
740
- | STS-B| 0.4103 | 92.07% | - | - | - | 92.23% |
741
- | WNLI | 0.7309 | - | 30.99% | - | - | - |
742
- | MRPC | 0.3759 | 86.12% | 83.58% | 88.66% | - | - |
743
- | CoLA | 0.4582 | - | - | - | 59.27% | - |
744
-
745
- *Model Source: BEE-spoke-data/bert-plus-L8-4096-v1.0*
746
-
 
 
 
 
 
 
 
 
 
 
747
  ---
748
 
749
  ## Training procedure
 
733
 
734
  Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes)
735
 
736
+ | GLUE Task | Accuracy | Combined Score | Pearson | Spearmanr | Matthews Correlation | Loss |
737
+ |-----------|----------|----------------|---------|-----------|----------------------|---------|
738
+ | QQP | 91.0% | 89.23% | - | - | - | 0.2264 |
739
+ | SST2 | 90.6% | - | - | - | - | 0.2464 |
740
+ | QNLI | 89.6% | - | - | - | - | 0.2891 |
741
+ | MRPC | 84.07% | 86.59% | - | - | - | 0.3759 |
742
+ | STSB | - | 92.07% | 92.23% | 91.92% | - | 0.4103 |
743
+ | MNLI | 82.2% | - | - | - | - | 0.4602 |
744
+ | CoLA | - | - | - | - | 60.72% | 0.4569 |
745
+ | RTE | 66.43% | - | - | - | - | 0.6981 |
746
+ | WNLI | 35.21% | - | - | - | - | 0.7425 |
747
+
748
+ ### Observations:
749
+
750
+ - **Performance Variation**: There's notable variation in model performance across different GLUE tasks. This variation can be attributed to the distinct nature of each task, the complexity of the datasets, and how well the model's architecture and hyperparameters are suited to each task.
751
+ - **Hyperparameters Impact**: Different weight decay settings and batch sizes seem to have nuanced impacts on performance across tasks, indicating the importance of hyperparameter tuning for optimal results.
752
+ - **Technology Features**: The use of `tf32` and `torch_compile` in certain tasks (e.g., SST2, MRPC, CoLA) suggests exploring these features might bring performance benefits, though their impact is mixed and may depend on the specific nature of the task and the model architecture.
753
+ - **Batch Size and Gradient Accumulation Steps**: These hyperparameters vary across tasks, reflecting a balance between computational efficiency and model performance. Larger batch sizes and gradient accumulation steps can help stabilize training but may require adjustments based on the available hardware and the specific task.
754
+ - **Task-specific Challenges**: Tasks like WNLI and RTE have notably lower accuracy scores compared to others, highlighting the challenges inherent in some NLP tasks, possibly due to dataset size, complexity, or the nuances of the task itself.
755
+ - **Overall Performance**: The model shows strong performance on tasks with numerical scores (e.g., STSB), high accuracy in classification tasks like QQP, SST2, and MNLI, but struggles with more nuanced or smaller datasets like WNLI and RTE, underscoring the importance of tailored approaches for different types of NLP challenges.
756
+
757
  ---
758
 
759
  ## Training procedure