BEE-spoke-data
/

bert-plus-L8-4096-v1.0

@@ -733,17 +733,27 @@ It achieves the following results on the evaluation set:
 Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes)
-| Task | Eval Loss | Combined Score | Accuracy | F1 Score | Matthews Correlation | Pearson Correlation |
-|------|-----------|----------------|----------|----------|----------------------|---------------------|
-| RTE  | 0.7066    | -              | 66.06%   | -        | -                    | -                   |
-| SST-2| 0.2464    | -              | 90.6%    | -        | -                    | -                   |
-| STS-B| 0.4103    | 92.07%         | -        | -        | -                    | 92.23%              |
-| WNLI | 0.7309    | -              | 30.99%   | -        | -                    | -                   |
-| MRPC | 0.3759    | 86.12%         | 83.58%   | 88.66%   | -                    | -                   |
-| CoLA | 0.4582    | -              | -        | -        | 59.27%               | -                   |
-*Model Source: BEE-spoke-data/bert-plus-L8-4096-v1.0*
 ---
 ## Training procedure

 Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes)
+| GLUE Task | Accuracy | Combined Score | Pearson | Spearmanr | Matthews Correlation | Loss    |
+|-----------|----------|----------------|---------|-----------|----------------------|---------|
+| QQP       | 91.0%    | 89.23%         | -       | -         | -                    | 0.2264  |
+| SST2      | 90.6%    | -              | -       | -         | -                    | 0.2464  |
+| QNLI      | 89.6%    | -              | -       | -         | -                    | 0.2891  |
+| MRPC      | 84.07%   | 86.59%         | -       | -         | -                    | 0.3759  |
+| STSB      | -        | 92.07%         | 92.23%  | 91.92%    | -                    | 0.4103  |
+| MNLI      | 82.2%    | -              | -       | -         | -                    | 0.4602  |
+| CoLA      | -        | -              | -       | -         | 60.72%               | 0.4569  |
+| RTE       | 66.43%   | -              | -       | -         | -                    | 0.6981  |
+| WNLI      | 35.21%   | -              | -       | -         | -                    | 0.7425  |
+### Observations:
+- **Performance Variation**: There's notable variation in model performance across different GLUE tasks. This variation can be attributed to the distinct nature of each task, the complexity of the datasets, and how well the model's architecture and hyperparameters are suited to each task.
+- **Hyperparameters Impact**: Different weight decay settings and batch sizes seem to have nuanced impacts on performance across tasks, indicating the importance of hyperparameter tuning for optimal results.
+- **Technology Features**: The use of `tf32` and `torch_compile` in certain tasks (e.g., SST2, MRPC, CoLA) suggests exploring these features might bring performance benefits, though their impact is mixed and may depend on the specific nature of the task and the model architecture.
+- **Batch Size and Gradient Accumulation Steps**: These hyperparameters vary across tasks, reflecting a balance between computational efficiency and model performance. Larger batch sizes and gradient accumulation steps can help stabilize training but may require adjustments based on the available hardware and the specific task.
+- **Task-specific Challenges**: Tasks like WNLI and RTE have notably lower accuracy scores compared to others, highlighting the challenges inherent in some NLP tasks, possibly due to dataset size, complexity, or the nuances of the task itself.
+- **Overall Performance**: The model shows strong performance on tasks with numerical scores (e.g., STSB), high accuracy in classification tasks like QQP, SST2, and MNLI, but struggles with more nuanced or smaller datasets like WNLI and RTE, underscoring the importance of tailored approaches for different types of NLP challenges.
 ---
 ## Training procedure