Edit model card

BEE-spoke-data/bert-plus-L8-4096-v1.0

image/png

still running some evals, etc. expect the model card to change a bit

* No additional code. This model uses position_embedding_type="relative_key" to help with long ctx.

this checkpoint

Further progression after multitask training etc. The most recent/last dataset it saw was the euirim/goodwiki dataset.

It achieves the following results on the evaluation set:

  • Loss: 1.9835
  • Accuracy: 0.6159

GLUE benchmark

WIP till this text is removed

Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes when supported)

Model Size Avg CoLA SST2 MRPC STSB QQP MNLI QNLI RTE
bert-plus-L8-4096-v1.0 88.1M 82.78 62.72 90.6 86.59 92.07 90.6 83.2 90.0 66.43
bert_uncased_L-8_H-768_A-12 81.2M 81.65 54.0 92.6 85.43 92.60 90.6 81.0 90.0 67.0
bert-base-uncased 110M 79.05 52.1 93.5 88.9 85.8 71.2 84.0 90.5 66.4

and some comparisons to recent BERT models taken from nomic's blog post:

Model Size Avg CoLA SST2 MRPC STSB QQP MNLI QNLI RTE
NomicBERT 137M 84.00 50.00 93.00 88.00 90.00 92.00 86.00 92.00 82.00
RobertaBase 125M 86.00 64.00 95.00 90.00 91.00 92.00 88.00 93.00 79.00
JinaBERTBase 137M 83.00 51.00 95.00 88.00 90.00 81.00 86.00 92.00 79.00
MosaicBERT 137M 85.00 59.00 94.00 89.00 90.00 92.00 86.00 91.00 83.00

Observations:

  1. Performance Variation Across Models and Tasks: The data highlights significant performance variability both across and within models for different GLUE tasks. This variability underscores the complexity of natural language understanding tasks and the need for models to be versatile in handling different types of linguistic challenges.

  2. Model Size and Efficiency: Despite the differences in model size, there is not always a direct correlation between size and performance across tasks. For instance, bert_uncased_L-8_H-768_A-12 performs competitively with larger models in certain tasks, suggesting that efficiency in model architecture and training can compensate for smaller model sizes.

  3. Task-specific Challenges: Certain tasks, such as RTE, present considerable challenges to all models, indicating the difficulty of tasks that require deep understanding and reasoning over language. This suggests areas where further research and model innovation are needed to improve performance.

  4. Overall Model Performance: Models like roberta-base show strong performance across a broad spectrum of tasks, indicating the effectiveness of its architecture and pre-training methodology. Meanwhile, models such as BEE-spoke-data/bert-plus-L8-4096-v1.0 showcase the potential for achieving competitive performance with relatively smaller sizes, emphasizing the importance of model design and optimization.


Training procedure

The below is auto-generated and just applies to the 'finishing touches' run on goodwiki.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 31010
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 1.0

Training results

Training Loss Epoch Step Validation Loss Accuracy
2.1283 0.25 150 2.0892 0.6018
2.0999 0.5 300 2.0387 0.6084
2.0595 0.75 450 1.9971 0.6143
2.0481 1.0 600 1.9893 0.6152

Framework versions

  • Transformers 4.37.2
  • Pytorch 2.3.0.dev20240206+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.1
Downloads last month
21
Safetensors
Model size
88.1M params
Tensor type
F32
·

Dataset used to train BEE-spoke-data/bert-plus-L8-4096-v1.0

Collection including BEE-spoke-data/bert-plus-L8-4096-v1.0