BEE-spoke-data/bert-plus-L8-4096-v1.0

still running some evals, etc. expect the model card to change a bit

* No additional code. This model uses position_embedding_type="relative_key" to help with long ctx.

this checkpoint

Further progression after multitask training etc. The most recent/last dataset it saw was the euirim/goodwiki dataset.

It achieves the following results on the evaluation set:

WIP till this text is removed

Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes when supported)

Model	Size	Avg	CoLA	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE
bert-plus-L8-4096-v1.0	88.1M	82.78	62.72	90.6	86.59	92.07	90.6	83.2	90.0	66.43
bert_uncased_L-8_H-768_A-12	81.2M	81.65	54.0	92.6	85.43	92.60	90.6	81.0	90.0	67.0
bert-base-uncased	110M	79.05	52.1	93.5	88.9	85.8	71.2	84.0	90.5	66.4

and some comparisons to recent BERT models taken from nomic's blog post:

Model	Size	Avg	CoLA	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE
NomicBERT	137M	84.00	50.00	93.00	88.00	90.00	92.00	86.00	92.00	82.00
RobertaBase	125M	86.00	64.00	95.00	90.00	91.00	92.00	88.00	93.00	79.00
JinaBERTBase	137M	83.00	51.00	95.00	88.00	90.00	81.00	86.00	92.00	79.00
MosaicBERT	137M	85.00	59.00	94.00	89.00	90.00	92.00	86.00	91.00	83.00

Performance Variation Across Models and Tasks: The data highlights significant performance variability both across and within models for different GLUE tasks. This variability underscores the complexity of natural language understanding tasks and the need for models to be versatile in handling different types of linguistic challenges.
Model Size and Efficiency: Despite the differences in model size, there is not always a direct correlation between size and performance across tasks. For instance, bert_uncased_L-8_H-768_A-12 performs competitively with larger models in certain tasks, suggesting that efficiency in model architecture and training can compensate for smaller model sizes.
Task-specific Challenges: Certain tasks, such as RTE, present considerable challenges to all models, indicating the difficulty of tasks that require deep understanding and reasoning over language. This suggests areas where further research and model innovation are needed to improve performance.
Overall Model Performance: Models like roberta-base show strong performance across a broad spectrum of tasks, indicating the effectiveness of its architecture and pre-training methodology. Meanwhile, models such as BEE-spoke-data/bert-plus-L8-4096-v1.0 showcase the potential for achieving competitive performance with relatively smaller sizes, emphasizing the importance of model design and optimization.

The below is auto-generated and just applies to the 'finishing touches' run on goodwiki.

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Accuracy
2.1283	0.25	150	2.0892	0.6018
2.0999	0.5	300	2.0387	0.6084
2.0595	0.75	450	1.9971	0.6143
2.0481	1.0	600	1.9893	0.6152