mosaicml
/

mosaic-bert-base

Model card Files Files and versions Community

mosaic-bert-base / README.md

jacobfulano's picture

Update README.md

29c1999 almost 2 years ago

|

2.14 kB

	---
	license: apache-2.0
	datasets:
	- c4
	language:
	- en
	---

	# MosaicBERT base model
	Our goal in developing MosaicBERT was to greatly reduce pretraining time.

	## Model description

	In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
	These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
	low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).

	1. Modifications to the Attention Mechanism
	FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
	reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
	(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
	[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).


	# How to use

	## Training data

	MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
	text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
	the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
	from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
	corpora like English Wikipedia and BooksCorpus.

	## Training procedure

	## Evaluation results

	When fine-tuned on downstream tasks, this model achieves the following results:

	GLUE test results:

	\| Task \| MNLI-(m/mm) \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \| Average \|
	\|:----:\|:-----------:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|:-------:\|
	\| \| \| \| \| \| \| \| \| \| \|

	## Intended uses & limitations