flax-community
/

pino-bigbird-roberta-base

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

pino-bigbird-roberta-base / README.md

Dat's picture

Dat

Update README.md

81688aa over 2 years ago

|

raw history blame contribute delete

No virus

2.92 kB

	---
	language: nl
	datasets:
	- mC4
	- Dutch_news
	---

	# Pino (Dutch BigBird) base model

	Created by [Dat Nguyen](https://www.linkedin.com/in/dat-nguyen-49a641138/) & [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)

	(Not finished yet)



	BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

	It is a pretrained model on Dutch language using a masked language modeling (MLM) objective. It was introduced in this [paper](https://arxiv.org/abs/2007.14062) and first released in this [repository](https://github.com/google-research/bigbird).

	## Model description

	BigBird relies on block sparse attention instead of normal attention (i.e. BERT's attention) and can handle sequences up to a length of 4096 at a much lower compute cost compared to BERT. It has achieved SOTA on various tasks involving very long sequences such as long documents summarization, question-answering with long contexts.

	## How to use

	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import BigBirdModel

	# by default its in `block_sparse` mode with num_random_blocks=3, block_size=64
	model = BigBirdModel.from_pretrained("flax-community/pino-bigbird-roberta-base")

	# you can change `attention_type` to full attention like this:
	model = BigBirdModel.from_pretrained("flax-community/pino-bigbird-roberta-base", attention_type="original_full")

	# you can change `block_size` & `num_random_blocks` like this:
	model = BigBirdModel.from_pretrained("flax-community/pino-bigbird-roberta-base", block_size=16, num_random_blocks=2)


	```

	## Training Data

	This model is pre-trained on four publicly available datasets: mC4, and scraped Dutch news from NRC en Nu.nl. It uses the the fast universal Byte-level BPE (BBPE) in contrast to the sentence piece tokenizer and vocabulary as RoBERTa (which is in turn borrowed from GPT2).

	## Training Procedure
	The data is cleaned as follows:
	Remove texts containing HTML codes / javascript codes / loremipsum / policies
	Remove lines without end mark.
	Remove too short texts, words
	Remove too long texts, words
	Remove bad words



	## BibTeX entry and citation info

	```tex
	@misc{zaheer2021big,
	title={Big Bird: Transformers for Longer Sequences},
	author={Manzil Zaheer and Guru Guruganesh and Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Ontanon and Philip Pham and Anirudh Ravula and Qifan Wang and Li Yang and Amr Ahmed},
	year={2021},
	eprint={2007.14062},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```