--- language: vi widget: - text: "Hoàng_Sa và Trường_Sa là Việt_Nam ." tags: - roberta - longformer - long context pipeline_tag: fill-mask --- # Longformer Phobert base model with max input length of 4096 **Experiment performed with Transformers version 4.25.1**\ A Longformer roberta model for long context based on [vinai/phobert-base](https://huggingface.co/vinai/phobert-base) and [Longformer](https://arxiv.org/abs/2004.05150).\ Phobert model is converted to Longformer version using [author's repo](https://github.com/allenai/longformer), then continued MLM pretraining for 5000 steps with batch size 64 on [Binhvq News Corpus](https://github.com/binhvq/news-corpus) so the model can learn to work with the new sliding window attention.\ This corpus does not contains very long documents in general so you should finetune this model using your long docment dataset on downstream task to get better results.\ The final BPC is 1.926 (In my expriment, the original BPC of Phobert-base model with max input length of 256 is 2.067). ## Usage Fill mask example: ```python: from transformers import RobertaForMaskedLM, AutoTokenizer from transformers.models.longformer.modeling_longformer import LongformerSelfAttention class RobertaLongSelfAttention(LongformerSelfAttention): def forward( self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value = None, output_attentions=False, ): attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1) is_index_masked = attention_mask < 0 is_index_global_attn = attention_mask > 0 is_global_attn = any(is_index_global_attn.flatten()) return super().forward(hidden_states, is_index_masked=is_index_masked, is_index_global_attn=is_index_global_attn, is_global_attn=is_global_attn, attention_mask=attention_mask, output_attentions=output_attentions) class RobertaLongForMaskedLM(RobertaForMaskedLM): def __init__(self, config): super().__init__(config) for i, layer in enumerate(self.roberta.encoder.layer): layer.attention.self = RobertaLongSelfAttention(config, layer_id=i) tokenizer = AutoTokenizer.from_pretrained("bluenguyen/longformer-phobert-base-4096") model = RobertaLongForMaskedLM.from_pretrained("bluenguyen/longformer-phobert-base-4096") TXT = ( "Hoàng_Sa và Trường_Sa là Việt_Nam ." + "Đó là điều không_thể chối_cãi ." * 300 + "Bằng_chứng lịch_sử , pháp_lý về chủ_quyền của Việt_Nam với 2 quần_đảo này đã và đang được nhiều quốc_gia và cộng_đồng quốc_tế ." ) input_ids = tokenizer([TXT], padding=True, pad_to_multiple_of=256, return_tensors="pt")["input_ids"] logits = model(input_ids).logits masked_index = [i.item() for i in (input_ids[0] == tokenizer.mask_token_id).nonzero()] for index in masked_index: probs = logits[0, index].softmax(dim=0) values, predictions = probs.topk(3) print(tokenizer.batch_decode([[p] for p in predictions])) > ['của', 'lãnh_thổ', 'chủ_quyền'] > ['công_nhận', 'thừa_nhận', 'ghi_nhận'] ``` Because this mode based on [vinai/phobert-base](https://huggingface.co/vinai/phobert-base), users should use [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) or [Python Vietnamese Toolkit](https://github.com/trungtv/pyvi)(pyvi) to segment input raw texts.\ More detail about Longformer can be found in [author's repo](https://github.com/allenai/longformer). ## Contact information For personal questions related to this implementation, please contact via reddotbluename@gmail.com