---
language: vi
widget:
- text: "Hoàng_Sa và Trường_Sa là <mask> Việt_Nam ."
tags:
- roberta
- longformer
- long context
pipeline_tag: fill-mask
---

# Longformer Phobert base model with max input length of 4096
**Experiment performed with Transformers version 4.25.1**\
A Longformer roberta model for long context based on [vinai/phobert-base](https://huggingface.co/vinai/phobert-base) and [Longformer](https://arxiv.org/abs/2004.05150).\
Phobert model is converted to Longformer version using [author's repo](https://github.com/allenai/longformer), then continued MLM pretraining for 5000 steps with batch size 64 on [Binhvq News Corpus](https://github.com/binhvq/news-corpus) so the model can learn to work with the new sliding window attention.\
This corpus does not contains very long documents in general so you should finetune this model using your long docment dataset on downstream task to get better results.\
The final BPC is 1.926 (In my expriment, the original BPC of Phobert-base model with max input length of 256 is 2.067).

## Usage
Fill mask example:
```python:
from transformers import RobertaForMaskedLM, AutoTokenizer
from transformers.models.longformer.modeling_longformer import LongformerSelfAttention

class RobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_value = None,
        output_attentions=False,
    ):
        attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
        is_index_masked = attention_mask < 0
        is_index_global_attn = attention_mask > 0
        is_global_attn = any(is_index_global_attn.flatten())
        return super().forward(hidden_states, 
                               is_index_masked=is_index_masked, 
                               is_index_global_attn=is_index_global_attn, 
                               is_global_attn=is_global_attn,
                               attention_mask=attention_mask, 
                               output_attentions=output_attentions)

class RobertaLongForMaskedLM(RobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)


tokenizer = AutoTokenizer.from_pretrained("bluenguyen/longformer-phobert-base-4096")
model = RobertaLongForMaskedLM.from_pretrained("bluenguyen/longformer-phobert-base-4096")

TXT = (
    "Hoàng_Sa và Trường_Sa là <mask> Việt_Nam ."
    + "Đó là điều không_thể chối_cãi ." * 300
    + "Bằng_chứng lịch_sử , pháp_lý về chủ_quyền của Việt_Nam với 2 quần_đảo này đã và đang được nhiều quốc_gia và cộng_đồng quốc_tế <mask> ."
)

input_ids = tokenizer([TXT], padding=True, pad_to_multiple_of=256, return_tensors="pt")["input_ids"]

logits = model(input_ids).logits

masked_index = [i.item() for i in (input_ids[0] == tokenizer.mask_token_id).nonzero()]

for index in masked_index:
    probs = logits[0, index].softmax(dim=0)
    values, predictions = probs.topk(3)
    print(tokenizer.batch_decode([[p] for p in predictions]))
> ['của', 'lãnh_thổ', 'chủ_quyền']
> ['công_nhận', 'thừa_nhận', 'ghi_nhận']
```
Because this mode based on [vinai/phobert-base](https://huggingface.co/vinai/phobert-base), users should use [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) or [Python Vietnamese Toolkit](https://github.com/trungtv/pyvi)(pyvi) to segment input raw texts.\
More detail about Longformer can be found in [author's repo](https://github.com/allenai/longformer).

## Contact information
For personal questions related to this implementation, please contact via reddotbluename@gmail.com