gxy's picture
FEAT: first commit
e4f529d
metadata
language:
  - zh
license: apache-2.0
tags:
  - bart
widget:
  - text: 桂林是著名的[MASK],它有很多[MASK]。

Randeng-BART-759M-BertTokenizer model (Chinese),one model of Fengshenbang-LM

The 759M million parameter Randeng-BART large model, using 180G Chinese data, 8 A100(40G) training for 7 days,which is a Encoder-Only transformer structure.

We use bert vocab as our tokenizer.

Task Description

Randeng-BART-759M-BertTokenizer is pre-trained by Text-Infilling task from BART paper

You can find our pretrain's code in Fengshengbang-LM

Usage

from transformers import BartForConditionalGeneration, AutoTokenizer, Text2TextGenerationPipeline
import torch

tokenizer=AutoTokenizer.from_pretrained('IDEA-CCNL/Randeng-BART-759M-BertTokenizer', use_fast=false)
model=BartForConditionalGeneration.from_pretrained('IDEA-CCNL/Randeng-BART-759M-BertTokenizer')
text = '桂林是著名的[MASK],它有很多[MASK]。'
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
print(text2text_generator(text, max_length=50, do_sample=False))

Citation

If you find the resource is useful, please cite the following website in your paper.

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2022},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}