diff --git a/fengshen/README.md b/fengshen/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..45f7b3579c36a68f899a9a02cfcfbe1330d413d8
--- /dev/null
+++ b/fengshen/README.md
@@ -0,0 +1,105 @@
+## 最新发布
+
+* \[2022.09.13\] [更新ErLangShen系列DeBERTa预训练代码](https://huggingface.co/IDEA-CCNL/Erlangshen-DeBERTa-v2-97M-Chinese)
+* \[2022.09.13\] [更新RanDeng系列Bart预训练代码](https://huggingface.co/IDEA-CCNL/Randeng-BART-139M)
+* \[2022.09.13\] [更新ErLangShen系列Bert预训练代码](https://huggingface.co/IDEA-CCNL/Erlangshen-MegatronBert-1.3B)
+* \[2022.05.11\] [更新TaiYi系列VIT多模态模型及下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/太乙系列/Taiyi-vit-87M-D.html)
+* \[2022.05.11\] [更新BiGan系列Transformer-XL去噪模型及下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/比干系列/Bigan-Transformer-XL-denoise-1.1B.html)
+* \[2022.05.11\] [更新ErLangShen系列下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/二郎神系列/Erlangshen-Roberta-110M-NLI.html)
+
+# 导航
+
+- [导航](#导航)
+  - [框架简介](#框架简介)
+  - [依赖环境](#依赖环境)
+  - [项目结构](#项目结构)
+  - [设计思路](#设计思路)
+  - [分类下游任务](#分类下游任务)
+
+## 框架简介
+
+FengShen训练框架是封神榜大模型开源计划的重要一环，在大模型的生产和应用中起到至关重要的作用。FengShen可以应用在基于海量数据的预训练以及各种下游任务的finetune中。封神榜专注于NLP大模型开源，然而模型的增大带来不仅仅是训练的问题，在使用上也存在诸多不便。为了解决训练和使用的问题，FengShen参考了目前开源的优秀方案并且重新设计了Pipeline，用户可以根据自己的需求，从封神榜中选取丰富的预训练模型，同时利用FengShen快速微调下游任务。
+
+目前所有实例以及文档可以查看我们的[Wiki](https://fengshenbang-doc.readthedocs.io/zh/latest/index.html)
+所有的模型可以在[Huggingface主页](https://huggingface.co/IDEA-CCNL)找到
+
+通过我们的框架，你可以快速享受到：
+
+1. 比原生torch更强的性能，训练速度提升<font color=#0000FF >**300%**</font>
+2. 支持更大的模型，支持<font color=#0000FF >**百亿级别**</font>内模型训练及微调
+3. 支持<font color=#0000FF >**TB级以上**</font>的数据集，在家用主机上即可享受预训练模型带来的效果提升
+3. 丰富的预训练、下游任务示例，一键开始训练
+4. 适应各种设备环境，支持在CPU、GPU、TPU等不同设备上运行
+5. 集成主流的分布式训练逻辑，无需修改代码即可支持DDP、Zero Optimizer等分布式优化技术
+
+![avartar](../pics/fengshen_pic.png)
+
+## 依赖环境
+
+* Python >= 3.8
+* torch >= 1.8
+* transformers >= 3.2.0
+* pytorch-lightning >= 1.5.10
+
+在Fengshenbang-LM根目录下
+pip install --editable ./
+
+## 项目结构
+
+```
+├── data                        # 支持多种数据处理方式以及数据集
+│   ├── cbart_dataloader
+|   ├── fs_datasets             # 基于transformers datasets的封装，新增中文数据集(开源计划中)
+|   ├── universal_datamodule    # 打通fs_datasets与lightning datamodule，减少重复开发工作量
+│   ├── megatron_dataloader     # 支持基于Megatron实现的TB级别数据集处理、训练
+│   ├── mmap_dataloader         # 通用的Memmap形式的数据加载
+│   └── task_dataloader         # 支持多种下游任务
+├── examples                    # 丰富的示例，从预训练到下游任务，应有尽有。
+├── metric                      # 提供各种metric计算，支持用户自定义metric
+├── losses                      # 同样支持loss自定义，满足定制化需求
+├── tokenizer                   # 支持自定义tokenizer，比如我们使用的SentencePiece训练代码等
+├── models                      # 模型库
+│   ├── auto                    # 支持自动导入对应的模型
+│   ├── bart
+│   ├── longformer
+│   ├── megatron_t5
+│   └── roformer
+└── utils                       # 实用函数
+```
+
+## 设计思路
+
+FengShen框架目前整体基于Pytorch-Lightning & Transformer进行开发，在底层框架上不断开源基于中文的预训练模型，同时提供丰富的examples，每一个封神榜的模型都能找到对应的预训练、下游任务代码。
+
+在FengShen上开发，整体可以按照下面的三个步骤进行：
+
+1. 封装数据处理流程 -> pytorch_lightning.LightningDataModule
+2. 封装模型结构 -> pytorch_lightning.LightningModule
+3. 配置一些插件，比如log_monitor，checkpoint_callback等等。
+
+一个完整的DEMO可以看Randeng-BART系列实例 -> [文档](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/燃灯系列/BART-139M.html) [代码](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/hf-ds/fengshen/examples/pretrain_bart)
+
+## 分类下游任务
+
+ 在examples/classification目录下，我们提供丰富的分类任务的示例，其中我们提供三个一键式运行的示例。
+
+* demo_classification_afqmc_roberta.sh              使用DDP微调roberta
+* demo_classification_afqmc_roberta_deepspeed.sh    结合deepspeed微调roberta，获得更快的运算速度
+* demo_classification_afqmc_erlangshen_offload.sh   仅需7G显存即可微调我们效果最好的二郎神系列模型
+
+ 上述示例均采用AFQMC的数据集，关于数据集的介绍可以在[这里](https://www.cluebenchmarks.com/introduce.html)找到。
+ 同时我们处理过的数据文件已经放在Huggingface上，点击[这里](https://huggingface.co/datasets/IDEA-CCNL/AFQMC)直达源文件。
+ 仅需要按我们的格式稍微处理一下数据集，即可适配下游不同的分类任务。
+ 在脚本示例中，仅需要修改如下参数即可适配本地文件
+
+ ```
+         --dataset_name IDEA-CCNL/AFQMC \
+
+ -------> 修改为
+
+         --data_dir $DATA_DIR \          # 数据目录
+         --train_data train.json \       # 数据文件
+         --valid_data dev.json \
+         --test_data test.json \
+
+ ```
diff --git a/fengshen/__init__.py b/fengshen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5829a3ac9e634d44d408d2ff6d22880e1c00805
--- /dev/null
+++ b/fengshen/__init__.py
@@ -0,0 +1,19 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .models.longformer import LongformerConfig, LongformerModel
+from .models.roformer import RoFormerConfig, RoFormerModel
+from .models.megatron_t5 import T5Config, T5EncoderModel
+from .models.ubert import UbertPiplines, UbertModel
diff --git a/fengshen/cli/fengshen_pipeline.py b/fengshen/cli/fengshen_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..07c31349ef96fd86d0c14b807601c645b095372f
--- /dev/null
+++ b/fengshen/cli/fengshen_pipeline.py
@@ -0,0 +1,34 @@
+import sys
+from importlib import import_module
+from datasets import load_dataset
+import argparse
+
+
+def main():
+    if len(sys.argv) < 3:
+        raise Exception(
+            'args len < 3, example: fengshen_pipeline text_classification predict xxxxx')
+    pipeline_name = sys.argv[1]
+    method = sys.argv[2]
+    pipeline_class = getattr(import_module('fengshen.pipelines.' + pipeline_name), 'Pipeline')
+
+    total_parser = argparse.ArgumentParser("FengShen Pipeline")
+    total_parser.add_argument('--model', default='', type=str)
+    total_parser.add_argument('--datasets', default='', type=str)
+    total_parser.add_argument('--text', default='', type=str)
+    total_parser = pipeline_class.add_pipeline_specific_args(total_parser)
+    args = total_parser.parse_args(args=sys.argv[3:])
+    pipeline = pipeline_class(args=args, model=args.model)
+
+    if method == 'predict':
+        print(pipeline(args.text))
+    elif method == 'train':
+        datasets = load_dataset(args.datasets)
+        pipeline.train(datasets)
+    else:
+        raise Exception(
+            'cmd not support, now only support {predict, train}')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/data/__init__.py b/fengshen/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bad5790a5799b96f2e164d825c0b1f8ec0c2dfb
--- /dev/null
+++ b/fengshen/data/__init__.py
@@ -0,0 +1 @@
+# coding=utf-8
diff --git a/fengshen/data/bert_dataloader/auto_split.sh b/fengshen/data/bert_dataloader/auto_split.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0a0f66d01df8f1728e44d9deb1d37e0396c5143a
--- /dev/null
+++ b/fengshen/data/bert_dataloader/auto_split.sh
@@ -0,0 +1,10 @@
+files=`find $1 -type f -size +1024M`
+
+for p in $files
+do
+echo "processing $p"
+name=`basename $p .json`
+file=`dirname $p`
+split -a 2 -C 300M $p $file/$name- && ls|grep -E "(-[a-zA-Z]{2})" |xargs -n1 -i{} mv {} {}.json
+rm -f $p
+done
\ No newline at end of file
diff --git a/fengshen/data/bert_dataloader/load.py b/fengshen/data/bert_dataloader/load.py
new file mode 100644
index 0000000000000000000000000000000000000000..b36ce8ae72b74e9fd006f087ee0810a306badd7e
--- /dev/null
+++ b/fengshen/data/bert_dataloader/load.py
@@ -0,0 +1,200 @@
+import os
+import re
+from pathlib import Path
+import glob
+from tqdm import tqdm
+from contextlib import ExitStack
+import datasets
+import multiprocessing
+from typing import cast, TextIO
+from itertools import chain
+import json
+from concurrent.futures import ProcessPoolExecutor
+from random import shuffle
+from pytorch_lightning import LightningDataModule
+from typing import Optional
+
+from torch.utils.data import DataLoader
+
+
+# _SPLIT_DATA_PATH = '/data1/datas/wudao_180g_split/test'
+_SPLIT_DATA_PATH = '/data1/datas/wudao_180g_split'
+_CACHE_SPLIT_DATA_PATH = '/data1/datas/wudao_180g_FSData'
+
+# feats = datasets.Features({"text": datasets.Value('string')})
+
+
+class BertDataGenerate(object):
+
+    def __init__(self,
+                 data_files=_SPLIT_DATA_PATH,
+                 save_path=_CACHE_SPLIT_DATA_PATH,
+                 train_test_validation='950,49,1',
+                 num_proc=1,
+                 cache=True):
+        self.data_files = Path(data_files)
+        if save_path:
+            self.save_path = Path(save_path)
+        else:
+            self.save_path = self.file_check(
+                Path(self.data_files.parent, self.data_files.name+'_FSDataset'),
+                'save')
+        self.num_proc = num_proc
+        self.cache = cache
+        self.split_idx = self.split_train_test_validation_index(train_test_validation)
+        if cache:
+            self.cache_path = self.file_check(
+                Path(self.save_path.parent, 'FSDataCache', self.data_files.name), 'cache')
+        else:
+            self.cache_path = None
+
+    @staticmethod
+    def file_check(path, path_type):
+        print(path)
+        if not path.exists():
+            path.mkdir(parents=True)
+        print(f"Since no {path_type} directory is specified, the program will automatically create it in {path} directory.")
+        return str(path)
+
+    @staticmethod
+    def split_train_test_validation_index(train_test_validation):
+        split_idx_ = [int(i) for i in train_test_validation.split(',')]
+        idx_dict = {
+            'train_rate': split_idx_[0]/sum(split_idx_),
+            'test_rate': split_idx_[1]/sum(split_idx_[1:])
+        }
+        return idx_dict
+
+    def process(self, index, path):
+        print('saving dataset shard {}'.format(index))
+
+        ds = (datasets.load_dataset('json', data_files=str(path),
+                                    cache_dir=self.cache_path,
+                                    features=None))
+        # ds = ds.map(self.cut_sent,input_columns='text')
+        # print(d)
+        # print('!!!',ds)
+        ds = ds['train'].train_test_split(train_size=self.split_idx['train_rate'])
+        ds_ = ds['test'].train_test_split(train_size=self.split_idx['test_rate'])
+        ds = datasets.DatasetDict({
+            'train': ds['train'],
+            'test': ds_['train'],
+            'validation': ds_['test']
+        })
+        # print('!!!!',ds)
+        ds.save_to_disk(Path(self.save_path, path.name))
+        return 'saving dataset shard {} done'.format(index)
+
+    def generate_cache_arrow(self) -> None:
+        '''
+        生成HF支持的缓存文件，加速后续的加载
+        '''
+        data_dict_paths = self.data_files.rglob('*')
+        p = ProcessPoolExecutor(max_workers=self.num_proc)
+        res = list()
+
+        for index, path in enumerate(data_dict_paths):
+            res.append(p.submit(self.process, index, path))
+
+        p.shutdown(wait=True)
+        for future in res:
+            print(future.result(), flush=True)
+
+
+def load_dataset(num_proc=4, **kargs):
+    cache_dict_paths = Path(_CACHE_SPLIT_DATA_PATH).glob('*')
+    ds = []
+    res = []
+    p = ProcessPoolExecutor(max_workers=num_proc)
+    for path in cache_dict_paths:
+        res.append(p.submit(datasets.load_from_disk,
+                            str(path), **kargs))
+
+    p.shutdown(wait=True)
+    for future in res:
+        ds.append(future.result())
+        # print(future.result())
+    train = []
+    test = []
+    validation = []
+    for ds_ in ds:
+        train.append(ds_['train'])
+        test.append(ds_['test'])
+        validation.append(ds_['validation'])
+    # ds = datasets.concatenate_datasets(ds)
+    # print(ds)
+    return datasets.DatasetDict({
+        'train': datasets.concatenate_datasets(train),
+        'test': datasets.concatenate_datasets(test),
+        'validation': datasets.concatenate_datasets(validation)
+    })
+
+
+class BertDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('Universal DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--val_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--datasets_name', type=str)
+        # parser.add_argument('--datasets_name', type=str)
+        parser.add_argument('--train_datasets_field', type=str, default='train')
+        parser.add_argument('--val_datasets_field', type=str, default='validation')
+        parser.add_argument('--test_datasets_field', type=str, default='test')
+        return parent_args
+
+    def __init__(
+        self,
+        tokenizer,
+        collate_fn,
+        args,
+        **kwargs,
+    ):
+        super().__init__()
+        self.datasets = load_dataset(num_proc=args.num_workers)
+        self.tokenizer = tokenizer
+        self.collate_fn = collate_fn
+        self.save_hyperparameters(args)
+
+    def setup(self, stage: Optional[str] = None) -> None:
+        self.train = DataLoader(
+            self.datasets[self.hparams.train_datasets_field],
+            batch_size=self.hparams.train_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        self.val = DataLoader(
+            self.datasets[self.hparams.val_datasets_field],
+            batch_size=self.hparams.val_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        self.test = DataLoader(
+            self.datasets[self.hparams.test_datasets_field],
+            batch_size=self.hparams.test_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        return
+
+    def train_dataloader(self):
+        return self.train
+
+    def val_dataloader(self):
+        return self.val
+
+    def test_dataloader(self):
+        return self.test
+
+
+if __name__ == '__main__':
+    # pre = PreProcessing(_SPLIT_DATA_PATH)
+    # pre.processing()
+
+    dataset = BertDataGenerate(_SPLIT_DATA_PATH, num_proc=16)
+    dataset.generate_cache_arrow()
diff --git a/fengshen/data/bert_dataloader/preprocessing.py b/fengshen/data/bert_dataloader/preprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..c40e39a8122a5cc4ebd57b558f451c371f6066a3
--- /dev/null
+++ b/fengshen/data/bert_dataloader/preprocessing.py
@@ -0,0 +1,110 @@
+import re
+import json
+import multiprocessing
+from tqdm import tqdm
+from pathlib import Path
+from itertools import chain
+
+_SPLIT_DATA_PATH = '/data1/datas/wudao_180g'
+
+
+def cut_sent(path):
+    """
+    中文分句，默认？、。、！、省略号分句，考虑双引号包裹的句子
+    采用分割替换的方式
+    """
+    path = Path(path)
+    # print(path)
+    save_path = str(Path('/data1/datas/wudao_180g_split', path.name))
+    print('处理文件：', save_path)
+    with open(save_path, 'wt', encoding='utf-8') as w:
+        with open(path, 'rt', encoding='utf-8') as f:
+            for para in tqdm(f):
+                para = json.loads(para)
+                para_ = para['text'] + ' '
+                # print('sentence piece......')
+                # pep8中 正则不能些 \? 要写成\\?
+                para_ = re.sub('([？。！\\?\\!…]+)([^”’]|[”’])',
+                               r'\1#####\2', para_)
+                para_ = re.sub('([\\.]{3,})([^”’])', r'\1#####\2', para_)
+
+                # 匹配 \1: 句子结束符紧挨’”  \2: 非句子结束符号，被引号包裹的句子
+                para_ = re.sub(
+                    '([。！？\\?\\!…][”’])([^，。！？\\?\\!]|\\s)', r'\1#####\2', para_)
+                para_ = re.sub(
+                    '([\\.]{3,}[”’])([^，。！？\\?\\!]|\\s)', r'\1#####\2', para_)
+                para_ = re.sub(
+                    '([#]{5})([”’])([^，。！？\\?\\!])', r'\2#####\3', para_)
+                para_ = para_.strip()
+                # 一个512里面多个样本
+                line_ = ''
+                for line in para_.split('#####'):
+                    line = line.strip()
+                    if len(line_) < 512 and len(line) > 0:
+                        line_ += line
+                    else:
+                        w.writelines(json.dumps(
+                            {'text': line_}, ensure_ascii=False)+'\n')
+                        line_ = line
+                w.writelines(json.dumps(
+                    {'text': line_}, ensure_ascii=False)+'\n')
+
+
+def chain_iter(*filenames):
+    """
+    将多个文件读成一个迭代器
+    """
+    reader = [open(file, 'r') for file in filenames]
+    return chain(*reader)
+
+
+class Config(object):
+
+    def __init__(self, data_path=_SPLIT_DATA_PATH, num_worker=16, split_numb=600000, cut_sentence=True, output_file=None) -> None:
+        self.data_path = Path(data_path)
+        self.num_worker = num_worker
+        self.split_numb = split_numb
+        self.cut_sentence = cut_sentence
+
+
+def processing1():
+    args = Config()
+    p_ = [str(i) for i in args.data_path.glob('*')]
+    fin = chain_iter(*p_)
+    pool = multiprocessing.Pool(args.num_worker)
+    docs = pool.imap(cut_sent, fin, chunksize=args.num_worker)
+
+    if not Path(args.data_path.parent, args.data_path.name+'_split').exists():
+        Path(args.data_path.parent, args.data_path.name+'_split').mkdir()
+    writer = open(str(Path(args.data_path.parent, args.data_path.name +
+                  '_split', 'sentence_level.json')), 'wt', encoding='utf-8')
+    for doc in tqdm(docs):
+        for sentence in doc:
+            writer.writelines(json.dumps(
+                {"text": sentence}, ensure_ascii=False)+'\n')
+    pool.close()
+    pool.join()
+    writer.close()
+
+
+if __name__ == '__main__':
+    from time import process_time, perf_counter
+    from random import shuffle
+    st = process_time()
+    args = Config(num_worker=16)
+
+    if not Path(args.data_path.parent, args.data_path.name+'_split').exists():
+        Path(args.data_path.parent, args.data_path.name +
+             '_split').mkdir(parents=True)
+
+    p_ = [str(i) for i in args.data_path.glob('*')]
+    # 简单shuffle
+    shuffle(p_)
+
+    pool = multiprocessing.Pool(args.num_worker)
+    for item in p_:
+        pool.apply_async(func=cut_sent, args=(item,))
+    pool.close()
+    pool.join()
+    cost_time = process_time() - st
+    print('DONE!! cost time : %.5f' % cost_time)
diff --git a/fengshen/data/clip_dataloader/flickr.py b/fengshen/data/clip_dataloader/flickr.py
new file mode 100644
index 0000000000000000000000000000000000000000..22155e039f74b49c8a4222a75144a2c134a6d507
--- /dev/null
+++ b/fengshen/data/clip_dataloader/flickr.py
@@ -0,0 +1,105 @@
+from torch.utils.data import Dataset, DataLoader
+from torchvision.transforms import Normalize, Compose, RandomResizedCrop, InterpolationMode, ToTensor, Resize, \
+    CenterCrop
+from transformers import BertTokenizer
+import pytorch_lightning as pl
+from PIL import Image
+import os
+
+
+class flickr30k_CNA(Dataset):
+    def __init__(self, img_root_path,
+                 annot_path,
+                 transform=None):
+        self.images = []
+        self.captions = []
+        self.labels = []
+        self.root = img_root_path
+        with open(annot_path, 'r') as f:
+            for line in f:
+                line = line.strip().split('\t')
+                key, caption = line[0].split('#')[0], line[1]
+                img_path = key + '.jpg'
+                self.images.append(img_path)
+                self.captions.append(caption)
+                self.labels.append(key)
+        self.transforms = transform
+        self.tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext")
+
+        # NOTE large 模型
+        self.context_length = 77
+
+    def __len__(self):
+        return len(self.images)
+
+    def __getitem__(self, idx):
+        img_path = str(self.images[idx])
+        image = self.transforms(Image.open(os.path.join(self.root, img_path)))
+        text = self.tokenizer(str(self.captions[idx]), max_length=self.context_length,
+                              padding='max_length', truncation=True, return_tensors='pt')['input_ids'][0]
+        label = self.labels[idx]
+        return image, text, label
+
+
+def _convert_to_rgb(image):
+    return image.convert('RGB')
+
+
+def image_transform(
+        image_size: int,
+        is_train: bool,
+        mean=(0.48145466, 0.4578275, 0.40821073),
+        std=(0.26862954, 0.26130258, 0.27577711)
+):
+    normalize = Normalize(mean=mean, std=std)
+    if is_train:
+        return Compose([
+            RandomResizedCrop(image_size, scale=(0.9, 1.0), interpolation=InterpolationMode.BICUBIC),
+            _convert_to_rgb,
+            ToTensor(),
+            normalize,
+        ])
+    else:
+        return Compose([
+            Resize(image_size, interpolation=InterpolationMode.BICUBIC),
+            CenterCrop(image_size),
+            _convert_to_rgb,
+            ToTensor(),
+            normalize,
+        ])
+
+
+class FlickrDataModule(pl.LightningDataModule):
+    def __init__(self, args):
+        self.batch_size = args.batch_size
+        self.train_filename = args.train_filename  # NOTE 标注的文件夹
+        self.train_root = args.train_root  # NOTE 图片地址
+        self.val_filename = args.val_filename
+        self.val_root = args.val_root
+        self.test_filename = args.test_filename
+        self.test_root = args.test_root
+
+        self.pretrain_model = args.pretrain_model
+        self.image_size = 224
+        self.prepare_data_per_node = True
+        self._log_hyperparams = False
+        self.num_workers = args.num_workers
+
+    def setup(self, stage=None):
+        # dataset
+        train_transform = image_transform(224, True)
+        val_transform = image_transform(224, False)
+        test_transform = image_transform(224, False)
+
+        self.train_dataset = flickr30k_CNA(self.train_root, self.train_filename, transform=train_transform)
+        self.val_dataset = flickr30k_CNA(self.val_root, self.val_filename, transform=val_transform)
+        self.test_dataset = flickr30k_CNA(self.test_root, self.test_filename, transform=test_transform)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers)
+
+    def val_dataloader(self):
+        return DataLoader(self.val_dataset, batch_size=self.batch_size, num_workers=self.num_workers)
+
+    def test_dataloader(self):
+        return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=self.num_workers)
diff --git a/fengshen/data/data_utils/common_utils.py b/fengshen/data/data_utils/common_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3eef10ecb8c73257ab4338a0ea2e7839b82bcc7e
--- /dev/null
+++ b/fengshen/data/data_utils/common_utils.py
@@ -0,0 +1,4 @@
+def padding_to_maxlength(ids, max_length, pad_id):
+    cur_len = len(ids)
+    len_diff = max_length - len(ids)
+    return ids + [pad_id] * len_diff, [1] * cur_len + [0] * len_diff
diff --git a/fengshen/data/data_utils/mask_utils.py b/fengshen/data/data_utils/mask_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0009f00272bf6feff1dbd491153332584cb431e1
--- /dev/null
+++ b/fengshen/data/data_utils/mask_utils.py
@@ -0,0 +1,285 @@
+import collections
+
+import numpy as np
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
+                                          ["index", "label"])
+
+
+def is_start_piece(piece):
+    """Check if the current word piece is the starting piece (BERT)."""
+    # When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    return not piece.startswith("##")
+
+
+def create_masked_lm_predictions(tokens,
+                                 vocab_id_list, vocab_id_to_token_dict,
+                                 masked_lm_prob,
+                                 cls_id, sep_id, mask_id,
+                                 max_predictions_per_seq,
+                                 np_rng,
+                                 max_ngrams=3,
+                                 do_whole_word_mask=True,
+                                 favor_longer_ngram=False,
+                                 do_permutation=False,
+                                 geometric_dist=False,
+                                 masking_style="bert",
+                                 zh_tokenizer=None):
+    """Creates the predictions for the masked LM objective.
+    Note: Tokens here are vocab ids and not text tokens."""
+    '''
+    modified from Megatron-LM
+    Args:
+        tokens: 输入
+        vocab_id_list: 词表token_id_list
+        vocab_id_to_token_dict： token_id到token字典
+        masked_lm_prob：mask概率
+        cls_id、sep_id、mask_id：特殊token
+        max_predictions_per_seq：最大mask个数
+        np_rng：mask随机数
+        max_ngrams：最大词长度
+        do_whole_word_mask：是否做全词掩码
+        favor_longer_ngram：优先用长的词
+        do_permutation：是否打乱
+        geometric_dist：用np_rng.geometric做随机
+        masking_style：mask类型
+        zh_tokenizer：WWM的分词器，比如用jieba.lcut做分词之类的
+    '''
+    cand_indexes = []
+    # Note(mingdachen): We create a list for recording if the piece is
+    # the starting piece of current token, where 1 means true, so that
+    # on-the-fly whole word masking is possible.
+    token_boundary = [0] * len(tokens)
+    # 如果没有指定中文分词器，那就直接按##算
+    if zh_tokenizer is None:
+        for (i, token) in enumerate(tokens):
+            if token == cls_id or token == sep_id:
+                token_boundary[i] = 1
+                continue
+        # Whole Word Masking means that if we mask all of the wordpieces
+        # corresponding to an original word.
+        #
+        # Note that Whole Word Masking does *not* change the training code
+        # at all -- we still predict each WordPiece independently, softmaxed
+        # over the entire vocabulary.
+            if (do_whole_word_mask and len(cand_indexes) >= 1 and
+                    not is_start_piece(vocab_id_to_token_dict[token])):
+                cand_indexes[-1].append(i)
+            else:
+                cand_indexes.append([i])
+                if is_start_piece(vocab_id_to_token_dict[token]):
+                    token_boundary[i] = 1
+    else:
+        # 如果指定了中文分词器，那就先用分词器分词，然后再进行判断
+        # 获取去掉CLS SEP的原始文本
+        raw_tokens = []
+        for t in tokens:
+            if t != cls_id and t != sep_id:
+                raw_tokens.append(t)
+        raw_tokens = [vocab_id_to_token_dict[i] for i in raw_tokens]
+        # 分词然后获取每次字开头的最长词的长度
+        word_list = set(zh_tokenizer(''.join(raw_tokens), HMM=True))
+        word_length_dict = {}
+        for w in word_list:
+            if len(w) < 1:
+                continue
+            if w[0] not in word_length_dict:
+                word_length_dict[w[0]] = len(w)
+            elif word_length_dict[w[0]] < len(w):
+                word_length_dict[w[0]] = len(w)
+        i = 0
+        # 从词表里面检索
+        while i < len(tokens):
+            token_id = tokens[i]
+            token = vocab_id_to_token_dict[token_id]
+            if len(token) == 0 or token_id == cls_id or token_id == sep_id:
+                token_boundary[i] = 1
+                i += 1
+                continue
+            word_max_length = 1
+            if token[0] in word_length_dict:
+                word_max_length = word_length_dict[token[0]]
+            j = 0
+            word = ''
+            word_end = i+1
+            # 兼容以前##的形式，如果后面的词是##开头的，那么直接把后面的拼到前面当作一个词
+            old_style = False
+            while word_end < len(tokens) and vocab_id_to_token_dict[tokens[word_end]].startswith('##'):
+                old_style = True
+                word_end += 1
+            if not old_style:
+                while j < word_max_length and i+j < len(tokens):
+                    cur_token = tokens[i+j]
+                    word += vocab_id_to_token_dict[cur_token]
+                    j += 1
+                    if word in word_list:
+                        word_end = i+j
+            cand_indexes.append([p for p in range(i, word_end)])
+            token_boundary[i] = 1
+            i = word_end
+
+    output_tokens = list(tokens)
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+
+    if masked_lm_prob == 0:
+        return (output_tokens, masked_lm_positions,
+                masked_lm_labels, token_boundary)
+
+    num_to_predict = min(max_predictions_per_seq,
+                         max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
+    if not geometric_dist:
+        # Note(mingdachen):
+        # By default, we set the probilities to favor shorter ngram sequences.
+        pvals = 1. / np.arange(1, max_ngrams + 1)
+        pvals /= pvals.sum(keepdims=True)
+        if favor_longer_ngram:
+            pvals = pvals[::-1]
+    # 获取一个ngram的idx，对于每个word，记录他的ngram的word
+    ngram_indexes = []
+    for idx in range(len(cand_indexes)):
+        ngram_index = []
+        for n in ngrams:
+            ngram_index.append(cand_indexes[idx:idx + n])
+        ngram_indexes.append(ngram_index)
+
+    np_rng.shuffle(ngram_indexes)
+
+    (masked_lms, masked_spans) = ([], [])
+    covered_indexes = set()
+    for cand_index_set in ngram_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if not cand_index_set:
+            continue
+        # Note(mingdachen):
+        # Skip current piece if they are covered in lm masking or previous ngrams.
+        for index_set in cand_index_set[0]:
+            for index in index_set:
+                if index in covered_indexes:
+                    continue
+
+        if not geometric_dist:
+            n = np_rng.choice(ngrams[:len(cand_index_set)],
+                              p=pvals[:len(cand_index_set)] /
+                              pvals[:len(cand_index_set)].sum(keepdims=True))
+        else:
+            # Sampling "n" from the geometric distribution and clipping it to
+            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
+            # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1)
+            n = min(np_rng.geometric(0.2), max_ngrams)
+
+        index_set = sum(cand_index_set[n - 1], [])
+        n -= 1
+        # Note(mingdachen):
+        # Repeatedly looking for a candidate that does not exceed the
+        # maximum number of predictions by trying shorter ngrams.
+        while len(masked_lms) + len(index_set) > num_to_predict:
+            if n == 0:
+                break
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+        # If adding a whole-word mask would exceed the maximum number of
+        # predictions, then just skip this candidate.
+        if len(masked_lms) + len(index_set) > num_to_predict:
+            continue
+        is_any_index_covered = False
+        for index in index_set:
+            if index in covered_indexes:
+                is_any_index_covered = True
+                break
+        if is_any_index_covered:
+            continue
+        for index in index_set:
+            covered_indexes.add(index)
+            masked_token = None
+            token_id = tokens[index]
+            if masking_style == "bert":
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        masked_token = tokens[index]
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
+            elif masking_style == "t5":
+                masked_token = mask_id
+            else:
+                raise ValueError("invalid value of masking style")
+
+            output_tokens[index] = masked_token
+            masked_lms.append(MaskedLmInstance(index=index, label=token_id))
+
+        masked_spans.append(MaskedLmInstance(
+            index=index_set,
+            label=[tokens[index] for index in index_set]))
+
+    assert len(masked_lms) <= num_to_predict
+    np_rng.shuffle(ngram_indexes)
+
+    select_indexes = set()
+    if do_permutation:
+        for cand_index_set in ngram_indexes:
+            if len(select_indexes) >= num_to_predict:
+                break
+            if not cand_index_set:
+                continue
+            # Note(mingdachen):
+            # Skip current piece if they are covered in lm masking or previous ngrams.
+            for index_set in cand_index_set[0]:
+                for index in index_set:
+                    if index in covered_indexes or index in select_indexes:
+                        continue
+
+            n = np.random.choice(ngrams[:len(cand_index_set)],
+                                 p=pvals[:len(cand_index_set)] /
+                                 pvals[:len(cand_index_set)].sum(keepdims=True))
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+
+            while len(select_indexes) + len(index_set) > num_to_predict:
+                if n == 0:
+                    break
+                index_set = sum(cand_index_set[n - 1], [])
+                n -= 1
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(select_indexes) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes or index in select_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                select_indexes.add(index)
+        assert len(select_indexes) <= num_to_predict
+
+        select_indexes = sorted(select_indexes)
+        permute_indexes = list(select_indexes)
+        np_rng.shuffle(permute_indexes)
+        orig_token = list(output_tokens)
+
+        for src_i, tgt_i in zip(select_indexes, permute_indexes):
+            output_tokens[src_i] = orig_token[tgt_i]
+            masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+    # Sort the spans by the index of the first span
+    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
+
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)
diff --git a/fengshen/data/data_utils/sentence_split.py b/fengshen/data/data_utils/sentence_split.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a25e4b51b13f86f4a8a6b39f497f85c050856b6
--- /dev/null
+++ b/fengshen/data/data_utils/sentence_split.py
@@ -0,0 +1,35 @@
+import re
+
+
+class ChineseSentenceSplitter(object):
+    def merge_symmetry(self, sentences, symmetry=('“', '”')):
+        # '''合并对称符号，如双引号'''
+        effective_ = []
+        merged = True
+        for index in range(len(sentences)):
+            if symmetry[0] in sentences[index] and symmetry[1] not in sentences[index]:
+                merged = False
+                effective_.append(sentences[index])
+            elif symmetry[1] in sentences[index] and not merged:
+                merged = True
+                effective_[-1] += sentences[index]
+            elif symmetry[0] not in sentences[index] and symmetry[1] not in sentences[index] and not merged:
+                effective_[-1] += sentences[index]
+            else:
+                effective_.append(sentences[index])
+        return [i.strip() for i in effective_ if len(i.strip()) > 0]
+
+    def to_sentences(self, paragraph):
+        #  """由段落切分成句子"""
+        sentences = re.split(r"(？|。|[！]+|!|\…\…)", paragraph)
+        sentences.append("")
+        sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
+        sentences = [i.strip() for i in sentences if len(i.strip()) > 0]
+        for j in range(1, len(sentences)):
+            if sentences[j][0] == '”':
+                sentences[j-1] = sentences[j-1] + '”'
+                sentences[j] = sentences[j][1:]
+        return self.merge_symmetry(sentences)
+
+    def tokenize(self, text):
+        return self.to_sentences(text)
diff --git a/fengshen/data/data_utils/sop_utils.py b/fengshen/data/data_utils/sop_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..505f14dca99638b10eee0a4017447401a71ef083
--- /dev/null
+++ b/fengshen/data/data_utils/sop_utils.py
@@ -0,0 +1,32 @@
+
+# copy from megatron
+def get_a_and_b_segments(sample, np_rng):
+    """Divide sample into a and b segments."""
+
+    # Number of sentences in the sample.
+    n_sentences = len(sample)
+    # Make sure we always have two sentences.
+    assert n_sentences > 1, 'make sure each sample has at least two sentences.'
+
+    # First part:
+    # `a_end` is how many sentences go into the `A`.
+    a_end = 1
+    if n_sentences >= 3:
+        # Note that randin in numpy is exclusive.
+        a_end = np_rng.randint(1, n_sentences)
+    tokens_a = []
+    for j in range(a_end):
+        tokens_a.extend(sample[j])
+
+    # Second part:
+    tokens_b = []
+    for j in range(a_end, n_sentences):
+        tokens_b.extend(sample[j])
+
+    # Random next:
+    is_next_random = False
+    if np_rng.random() < 0.5:
+        is_next_random = True
+        tokens_a, tokens_b = tokens_b, tokens_a
+
+    return tokens_a, tokens_b, is_next_random
diff --git a/fengshen/data/data_utils/token_type_utils.py b/fengshen/data/data_utils/token_type_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b805d23b9aa4cda495d3b76ecba7effdc2854eb
--- /dev/null
+++ b/fengshen/data/data_utils/token_type_utils.py
@@ -0,0 +1,25 @@
+def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
+    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
+
+    tokens = []
+    tokentypes = []
+    # [CLS].
+    tokens.append(cls_id)
+    tokentypes.append(0)
+    # Segment A.
+    for token in tokens_a:
+        tokens.append(token)
+        tokentypes.append(0)
+    # [SEP].
+    tokens.append(sep_id)
+    tokentypes.append(0)
+    # Segment B.
+    for token in tokens_b:
+        tokens.append(token)
+        tokentypes.append(1)
+    if tokens_b:
+        # [SEP].
+        tokens.append(sep_id)
+        tokentypes.append(1)
+
+    return tokens, tokentypes
diff --git a/fengshen/data/data_utils/truncate_utils.py b/fengshen/data/data_utils/truncate_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba4c6b653762c01a26da1bea9cb3d3cbeec08fd7
--- /dev/null
+++ b/fengshen/data/data_utils/truncate_utils.py
@@ -0,0 +1,19 @@
+
+def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    # print(len_a, len_b, max_num_tokens)
+    assert len_a > 0
+    if len_a + len_b <= max_num_tokens:
+        return False
+    while len_a + len_b > max_num_tokens:
+        if len_a > len_b:
+            len_a -= 1
+            tokens = tokens_a
+        else:
+            len_b -= 1
+            tokens = tokens_b
+        if np_rng.random() < 0.5:
+            del tokens[0]
+        else:
+            tokens.pop()
+    return True
diff --git a/fengshen/data/hubert/hubert_dataset.py b/fengshen/data/hubert/hubert_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8eaa25a5238740cc86a05af257aa3e0996f1499
--- /dev/null
+++ b/fengshen/data/hubert/hubert_dataset.py
@@ -0,0 +1,361 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import itertools
+import logging
+import os
+import sys
+from typing import Any, List, Optional, Union
+
+import numpy as np
+
+import torch
+import torch.nn.functional as F
+from fairseq.data import data_utils
+from fairseq.data.fairseq_dataset import FairseqDataset
+
+logger = logging.getLogger(__name__)
+
+
+def add_data_specific_args(parent_args):
+    parser = parent_args.add_argument_group('Hubert Dataset')
+    parser.add_argument('--data', type=str)
+    parser.add_argument('--sample_rate', type=float, default=16000)
+    parser.add_argument('--label_dir', type=str)
+    parser.add_argument('--labels', type=str, nargs='+')
+    parser.add_argument('--label_rate', type=float)
+    parser.add_argument('--max_keep_size', type=int, default=None)
+    parser.add_argument('--min_sample_size', type=int)
+    parser.add_argument('--max_sample_size', type=int)
+    parser.add_argument('--pad_audio', type=bool)
+    parser.add_argument('--normalize', type=bool)
+    parser.add_argument('--random_crop', type=bool)
+    parser.add_argument('--single_target', type=bool, default=False)
+    return parent_args
+
+
+def load_audio(manifest_path, max_keep, min_keep):
+    n_long, n_short = 0, 0
+    names, inds, sizes = [], [], []
+    with open(manifest_path) as f:
+        root = f.readline().strip()
+        for ind, line in enumerate(f):
+            items = line.strip().split("\t")
+            assert len(items) == 2, line
+            sz = int(items[1])
+            if min_keep is not None and sz < min_keep:
+                n_short += 1
+            elif max_keep is not None and sz > max_keep:
+                n_long += 1
+            else:
+                names.append(items[0])
+                inds.append(ind)
+                sizes.append(sz)
+    tot = ind + 1
+    logger.info(
+        (
+            f"max_keep={max_keep}, min_keep={min_keep}, "
+            f"loaded {len(names)}, skipped {n_short} short and {n_long} long, "
+            f"longest-loaded={max(sizes)}, shortest-loaded={min(sizes)}"
+        )
+    )
+    return root, names, inds, tot, sizes
+
+
+def load_label(label_path, inds, tot):
+    with open(label_path) as f:
+        labels = [line.rstrip() for line in f]
+        assert (
+            len(labels) == tot
+        ), f"number of labels does not match ({len(labels)} != {tot})"
+        labels = [labels[i] for i in inds]
+    return labels
+
+
+def load_label_offset(label_path, inds, tot):
+    with open(label_path) as f:
+        code_lengths = [len(line.encode("utf-8")) for line in f]
+        assert (
+            len(code_lengths) == tot
+        ), f"number of labels does not match ({len(code_lengths)} != {tot})"
+        offsets = list(itertools.accumulate([0] + code_lengths))
+        offsets = [(offsets[i], offsets[i + 1]) for i in inds]
+    return offsets
+
+
+def verify_label_lengths(
+    audio_sizes,
+    audio_rate,
+    label_path,
+    label_rate,
+    inds,
+    tot,
+    tol=0.1,  # tolerance in seconds
+):
+    if label_rate < 0:
+        logger.info(f"{label_path} is sequence label. skipped")
+        return
+
+    with open(label_path) as f:
+        lengths = [len(line.rstrip().split()) for line in f]
+        assert len(lengths) == tot
+        lengths = [lengths[i] for i in inds]
+    num_invalid = 0
+    for i, ind in enumerate(inds):
+        dur_from_audio = audio_sizes[i] / audio_rate
+        dur_from_label = lengths[i] / label_rate
+        if abs(dur_from_audio - dur_from_label) > tol:
+            logger.warning(
+                (
+                    f"audio and label duration differ too much "
+                    f"(|{dur_from_audio} - {dur_from_label}| > {tol}) "
+                    f"in line {ind+1} of {label_path}. Check if `label_rate` "
+                    f"is correctly set (currently {label_rate}). "
+                    f"num. of samples = {audio_sizes[i]}; "
+                    f"label length = {lengths[i]}"
+                )
+            )
+            num_invalid += 1
+    if num_invalid > 0:
+        logger.warning(
+            f"total {num_invalid} (audio, label) pairs with mismatched lengths"
+        )
+
+
+class HubertDataset(FairseqDataset):
+    def __init__(
+        self,
+        manifest_path: str,
+        sample_rate: float,
+        label_paths: List[str],
+        label_rates: Union[List[float], float],  # -1 for sequence labels
+        pad_list: List[str],
+        eos_list: List[str],
+        label_processors: Optional[List[Any]] = None,
+        max_keep_sample_size: Optional[int] = None,
+        min_keep_sample_size: Optional[int] = None,
+        max_sample_size: Optional[int] = None,
+        shuffle: bool = True,
+        pad_audio: bool = False,
+        normalize: bool = False,
+        store_labels: bool = True,
+        random_crop: bool = False,
+        single_target: bool = False,
+    ):
+        self.audio_root, self.audio_names, inds, tot, self.sizes = load_audio(
+            manifest_path, max_keep_sample_size, min_keep_sample_size
+        )
+        self.sample_rate = sample_rate
+        self.shuffle = shuffle
+        self.random_crop = random_crop
+
+        self.num_labels = len(label_paths)
+        self.pad_list = pad_list
+        self.eos_list = eos_list
+        self.label_processors = label_processors
+        self.single_target = single_target
+        self.label_rates = (
+            [label_rates for _ in range(len(label_paths))]
+            if isinstance(label_rates, float)
+            else label_rates
+        )
+        self.store_labels = store_labels
+        if store_labels:
+            self.label_list = [load_label(p, inds, tot) for p in label_paths]
+        else:
+            self.label_paths = label_paths
+            self.label_offsets_list = [
+                load_label_offset(p, inds, tot) for p in label_paths
+            ]
+        assert label_processors is None or len(label_processors) == self.num_labels
+        for label_path, label_rate in zip(label_paths, self.label_rates):
+            verify_label_lengths(
+                self.sizes, sample_rate, label_path, label_rate, inds, tot
+            )
+
+        self.max_sample_size = (
+            max_sample_size if max_sample_size is not None else sys.maxsize
+        )
+        self.pad_audio = pad_audio
+        self.normalize = normalize
+        logger.info(
+            f"pad_audio={pad_audio}, random_crop={random_crop}, "
+            f"normalize={normalize}, max_sample_size={self.max_sample_size}"
+        )
+
+    def get_audio(self, index):
+        import soundfile as sf
+
+        wav_path = os.path.join(self.audio_root, self.audio_names[index])
+        wav, cur_sample_rate = sf.read(wav_path)
+        wav = torch.from_numpy(wav).float()
+        wav = self.postprocess(wav, cur_sample_rate)
+        return wav
+
+    def get_label(self, index, label_idx):
+        if self.store_labels:
+            label = self.label_list[label_idx][index]
+        else:
+            with open(self.label_paths[label_idx]) as f:
+                offset_s, offset_e = self.label_offsets_list[label_idx][index]
+                f.seek(offset_s)
+                label = f.read(offset_e - offset_s)
+
+        if self.label_processors is not None:
+            label = self.label_processors[label_idx](label)
+        return label
+
+    def get_labels(self, index):
+        return [self.get_label(index, i) for i in range(self.num_labels)]
+
+    def __getitem__(self, index):
+        wav = self.get_audio(index)
+        labels = self.get_labels(index)
+        return {"id": index, "source": wav, "label_list": labels}
+
+    def __len__(self):
+        return len(self.sizes)
+
+    def crop_to_max_size(self, wav, target_size):
+        size = len(wav)
+        diff = size - target_size
+        if diff <= 0:
+            return wav, 0
+
+        start, end = 0, target_size
+        if self.random_crop:
+            start = np.random.randint(0, diff + 1)
+            end = size - diff + start
+        return wav[start:end], start
+
+    def collater(self, samples):
+        # target = max(sizes) -> random_crop not used
+        # target = max_sample_size -> random_crop used for long
+        samples = [s for s in samples if s["source"] is not None]
+        if len(samples) == 0:
+            return {}
+
+        audios = [s["source"] for s in samples]
+        audio_sizes = [len(s) for s in audios]
+        if self.pad_audio:
+            audio_size = min(max(audio_sizes), self.max_sample_size)
+        else:
+            audio_size = min(min(audio_sizes), self.max_sample_size)
+        collated_audios, padding_mask, audio_starts = self.collater_audio(
+            audios, audio_size
+        )
+
+        targets_by_label = [
+            [s["label_list"][i] for s in samples] for i in range(self.num_labels)
+        ]
+        targets_list, lengths_list, ntokens_list = self.collater_label(
+            targets_by_label, audio_size, audio_starts
+        )
+
+        net_input = {"source": collated_audios, "padding_mask": padding_mask}
+        batch = {
+            "id": torch.LongTensor([s["id"] for s in samples]),
+            "net_input": net_input,
+        }
+
+        if self.single_target:
+            batch["target_lengths"] = lengths_list[0]
+            batch["ntokens"] = ntokens_list[0]
+            batch["target"] = targets_list[0]
+        else:
+            batch["target_lengths_list"] = lengths_list
+            batch["ntokens_list"] = ntokens_list
+            batch["target_list"] = targets_list
+        return batch
+
+    def collater_audio(self, audios, audio_size):
+        collated_audios = audios[0].new_zeros(len(audios), audio_size)
+        padding_mask = (
+            torch.BoolTensor(collated_audios.shape).fill_(False)
+            # if self.pad_audio else None
+        )
+        audio_starts = [0 for _ in audios]
+        for i, audio in enumerate(audios):
+            diff = len(audio) - audio_size
+            if diff == 0:
+                collated_audios[i] = audio
+            elif diff < 0:
+                assert self.pad_audio
+                collated_audios[i] = torch.cat([audio, audio.new_full((-diff,), 0.0)])
+                padding_mask[i, diff:] = True
+            else:
+                collated_audios[i], audio_starts[i] = self.crop_to_max_size(
+                    audio, audio_size
+                )
+        return collated_audios, padding_mask, audio_starts
+
+    def collater_frm_label(self, targets, audio_size, audio_starts, label_rate, pad):
+        assert label_rate > 0
+        s2f = label_rate / self.sample_rate
+        frm_starts = [int(round(s * s2f)) for s in audio_starts]
+        frm_size = int(round(audio_size * s2f))
+        if not self.pad_audio:
+            rem_size = [len(t) - s for t, s in zip(targets, frm_starts)]
+            frm_size = min(frm_size, *rem_size)
+        targets = [t[s: s + frm_size] for t, s in zip(targets, frm_starts)]
+        logger.debug(f"audio_starts={audio_starts}")
+        logger.debug(f"frame_starts={frm_starts}")
+        logger.debug(f"frame_size={frm_size}")
+
+        lengths = torch.LongTensor([len(t) for t in targets])
+        ntokens = lengths.sum().item()
+        targets = data_utils.collate_tokens(targets, pad_idx=pad, left_pad=False)
+        return targets, lengths, ntokens
+
+    def collater_seq_label(self, targets, pad):
+        lengths = torch.LongTensor([len(t) for t in targets])
+        ntokens = lengths.sum().item()
+        targets = data_utils.collate_tokens(targets, pad_idx=pad, left_pad=False)
+        return targets, lengths, ntokens
+
+    def collater_label(self, targets_by_label, audio_size, audio_starts):
+        targets_list, lengths_list, ntokens_list = [], [], []
+        itr = zip(targets_by_label, self.label_rates, self.pad_list)
+        for targets, label_rate, pad in itr:
+            if label_rate == -1.0:
+                targets, lengths, ntokens = self.collater_seq_label(targets, pad)
+            else:
+                targets, lengths, ntokens = self.collater_frm_label(
+                    targets, audio_size, audio_starts, label_rate, pad
+                )
+            targets_list.append(targets)
+            lengths_list.append(lengths)
+            ntokens_list.append(ntokens)
+        return targets_list, lengths_list, ntokens_list
+
+    def num_tokens(self, index):
+        return self.size(index)
+
+    def size(self, index):
+        if self.pad_audio:
+            return self.sizes[index]
+        return min(self.sizes[index], self.max_sample_size)
+
+    def ordered_indices(self):
+        if self.shuffle:
+            order = [np.random.permutation(len(self))]
+        else:
+            order = [np.arange(len(self))]
+
+        order.append(self.sizes)
+        return np.lexsort(order)[::-1]
+
+    def postprocess(self, wav, cur_sample_rate):
+        if wav.dim() == 2:
+            wav = wav.mean(-1)
+        assert wav.dim() == 1, wav.dim()
+
+        if cur_sample_rate != self.sample_rate:
+            raise Exception(f"sr {cur_sample_rate} != {self.sample_rate}")
+
+        if self.normalize:
+            with torch.no_grad():
+                wav = F.layer_norm(wav, wav.shape)
+        return wav
diff --git a/fengshen/data/megatron_dataloader/Makefile b/fengshen/data/megatron_dataloader/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..8f9db7686696fbea6c94b998db4b40ef426c748d
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/Makefile
@@ -0,0 +1,9 @@
+CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color
+CPPFLAGS += $(shell python3 -m pybind11 --includes)
+LIBNAME = helpers
+LIBEXT = $(shell python3-config --extension-suffix)
+
+default: $(LIBNAME)$(LIBEXT)
+
+%$(LIBEXT): %.cpp
+	$(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@
diff --git a/fengshen/data/megatron_dataloader/__init__.py b/fengshen/data/megatron_dataloader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd5f898c6bdf89c6cf0243af102d04f6efed86b8
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/__init__.py
@@ -0,0 +1 @@
+from . import indexed_dataset
diff --git a/fengshen/data/megatron_dataloader/bart_dataset.py b/fengshen/data/megatron_dataloader/bart_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..82a22aa21eba9ac4794305c72efe3c25e2bdefb7
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/bart_dataset.py
@@ -0,0 +1,443 @@
+"""BART Style dataset. Modified from fairseq."""
+
+import numpy as np
+import torch
+import math
+import re
+
+from fengshen.data.megatron_dataloader.dataset_utils import (
+    get_samples_mapping
+)
+
+
+class BartDataset(torch.utils.data.Dataset):
+    def __init__(self, name, indexed_dataset, data_prefix,
+                 num_epochs, max_num_samples, masked_lm_prob,
+                 max_seq_length, short_seq_prob, seed, tokenizer, zh_tokenizer):
+
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(self.indexed_dataset,
+                                                   data_prefix,
+                                                   num_epochs,
+                                                   max_num_samples,
+                                                   self.max_seq_length - 3,  # account for added tokens
+                                                   short_seq_prob,
+                                                   self.seed,
+                                                   self.name,
+                                                   False)
+
+        # Vocab stuff.
+        self.vocab_size = tokenizer.vocab_size
+        inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+        self.tokenizer = tokenizer
+
+        seg_tokens = ['。', ';', '；', '!', '！', '?', '？']
+        seg_token_ids = []
+        for t in seg_tokens:
+            if t in tokenizer.vocab:
+                seg_token_ids.append(tokenizer.vocab[t])
+            else:
+                print('seg_token "{}" not in vocab'.format(t))
+        self.seg_token_ids = set(seg_token_ids)
+
+        self.zh_tokenizer = zh_tokenizer
+
+        # Denoising ratios
+        self.permute_sentence_ratio = 1.0
+        self.mask_ratio = masked_lm_prob  # 0.15
+        self.random_ratio = 0.1
+        self.insert_ratio = 0.0
+        self.rotate_ratio = 0.0
+        self.mask_whole_word = 1
+        self.item_transform_func = None
+
+        self.mask_span_distribution = None
+        if False:
+            _lambda = 3  # Poisson lambda
+
+            lambda_to_the_k = 1
+            e_to_the_minus_lambda = math.exp(-_lambda)
+            k_factorial = 1
+            ps = []
+            for k in range(0, 128):
+                ps.append(e_to_the_minus_lambda * lambda_to_the_k / k_factorial)
+                lambda_to_the_k *= _lambda
+                k_factorial *= k + 1
+                if ps[-1] < 0.0000001:
+                    break
+            ps = torch.FloatTensor(ps)
+            self.mask_span_distribution = torch.distributions.Categorical(ps)
+
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+
+    def __getitem__(self, idx):
+        start_idx, end_idx, seq_length = self.samples_mapping[idx]
+        sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return self.build_training_sample(sample, self.max_seq_length, np_rng)
+
+    def build_training_sample(self, sample, max_seq_length, np_rng):
+        """Biuld training sample.
+
+        Arguments:
+            sample: A list of sentences in which each sentence is a list token ids.
+            max_seq_length: Desired sequence length.
+            np_rng: Random number genenrator. Note that this rng state should be
+                numpy and not python since python randint is inclusive for
+                the opper bound whereas the numpy one is exclusive.
+        """
+        # permute sentences
+        full_stops = []
+        tokens = [self.cls_id]
+        for sent in sample:
+            for t in sent:
+                token = self.vocab_id_to_token_dict[t]
+                if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                    # 兼容erlangshen ##的方式做whole word mask
+                    t = self.tokenizer.convert_tokens_to_ids(token[2:])
+                tokens.append(t)
+                if t in self.seg_token_ids:
+                    tokens.append(self.sep_id)
+            if tokens[-1] != self.sep_id:
+                tokens.append(self.sep_id)
+
+        if len(tokens) > max_seq_length:
+            tokens = tokens[:max_seq_length]
+            tokens[-1] = self.sep_id
+        tokens = torch.LongTensor(tokens)
+        full_stops = (tokens == self.sep_id).long()
+        assert (max_seq_length - tokens.shape[0]) >= 0, (tokens.size(), tokens[-1], max_seq_length)
+
+        source, target = tokens, tokens[1:].clone()
+        use_decoder = 1
+        # if torch.rand(1).item() < 0.5:
+        #     use_decoder = 0
+
+        if self.permute_sentence_ratio > 0.0 and use_decoder == 1:
+            source = self.permute_sentences(source, full_stops, self.permute_sentence_ratio)
+
+        if self.mask_ratio > 0.0:
+            replace_length = 1 if use_decoder else -1
+            mask_ratio = self.mask_ratio * 2 if use_decoder else self.mask_ratio
+            source = self.add_whole_word_mask(source, mask_ratio, replace_length)
+
+        if self.insert_ratio > 0.0:
+            raise NotImplementedError
+            source = self.add_insertion_noise(source, self.insert_ratio)
+
+        if self.rotate_ratio > 0.0 and np.random.random() < self.rotate_ratio:
+            raise NotImplementedError
+            source = self.add_rolling_noise(source)
+
+        # there can additional changes to make:
+        if self.item_transform_func is not None:
+            source, target = self.item_transform_func(source, target)
+
+        assert (source >= 0).all()
+        # assert (source[1:-1] >= 1).all()
+        assert (source <= self.vocab_size).all()
+        assert source[0] == self.cls_id
+        assert source[-1] == self.sep_id
+
+        # tokenizer = get_tokenizer()
+        # print(' '.join(tokenizer.tokenizer.convert_ids_to_tokens(source)))
+        # print(tokenizer.detokenize(target))
+        # print(tokenizer.detokenize(source))
+        # print()
+
+        prev_output_tokens = torch.zeros_like(target)
+        prev_output_tokens[0] = self.sep_id  # match the preprocessing in fairseq
+        prev_output_tokens[1:] = target[:-1]
+
+        # src_padding_length = max_seq_length - source.shape[0]
+        # tgt_padding_length = max_seq_length - target.shape[0]
+        # assert src_padding_length >= 0, (source.size(), source[-1], max_seq_length)
+        # assert tgt_padding_length >= 0, (target.size(), target[-1], max_seq_length)
+        source_ = torch.full((max_seq_length,), self.pad_id, dtype=torch.long)
+        source_[:source.shape[0]] = source
+        target_ = torch.full((max_seq_length,), -100, dtype=torch.long)
+        # decoder not need bos in the front
+        target_[:target.shape[0]] = target
+        prev_output_tokens_ = torch.full((max_seq_length,), self.pad_id, dtype=torch.long)
+        prev_output_tokens_[:prev_output_tokens.shape[0]] = prev_output_tokens
+
+        return {
+            "input_ids": source_,
+            "labels": target_,
+            # "decoder_input_ids": prev_output_tokens_,
+            "attention_mask": (source_ != self.pad_id).long()
+        }
+
+    def permute_sentences(self, source, full_stops, p=1.0):
+        # Tokens that are full stops, where the previous token is not
+        sentence_ends = (full_stops[1:] * ~full_stops[:-1]).nonzero(as_tuple=False) + 2
+        result = source.clone()
+
+        num_sentences = sentence_ends.size(0)
+        num_to_permute = math.ceil((num_sentences * 2 * p) / 2.0)
+        substitutions = torch.randperm(num_sentences)[:num_to_permute]
+        ordering = torch.arange(0, num_sentences)
+        ordering[substitutions] = substitutions[torch.randperm(num_to_permute)]
+
+        # Ignore <bos> at start
+        index = 1
+        for i in ordering:
+            sentence = source[(sentence_ends[i - 1] if i > 0 else 1): sentence_ends[i]]
+            result[index: index + sentence.size(0)] = sentence
+            index += sentence.size(0)
+        return result
+
+    def word_starts_en(self, source):
+        if self.mask_whole_word is not None:
+            is_word_start = self.mask_whole_word.gather(0, source)
+        else:
+            is_word_start = torch.ones(source.size())
+        is_word_start[0] = 0
+        is_word_start[-1] = 0
+        return is_word_start
+
+    def word_starts(self, source):
+        if self.mask_whole_word is None:
+            is_word_start = torch.ones(source.size())
+            is_word_start[0] = 0
+            is_word_start[-1] = 0
+            return is_word_start
+        raw_tokens = [self.vocab_id_to_token_dict[i] for i in source.tolist()]
+        words = [raw_tokens[0]] + \
+            self.zh_tokenizer(''.join(raw_tokens[1:-1]), HMM=True) + [raw_tokens[-1]]
+
+        def _is_chinese_char(c):
+            """Checks whether CP is the #codepoint of a CJK character."""
+            # This defines a "chinese character" as anything in the CJK Unicode block:
+            #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+            #
+            # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+            # despite its name. The modern Korean Hangul alphabet is a different block,
+            # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+            # space-separated words, so they are not treated specially and handled
+            # like the all of the other languages.
+            if len(c) > 1:
+                return all([_is_chinese_char(c_i) for c_i in c])
+            cp = ord(c)
+            if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                    (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+                return True
+
+            return False
+
+        def align_linear(atokens, btokens):
+            a2c = []
+            c2b = []
+            a2b = []
+            length = 0
+            for tok in atokens:
+                a2c.append([length + i for i in range(len(tok))])
+                length += len(tok)
+            for i, tok in enumerate(btokens):
+                c2b.extend([i for _ in range(len(tok))])
+
+            for i, amap in enumerate(a2c):
+                bmap = [c2b[ci] for ci in amap]
+                a2b.append(list(set(bmap)))
+            return a2b
+
+        raw_to_word_align = align_linear(raw_tokens, words)
+        is_word_start = torch.zeros(source.size())
+        word_starts = []
+        skip_cur_word = True
+        for i in range(1, len(raw_to_word_align)):
+            if raw_to_word_align[i-1] == raw_to_word_align[i]:
+                # not a word start, as they align to the same word
+                if not skip_cur_word and not _is_chinese_char(raw_tokens[i]):
+                    word_starts.pop(-1)
+                    skip_cur_word = True
+                continue
+            else:
+                is_word_start[i] = 1
+                if _is_chinese_char(raw_tokens[i]):
+                    word_starts.append(i)
+                    skip_cur_word = False
+        is_word_start[0] = 0
+        is_word_start[-1] = 0
+        word_starts = torch.tensor(word_starts).long().view(-1, 1)
+        return is_word_start, word_starts
+
+    def add_whole_word_mask(self, source, p, replace_length=1):
+        is_word_start, word_starts = self.word_starts(source)
+        num_to_mask_word = int(math.ceil(word_starts.size(0) * p))
+        num_to_mask_char = int(math.ceil(word_starts.size(0) * p * 0.1))
+        num_to_mask = num_to_mask_word + num_to_mask_char
+        if num_to_mask > word_starts.size(0):
+            word_starts = is_word_start.nonzero(as_tuple=False)
+        num_inserts = 0
+        if num_to_mask == 0:
+            return source
+
+        if self.mask_span_distribution is not None:
+            lengths = self.mask_span_distribution.sample(sample_shape=(num_to_mask,))
+
+            # Make sure we have enough to mask
+            cum_length = torch.cumsum(lengths, 0)
+            while cum_length[-1] < num_to_mask:
+                lengths = torch.cat(
+                    [
+                        lengths,
+                        self.mask_span_distribution.sample(sample_shape=(num_to_mask,)),
+                    ],
+                    dim=0,
+                )
+                cum_length = torch.cumsum(lengths, 0)
+
+            # Trim to masking budget
+            i = 0
+            while cum_length[i] < num_to_mask:
+                i += 1
+            lengths[i] = num_to_mask - (0 if i == 0 else cum_length[i - 1])
+            num_to_mask = i + 1
+            lengths = lengths[:num_to_mask]
+
+            # Handle 0-length mask (inserts) separately
+            lengths = lengths[lengths > 0]
+            num_inserts = num_to_mask - lengths.size(0)
+            num_to_mask -= num_inserts
+            if num_to_mask == 0:
+                return self.add_insertion_noise(source, num_inserts / source.size(0))
+
+            assert (lengths > 0).all()
+        else:
+            lengths = torch.ones((num_to_mask,)).long()
+        assert is_word_start[-1] == 0
+        indices = word_starts[
+            torch.randperm(word_starts.size(0))[:num_to_mask]
+        ].squeeze(1)
+        mask_random = torch.FloatTensor(num_to_mask).uniform_() < self.random_ratio
+        source_length = source.size(0)
+        assert source_length - 1 not in indices
+        to_keep = torch.ones(source_length, dtype=torch.bool)
+        is_word_start[
+            -1
+        ] = 255  # acts as a long length, so spans don't go over the end of doc
+        if replace_length == 0:
+            to_keep[indices] = 0
+        else:
+            # keep index, but replace it with [MASK]
+            # print(source.size(), word_starts.size(), indices.size(), mask_random.size())
+            source[indices] = self.mask_id
+            source[indices[mask_random]] = torch.randint(
+                1, self.vocab_size, size=(mask_random.sum(),)
+            )
+            # sorted_indices = torch.sort(indices)[0]
+            # continue_mask_pos = ((sorted_indices + 1)[:-1] == sorted_indices[1:])
+            # continue_mask_indices = sorted_indices[1:][continue_mask_pos]
+            # to_keep[continue_mask_indices] = 0
+
+        # for char indices, we already masked, the following loop handles word mask
+        indices = indices[:num_to_mask_word]
+        mask_random = mask_random[:num_to_mask_word]
+        if self.mask_span_distribution is not None:
+            assert len(lengths.size()) == 1
+            assert lengths.size() == indices.size()
+            lengths -= 1
+            while indices.size(0) > 0:
+                assert lengths.size() == indices.size()
+                lengths -= is_word_start[indices + 1].long()
+                uncompleted = lengths >= 0
+                indices = indices[uncompleted] + 1
+                mask_random = mask_random[uncompleted]
+                lengths = lengths[uncompleted]
+                if replace_length != -1:
+                    # delete token
+                    to_keep[indices] = 0
+                else:
+                    # keep index, but replace it with [MASK]
+                    source[indices] = self.mask_id
+                    source[indices[mask_random]] = torch.randint(
+                        1, self.vocab_size, size=(mask_random.sum(),)
+                    )
+        else:
+            # A bit faster when all lengths are 1
+            while indices.size(0) > 0:
+                uncompleted = is_word_start[indices + 1] == 0
+                indices = indices[uncompleted] + 1
+                mask_random = mask_random[uncompleted]
+                if replace_length != -1:
+                    # delete token
+                    to_keep[indices] = 0
+                else:
+                    # keep index, but replace it with [MASK]
+                    source[indices] = self.mask_id
+                    source[indices[mask_random]] = torch.randint(
+                        1, self.vocab_size, size=(mask_random.sum(),)
+                    )
+
+                assert source_length - 1 not in indices
+
+        source = source[to_keep]
+
+        if num_inserts > 0:
+            source = self.add_insertion_noise(source, num_inserts / source.size(0))
+
+        return source
+
+    def add_permuted_noise(self, tokens, p):
+        num_words = len(tokens)
+        num_to_permute = math.ceil(((num_words * 2) * p) / 2.0)
+        substitutions = torch.randperm(num_words - 2)[:num_to_permute] + 1
+        tokens[substitutions] = tokens[substitutions[torch.randperm(num_to_permute)]]
+        return tokens
+
+    def add_rolling_noise(self, tokens):
+        offset = np.random.randint(1, max(1, tokens.size(-1) - 1) + 1)
+        tokens = torch.cat(
+            (tokens[0:1], tokens[offset:-1], tokens[1:offset], tokens[-1:]),
+            dim=0,
+        )
+        return tokens
+
+    def add_insertion_noise(self, tokens, p):
+        if p == 0.0:
+            return tokens
+
+        num_tokens = len(tokens)
+        n = int(math.ceil(num_tokens * p))
+
+        noise_indices = torch.randperm(num_tokens + n - 2)[:n] + 1
+        noise_mask = torch.zeros(size=(num_tokens + n,), dtype=torch.bool)
+        noise_mask[noise_indices] = 1
+        result = torch.LongTensor(n + len(tokens)).fill_(-1)
+
+        num_random = int(math.ceil(n * self.random_ratio))
+        result[noise_indices[num_random:]] = self.mask_id
+        result[noise_indices[:num_random]] = torch.randint(
+            low=1, high=self.vocab_size, size=(num_random,)
+        )
+
+        result[~noise_mask] = tokens
+
+        assert (result >= 0).all()
+        return result
diff --git a/fengshen/data/megatron_dataloader/bert_dataset.py b/fengshen/data/megatron_dataloader/bert_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c007f060fd07fc9c6302b7f88e191469d599222
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/bert_dataset.py
@@ -0,0 +1,196 @@
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BERT Style dataset."""
+
+
+import numpy as np
+import torch
+
+from fengshen.data.megatron_dataloader.dataset_utils import (
+    get_samples_mapping,
+    get_a_and_b_segments,
+    create_masked_lm_predictions,
+    create_tokens_and_tokentypes,
+)
+
+
+class BertDataset(torch.utils.data.Dataset):
+
+    def __init__(self, name, indexed_dataset, data_prefix,
+                 num_epochs, max_num_samples, masked_lm_prob,
+                 max_seq_length, short_seq_prob, seed, binary_head, tokenizer, masking_style):
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+        self.short_seq_prob = short_seq_prob
+        self.binary_head = binary_head
+        self.masking_style = masking_style
+
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(self.indexed_dataset,
+                                                   data_prefix,
+                                                   num_epochs,
+                                                   max_num_samples,
+                                                   # account for added tokens
+                                                   self.max_seq_length - 3,
+                                                   short_seq_prob,
+                                                   self.seed,
+                                                   self.name,
+                                                   self.binary_head)
+        inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+        self.tokenizer = tokenizer
+
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+
+    def __getitem__(self, idx):
+        start_idx, end_idx, seq_length = self.samples_mapping[idx]
+        sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return build_training_sample(sample, seq_length,
+                                     self.max_seq_length,  # needed for padding
+                                     self.vocab_id_list,
+                                     self.vocab_id_to_token_dict,
+                                     self.cls_id, self.sep_id,
+                                     self.mask_id, self.pad_id,
+                                     self.masked_lm_prob, np_rng,
+                                     self.binary_head,
+                                     tokenizer=self.tokenizer,
+                                     masking_style=self.masking_style)
+
+
+def build_training_sample(sample,
+                          target_seq_length, max_seq_length,
+                          vocab_id_list, vocab_id_to_token_dict,
+                          cls_id, sep_id, mask_id, pad_id,
+                          masked_lm_prob, np_rng, binary_head,
+                          tokenizer,
+                          masking_style='bert'):
+    """Biuld training sample.
+
+    Arguments:
+        sample: A list of sentences in which each sentence is a list token ids.
+        target_seq_length: Desired sequence length.
+        max_seq_length: Maximum length of the sequence. All values are padded to
+            this length.
+        vocab_id_list: List of vocabulary ids. Used to pick a random id.
+        vocab_id_to_token_dict: A dictionary from vocab ids to text tokens.
+        cls_id: Start of example id.
+        sep_id: Separator id.
+        mask_id: Mask token id.
+        pad_id: Padding token id.
+        masked_lm_prob: Probability to mask tokens.
+        np_rng: Random number genenrator. Note that this rng state should be
+              numpy and not python since python randint is inclusive for
+              the opper bound whereas the numpy one is exclusive.
+    """
+
+    if binary_head:
+        # We assume that we have at least two sentences in the sample
+        assert len(sample) > 1
+    assert target_seq_length <= max_seq_length
+
+    # Divide sample into two segments (A and B).
+    if binary_head:
+        tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample,
+                                                                  np_rng)
+    else:
+        tokens_a = []
+        for j in range(len(sample)):
+            tokens_a.extend(sample[j])
+        tokens_b = []
+        is_next_random = False
+
+    if len(tokens_a) >= max_seq_length-3:
+        tokens_a = tokens_a[:max_seq_length-3]
+
+    # Truncate to `target_sequence_length`.
+    max_num_tokens = target_seq_length
+    ''''
+    truncated = truncate_segments(tokens_a, tokens_b, len(tokens_a),
+                                  len(tokens_b), max_num_tokens, np_rng)
+    '''
+
+    # Build tokens and toketypes.
+    tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b,
+                                                      cls_id, sep_id)
+    # Masking.
+    max_predictions_per_seq = masked_lm_prob * max_num_tokens
+    (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions(
+        tokens, vocab_id_list, vocab_id_to_token_dict, masked_lm_prob,
+        cls_id, sep_id, mask_id, max_predictions_per_seq, np_rng,
+        tokenizer=tokenizer,
+        masking_style=masking_style)
+
+    # Padding.
+    tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np \
+        = pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                                   masked_labels, pad_id, max_seq_length)
+
+    train_sample = {
+        'input_ids': tokens_np,
+        'token_type_ids': tokentypes_np,
+        'labels': labels_np,
+        'next_sentence_label': int(is_next_random),
+        'attention_mask': padding_mask_np}
+    return train_sample
+
+
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                             masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+
+    # Some checks.
+    num_tokens = len(tokens)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                               dtype=np.int64)
+
+    # Lables and loss mask.
+    labels = [-100] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
diff --git a/fengshen/data/megatron_dataloader/blendable_dataset.py b/fengshen/data/megatron_dataloader/blendable_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee24d4056b86333a13d4926e79283a0bc96bbea3
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/blendable_dataset.py
@@ -0,0 +1,64 @@
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Blendable dataset."""
+
+import time
+
+import numpy as np
+import torch
+
+from fengshen.data.megatron_dataloader.utils import print_rank_0
+
+
+class BlendableDataset(torch.utils.data.Dataset):
+
+    def __init__(self, datasets, weights):
+
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+
+        self.size = 0
+        for dataset in self.datasets:
+            self.size += len(dataset)
+
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+
+        # Build indecies.
+        start_time = time.time()
+        assert num_datasets < 255
+        self.dataset_index = np.zeros(self.size, dtype=np.uint8)
+        self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+
+        from fengshen.data.megatron_dataloader import helpers
+        helpers.build_blending_indices(self.dataset_index,
+                                       self.dataset_sample_index,
+                                       weights, num_datasets, self.size,
+                                       torch.distributed.get_rank() == 0)
+        print_rank_0('> elapsed time for building blendable dataset indices: '
+                     '{:.2f} (sec)'.format(time.time() - start_time))
+
+    def __len__(self):
+        return self.size
+
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return self.datasets[dataset_idx][sample_idx]
diff --git a/fengshen/data/megatron_dataloader/dataset_utils.py b/fengshen/data/megatron_dataloader/dataset_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b579751573ff8ddf94882c032d4ed6cc168ba07
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/dataset_utils.py
@@ -0,0 +1,788 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors, and NVIDIA.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Most of the code here has been copied from:
+#   https://github.com/google-research/albert/blob/master/create_pretraining_data.py
+# with some modifications.
+
+import math
+import time
+import collections
+
+import numpy as np
+import re
+
+from fengshen.data.megatron_dataloader.utils import (
+    print_rank_0
+)
+from fengshen.data.megatron_dataloader.blendable_dataset import BlendableDataset
+from fengshen.data.megatron_dataloader.indexed_dataset import make_dataset as make_indexed_dataset
+
+DSET_TYPE_BERT = 'standard_bert'
+DSET_TYPE_ICT = 'ict'
+DSET_TYPE_T5 = 't5'
+DSET_TYPE_BERT_CN_WWM = 'bert_cn_wwm'
+DSET_TYPE_BART = 'bart'
+DSET_TYPE_COCOLM = 'coco_lm'
+
+DSET_TYPES = [DSET_TYPE_BERT, DSET_TYPE_ICT,
+              DSET_TYPE_T5, DSET_TYPE_BERT_CN_WWM,
+              DSET_TYPE_BART, DSET_TYPE_COCOLM]
+
+
+def get_datasets_weights_and_num_samples(data_prefix,
+                                         train_valid_test_num_samples):
+
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005))
+             for val in train_valid_test_num_samples])
+
+    return prefixes, weights, datasets_train_valid_test_num_samples
+
+
+def compile_helper():
+    """Compile helper function ar runtime. Make sure this
+    is invoked on a single process."""
+    import os
+    import subprocess
+    path = os.path.abspath(os.path.dirname(__file__))
+    ret = subprocess.run(['make', '-C', path])
+    if ret.returncode != 0:
+        print("Making C++ dataset helpers module failed, exiting.")
+        import sys
+        sys.exit(1)
+
+
+def get_a_and_b_segments(sample, np_rng):
+    """Divide sample into a and b segments."""
+
+    # Number of sentences in the sample.
+    n_sentences = len(sample)
+    # Make sure we always have two sentences.
+    assert n_sentences > 1, 'make sure each sample has at least two sentences.'
+
+    # First part:
+    # `a_end` is how many sentences go into the `A`.
+    a_end = 1
+    if n_sentences >= 3:
+        # Note that randin in numpy is exclusive.
+        a_end = np_rng.randint(1, n_sentences)
+    tokens_a = []
+    for j in range(a_end):
+        tokens_a.extend(sample[j])
+
+    # Second part:
+    tokens_b = []
+    for j in range(a_end, n_sentences):
+        tokens_b.extend(sample[j])
+
+    # Random next:
+    is_next_random = False
+    if np_rng.random() < 0.5:
+        is_next_random = True
+        tokens_a, tokens_b = tokens_b, tokens_a
+
+    return tokens_a, tokens_b, is_next_random
+
+
+def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    # print(len_a, len_b, max_num_tokens)
+    assert len_a > 0
+    if len_a + len_b <= max_num_tokens:
+        return False
+    while len_a + len_b > max_num_tokens:
+        if len_a > len_b:
+            len_a -= 1
+            tokens = tokens_a
+        else:
+            len_b -= 1
+            tokens = tokens_b
+        if np_rng.random() < 0.5:
+            del tokens[0]
+        else:
+            tokens.pop()
+    return True
+
+
+def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
+    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
+
+    tokens = []
+    tokentypes = []
+    # [CLS].
+    tokens.append(cls_id)
+    tokentypes.append(0)
+    # Segment A.
+    for token in tokens_a:
+        tokens.append(token)
+        tokentypes.append(0)
+    # [SEP].
+    tokens.append(sep_id)
+    tokentypes.append(0)
+    # Segment B.
+    for token in tokens_b:
+        tokens.append(token)
+        tokentypes.append(1)
+    if tokens_b:
+        # [SEP].
+        tokens.append(sep_id)
+        tokentypes.append(1)
+
+    return tokens, tokentypes
+
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
+                                          ["index", "label"])
+
+
+def is_start_piece(piece):
+    """Check if the current word piece is the starting piece (BERT)."""
+    # When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    return not piece.startswith("##")
+
+
+def create_masked_lm_predictions(tokens,
+                                 vocab_id_list, vocab_id_to_token_dict,
+                                 masked_lm_prob,
+                                 cls_id, sep_id, mask_id,
+                                 max_predictions_per_seq,
+                                 np_rng,
+                                 tokenizer,
+                                 max_ngrams=3,
+                                 do_whole_word_mask=True,
+                                 favor_longer_ngram=False,
+                                 do_permutation=False,
+                                 geometric_dist=False,
+                                 masking_style="bert",
+                                 zh_tokenizer=None):
+    """Creates the predictions for the masked LM objective.
+    Note: Tokens here are vocab ids and not text tokens."""
+
+    cand_indexes = []
+    # Note(mingdachen): We create a list for recording if the piece is
+    # the starting piece of current token, where 1 means true, so that
+    # on-the-fly whole word masking is possible.
+    token_boundary = [0] * len(tokens)
+
+    # 如果没有指定中文分词器，那就直接按##算
+    if zh_tokenizer is None:
+        for (i, token) in enumerate(tokens):
+            if token == cls_id or token == sep_id:
+                token_boundary[i] = 1
+                continue
+        # Whole Word Masking means that if we mask all of the wordpieces
+        # corresponding to an original word.
+        #
+        # Note that Whole Word Masking does *not* change the training code
+        # at all -- we still predict each WordPiece independently, softmaxed
+        # over the entire vocabulary.
+            if (do_whole_word_mask and len(cand_indexes) >= 1 and
+                    not is_start_piece(vocab_id_to_token_dict[token])):
+                cand_indexes[-1].append(i)
+            else:
+                cand_indexes.append([i])
+                if is_start_piece(vocab_id_to_token_dict[token]):
+                    token_boundary[i] = 1
+    else:
+        # 如果指定了中文分词器，那就先用分词器分词，然后再进行判断
+        # 获取去掉CLS SEP的原始文本
+        raw_tokens = []
+        for t in tokens:
+            if t != cls_id and t != sep_id:
+                raw_tokens.append(t)
+        raw_tokens = [vocab_id_to_token_dict[i] for i in raw_tokens]
+        # 分词然后获取每次字开头的最长词的长度
+        word_list = set(zh_tokenizer(''.join(raw_tokens), HMM=True))
+        word_length_dict = {}
+        for w in word_list:
+            if len(w) < 1:
+                continue
+            if w[0] not in word_length_dict:
+                word_length_dict[w[0]] = len(w)
+            elif word_length_dict[w[0]] < len(w):
+                word_length_dict[w[0]] = len(w)
+        i = 0
+        # 从词表里面检索
+        while i < len(tokens):
+            token_id = tokens[i]
+            token = vocab_id_to_token_dict[token_id]
+            if len(token) == 0 or token_id == cls_id or token_id == sep_id:
+                token_boundary[i] = 1
+                i += 1
+                continue
+            word_max_length = 1
+            if token[0] in word_length_dict:
+                word_max_length = word_length_dict[token[0]]
+            j = 0
+            word = ''
+            word_end = i+1
+            # 兼容以前##的形式，如果后面的词是##开头的，那么直接把后面的拼到前面当作一个词
+            old_style = False
+            while word_end < len(tokens) and vocab_id_to_token_dict[tokens[word_end]].startswith('##'):
+                old_style = True
+                word_end += 1
+            if not old_style:
+                while j < word_max_length and i+j < len(tokens):
+                    cur_token = tokens[i+j]
+                    word += vocab_id_to_token_dict[cur_token]
+                    j += 1
+                    if word in word_list:
+                        word_end = i+j
+            cand_indexes.append([p for p in range(i, word_end)])
+            token_boundary[i] = 1
+            i = word_end
+
+    output_tokens = list(tokens)
+    # add by ganruyi
+    if masking_style == 'bert-cn-wwm':
+        # if non chinese is False, that means it is chinese
+        # then try to remove "##" which is added previously
+        new_token_ids = []
+        for token_id in output_tokens:
+            token = tokenizer.convert_ids_to_tokens([token_id])[0]
+            if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                token = token[2:]
+            new_token_id = tokenizer.convert_tokens_to_ids([token])[
+                0]
+            new_token_ids.append(new_token_id)
+        output_tokens = new_token_ids
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+
+    if masked_lm_prob == 0:
+        return (output_tokens, masked_lm_positions,
+                masked_lm_labels, token_boundary)
+
+    num_to_predict = min(max_predictions_per_seq,
+                         max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
+    if not geometric_dist:
+        # Note(mingdachen):
+        # By default, we set the probilities to favor shorter ngram sequences.
+        pvals = 1. / np.arange(1, max_ngrams + 1)
+        pvals /= pvals.sum(keepdims=True)
+        if favor_longer_ngram:
+            pvals = pvals[::-1]
+    # 获取一个ngram的idx，对于每个word，记录他的ngram的word
+    ngram_indexes = []
+    for idx in range(len(cand_indexes)):
+        ngram_index = []
+        for n in ngrams:
+            ngram_index.append(cand_indexes[idx:idx + n])
+        ngram_indexes.append(ngram_index)
+
+    np_rng.shuffle(ngram_indexes)
+
+    (masked_lms, masked_spans) = ([], [])
+    covered_indexes = set()
+    for cand_index_set in ngram_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if not cand_index_set:
+            continue
+        # Note(mingdachen):
+        # Skip current piece if they are covered in lm masking or previous ngrams.
+        for index_set in cand_index_set[0]:
+            for index in index_set:
+                if index in covered_indexes:
+                    continue
+
+        if not geometric_dist:
+            n = np_rng.choice(ngrams[:len(cand_index_set)],
+                              p=pvals[:len(cand_index_set)] /
+                              pvals[:len(cand_index_set)].sum(keepdims=True))
+        else:
+            # Sampling "n" from the geometric distribution and clipping it to
+            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
+            # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1)
+            n = min(np_rng.geometric(0.2), max_ngrams)
+
+        index_set = sum(cand_index_set[n - 1], [])
+        n -= 1
+        # Note(mingdachen):
+        # Repeatedly looking for a candidate that does not exceed the
+        # maximum number of predictions by trying shorter ngrams.
+        while len(masked_lms) + len(index_set) > num_to_predict:
+            if n == 0:
+                break
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+        # If adding a whole-word mask would exceed the maximum number of
+        # predictions, then just skip this candidate.
+        if len(masked_lms) + len(index_set) > num_to_predict:
+            continue
+        is_any_index_covered = False
+        for index in index_set:
+            if index in covered_indexes:
+                is_any_index_covered = True
+                break
+        if is_any_index_covered:
+            continue
+        for index in index_set:
+            covered_indexes.add(index)
+            masked_token = None
+            if masking_style == "bert":
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        masked_token = tokens[index]
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
+            elif masking_style == 'bert-cn-wwm':
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        # 如果是中文全词mask，去掉tokens里的##
+                        token_id = tokens[index]
+                        token = tokenizer.convert_ids_to_tokens([token_id])[
+                            0]
+                        if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                            token = token[2:]
+                        new_token_id = tokenizer.convert_tokens_to_ids([token])[
+                            0]
+                        masked_token = new_token_id
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(
+                            0, len(vocab_id_list))]
+            elif masking_style == "t5":
+                masked_token = mask_id
+            else:
+                raise ValueError("invalid value of masking style")
+
+            output_tokens[index] = masked_token
+            masked_lms.append(MaskedLmInstance(
+                index=index, label=tokens[index]))
+
+        masked_spans.append(MaskedLmInstance(
+            index=index_set,
+            label=[tokens[index] for index in index_set]))
+
+    assert len(masked_lms) <= num_to_predict
+    np_rng.shuffle(ngram_indexes)
+
+    select_indexes = set()
+    if do_permutation:
+        for cand_index_set in ngram_indexes:
+            if len(select_indexes) >= num_to_predict:
+                break
+            if not cand_index_set:
+                continue
+            # Note(mingdachen):
+            # Skip current piece if they are covered in lm masking or previous ngrams.
+            for index_set in cand_index_set[0]:
+                for index in index_set:
+                    if index in covered_indexes or index in select_indexes:
+                        continue
+
+            n = np.random.choice(ngrams[:len(cand_index_set)],
+                                 p=pvals[:len(cand_index_set)] /
+                                 pvals[:len(cand_index_set)].sum(keepdims=True))
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+
+            while len(select_indexes) + len(index_set) > num_to_predict:
+                if n == 0:
+                    break
+                index_set = sum(cand_index_set[n - 1], [])
+                n -= 1
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(select_indexes) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes or index in select_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                select_indexes.add(index)
+        assert len(select_indexes) <= num_to_predict
+
+        select_indexes = sorted(select_indexes)
+        permute_indexes = list(select_indexes)
+        np_rng.shuffle(permute_indexes)
+        orig_token = list(output_tokens)
+
+        for src_i, tgt_i in zip(select_indexes, permute_indexes):
+            output_tokens[src_i] = orig_token[tgt_i]
+            masked_lms.append(MaskedLmInstance(
+                index=src_i, label=orig_token[src_i]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+    # Sort the spans by the index of the first span
+    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
+
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)
+
+
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                             masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+
+    # Some checks.
+    num_tokens = len(tokens)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                               dtype=np.int64)
+
+    # Lables and loss mask.
+    labels = [-1] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
+
+
+def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
+                                    train_valid_test_num_samples,
+                                    max_seq_length,
+                                    masked_lm_prob, short_seq_prob, seed,
+                                    tokenizer,
+                                    skip_warmup, binary_head=False,
+                                    max_seq_length_dec=None,
+                                    dataset_type='standard_bert',
+                                    zh_tokenizer=None,
+                                    span=None):
+
+    if len(data_prefix) == 1:
+        return _build_train_valid_test_datasets(data_prefix[0],
+                                                data_impl, splits_string,
+                                                train_valid_test_num_samples,
+                                                max_seq_length, masked_lm_prob,
+                                                short_seq_prob, seed,
+                                                skip_warmup,
+                                                binary_head,
+                                                max_seq_length_dec,
+                                                tokenizer,
+                                                dataset_type=dataset_type,
+                                                zh_tokenizer=zh_tokenizer,
+                                                span=span)
+    # Blending dataset.
+    # Parse the values.
+    output = get_datasets_weights_and_num_samples(data_prefix,
+                                                  train_valid_test_num_samples)
+    prefixes, weights, datasets_train_valid_test_num_samples = output
+
+    # Build individual datasets.
+    train_datasets = []
+    valid_datasets = []
+    test_datasets = []
+    for i in range(len(prefixes)):
+        train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+            prefixes[i], data_impl, splits_string,
+            datasets_train_valid_test_num_samples[i],
+            max_seq_length, masked_lm_prob, short_seq_prob,
+            seed, skip_warmup, binary_head, max_seq_length_dec,
+            tokenizer, dataset_type=dataset_type, zh_tokenizer=zh_tokenizer)
+        if train_ds:
+            train_datasets.append(train_ds)
+        if valid_ds:
+            valid_datasets.append(valid_ds)
+        if test_ds:
+            test_datasets.append(test_ds)
+
+        # Blend.
+    blending_train_dataset = None
+    if train_datasets:
+        blending_train_dataset = BlendableDataset(train_datasets, weights)
+    blending_valid_dataset = None
+    if valid_datasets:
+        blending_valid_dataset = BlendableDataset(valid_datasets, weights)
+    blending_test_dataset = None
+    if test_datasets:
+        blending_test_dataset = BlendableDataset(test_datasets, weights)
+
+    return (blending_train_dataset, blending_valid_dataset,
+            blending_test_dataset)
+
+
+def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
+                                     train_valid_test_num_samples,
+                                     max_seq_length,
+                                     masked_lm_prob, short_seq_prob, seed,
+                                     skip_warmup, binary_head,
+                                     max_seq_length_dec,
+                                     tokenizer,
+                                     dataset_type='standard_bert',
+                                     zh_tokenizer=None,
+                                     span=None):
+
+    if dataset_type not in DSET_TYPES:
+        raise ValueError("Invalid dataset_type: ", dataset_type)
+
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix,
+                                           data_impl,
+                                           skip_warmup)
+
+    # Get start and end indices of train/valid/train into doc-idx
+    # Note that doc-idx is desinged to be num-docs + 1 so we can
+    # easily iterate over it.
+    total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1
+    splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+
+    # Print stats about the splits.
+    print_rank_0(' > dataset split:')
+
+    def print_split_stats(name, index):
+        print_rank_0('    {}:'.format(name))
+        print_rank_0('     document indices in [{}, {}) total of {} '
+                     'documents'.format(splits[index], splits[index + 1],
+                                        splits[index + 1] - splits[index]))
+        start_index = indexed_dataset.doc_idx[splits[index]]
+        end_index = indexed_dataset.doc_idx[splits[index + 1]]
+        print_rank_0('     sentence indices in [{}, {}) total of {} '
+                     'sentences'.format(start_index, end_index,
+                                        end_index - start_index))
+    print_split_stats('train', 0)
+    print_split_stats('validation', 1)
+    print_split_stats('test', 2)
+
+    def build_dataset(index, name):
+        from fengshen.data.megatron_dataloader.bert_dataset import BertDataset
+        from fengshen.data.megatron_dataloader.bart_dataset import BartDataset
+        from fengshen.data.megatron_dataloader.cocolm_dataset import COCOLMDataset
+        dataset = None
+        if splits[index + 1] > splits[index]:
+            # Get the pointer to the original doc-idx so we can set it later.
+            doc_idx_ptr = indexed_dataset.get_doc_idx()
+            # Slice the doc-idx
+            start_index = splits[index]
+            # Add +1 so we can index into the dataset to get the upper bound.
+            end_index = splits[index + 1] + 1
+            # New doc_idx view.
+            indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index])
+            # Build the dataset accordingly.
+            kwargs = dict(
+                name=name,
+                data_prefix=data_prefix,
+                num_epochs=None,
+                max_num_samples=train_valid_test_num_samples[index],
+                max_seq_length=max_seq_length,
+                seed=seed,
+            )
+
+            if dataset_type == DSET_TYPE_BERT or dataset_type == DSET_TYPE_BERT_CN_WWM:
+                dataset = BertDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    binary_head=binary_head,
+                    # 增加参数区分bert和bert-cn-wwm
+                    tokenizer=tokenizer,
+                    masking_style='bert' if dataset_type == DSET_TYPE_BERT else 'bert-cn-wwm',
+                    **kwargs
+                )
+            elif dataset_type == DSET_TYPE_BART:
+                dataset = BartDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    tokenizer=tokenizer,
+                    zh_tokenizer=zh_tokenizer,
+                    **kwargs
+                )
+            elif dataset_type == DSET_TYPE_COCOLM:
+                dataset = COCOLMDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    tokenizer=tokenizer,
+                    masking_style='bert',
+                    span=span,
+                    **kwargs
+                )
+            else:
+                raise NotImplementedError(
+                    "Dataset type not fully implemented.")
+
+            # Set the original pointer so dataset remains the main dataset.
+            indexed_dataset.set_doc_idx(doc_idx_ptr)
+            # Checks.
+            assert indexed_dataset.doc_idx[0] == 0
+            assert indexed_dataset.doc_idx.shape[0] == \
+                (total_num_of_documents + 1)
+        return dataset
+
+    train_dataset = build_dataset(0, 'train')
+    valid_dataset = build_dataset(1, 'valid')
+    test_dataset = build_dataset(2, 'test')
+
+    return (train_dataset, valid_dataset, test_dataset)
+
+
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+
+    print_rank_0(' > building dataset index ...')
+
+    start_time = time.time()
+    indexed_dataset = make_indexed_dataset(data_prefix,
+                                           data_impl,
+                                           skip_warmup)
+    assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1]
+    print_rank_0(' > finished creating indexed dataset in {:4f} '
+                 'seconds'.format(time.time() - start_time))
+
+    print_rank_0(' > indexed dataset stats:')
+    print_rank_0('    number of documents: {}'.format(
+        indexed_dataset.doc_idx.shape[0] - 1))
+    print_rank_0('    number of sentences: {}'.format(
+        indexed_dataset.sizes.shape[0]))
+
+    return indexed_dataset
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """ Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(',') != -1:
+        splits = [float(s) for s in splits_string.split(',')]
+    elif splits_string.find('/') != -1:
+        splits = [float(s) for s in splits_string.split('/')]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] +
+                            int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def get_samples_mapping(indexed_dataset,
+                        data_prefix,
+                        num_epochs,
+                        max_num_samples,
+                        max_seq_length,
+                        short_seq_prob,
+                        seed,
+                        name,
+                        binary_head):
+    """Get a list that maps a sample index to a starting
+    sentence index, end sentence index, and length"""
+
+    if not num_epochs:
+        if not max_num_samples:
+            raise ValueError("Need to specify either max_num_samples "
+                             "or num_epochs")
+        num_epochs = np.iinfo(np.int32).max - 1
+    if not max_num_samples:
+        max_num_samples = np.iinfo(np.int64).max - 1
+
+    # Filename of the index mapping
+    indexmap_filename = data_prefix
+    indexmap_filename += '_{}_indexmap'.format(name)
+    if num_epochs != (np.iinfo(np.int32).max - 1):
+        indexmap_filename += '_{}ep'.format(num_epochs)
+    if max_num_samples != (np.iinfo(np.int64).max - 1):
+        indexmap_filename += '_{}mns'.format(max_num_samples)
+    indexmap_filename += '_{}msl'.format(max_seq_length)
+    indexmap_filename += '_{:0.2f}ssp'.format(short_seq_prob)
+    indexmap_filename += '_{}s'.format(seed)
+    indexmap_filename += '.npy'
+
+    # This should be a barrier but nccl barrier assumes
+    # device_index=rank which is not the case for model
+    # parallel case
+    # ganruyi comment
+    # counts = torch.cuda.LongTensor([1])
+    # torch.distributed.all_reduce(
+    # counts, group=mpu.get_data_parallel_group())
+    # torch.distributed.all_reduce(
+    # counts, group=mpu.get_pipeline_model_parallel_group())
+    # assert counts[0].item() == (
+    #    torch.distributed.get_world_size() //
+    #    torch.distributed.get_world_size(
+    # group=mpu.get_tensor_model_parallel_group()))
+
+    # Load indexed dataset.
+    print_rank_0(' > loading indexed mapping from {}'.format(
+        indexmap_filename))
+    start_time = time.time()
+    samples_mapping = np.load(
+        indexmap_filename, allow_pickle=True, mmap_mode='r')
+    print_rank_0('    loaded indexed file in {:3.3f} seconds'.format(
+        time.time() - start_time))
+    print_rank_0('    total number of samples: {}'.format(
+        samples_mapping.shape[0]))
+
+    return samples_mapping
diff --git a/fengshen/data/megatron_dataloader/helpers.cpp b/fengshen/data/megatron_dataloader/helpers.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..31277dd1ce3a449bf962ba5a4d6343e7a9c0b5f9
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/helpers.cpp
@@ -0,0 +1,794 @@
+/*
+ coding=utf-8
+ Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ */
+
+/* Helper methods for fast index mapping builds */
+
+#include <algorithm>
+#include <iostream>
+#include <limits>
+#include <math.h>
+#include <stdexcept>
+#include <pybind11/pybind11.h>
+#include <pybind11/numpy.h>
+#include <random>
+
+namespace py = pybind11;
+using namespace std;
+
+const int32_t LONG_SENTENCE_LEN = 512;
+
+void build_blending_indices(py::array_t<uint8_t> &dataset_index,
+                            py::array_t<int64_t> &dataset_sample_index,
+                            const py::array_t<double> &weights,
+                            const int32_t num_datasets,
+                            const int64_t size, const bool verbose)
+{
+  /* Given multiple datasets and a weighting array, build samples
+   such that it follows those wieghts.*/
+
+  if (verbose)
+  {
+    std::cout << "> building indices for blendable datasets ..." << std::endl;
+  }
+
+  // Get the pointer access without the checks.
+  auto dataset_index_ptr = dataset_index.mutable_unchecked<1>();
+  auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>();
+  auto weights_ptr = weights.unchecked<1>();
+
+  // Initialize buffer for number of samples used for each dataset.
+  int64_t current_samples[num_datasets];
+  for (int64_t i = 0; i < num_datasets; ++i)
+  {
+    current_samples[i] = 0;
+  }
+
+  // For each sample:
+  for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx)
+  {
+
+    // Determine where the max error in sampling is happening.
+    auto sample_idx_double = std::max(static_cast<double>(sample_idx), 1.0);
+    int64_t max_error_index = 0;
+    double max_error = weights_ptr[0] * sample_idx_double -
+                       static_cast<double>(current_samples[0]);
+    for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx)
+    {
+      double error = weights_ptr[dataset_idx] * sample_idx_double -
+                     static_cast<double>(current_samples[dataset_idx]);
+      if (error > max_error)
+      {
+        max_error = error;
+        max_error_index = dataset_idx;
+      }
+    }
+
+    // Populate the indices.
+    dataset_index_ptr[sample_idx] = static_cast<uint8_t>(max_error_index);
+    dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index];
+
+    // Update the total samples.
+    current_samples[max_error_index] += 1;
+  }
+
+  // print info
+  if (verbose)
+  {
+    std::cout << " > sample ratios:" << std::endl;
+    for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx)
+    {
+      auto ratio = static_cast<double>(current_samples[dataset_idx]) /
+                   static_cast<double>(size);
+      std::cout << "   dataset " << dataset_idx << ", input: " << weights_ptr[dataset_idx] << ", achieved: " << ratio << std::endl;
+    }
+  }
+}
+
+py::array build_sample_idx(const py::array_t<int32_t> &sizes_,
+                           const py::array_t<int32_t> &doc_idx_,
+                           const int32_t seq_length,
+                           const int32_t num_epochs,
+                           const int64_t tokens_per_epoch)
+{
+  /* Sample index (sample_idx) is used for gpt2 like dataset for which
+       the documents are flattened and the samples are built based on this
+       1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2]
+       where [..., 0] contains the index into `doc_idx` and [..., 1] is the
+       starting offset in that document.*/
+
+  // Consistency checks.
+  assert(seq_length > 1);
+  assert(num_epochs > 0);
+  assert(tokens_per_epoch > 1);
+
+  // Remove bound checks.
+  auto sizes = sizes_.unchecked<1>();
+  auto doc_idx = doc_idx_.unchecked<1>();
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length;
+  int32_t *sample_idx = new int32_t[2 * (num_samples + 1)];
+
+  cout << "    using:" << endl
+       << std::flush;
+  cout << "     number of documents:       " << doc_idx_.shape(0) / num_epochs << endl
+       << std::flush;
+  cout << "     number of epochs:          " << num_epochs << endl
+       << std::flush;
+  cout << "     sequence length:           " << seq_length << endl
+       << std::flush;
+  cout << "     total number of samples:   " << num_samples << endl
+       << std::flush;
+
+  // Index into sample_idx.
+  int64_t sample_index = 0;
+  // Index into doc_idx.
+  int64_t doc_idx_index = 0;
+  // Begining offset for each document.
+  int32_t doc_offset = 0;
+  // Start with first document and no offset.
+  sample_idx[2 * sample_index] = doc_idx_index;
+  sample_idx[2 * sample_index + 1] = doc_offset;
+  ++sample_index;
+
+  while (sample_index <= num_samples)
+  {
+    // Start with a fresh sequence.
+    int32_t remaining_seq_length = seq_length + 1;
+    while (remaining_seq_length != 0)
+    {
+      // Get the document length.
+      auto doc_id = doc_idx[doc_idx_index];
+      auto doc_length = sizes[doc_id] - doc_offset;
+      // And add it to the current sequence.
+      remaining_seq_length -= doc_length;
+      // If we have more than a full sequence, adjust offset and set
+      // remaining length to zero so we return from the while loop.
+      // Note that -1 here is for the same reason we have -1 in
+      // `_num_epochs` calculations.
+      if (remaining_seq_length <= 0)
+      {
+        doc_offset += (remaining_seq_length + doc_length - 1);
+        remaining_seq_length = 0;
+      }
+      else
+      {
+        // Otherwise, start from the begining of the next document.
+        ++doc_idx_index;
+        doc_offset = 0;
+      }
+    }
+    // Record the sequence.
+    sample_idx[2 * sample_index] = doc_idx_index;
+    sample_idx[2 * sample_index + 1] = doc_offset;
+    ++sample_index;
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(sample_idx, [](void *mem_)
+                             {
+                               int32_t *mem = reinterpret_cast<int32_t *>(mem_);
+                               delete[] mem;
+                             });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(int32_t);
+  return py::array(std::vector<int64_t>{num_samples + 1, 2}, // shape
+                   {2 * byte_size, byte_size},               // C-style contiguous strides
+                   sample_idx,                               // the data pointer
+                   free_when_done);                          // numpy array references
+}
+
+inline int32_t get_target_sample_len(const int32_t short_seq_ratio,
+                                     const int32_t max_length,
+                                     std::mt19937 &rand32_gen)
+{
+  /* Training sample length. */
+  if (short_seq_ratio == 0)
+  {
+    return max_length;
+  }
+  const auto random_number = rand32_gen();
+  if ((random_number % short_seq_ratio) == 0)
+  {
+    return 2 + random_number % (max_length - 1);
+  }
+  return max_length;
+}
+
+template <typename DocIdx>
+py::array build_mapping_impl(const py::array_t<int64_t> &docs_,
+                             const py::array_t<int32_t> &sizes_,
+                             const int32_t num_epochs,
+                             const uint64_t max_num_samples,
+                             const int32_t max_seq_length,
+                             const double short_seq_prob,
+                             const int32_t seed,
+                             const bool verbose,
+                             const int32_t min_num_sent)
+{
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+       start and end index are the indices of the sentences in the sample
+       and sequence-length is the target sequence length.
+    */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(short_seq_prob >= 0.0);
+  assert(short_seq_prob <= 1.0);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+
+  // For efficiency, convert probability to ratio. Note: rand() generates int.
+  int32_t short_seq_ratio = 0;
+  if (short_seq_prob > 0)
+  {
+    short_seq_ratio = static_cast<int32_t>(round(1.0 / short_seq_prob));
+  }
+
+  if (verbose)
+  {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl
+         << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1 << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", " << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     short sequence probability:     " << short_seq_prob << endl
+         << std::flush;
+    cout << "     short sequence ration (1/prob): " << short_seq_ratio << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration)
+  {
+
+    // Set the seed so both iterations produce the same results.
+    std::mt19937 rand32_gen(seed);
+
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Counters:
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch)
+    {
+      if (map_index >= max_num_samples)
+      {
+        if (verbose && (!second))
+        {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc)
+      {
+
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+
+        // At the begining of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second))
+        {
+          if (num_remain_sent == 0)
+          {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1)
+          {
+            ++one_sent_docs;
+          }
+        }
+
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent > 1)
+        {
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN)
+            {
+              if ((epoch == 0) && (!second))
+              {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+
+        // If we have more than two sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence))
+        {
+
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+          auto target_seq_len = get_target_sample_len(short_seq_ratio,
+                                                      max_seq_length,
+                                                      rand32_gen);
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and if not only one sentence is left in the document.
+            // and if we have at least two sentneces.
+            // and if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent > 1) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0))
+            {
+
+              // Check for overflow.
+              if ((3 * map_index + 2) >
+                  std::numeric_limits<int64_t>::max())
+              {
+                cout << "number of samples exceeded maximum "
+                     << "allowed by type int64: "
+                     << std::numeric_limits<int64_t>::max()
+                     << endl;
+                throw std::overflow_error("Number of samples");
+              }
+
+              // Populate the map.
+              if (second)
+              {
+                const auto map_index_0 = 3 * map_index;
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(target_seq_len);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              prev_start_index = sent_index + 1;
+              target_seq_len = get_target_sample_len(short_seq_ratio,
+                                                     max_seq_length,
+                                                     rand32_gen);
+              seq_len = 0;
+              num_sent = 0;
+            }
+
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second)
+    {
+      if (verbose)
+      {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[3 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i)
+  {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 3 * i;
+    const auto j0 = 3 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_)
+                             {
+                               DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+                               delete[] mem;
+                             });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 3}, // shape
+                   {3 * byte_size, byte_size},           // C-style contiguous strides
+                   maps,                                 // the data pointer
+                   free_when_done);                      // numpy array references
+}
+
+py::array build_mapping(const py::array_t<int64_t> &docs_,
+                        const py::array_t<int> &sizes_,
+                        const int num_epochs,
+                        const uint64_t max_num_samples,
+                        const int max_seq_length,
+                        const double short_seq_prob,
+                        const int seed,
+                        const bool verbose,
+                        const int32_t min_num_sent)
+{
+
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max())
+  {
+    if (verbose)
+    {
+      cout << "    using uint64 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_mapping_impl<uint64_t>(docs_, sizes_, num_epochs,
+                                        max_num_samples, max_seq_length,
+                                        short_seq_prob, seed, verbose,
+                                        min_num_sent);
+  }
+  else
+  {
+    if (verbose)
+    {
+      cout << "    using uint32 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_mapping_impl<uint32_t>(docs_, sizes_, num_epochs,
+                                        max_num_samples, max_seq_length,
+                                        short_seq_prob, seed, verbose,
+                                        min_num_sent);
+  }
+}
+
+template <typename DocIdx>
+py::array build_blocks_mapping_impl(const py::array_t<int64_t> &docs_,
+                                    const py::array_t<int32_t> &sizes_,
+                                    const py::array_t<int32_t> &titles_sizes_,
+                                    const int32_t num_epochs,
+                                    const uint64_t max_num_samples,
+                                    const int32_t max_seq_length,
+                                    const int32_t seed,
+                                    const bool verbose,
+                                    const bool use_one_sent_blocks)
+{
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+       start and end index are the indices of the sentences in the sample
+       and sequence-length is the target sequence length.
+    */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+  auto titles_sizes = titles_sizes_.unchecked<1>();
+
+  if (verbose)
+  {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl
+         << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1 << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", " << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and its length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+
+  // Acceptable number of sentences per block.
+  int min_num_sent = 2;
+  if (use_one_sent_blocks)
+  {
+    min_num_sent = 1;
+  }
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration)
+  {
+
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch)
+    {
+      // assign every block a unique id
+      int32_t block_id = 0;
+
+      if (map_index >= max_num_samples)
+      {
+        if (verbose && (!second))
+        {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc)
+      {
+
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        const auto target_seq_len = max_seq_length - titles_sizes[doc];
+
+        // At the begining of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second))
+        {
+          if (num_remain_sent == 0)
+          {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1)
+          {
+            ++one_sent_docs;
+          }
+        }
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent >= min_num_sent)
+        {
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN)
+            {
+              if ((epoch == 0) && (!second))
+              {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have enough sentences and no long sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence))
+        {
+
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and there are an acceptable number of sentences left
+            // and if we have at least the minimum number of sentences.
+            // or if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent >= min_num_sent) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0))
+            {
+
+              // Populate the map.
+              if (second)
+              {
+                const auto map_index_0 = 4 * map_index;
+                // Each sample has 4 items: the starting sentence index, ending sentence index,
+                // the index of the document from which the block comes (used for fetching titles)
+                // and the unique id of the block (used for creating block indexes)
+
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(doc);
+                maps[map_index_0 + 3] = static_cast<DocIdx>(block_id);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              ++block_id;
+              prev_start_index = sent_index + 1;
+              seq_len = 0;
+              num_sent = 0;
+            }
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second)
+    {
+      if (verbose)
+      {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[4 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i)
+  {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 4 * i;
+    const auto j0 = 4 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+    swap(maps[i0 + 3], maps[j0 + 3]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_)
+                             {
+                               DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+                               delete[] mem;
+                             });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 4}, // shape
+                   {4 * byte_size, byte_size},           // C-style contiguous strides
+                   maps,                                 // the data pointer
+                   free_when_done);                      // numpy array references
+}
+
+py::array build_blocks_mapping(const py::array_t<int64_t> &docs_,
+                               const py::array_t<int> &sizes_,
+                               const py::array_t<int> &titles_sizes_,
+                               const int num_epochs,
+                               const uint64_t max_num_samples,
+                               const int max_seq_length,
+                               const int seed,
+                               const bool verbose,
+                               const bool use_one_sent_blocks)
+{
+
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max())
+  {
+    if (verbose)
+    {
+      cout << "    using uint64 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_blocks_mapping_impl<uint64_t>(docs_, sizes_, titles_sizes_,
+                                               num_epochs, max_num_samples, max_seq_length, seed, verbose, use_one_sent_blocks);
+  }
+  else
+  {
+    if (verbose)
+    {
+      cout << "    using uint32 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_blocks_mapping_impl<uint32_t>(docs_, sizes_, titles_sizes_,
+                                               num_epochs, max_num_samples, max_seq_length, seed, verbose, use_one_sent_blocks);
+  }
+}
+
+PYBIND11_MODULE(helpers, m)
+{
+  m.def("build_mapping", &build_mapping);
+  m.def("build_blocks_mapping", &build_blocks_mapping);
+  m.def("build_sample_idx", &build_sample_idx);
+  m.def("build_blending_indices", &build_blending_indices);
+}
diff --git a/fengshen/data/megatron_dataloader/indexed_dataset.py b/fengshen/data/megatron_dataloader/indexed_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..9eba91d303ab11884d993b707ca1d166f540588b
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/indexed_dataset.py
@@ -0,0 +1,585 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+
+# copied from fairseq/fairseq/data/indexed_dataset.py
+# Removed IndexedRawTextDataset since it relied on Fairseq dictionary
+# other slight modifications to remove fairseq dependencies
+# Added document index to index file and made it accessible.
+#    An empty sentence no longer separates documents.
+
+from functools import lru_cache
+import os
+import shutil
+import struct
+from itertools import accumulate
+
+import numpy as np
+import torch
+from fengshen.data.megatron_dataloader.utils import print_rank_0
+
+
+def __best_fitting_dtype(vocab_size=None):
+    if vocab_size is not None and vocab_size < 65500:
+        return np.uint16
+    else:
+        return np.int32
+
+
+def get_available_dataset_impl():
+    return ['lazy', 'cached', 'mmap']
+
+
+def infer_dataset_impl(path):
+    if IndexedDataset.exists(path):
+        with open(index_file_path(path), 'rb') as f:
+            magic = f.read(8)
+            if magic == IndexedDataset._HDR_MAGIC:
+                return 'cached'
+            elif magic == MMapIndexedDataset.Index._HDR_MAGIC[:8]:
+                return 'mmap'
+            else:
+                return None
+    else:
+        print(f"Dataset does not exist: {path}")
+        print("Path should be a basename that both .idx and "
+              ".bin can be appended to get full filenames.")
+        return None
+
+
+def make_builder(out_file, impl, vocab_size=None):
+    if impl == 'mmap':
+        return MMapIndexedDatasetBuilder(out_file,
+                                         dtype=__best_fitting_dtype(vocab_size))
+    else:
+        return IndexedDatasetBuilder(out_file)
+
+
+def make_dataset(path, impl, skip_warmup=False):
+    if not IndexedDataset.exists(path):
+        print(f"Dataset does not exist: {path}")
+        print("Path should be a basename that both .idx "
+              "and .bin can be appended to get full filenames.")
+        return None
+    if impl == 'infer':
+        impl = infer_dataset_impl(path)
+    if impl == 'lazy' and IndexedDataset.exists(path):
+        return IndexedDataset(path)
+    elif impl == 'cached' and IndexedDataset.exists(path):
+        return IndexedCachedDataset(path)
+    elif impl == 'mmap' and MMapIndexedDataset.exists(path):
+        return MMapIndexedDataset(path, skip_warmup)
+    print(f"Unknown dataset implementation: {impl}")
+    return None
+
+
+def dataset_exists(path, impl):
+    if impl == 'mmap':
+        return MMapIndexedDataset.exists(path)
+    else:
+        return IndexedDataset.exists(path)
+
+
+def read_longs(f, n):
+    a = np.empty(n, dtype=np.int64)
+    f.readinto(a)
+    return a
+
+
+def write_longs(f, a):
+    f.write(np.array(a, dtype=np.int64))
+
+
+dtypes = {
+    1: np.uint8,
+    2: np.int8,
+    3: np.int16,
+    4: np.int32,
+    5: np.int64,
+    6: np.float,
+    7: np.double,
+    8: np.uint16
+}
+
+
+def code(dtype):
+    for k in dtypes.keys():
+        if dtypes[k] == dtype:
+            return k
+    raise ValueError(dtype)
+
+
+def index_file_path(prefix_path):
+    return prefix_path + '.idx'
+
+
+def data_file_path(prefix_path):
+    return prefix_path + '.bin'
+
+
+def create_doc_idx(sizes):
+    doc_idx = [0]
+    for i, s in enumerate(sizes):
+        if s == 0:
+            doc_idx.append(i + 1)
+    return doc_idx
+
+
+class IndexedDataset(torch.utils.data.Dataset):
+    """Loader for IndexedDataset"""
+    _HDR_MAGIC = b'TNTIDX\x00\x00'
+
+    def __init__(self, path):
+        super().__init__()
+        self.path = path
+        self.data_file = None
+        self.read_index(path)
+
+    def read_index(self, path):
+        with open(index_file_path(path), 'rb') as f:
+            magic = f.read(8)
+            assert magic == self._HDR_MAGIC, (
+                'Index file doesn\'t match expected format. '
+                'Make sure that --dataset-impl is configured properly.'
+            )
+            version = f.read(8)
+            assert struct.unpack('<Q', version) == (1,)
+            code, self.element_size = struct.unpack('<QQ', f.read(16))
+            self.dtype = dtypes[code]
+            self._len, self.s = struct.unpack('<QQ', f.read(16))
+            self.doc_count = struct.unpack('<Q', f.read(8))
+            self.dim_offsets = read_longs(f, self._len + 1)
+            self.data_offsets = read_longs(f, self._len + 1)
+            self.sizes = read_longs(f, self.s)
+            self.doc_idx = read_longs(f, self.doc_count)
+
+    def read_data(self, path):
+        self.data_file = open(data_file_path(path), 'rb', buffering=0)
+
+    def check_index(self, i):
+        if i < 0 or i >= self._len:
+            raise IndexError('index out of range')
+
+    def __del__(self):
+        if self.data_file:
+            self.data_file.close()
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if not self.data_file:
+            self.read_data(self.path)
+        if isinstance(idx, int):
+            i = idx
+            self.check_index(i)
+            tensor_size = self.sizes[
+                self.dim_offsets[i]:self.dim_offsets[i + 1]]
+            a = np.empty(tensor_size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[i] * self.element_size)
+            self.data_file.readinto(a)
+            return a
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError(
+                    "Slices into indexed_dataset must be contiguous")
+            sizes = self.sizes[self.dim_offsets[start]:self.dim_offsets[stop]]
+            size = sum(sizes)
+            a = np.empty(size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[start] * self.element_size)
+            self.data_file.readinto(a)
+            offsets = list(accumulate(sizes))
+            sents = np.split(a, offsets[:-1])
+            return sents
+
+    def __len__(self):
+        return self._len
+
+    def num_tokens(self, index):
+        return self.sizes[index]
+
+    def size(self, index):
+        return self.sizes[index]
+
+    @staticmethod
+    def exists(path):
+        return (
+            os.path.exists(index_file_path(path)) and os.path.exists(
+                data_file_path(path))
+        )
+
+    @property
+    def supports_prefetch(self):
+        return False  # avoid prefetching to save memory
+
+
+class IndexedCachedDataset(IndexedDataset):
+
+    def __init__(self, path):
+        super().__init__(path)
+        self.cache = None
+        self.cache_index = {}
+
+    @property
+    def supports_prefetch(self):
+        return True
+
+    def prefetch(self, indices):
+        if all(i in self.cache_index for i in indices):
+            return
+        if not self.data_file:
+            self.read_data(self.path)
+        indices = sorted(set(indices))
+        total_size = 0
+        for i in indices:
+            total_size += self.data_offsets[i + 1] - self.data_offsets[i]
+        self.cache = np.empty(total_size, dtype=self.dtype)
+        ptx = 0
+        self.cache_index.clear()
+        for i in indices:
+            self.cache_index[i] = ptx
+            size = self.data_offsets[i + 1] - self.data_offsets[i]
+            a = self.cache[ptx: ptx + size]
+            self.data_file.seek(self.data_offsets[i] * self.element_size)
+            self.data_file.readinto(a)
+            ptx += size
+        if self.data_file:
+            # close and delete data file after prefetch so we can pickle
+            self.data_file.close()
+            self.data_file = None
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            i = idx
+            self.check_index(i)
+            tensor_size = self.sizes[
+                self.dim_offsets[i]:self.dim_offsets[i + 1]]
+            a = np.empty(tensor_size, dtype=self.dtype)
+            ptx = self.cache_index[i]
+            np.copyto(a, self.cache[ptx: ptx + a.size])
+            return a
+        elif isinstance(idx, slice):
+            # Hack just to make this work, can optimizer later if necessary
+            sents = []
+            for i in range(*idx.indices(len(self))):
+                sents.append(self[i])
+            return sents
+
+
+class IndexedDatasetBuilder(object):
+    element_sizes = {
+        np.uint8: 1,
+        np.int8: 1,
+        np.int16: 2,
+        np.int32: 4,
+        np.int64: 8,
+        np.float: 4,
+        np.double: 8
+    }
+
+    def __init__(self, out_file, dtype=np.int32):
+        self.out_file = open(out_file, 'wb')
+        self.dtype = dtype
+        self.data_offsets = [0]
+        self.dim_offsets = [0]
+        self.sizes = []
+        self.element_size = self.element_sizes[self.dtype]
+        self.doc_idx = [0]
+
+    def add_item(self, tensor):
+        bytes = self.out_file.write(np.array(tensor.numpy(), dtype=self.dtype))
+        self.data_offsets.append(
+            self.data_offsets[-1] + bytes / self.element_size)
+        for s in tensor.size():
+            self.sizes.append(s)
+        self.dim_offsets.append(self.dim_offsets[-1] + len(tensor.size()))
+
+    def end_document(self):
+        self.doc_idx.append(len(self.sizes))
+
+    def merge_file_(self, another_file):
+        index = IndexedDataset(another_file)
+        assert index.dtype == self.dtype
+
+        begin = self.data_offsets[-1]
+        for offset in index.data_offsets[1:]:
+            self.data_offsets.append(begin + offset)
+        self.sizes.extend(index.sizes)
+        begin = self.dim_offsets[-1]
+        for dim_offset in index.dim_offsets[1:]:
+            self.dim_offsets.append(begin + dim_offset)
+
+        with open(data_file_path(another_file), 'rb') as f:
+            while True:
+                data = f.read(1024)
+                if data:
+                    self.out_file.write(data)
+                else:
+                    break
+
+    def finalize(self, index_file):
+        self.out_file.close()
+        index = open(index_file, 'wb')
+        index.write(b'TNTIDX\x00\x00')
+        index.write(struct.pack('<Q', 1))
+        index.write(struct.pack('<QQ', code(self.dtype), self.element_size))
+        index.write(struct.pack('<QQ', len(
+            self.data_offsets) - 1, len(self.sizes)))
+        index.write(struct.pack('<Q', len(self.doc_idx)))
+        write_longs(index, self.dim_offsets)
+        write_longs(index, self.data_offsets)
+        write_longs(index, self.sizes)
+        write_longs(index, self.doc_idx)
+        index.close()
+
+
+def _warmup_mmap_file(path):
+    with open(path, 'rb') as stream:
+        while stream.read(100 * 1024 * 1024):
+            pass
+
+
+class MMapIndexedDataset(torch.utils.data.Dataset):
+    class Index(object):
+        _HDR_MAGIC = b'MMIDIDX\x00\x00'
+
+        @classmethod
+        def writer(cls, path, dtype):
+            class _Writer(object):
+                def __enter__(self):
+                    self._file = open(path, 'wb')
+
+                    self._file.write(cls._HDR_MAGIC)
+                    self._file.write(struct.pack('<Q', 1))
+                    self._file.write(struct.pack('<B', code(dtype)))
+
+                    return self
+
+                @staticmethod
+                def _get_pointers(sizes):
+                    dtype_size = dtype().itemsize
+                    address = 0
+                    pointers = []
+
+                    for size in sizes:
+                        pointers.append(address)
+                        address += size * dtype_size
+
+                    return pointers
+
+                def write(self, sizes, doc_idx):
+                    pointers = self._get_pointers(sizes)
+
+                    self._file.write(struct.pack('<Q', len(sizes)))
+                    self._file.write(struct.pack('<Q', len(doc_idx)))
+
+                    sizes = np.array(sizes, dtype=np.int32)
+                    self._file.write(sizes.tobytes(order='C'))
+                    del sizes
+
+                    pointers = np.array(pointers, dtype=np.int64)
+                    self._file.write(pointers.tobytes(order='C'))
+                    del pointers
+
+                    doc_idx = np.array(doc_idx, dtype=np.int64)
+                    self._file.write(doc_idx.tobytes(order='C'))
+
+                def __exit__(self, exc_type, exc_val, exc_tb):
+                    self._file.close()
+
+            return _Writer()
+
+        def __init__(self, path, skip_warmup=False):
+            with open(path, 'rb') as stream:
+                magic_test = stream.read(9)
+                assert self._HDR_MAGIC == magic_test, (
+                    'Index file doesn\'t match expected format. '
+                    'Make sure that --dataset-impl is configured properly.'
+                )
+                version = struct.unpack('<Q', stream.read(8))
+                assert (1,) == version
+
+                dtype_code, = struct.unpack('<B', stream.read(1))
+                self._dtype = dtypes[dtype_code]
+                self._dtype_size = self._dtype().itemsize
+
+                self._len = struct.unpack('<Q', stream.read(8))[0]
+                self._doc_count = struct.unpack('<Q', stream.read(8))[0]
+                offset = stream.tell()
+
+            if not skip_warmup:
+                print_rank_0("    warming up index mmap file...")
+                _warmup_mmap_file(path)
+
+            self._bin_buffer_mmap = np.memmap(path, mode='r', order='C')
+            self._bin_buffer = memoryview(self._bin_buffer_mmap)
+            print_rank_0("    reading sizes...")
+            self._sizes = np.frombuffer(
+                self._bin_buffer,
+                dtype=np.int32,
+                count=self._len,
+                offset=offset)
+            print_rank_0("    reading pointers...")
+            self._pointers = np.frombuffer(self._bin_buffer,
+                                           dtype=np.int64, count=self._len,
+                                           offset=offset + self._sizes.nbytes)
+            print_rank_0("    reading document index...")
+            self._doc_idx = np.frombuffer(
+                self._bin_buffer,
+                dtype=np.int64, count=self._doc_count,
+                offset=offset + self._sizes.nbytes + self._pointers.nbytes)
+
+        def __del__(self):
+            self._bin_buffer_mmap._mmap.close()
+            del self._bin_buffer_mmap
+
+        @property
+        def dtype(self):
+            return self._dtype
+
+        @property
+        def sizes(self):
+            return self._sizes
+
+        @property
+        def doc_idx(self):
+            return self._doc_idx
+
+        @lru_cache(maxsize=8)
+        def __getitem__(self, i):
+            return self._pointers[i], self._sizes[i]
+
+        def __len__(self):
+            return self._len
+
+    def __init__(self, path, skip_warmup=False):
+        super().__init__()
+
+        self._path = None
+        self._index = None
+        self._bin_buffer = None
+
+        self._do_init(path, skip_warmup)
+
+    def __getstate__(self):
+        return self._path
+
+    def __setstate__(self, state):
+        self._do_init(state)
+
+    def _do_init(self, path, skip_warmup):
+        self._path = path
+        self._index = self.Index(index_file_path(self._path), skip_warmup)
+
+        if not skip_warmup:
+            print_rank_0("    warming up data mmap file...")
+            _warmup_mmap_file(data_file_path(self._path))
+        print_rank_0("    creating numpy buffer of mmap...")
+        self._bin_buffer_mmap = np.memmap(
+            data_file_path(self._path), mode='r', order='C')
+        print_rank_0("    creating memory view of numpy buffer...")
+        self._bin_buffer = memoryview(self._bin_buffer_mmap)
+
+    def __del__(self):
+        self._bin_buffer_mmap._mmap.close()
+        del self._bin_buffer_mmap
+        del self._index
+
+    def __len__(self):
+        return len(self._index)
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            ptr, size = self._index[idx]
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                     count=size, offset=ptr)
+            return np_array
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError(
+                    "Slices into indexed_dataset must be contiguous")
+            ptr = self._index._pointers[start]
+            sizes = self._index._sizes[idx]
+            offsets = list(accumulate(sizes))
+            total_size = sum(sizes)
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                     count=total_size, offset=ptr)
+            sents = np.split(np_array, offsets[:-1])
+            return sents
+
+    def get(self, idx, offset=0, length=None):
+        """ Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        ptr, size = self._index[idx]
+        if length is None:
+            length = size - offset
+        ptr += offset * np.dtype(self._index.dtype).itemsize
+        np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                 count=length, offset=ptr)
+        return np_array
+
+    @property
+    def sizes(self):
+        return self._index.sizes
+
+    @property
+    def doc_idx(self):
+        return self._index.doc_idx
+
+    def get_doc_idx(self):
+        return self._index._doc_idx
+
+    def set_doc_idx(self, doc_idx_):
+        self._index._doc_idx = doc_idx_
+
+    @property
+    def supports_prefetch(self):
+        return False
+
+    @staticmethod
+    def exists(path):
+        return (
+            os.path.exists(index_file_path(path)) and os.path.exists(
+                data_file_path(path))
+        )
+
+
+class MMapIndexedDatasetBuilder(object):
+    def __init__(self, out_file, dtype=np.int64):
+        self._data_file = open(out_file, 'wb', buffering=5000000)
+        self._dtype = dtype
+        self._sizes = []
+        self._doc_idx = [0]
+
+    def add_item(self, tensor):
+        np_array = np.array(tensor.numpy(), dtype=self._dtype)
+        self._data_file.write(np_array.tobytes(order='C'))
+        self._sizes.append(np_array.size)
+
+    def end_document(self):
+        self._doc_idx.append(len(self._sizes))
+
+    def merge_file_(self, another_file):
+        # Concatenate index
+        index = MMapIndexedDataset.Index(index_file_path(another_file))
+        assert index.dtype == self._dtype
+
+        for size in index.sizes:
+            self._sizes.append(size)
+
+        # Concatenate data
+        with open(data_file_path(another_file), 'rb') as f:
+            shutil.copyfileobj(f, self._data_file)
+
+    def finalize(self, index_file):
+        self._data_file.close()
+
+        with MMapIndexedDataset.Index.writer(index_file, self._dtype) as index:
+            index.write(self._sizes, self._doc_idx)
diff --git a/fengshen/data/megatron_dataloader/utils.py b/fengshen/data/megatron_dataloader/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9258f4830fb22333b37603439da8f8116cd7a048
--- /dev/null
+++ b/fengshen/data/megatron_dataloader/utils.py
@@ -0,0 +1,24 @@
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+
+
+def print_rank_0(message):
+    """If distributed is initialized, print only on rank 0."""
+    if torch.distributed.is_initialized():
+        if torch.distributed.get_rank() == 0:
+            print(message, flush=True)
+    else:
+        print(message, flush=True)
diff --git a/fengshen/data/mmap_dataloader/mmap_datamodule.py b/fengshen/data/mmap_dataloader/mmap_datamodule.py
new file mode 100644
index 0000000000000000000000000000000000000000..534cfb179b649a317253685848e88aebeaea7e0f
--- /dev/null
+++ b/fengshen/data/mmap_dataloader/mmap_datamodule.py
@@ -0,0 +1,68 @@
+from typing import Optional
+from pytorch_lightning import LightningDataModule
+from torch.utils.data import DataLoader
+from fengshen.data.mmap_index_dataset import MMapIndexDataset
+
+
+class MMapDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('MMAP DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--eval_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--train_datas', default=[
+            './train_datas'
+        ], type=str, nargs='+')
+        parser.add_argument('--valid_datas', default=[
+            './valid_datas'
+        ], type=str, nargs='+')
+        parser.add_argument('--test_datas', default=[
+            './test_datas'],
+            type=str, nargs='+')
+        parser.add_argument('--input_tensor_name', default=['input_ids'], type=str, nargs='+')
+        return parent_args
+
+    def __init__(
+        self,
+        collate_fn,
+        args,
+        **kwargs,
+    ):
+        super().__init__()
+        self.collate_fn = collate_fn
+        self.train_dataset = MMapIndexDataset(args.train_datas, args.input_tensor_name)
+        self.valid_dataset = MMapIndexDataset(args.valid_datas, args.input_tensor_name)
+        self.test_dataset = MMapIndexDataset(args.test_datas, args.input_tensor_name)
+        self.save_hyperparameters(args)
+
+    def setup(self, stage: Optional[str] = None) -> None:
+        return super().setup(stage)
+
+    def train_dataloader(self):
+        return DataLoader(
+            self.train_dataset,
+            batch_size=self.hparams.train_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+
+    def val_dataloader(self):
+        return DataLoader(
+            self.valid_dataset,
+            batch_size=self.hparams.eval_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+
+    def test_dataloader(self):
+        return DataLoader(
+            self.test_dataset,
+            batch_size=self.hparams.test_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
diff --git a/fengshen/data/mmap_dataloader/mmap_index_dataset.py b/fengshen/data/mmap_dataloader/mmap_index_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..53b290c12a8825a483f14ca0535a813b36477fa1
--- /dev/null
+++ b/fengshen/data/mmap_dataloader/mmap_index_dataset.py
@@ -0,0 +1,53 @@
+import numpy as np
+import torch
+from typing import List
+from torch.utils.data import Dataset
+
+
+class MMapIndexDataset(Dataset):
+    # datapaths 是所有的内存映射文件的路径
+    # input_tensor_name 是输入的tensor的名字 例如 ['input_ids'] 会存储在对应的文件里面
+    def __init__(self, datapaths: List[str], input_tensor_name: List[str]):
+        dict_idx_fp = {}
+        dict_bin_fp = {}
+        idx_len = []
+        for tensor_name in input_tensor_name:
+            idx_fp = []
+            bin_fp = []
+            len = 0
+            for data_path in datapaths:
+                idx_fp += [np.load(
+                    data_path + '_' + tensor_name + '.npy', mmap_mode='r')]
+                bin_fp += [np.memmap(
+                    data_path + '_' + tensor_name + '.bin',
+                    dtype='long',
+                    mode='r')]
+                len += idx_fp[-1].shape[0]
+                idx_len += [idx_fp[-1].shape[0]]
+            dict_idx_fp[tensor_name] = idx_fp
+            dict_bin_fp[tensor_name] = bin_fp
+            #  通常情况下不同的tensor的长度是一样的
+            self._len = len
+
+        self._input_tensor_name = input_tensor_name
+        self._dict_idx_fp = dict_idx_fp
+        self._dict_bin_fp = dict_bin_fp
+        self._idx_len = idx_len
+
+    def __len__(self):
+        return self._len
+
+    def __getitem__(self, idx):
+        sample = {}
+        for i in range(len(self._idx_len)):
+            if idx >= self._idx_len[i]:
+                idx -= self._idx_len[i]
+            else:
+                break
+        for tensor_name in self._input_tensor_name:
+            sample[tensor_name] = torch.tensor(self._dict_bin_fp[tensor_name][i][
+                self._dict_idx_fp[tensor_name][i][idx, 0]:
+                    self._dict_idx_fp[tensor_name][i][idx, 1]
+            ], dtype=torch.long)
+        # print(sample)
+        return sample
diff --git a/fengshen/data/preprocess.py b/fengshen/data/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bad5790a5799b96f2e164d825c0b1f8ec0c2dfb
--- /dev/null
+++ b/fengshen/data/preprocess.py
@@ -0,0 +1 @@
+# coding=utf-8
diff --git a/fengshen/data/t5_dataloader/t5_datasets.py b/fengshen/data/t5_dataloader/t5_datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fd55b8d0be1dd61881b8c782a7eea7a6123efdd
--- /dev/null
+++ b/fengshen/data/t5_dataloader/t5_datasets.py
@@ -0,0 +1,562 @@
+# coding=utf8
+import json
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+from transformers import BertTokenizer, MT5Config, MT5Tokenizer, BatchEncoding
+import torch
+import pytorch_lightning as pl
+import numpy as np
+from itertools import chain
+import sys
+sys.path.append('../../')
+
+
+def compute_input_and_target_lengths(inputs_length, noise_density, mean_noise_span_length):
+    """This function is copy of `random_spans_helper <https://github.com/google-research/
+    text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2466>`__ .
+    Training parameters to avoid padding with random_spans_noise_mask.
+    When training a model with random_spans_noise_mask, we would like to set the other
+    training hyperparmeters in a way that avoids padding.
+    This function helps us compute these hyperparameters.
+    We assume that each noise span in the input is replaced by extra_tokens_per_span_inputs sentinel tokens,
+    and each non-noise span in the targets is replaced by extra_tokens_per_span_targets sentinel tokens.
+    This function tells us the required number of tokens in the raw example (for split_tokens())
+    as well as the length of the encoded targets. Note that this function assumes
+    the inputs and targets will have EOS appended and includes that in the reported length.
+    Args:
+        inputs_length: an integer - desired length of the tokenized inputs sequence
+        noise_density: a float
+        mean_noise_span_length: a float
+    Returns:
+        tokens_length: length of original text in tokens
+        targets_length: an integer - length in tokens of encoded targets sequence
+    """
+
+    def _tokens_length_to_inputs_length_targets_length(tokens_length):
+        num_noise_tokens = int(round(tokens_length * noise_density))
+        num_nonnoise_tokens = tokens_length - num_noise_tokens
+        num_noise_spans = int(round(num_noise_tokens / mean_noise_span_length))
+        # inputs contain all nonnoise tokens, sentinels for all noise spans
+        # and one EOS token.
+        _input_length = num_nonnoise_tokens + num_noise_spans + 1
+        _output_length = num_noise_tokens + num_noise_spans + 1
+        return _input_length, _output_length
+
+    tokens_length = inputs_length
+
+    while _tokens_length_to_inputs_length_targets_length(tokens_length + 1)[0] <= inputs_length:
+        tokens_length += 1
+
+    inputs_length, targets_length = _tokens_length_to_inputs_length_targets_length(
+        tokens_length)
+
+    # minor hack to get the targets length to be equal to inputs length
+    # which is more likely to have been set to a nice round number.
+    if noise_density == 0.5 and targets_length > inputs_length:
+        tokens_length -= 1
+        targets_length -= 1
+    return tokens_length, targets_length
+
+
+class UnsuperviseT5Dataset(Dataset):
+    '''
+    Dataset Used for T5 unsuprvise pretrain.
+    load_data_type = 0: load raw data from data path and save tokenized data, call function load_data
+    load_data_type = 1: load tokenized data from path, call function load_tokenized_data
+    load_data_type = 2: load tokenized data from memery data, call function load_tokenized_memory_data
+    '''
+
+    def __init__(self, data_path, args, load_data_type=0, data=None):
+        super().__init__()
+
+        if args.tokenizer_type == 't5_tokenizer':
+            if args.new_vocab_path is not None:
+                self.tokenizer = MT5Tokenizer.from_pretrained(args.new_vocab_path)
+            else:
+                self.tokenizer = MT5Tokenizer.from_pretrained(args.pretrained_model_path)
+        else:
+            self.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path)
+        self.noise_density = 0.15
+        self.mean_noise_span_length = 3
+        self.text_column_name = args.text_column_name
+        self.dataset_num_workers = args.dataset_num_workers
+        self.max_seq_length = args.max_seq_length
+        self.remove_columns = args.remove_columns
+        # whether load tokenieze data
+        self.load_data_type = load_data_type
+
+        if self.load_data_type == 0:
+            # T5-like span masked language modeling will fuse consecutively masked tokens to a single sentinel token.
+            # To ensure that the input length is `max_seq_length`, we need to increase the maximum length
+            # according to `mlm_probability` and `mean_noise_span_length`.
+            # We can also define the label length accordingly.
+            self.expanded_inputs_length, self.targets_length = compute_input_and_target_lengths(
+                inputs_length=self.max_seq_length,
+                noise_density=self.noise_density,
+                mean_noise_span_length=self.mean_noise_span_length,
+            )
+            print('self.expanded_inputs_length, self.targets_length:{},{}'.format(
+                self.expanded_inputs_length, self.targets_length))
+            self.data = self.load_data(data_path)
+        elif self.load_data_type == 1:
+            self.data = self.load_tokenized_data(data_path)
+        else:
+            assert data is not None
+            self.data = self.load_tokenized_memory_data(data)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path):
+        # TODO: large data process
+        from data.fs_datasets import load_dataset
+        samples = load_dataset(
+            # samples = datasets.load_from_disk(data_path)['train']
+            data_path, num_proc=self.dataset_num_workers)['train']
+        # print(samples)
+        tokenized_datasets = samples.map(
+            self.tokenize_function,
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            # load_from_cache_file=not data_args.overwrite_cache,
+        ).map(
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            remove_columns=self.remove_columns)
+        # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
+        # remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
+        # might be slower to preprocess.
+        #
+        # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
+        # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
+        tokenized_datasets = tokenized_datasets.map(
+            self.group_texts,
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            # load_from_cache_file=not data_args.overwrite_cache,
+        )
+        return tokenized_datasets
+    '''
+        The function load tokenized data saved from load_data function.
+    '''
+
+    def load_tokenized_data(self, data_path):
+        from data.fs_datasets import load_dataset
+        samples = load_dataset(data_path)['train']
+        return samples
+
+    def load_tokenized_memory_data(self, data):
+        return data
+
+    # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
+    # Since we make sure that all sequences are of the same length, no attention_mask is needed.
+    def tokenize_function(self, examples):
+        # 这里add_special_tokens=False，避免句子中间出现eos
+        return self.tokenizer(examples[self.text_column_name],
+                              add_special_tokens=False,
+                              return_attention_mask=False)
+
+    # Main data processing function that will concatenate all texts from our dataset
+    # and generate chunks of expanded_inputs_length.
+    def group_texts(self, examples):
+        # Concatenate all texts.
+        concatenated_examples = {
+            k: list(chain(*examples[k])) for k in examples.keys()}
+        total_length = len(concatenated_examples[list(examples.keys())[0]])
+        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
+        # customize this part to your needs.
+        if total_length >= self.expanded_inputs_length:
+            total_length = (
+                total_length // self.expanded_inputs_length) * self.expanded_inputs_length
+        # Split by chunks of max_len.
+        result = {
+            k: [t[i: i + self.expanded_inputs_length]
+                for i in range(0, total_length, self.expanded_inputs_length)]
+            for k, t in concatenated_examples.items()
+        }
+        return result
+
+
+class UnsuperviseT5DataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('UnsuperviseT5DataModel')
+        parser.add_argument('--dataset_num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_num_workers', default=4, type=int)
+        parser.add_argument(
+            '--train_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument('--train_batchsize', default=2, type=int)
+        parser.add_argument('--valid_batchsize', default=2, type=int)
+        parser.add_argument('--train_split_size', default=None, type=float)
+        parser.add_argument('--tokenizer_type', default='t5_tokenizer', choices=['t5_tokenizer', 'bert_tokenizer'])
+        parser.add_argument('--text_column_name', default='text')
+        parser.add_argument('--remove_columns', nargs='+', default=[])
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        if args.train_split_size is not None:
+            from data.fs_datasets import load_dataset
+            data_splits = load_dataset(args.train_data_path, num_proc=args.dataset_num_workers)
+            train_split = data_splits['train']
+            test_split = data_splits['test']
+            print('train:', train_split, '\ntest_data:', test_split)
+            self.train_dataset = UnsuperviseT5Dataset('', args, load_data_type=2, data=train_split)
+            self.test_dataset = UnsuperviseT5Dataset('', args, load_data_type=2, data=test_split)
+        else:
+            self.train_data = UnsuperviseT5Dataset(args.train_data_path, args, load_data_type=1)
+
+        self.config = MT5Config.from_pretrained(args.pretrained_model_path)
+        self.noise_density = 0.15
+        self.mean_noise_span_length = 3
+        self.pad_token_id = self.config.pad_token_id
+        self.decoder_start_token_id = self.config.decoder_start_token_id
+        self.eos_token_id = self.config.eos_token_id
+        self.vocab_size = self.config.vocab_size
+        self.max_seq_length = args.max_seq_length
+        # 因为加载旧的spm里面已经包括了exrta_ids，但是T5Tokenizer会在spm的基础上再增加100个extra_ids,所以需要指定extra_ids=0
+        if args.tokenizer_type == 't5_tokenizer' and args.new_vocab_path is not None:
+            self.tokenizer = MT5Tokenizer.from_pretrained(args.new_vocab_path, extra_ids=0)
+            # 如果是刚开始加载mt5,需要更新vocab_size为提取中英词之后的new_vocab_size
+            self.vocab_size = len(self.tokenizer)
+
+        # T5-like span masked language modeling will fuse consecutively masked tokens to a single sentinel token.
+        # To ensure that the input length is `max_seq_length`, we need to increase the maximum length
+        # according to `mlm_probability` and `mean_noise_span_length`. We can also define the label length accordingly.
+        self.expanded_inputs_length, self.targets_length = compute_input_and_target_lengths(
+            inputs_length=self.max_seq_length,
+            noise_density=self.noise_density,
+            mean_noise_span_length=self.mean_noise_span_length,
+        )
+
+    def train_dataloader(self):
+        from fengshen.data.universal_datamodule.universal_sampler import PretrainingSampler
+        from fengshen.data.universal_datamodule.universal_datamodule import get_consume_samples
+        # 采用自定义的sampler，确保继续训练能正确取到数据
+        consumed_samples = get_consume_samples(self)
+        batch_sampler = PretrainingSampler(
+            total_samples=len(self.train_dataset),
+            consumed_samples=consumed_samples,
+            micro_batch_size=self.hparams.train_batchsize,
+            data_parallel_rank=self.trainer.global_rank,
+            data_parallel_size=self.trainer.world_size,
+        )
+        return DataLoader(
+            self.train_dataset,
+            batch_sampler=batch_sampler,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+
+    def val_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.test_dataset, shuffle=False)
+        return DataLoader(
+            self.test_dataset,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+
+    def predict_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.test_dataset, shuffle=False)
+        return DataLoader(
+            self.test_data,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+
+    def collate_fn(self, examples):
+        # convert list to dict and tensorize input
+        batch = BatchEncoding(
+            {k: np.array([examples[i][k] for i in range(len(examples))])
+             for k, v in examples[0].items()}
+        )
+
+        input_ids = np.array(batch['input_ids'])
+        batch_size, expanded_input_length = input_ids.shape
+        mask_indices = np.asarray([self.random_spans_noise_mask(
+            expanded_input_length) for i in range(batch_size)])
+        labels_mask = ~mask_indices
+
+        input_ids_sentinel = self.create_sentinel_ids(
+            mask_indices.astype(np.int8))
+        labels_sentinel = self.create_sentinel_ids(labels_mask.astype(np.int8))
+
+        batch["input_ids"] = self.filter_input_ids(
+            input_ids, input_ids_sentinel)
+        batch["labels"] = self.filter_input_ids(input_ids, labels_sentinel)
+
+        if batch["input_ids"].shape[-1] != self.max_seq_length:
+            raise ValueError(
+                f"`input_ids` are incorrectly preprocessed. `input_ids` length is \
+                    {batch['input_ids'].shape[-1]}, but should be {self.targets_length}."
+            )
+
+        if batch["labels"].shape[-1] != self.targets_length:
+            raise ValueError(
+                f"`labels` are incorrectly preprocessed. `labels` length is \
+                    {batch['labels'].shape[-1]}, but should be {self.targets_length}."
+            )
+
+        batch["decoder_input_ids"] = self.shift_tokens_right(
+            batch["labels"], self.pad_token_id, self.decoder_start_token_id
+        )
+
+        for k, v in batch.items():
+            batch[k] = torch.tensor(v)
+            # print(k, batch[k], self.tokenizer.batch_decode(batch[k]), '\n', flush=True)
+        return batch
+
+    def create_sentinel_ids(self, mask_indices):
+        """
+        Sentinel ids creation given the indices that should be masked.
+        The start indices of each mask are replaced by the sentinel ids in increasing
+        order. Consecutive mask indices to be deleted are replaced with `-1`.
+        """
+        start_indices = mask_indices - \
+            np.roll(mask_indices, 1, axis=-1) * mask_indices
+        start_indices[:, 0] = mask_indices[:, 0]
+
+        sentinel_ids = np.where(start_indices != 0, np.cumsum(
+            start_indices, axis=-1), start_indices)
+        sentinel_ids = np.where(
+            sentinel_ids != 0, (self.vocab_size - sentinel_ids), 0)
+        sentinel_ids -= mask_indices - start_indices
+
+        return sentinel_ids
+
+    def filter_input_ids(self, input_ids, sentinel_ids):
+        """
+        Puts sentinel mask on `input_ids` and fuse consecutive mask tokens into a single mask token by deleting.
+        This will reduce the sequence length from `expanded_inputs_length` to `input_length`.
+        """
+        batch_size = input_ids.shape[0]
+
+        input_ids_full = np.where(sentinel_ids != 0, sentinel_ids, input_ids)
+        # input_ids tokens and sentinel tokens are >= 0, tokens < 0 are
+        # masked tokens coming after sentinel tokens and should be removed
+        input_ids = input_ids_full[input_ids_full >=
+                                   0].reshape((batch_size, -1))
+        input_ids = np.concatenate(
+            [input_ids, np.full((batch_size, 1), self.eos_token_id, dtype=np.int32)], axis=-1
+        )
+        return input_ids
+
+    # Copied from transformers.models.bart.modeling_flax_bart.shift_tokens_right
+    def shift_tokens_right(self, input_ids: np.array, pad_token_id: int, decoder_start_token_id: int) -> np.ndarray:
+        """
+        Shift input ids one token to the right.
+        """
+        shifted_input_ids = np.zeros_like(input_ids)
+        shifted_input_ids[:, 1:] = input_ids[:, :-1]
+        shifted_input_ids[:, 0] = decoder_start_token_id
+
+        shifted_input_ids = np.where(
+            shifted_input_ids == -100, pad_token_id, shifted_input_ids)
+        return shifted_input_ids
+
+    def random_spans_noise_mask(self, length):
+        """This function is copy of `random_spans_helper <https://github.com/google-research/text-to-text-transfer-transformer/
+        blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2682>`__ .
+        Noise mask consisting of random spans of noise tokens.
+        The number of noise tokens and the number of noise spans and non-noise spans
+        are determined deterministically as follows:
+        num_noise_tokens = round(length * noise_density)
+        num_nonnoise_spans = num_noise_spans = round(num_noise_tokens / mean_noise_span_length)
+        Spans alternate between non-noise and noise, beginning with non-noise.
+        Subject to the above restrictions, all masks are equally likely.
+        Args:
+            length: an int32 scalar (length of the incoming token sequence)
+            noise_density: a float - approximate density of output mask
+            mean_noise_span_length: a number
+        Returns:
+            a boolean tensor with shape [length]
+        """
+
+        orig_length = length
+
+        num_noise_tokens = int(np.round(length * self.noise_density))
+        # avoid degeneracy by ensuring positive numbers of noise and nonnoise tokens.
+        num_noise_tokens = min(max(num_noise_tokens, 1), length - 1)
+        num_noise_spans = int(
+            np.round(num_noise_tokens / self.mean_noise_span_length))
+
+        # avoid degeneracy by ensuring positive number of noise spans
+        num_noise_spans = max(num_noise_spans, 1)
+        num_nonnoise_tokens = length - num_noise_tokens
+
+        # pick the lengths of the noise spans and the non-noise spans
+        def _random_segmentation(num_items, num_segments):
+            """Partition a sequence of items randomly into non-empty segments.
+            Args:
+                num_items: an integer scalar > 0
+                num_segments: an integer scalar in [1, num_items]
+            Returns:
+                a Tensor with shape [num_segments] containing positive integers that add
+                up to num_items
+            """
+            mask_indices = np.arange(num_items - 1) < (num_segments - 1)
+            np.random.shuffle(mask_indices)
+            first_in_segment = np.pad(mask_indices, [[1, 0]])
+            segment_id = np.cumsum(first_in_segment)
+            # count length of sub segments assuming that list is sorted
+            _, segment_length = np.unique(segment_id, return_counts=True)
+            return segment_length
+
+        noise_span_lengths = _random_segmentation(
+            num_noise_tokens, num_noise_spans)
+        nonnoise_span_lengths = _random_segmentation(
+            num_nonnoise_tokens, num_noise_spans)
+
+        interleaved_span_lengths = np.reshape(
+            np.stack([nonnoise_span_lengths, noise_span_lengths],
+                     axis=1), [num_noise_spans * 2]
+        )
+        span_starts = np.cumsum(interleaved_span_lengths)[:-1]
+        span_start_indicator = np.zeros((length,), dtype=np.int8)
+        span_start_indicator[span_starts] = True
+        span_num = np.cumsum(span_start_indicator)
+        is_noise = np.equal(span_num % 2, 1)
+
+        return is_noise[:orig_length]
+
+
+class TaskT5Dataset(Dataset):
+    def __init__(self, data_path, args):
+        super().__init__()
+        self.max_length = args.max_seq_length
+        if args.tokenizer_type == 't5_tokenizer':
+            self.tokenizer = MT5Tokenizer.from_pretrained(args.pretrained_model_path)
+        else:
+            self.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path)
+        self.data = self.load_data(data_path)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+
+    def load_data(self, data_path):
+        samples = []
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            for line in tqdm(lines):
+                samples.append(json.loads(line))
+        return samples
+
+    def encode(self, item):
+        if item["textb"] != "":
+            text = item['question'] + '，'.join(item['choice'])+'。' + f"""{item["texta"]}""" + f"""{item["textb"]}"""
+        else:
+            text = f"""{item["question"]}""" + "，".join(item["choice"]) + "。" + f"""{item["texta"]}"""
+        label = item['answer']
+        encode_dict = self.tokenizer.encode_plus(text, max_length=self.max_length, padding='max_length',
+                                                 truncation=True, return_tensors='pt')
+        decode_dict = self.tokenizer.encode_plus(label, max_length=16, padding='max_length',
+                                                 truncation=True)
+
+        answer_token = []
+        max_label_len = 0
+        choice_encode = []  # 用来确定模型生成的最大长度
+        for a in item['choice']:
+            answer_encode = self.tokenizer.encode(a)
+            choice_encode.append(answer_encode)
+            if len(answer_encode) > max_label_len:
+                max_label_len = len(answer_encode)
+            for an in answer_encode:
+                if an not in answer_token:
+                    answer_token.append(an)
+
+        # bad_words_ids = [[i] for i in range(self.tokenizer.vocab_size) if i not in answer_token] #不生成这些token
+
+        # while len(bad_words_ids)<self.tokenizer.vocab_size:
+        #     bad_words_ids.append(bad_words_ids[0])
+
+        # bad_words_ids = [[423],[67],[878]]
+
+        encode_sent = encode_dict['input_ids'].squeeze()
+        attention_mask = encode_dict['attention_mask'].squeeze()
+        target = decode_dict['input_ids']
+        labels = torch.tensor(target)
+        labels[target == self.tokenizer.pad_token_id] = -100
+
+        return {
+            "input_ids": torch.tensor(encode_sent).long(),
+            "attention_mask": torch.tensor(attention_mask).float(),
+            "labels": torch.tensor(target).long(),
+            "force_words_ids": answer_token,
+        }
+
+
+class TaskT5DataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TaskT5DataModel')
+        parser.add_argument('--dataset_num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_num_workers', default=4, type=int)
+        parser.add_argument(
+            '--train_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument(
+            '--valid_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument('--train_batchsize', default=2, type=int)
+        parser.add_argument('--valid_batchsize', default=2, type=int)
+        parser.add_argument('--train_split_size', default=None, type=float)
+        parser.add_argument('--tokenizer_type', default='t5_tokenizer', choices=['t5_tokenizer', 'bert_tokenizer'])
+        parser.add_argument('--text_column_name', default='text')
+        parser.add_argument('--remove_columns', nargs='+', default=[])
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.train_dataset = TaskT5Dataset(args.train_data_path, args)
+        self.valid_dataset = TaskT5Dataset(args.valid_data_path, args)
+
+    def train_dataloader(self):
+        from fengshen.data.universal_datamodule.universal_sampler import PretrainingSampler
+        from fengshen.data.universal_datamodule.universal_datamodule import get_consume_samples
+        # 采用自定义的sampler，确保继续训练能正确取到数据
+        consumed_samples = get_consume_samples(self)
+        # batch_sampler = PretrainingRandomSampler(
+        batch_sampler = PretrainingSampler(
+            total_samples=len(self.train_dataset),
+            consumed_samples=consumed_samples,
+            micro_batch_size=self.hparams.train_batchsize,
+            data_parallel_rank=self.trainer.global_rank,
+            data_parallel_size=self.trainer.world_size,
+        )
+        # epoch=self.trainer.current_epoch
+        # )
+        return DataLoader(
+            self.train_dataset,
+            batch_sampler=batch_sampler,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers
+        )
+
+    def val_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.valid_dataset, shuffle=False)
+        return DataLoader(
+            self.valid_dataset,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers
+        )
diff --git a/fengshen/data/task_dataloader/__init__.py b/fengshen/data/task_dataloader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..25810ab9ab20ad36f72ba20b31768341e78e2676
--- /dev/null
+++ b/fengshen/data/task_dataloader/__init__.py
@@ -0,0 +1,3 @@
+# coding=utf-8
+from .task_datasets import LCSTSDataModel, LCSTSDataset
+__all__ = ['LCSTSDataModel', 'LCSTSDataset']
diff --git a/fengshen/data/task_dataloader/medicalQADataset.py b/fengshen/data/task_dataloader/medicalQADataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d76ed583c7d150769c81d830293909e1c110485
--- /dev/null
+++ b/fengshen/data/task_dataloader/medicalQADataset.py
@@ -0,0 +1,137 @@
+# coding=utf8
+import os
+import pytorch_lightning as pl
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+from transformers import AutoTokenizer
+
+
+class GPT2QADataset(Dataset):
+    '''
+    Dataset Used for yuyuan medical qa task.
+    Just surpport small datasets, when deal with large datasets it may be slowly.
+    for large datasets please use mmapdatasets(doing)
+    '''
+
+    def __init__(self, data_path, name, args):
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path)
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
+        self.data_size = os.path.getsize(data_path)/1024/1024/1024
+        self.data_type_name = name
+        self.data = self.load_data(data_path)
+        self.max_seq_length = args.max_seq_length
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+
+    def load_data(self, data_path):
+        # 有进度条展示
+        if self.data_size <= 5:
+            with open(data_path, "rt", encoding='utf8') as f:
+                lines = f.readlines()
+            total_num = len(lines)
+            data_gen = lines
+        else:
+            data_gen = open(data_path, "rt", encoding='utf8')
+            total_num = None
+
+        data = []
+        with tqdm(total=total_num, desc=f'{self.data_type_name}处理进度', mininterval=0.3) as bar:
+            for idx, line in enumerate(data_gen):
+                data.append(self.data_parse(line))
+                bar.update()
+
+        if self.data_size > 5:
+            data_gen.close()
+        return data
+
+    def data_parse(self, line):
+        """
+        解析不同格式的数据
+        """
+        dic = eval(line.strip())
+        return dic
+
+    def encode(self, item):
+        """
+        将数据转换成模型训练的输入
+        """
+        inputs_dict = self.tokenizer.encode_plus(item['Question']+item['answer'],
+                                                 max_length=self.max_seq_length, padding='max_length',
+                                                 truncation=True, return_tensors='pt')
+        target = inputs_dict['input_ids']
+        labels = target.clone().detach()
+        labels[target == self.tokenizer.pad_token_id] = -100
+        return {
+            "input_ids": inputs_dict['input_ids'].squeeze(),
+            "attention_mask": inputs_dict['attention_mask'].squeeze(),
+            "labels": labels.squeeze(),
+            "question": item['Question'],
+            "answer": item['answer']
+        }
+
+
+class GPT2QADataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('GPT2QADataModel')
+        parser.add_argument('--data_dir', type=str, required=True)
+        parser.add_argument('--num_workers', default=2, type=int)
+        parser.add_argument('--train_data', default='train.txt', type=str)
+        parser.add_argument('--valid_data', default='valid.txt', type=str)
+        parser.add_argument('--test_data', default='test.txt', type=str)
+        parser.add_argument('--train_batchsize', type=int, required=True)
+        parser.add_argument('--valid_batchsize', type=int, required=True)
+        parser.add_argument('--max_seq_length', default=1024, type=int)
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        if not args.do_eval_only:
+            self.train_data = GPT2QADataset(os.path.join(
+                args.data_dir, args.train_data), '训练集', args)
+            self.valid_data = GPT2QADataset(os.path.join(
+                args.data_dir, args.valid_data), '验证集', args)
+        self.test_data = GPT2QADataset(os.path.join(
+            args.data_dir, args.test_data), '测试集', args)
+
+    def train_dataloader(self):
+        return DataLoader(
+            self.train_data, shuffle=True,
+            batch_size=self.train_batchsize,
+            pin_memory=False, num_workers=self.args.num_workers)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False, num_workers=self.args.num_workers)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False,
+                          batch_size=self.valid_batchsize, pin_memory=False,
+                          num_workers=self.args.num_workers)
+
+
+if __name__ == '__main__':
+    import argparse
+    modelfile = '/cognitive_comp/wuziwei/pretrained_model_hf/medical_v2'
+    datafile = '/cognitive_comp/wuziwei/task-data/medical_qa/medical_qa_train.txt'
+    parser = argparse.ArgumentParser(description='hf test', allow_abbrev=False)
+    group = parser.add_argument_group(title='test args')
+    group.add_argument('--pretrained-model-path', type=str, default=modelfile,
+                       help='Number of transformer layers.')
+    group.add_argument('--max-seq-length', type=int, default=1024)
+    args = parser.parse_args()
+
+    testml = GPT2QADataset(datafile, 'medical_qa', args=args)
+
+    print(testml[10])
diff --git a/fengshen/data/task_dataloader/task_datasets.py b/fengshen/data/task_dataloader/task_datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8fe7bcf732c61725853df92d9422f207d55f785
--- /dev/null
+++ b/fengshen/data/task_dataloader/task_datasets.py
@@ -0,0 +1,206 @@
+# coding=utf8
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+from transformers import AutoTokenizer
+import json
+import torch
+import pytorch_lightning as pl
+import os
+
+
+class AbstractCollator:
+    """
+    collector for summary task
+    """
+
+    def __init__(self, tokenizer, max_enc_length, max_dec_length, prompt):
+        self.tokenizer = tokenizer
+        self.max_enc_length = max_enc_length
+        self.max_dec_length = max_dec_length
+        self.prompt = prompt
+
+    def __call__(self, samples):
+
+        labels = []
+        attn_mask = []
+        # decoder_attn_mask = []
+        source_inputs = []
+        for sample in samples:
+            encode_dict = self.tokenizer.encode_plus(
+                self.prompt + sample['text'],
+                max_length=self.max_enc_length,
+                padding='max_length',
+                truncation=True,
+                return_tensors='pt')
+            decode_dict = self.tokenizer.encode_plus(
+                sample['summary'],
+                max_length=self.max_dec_length,
+                padding='max_length',
+                truncation=True,
+                return_tensors='pt')
+            source_inputs.append(encode_dict['input_ids'].squeeze())
+            labels.append(decode_dict['input_ids'].squeeze())
+            attn_mask.append(encode_dict['attention_mask'].squeeze())
+            # decoder_attn_mask.append(decode_dict['attention_mask'].squeeze())
+        # labels = torch.tensor(decode_dict['input'])
+
+        source_inputs = torch.stack(source_inputs)
+        labels = torch.stack(labels)
+        attn_mask = torch.stack(attn_mask)
+        # decoder_attn_mask = torch.stack(decoder_attn_mask)
+        # decode_input_idxs = shift_tokens_right(labels, self.tokenizer.pad_token_id, self.tokenizer.pad_token_id)
+        end_token_index = torch.where(labels == self.tokenizer.eos_token_id)[1]
+        for idx, end_idx in enumerate(end_token_index):
+            labels[idx][end_idx + 1:] = -100
+
+        return {
+            "input_ids": source_inputs,
+            "attention_mask": attn_mask,
+            "labels": labels,
+            "text": [sample['text'] for sample in samples],
+            "summary": [sample['summary'] for sample in samples]
+        }
+
+
+class LCSTSDataset(Dataset):
+    '''
+    Dataset Used for LCSTS summary task.
+    '''
+
+    def __init__(self, data_path, args):
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path, use_fast=False)
+        self.data = self.load_data(data_path)
+        self.prompt = args.prompt
+        self.max_enc_length = args.max_enc_length
+        self.max_dec_length = args.max_dec_length
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+
+    def load_data(self, data_path):
+        with open(data_path, "r", encoding='utf8') as f:
+            lines = f.readlines()
+        samples = []
+        for line in tqdm(lines):
+            obj = json.loads(line)
+            source = obj['text']
+            target = obj['summary']
+            samples.append({
+                "text": source,
+                "summary": target
+            })
+        return samples
+
+    def cal_data(self, data_path):
+        with open(data_path, "r", encoding='utf8') as f:
+            lines = f.readlines()
+        samples = []
+        enc_sizes = []
+        dec_sizes = []
+        for line in tqdm(lines):
+            obj = json.loads(line.strip())
+            source = obj['text']
+            target = obj['summary']
+            enc_input_ids = self.tokenizer.encode(source)
+            target = self.tokenizer.encode(target)
+            enc_sizes.append(len(enc_input_ids))
+            dec_sizes.append(len(target)-1)
+            samples.append({
+                "enc_input_ids": enc_input_ids,
+                "dec_input_ids": target[:-1],
+                "label_ids": target[1:]
+            })
+        max_enc_len = max(enc_sizes)
+        max_dec_len = max(dec_sizes)
+        import numpy as np
+        # mean of len(enc_input_ids): 74.68041911345998
+        # mean of len(dec_input_ids): 14.02265483791283
+        # max of len(enc_input_ids): 132
+        # max of len(dec_input_ids): 31
+        print('mean of len(enc_input_ids):', np.mean(enc_sizes),
+              'mean of len(dec_input_ids):', np.mean(dec_sizes),
+              'max of len(enc_input_ids):', max_enc_len,
+              'max of len(dec_input_ids):', max_dec_len)
+        return samples
+
+    def encode(self, item):
+        encode_dict = self.tokenizer.encode_plus(
+            self.prompt + item['text'],
+            max_length=self.max_enc_length,
+            padding='max_length',
+            truncation=True,
+            return_tensors='pt')
+        decode_dict = self.tokenizer.encode_plus(
+            item['summary'],
+            max_length=self.max_dec_length,
+            padding='max_length',
+            truncation=True)
+
+        target = decode_dict['input_ids']
+        # print('encode_dict shape:', encode_dict['input_ids'].shape)
+        labels = torch.tensor(target)
+        labels[target == self.tokenizer.pad_token_id] = -100
+        return {
+            "input_ids": encode_dict['input_ids'].squeeze(),
+            "attention_mask": encode_dict['attention_mask'].squeeze(),
+            "labels": labels.squeeze(),
+            "text": item['text'],
+            "summary": item['summary']
+        }
+
+
+class LCSTSDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('LCSTSDataModel')
+        parser.add_argument(
+            '--data_dir', default='/cognitive_comp/ganruyi/data_datasets_LCSTS_LCSTS/', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.jsonl', type=str)
+        parser.add_argument('--valid_data', default='valid.jsonl', type=str)
+        parser.add_argument('--test_data', default='test_public.jsonl', type=str)
+        parser.add_argument('--train_batchsize', default=128, type=int)
+        parser.add_argument('--valid_batchsize', default=128, type=int)
+        parser.add_argument('--max_enc_length', default=128, type=int)
+        parser.add_argument('--max_dec_length', default=30, type=int)
+        parser.add_argument('--prompt', default='summarize:', type=str)
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        if not args.do_eval_only:
+            self.train_data = LCSTSDataset(os.path.join(
+                args.data_dir, args.train_data), args)
+        self.valid_data = LCSTSDataset(os.path.join(
+            args.data_dir, args.valid_data), args)
+        self.test_data = LCSTSDataset(os.path.join(
+            args.data_dir, args.test_data), args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data,
+                          shuffle=True,
+                          batch_size=self.train_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data,
+                          shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data,
+                          shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)
diff --git a/fengshen/data/universal_datamodule/__init__.py b/fengshen/data/universal_datamodule/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..68169d26a8424ae877b5c7efc2b7be2e761cd3cb
--- /dev/null
+++ b/fengshen/data/universal_datamodule/__init__.py
@@ -0,0 +1,4 @@
+from .universal_datamodule import UniversalDataModule
+from .universal_sampler import PretrainingSampler, PretrainingRandomSampler
+
+__all__ = ['UniversalDataModule', 'PretrainingSampler', 'PretrainingRandomSampler']
diff --git a/fengshen/data/universal_datamodule/universal_datamodule.py b/fengshen/data/universal_datamodule/universal_datamodule.py
new file mode 100644
index 0000000000000000000000000000000000000000..e73d985f661c77ebb452f5060cd30bfb1d8968be
--- /dev/null
+++ b/fengshen/data/universal_datamodule/universal_datamodule.py
@@ -0,0 +1,161 @@
+from pytorch_lightning import LightningDataModule
+from typing import Optional
+
+from torch.utils.data import DataLoader, DistributedSampler
+
+
+def get_consume_samples(data_model: LightningDataModule) -> int:
+    if hasattr(data_model.trainer.lightning_module, 'consumed_samples'):
+        consumed_samples = data_model.trainer.lightning_module.consumed_samples
+        print('get consumed samples from model: {}'.format(consumed_samples))
+    else:
+        world_size = data_model.trainer.world_size
+        consumed_samples = max(0, data_model.trainer.global_step - 1) * \
+            data_model.hparams.train_batchsize * world_size * data_model.trainer.accumulate_grad_batches
+        print('calculate consumed samples: {}'.format(consumed_samples))
+    return consumed_samples
+
+
+class UniversalDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('Universal DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_workers', default=2, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--val_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--datasets_name', type=str, default=None)
+        parser.add_argument('--train_datasets_field', type=str, default='train')
+        parser.add_argument('--val_datasets_field', type=str, default='validation')
+        parser.add_argument('--test_datasets_field', type=str, default='test')
+        parser.add_argument('--train_file', type=str, default=None)
+        parser.add_argument('--val_file', type=str, default=None)
+        parser.add_argument('--test_file', type=str, default=None)
+        parser.add_argument('--raw_file_type', type=str, default='json')
+        parser.add_argument('--sampler_type', type=str,
+                            choices=['single',
+                                     'random'],
+                            default='random')
+        return parent_args
+
+    def __init__(
+        self,
+        tokenizer,
+        collate_fn,
+        args,
+        datasets=None,
+        **kwargs,
+    ):
+        super().__init__()
+        # 如果不传入datasets的名字，则可以在对象外部替换内部的datasets为模型需要的
+        if datasets is not None:
+            self.datasets = datasets
+        elif args.datasets_name is not None:
+            from fengshen.data.fs_datasets import load_dataset
+            print('---------begin to load datasets {}'.format(args.datasets_name))
+            self.datasets = load_dataset(
+                args.datasets_name, num_proc=args.num_workers)
+            print('---------ending load datasets {}'.format(args.datasets_name))
+        else:
+            print('---------begin to load datasets from local file')
+            from datasets import load_dataset
+            self.datasets = load_dataset(args.raw_file_type,
+                                         data_files={
+                                             args.train_datasets_field: args.train_file,
+                                             args.val_datasets_field: args.val_file,
+                                             args.test_datasets_field: args.test_file})
+            print('---------end to load datasets from local file')
+
+        self.tokenizer = tokenizer
+        self.collate_fn = collate_fn
+        self.save_hyperparameters(args)
+
+    def get_custom_sampler(self, ds):
+        from .universal_sampler import PretrainingRandomSampler
+        from .universal_sampler import PretrainingSampler
+        world_size = self.trainer.world_size
+        consumed_samples = get_consume_samples(self)
+        # use the user default sampler
+        if self.hparams.sampler_type == 'random':
+            return PretrainingRandomSampler(
+                total_samples=len(ds),
+                # consumed_samples cal by global steps
+                consumed_samples=consumed_samples,
+                micro_batch_size=self.hparams.train_batchsize,
+                data_parallel_rank=self.trainer.global_rank,
+                data_parallel_size=world_size,
+                epoch=self.trainer.current_epoch,
+            )
+        elif self.hparams.sampler_type == 'single':
+            return PretrainingSampler(
+                total_samples=len(ds),
+                # consumed_samples cal by global steps
+                consumed_samples=consumed_samples,
+                micro_batch_size=self.hparams.train_batchsize,
+                data_parallel_rank=self.trainer.global_rank,
+                data_parallel_size=world_size,
+            )
+        else:
+            raise Exception('Unknown sampler type: {}'.format(self.hparams.sampler_type))
+
+    def setup(self, stage: Optional[str] = None) -> None:
+        return
+
+    def train_dataloader(self):
+        ds = self.datasets[self.hparams.train_datasets_field]
+
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+
+        if self.hparams.replace_sampler_ddp is False:
+            return DataLoader(
+                ds,
+                batch_sampler=self.get_custom_sampler(ds),
+                num_workers=self.hparams.dataloader_workers,
+                collate_fn=collate_fn,
+                pin_memory=True,
+            )
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.train_batchsize,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            pin_memory=True,
+        )
+
+    def val_dataloader(self):
+        ds = self.datasets[self.hparams.val_datasets_field]
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.val_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            sampler=DistributedSampler(
+                ds, shuffle=False),
+            pin_memory=True,
+        )
+
+    def test_dataloader(self):
+        ds = self.datasets[self.hparams.test_datasets_field]
+
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.test_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            sampler=DistributedSampler(
+                ds, shuffle=False),
+            pin_memory=True,
+        )
diff --git a/fengshen/data/universal_datamodule/universal_sampler.py b/fengshen/data/universal_datamodule/universal_sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..86db3016d0f9795f5c8e501da2ff55c6e34e7222
--- /dev/null
+++ b/fengshen/data/universal_datamodule/universal_sampler.py
@@ -0,0 +1,125 @@
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Dataloaders."""
+
+
+import torch
+
+
+class PretrainingSampler:
+
+    def __init__(self, total_samples, consumed_samples, micro_batch_size,
+                 data_parallel_rank, data_parallel_size, drop_last=True):
+        # Keep a copy of input params for later use.
+        self.total_samples = total_samples
+        self.consumed_samples = consumed_samples
+        self.micro_batch_size = micro_batch_size
+        self.data_parallel_rank = data_parallel_rank
+        self.micro_batch_times_data_parallel_size = \
+            self.micro_batch_size * data_parallel_size
+        self.drop_last = drop_last
+
+        # Sanity checks.
+        assert self.total_samples > 0, \
+            'no sample to consume: {}'.format(self.total_samples)
+        assert self.consumed_samples < self.total_samples, \
+            'no samples left to consume: {}, {}'.format(self.consumed_samples,
+                                                        self.total_samples)
+        assert self.micro_batch_size > 0
+        assert data_parallel_size > 0
+        assert self.data_parallel_rank < data_parallel_size, \
+            'data_parallel_rank should be smaller than data size: {}, ' \
+            '{}'.format(self.data_parallel_rank, data_parallel_size)
+
+    def __len__(self):
+        return self.total_samples // self.micro_batch_times_data_parallel_size
+
+    def get_start_end_idx(self):
+        start_idx = self.data_parallel_rank * self.micro_batch_size
+        end_idx = start_idx + self.micro_batch_size
+        return start_idx, end_idx
+
+    def __iter__(self):
+        batch = []
+        # Last batch will be dropped if drop_last is not set False
+        for idx in range(self.consumed_samples, self.total_samples):
+            batch.append(idx)
+            if len(batch) == self.micro_batch_times_data_parallel_size:
+                start_idx, end_idx = self.get_start_end_idx()
+                yield batch[start_idx:end_idx]
+                batch = []
+
+        # Check the last partial batch and see drop_last is set
+        if len(batch) > 0 and not self.drop_last:
+            start_idx, end_idx = self.get_start_end_idx()
+            yield batch[start_idx:end_idx]
+
+
+class PretrainingRandomSampler:
+
+    def __init__(self, total_samples, consumed_samples, micro_batch_size,
+                 data_parallel_rank, data_parallel_size, epoch):
+        # Keep a copy of input params for later use.
+        self.total_samples = total_samples
+        self.consumed_samples = consumed_samples
+        self.micro_batch_size = micro_batch_size
+        self.data_parallel_rank = data_parallel_rank
+        self.data_parallel_size = data_parallel_size
+        self.micro_batch_times_data_parallel_size = \
+            self.micro_batch_size * data_parallel_size
+        self.last_batch_size = \
+            self.total_samples % self.micro_batch_times_data_parallel_size
+        self.epoch = epoch
+
+        # Sanity checks.
+        assert self.total_samples > 0, \
+            'no sample to consume: {}'.format(self.total_samples)
+        assert self.micro_batch_size > 0
+        assert data_parallel_size > 0
+        assert self.data_parallel_rank < data_parallel_size, \
+            'data_parallel_rank should be smaller than data size: {}, ' \
+            '{}'.format(self.data_parallel_rank, data_parallel_size)
+
+    def __len__(self):
+        return self.total_samples // self.micro_batch_times_data_parallel_size
+
+    def __iter__(self):
+        active_total_samples = self.total_samples - self.last_batch_size
+        current_epoch_samples = self.consumed_samples % active_total_samples
+        assert current_epoch_samples % self.micro_batch_times_data_parallel_size == 0
+
+        # data sharding and random sampling
+        bucket_size = (self.total_samples // self.micro_batch_times_data_parallel_size) \
+            * self.micro_batch_size
+        bucket_offset = current_epoch_samples // self.data_parallel_size
+        start_idx = self.data_parallel_rank * bucket_size
+
+        g = torch.Generator()
+        g.manual_seed(self.epoch)
+        random_idx = torch.randperm(bucket_size, generator=g).tolist()
+        idx_range = [start_idx + x for x in random_idx[bucket_offset:]]
+
+        batch = []
+        # Last batch if not complete will be dropped.
+        for idx in idx_range:
+            batch.append(idx)
+            if len(batch) == self.micro_batch_size:
+                self.consumed_samples += self.micro_batch_times_data_parallel_size
+                yield batch
+                batch = []
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
diff --git a/fengshen/examples/FastDemo/README.md b/fengshen/examples/FastDemo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..132519b95da3fd35f4c4fb6aae5d8c44faad3a42
--- /dev/null
+++ b/fengshen/examples/FastDemo/README.md
@@ -0,0 +1,105 @@
+# 「streamlit」快速搭建你的算法demo
+在搭建demo之前，首先得做好这些准备工作：
+- 模型训练完毕
+- 模型的入参确定
+- 安装streamlit库，`pip install streamlit` 就可以安装。
+
+streamlit脚本的启动方式是 `streamlit run demo.py`，很简单就启动了一个demo页面，页面会随着脚本代码的改变实时刷新的。所以在没有经验的时候，可以创建一个demo.py的文件，照着下面的教程一步一步添加代码，看页面的展示情况。下面开始上干货，具体细节在代码注释中有说明！
+
+### 第一步 导包
+```python 
+import streamlit as st
+# 其他包更具你的需要导入
+```
+[streamlit](https://streamlit.io)是一个用于构建机器学习、深度学习、数据可视化demo的python框架。它不需要你有web开发的经验，会写python就可以高效的开发你的demo。
+
+### 第二步 页面导航信息以及布局配置
+
+```python 
+st.set_page_config(
+     page_title="余元医疗问答", # 页面标签标题
+     page_icon=":shark:", # 页面标签图标
+     layout="wide", # 页面的布局
+     initial_sidebar_state="expanded", # 左侧的sidebar的布局方式
+     # 配置菜单按钮的信息
+     menu_items={
+         'Get Help': 'https://www.extremelycoolapp.com/help',
+         'Report a bug': "https://www.extremelycoolapp.com/bug",
+         'About': "# This is a header. This is an *extremely* cool app!"
+     }
+ )
+```
+这一步可以省略，如果想让app更加个性化，可以添加这些设置。
+
+### 第三步 设置demo标题
+```python 
+st.title('Demo for MedicalQA') 
+```
+streamlit的每一个小组件对应于页面都有一个默认的样式展示。
+
+### 第四步 配置demo的参数
+
+```python 
+# 此处是用的sidebar，侧边栏作为参数配置模块
+st.sidebar.header("参数配置")
+# 这里是在sidebar里面创建了表单，每个表单一定有一个标题和提交按钮
+sbform = st.sidebar.form("固定参数设置")
+# slider是滑动条组建，可以配置数值型参数
+n_sample = sbform.slider("设置返回条数",min_value=1,max_value=10,value=3)
+text_length = sbform.slider('生成长度:',min_value=32,max_value=512,value=64,step=32)
+text_level = sbform.slider('文本多样性:',min_value=0.1,max_value=1.0,value=0.9,step=0.1)
+# number_input也可以配置数值型参数
+model_id = sbform.number_input('选择模型号:',min_value=0,max_value=13,value=13,step=1)
+# selectbox选择组建，只能选择配置的选项
+trans = sbform.selectbox('选择翻译内核',['百度通用','医疗生物'])
+# 提交表单的配置，这些参数的赋值才生效
+sbform.form_submit_button("提交配置")
+
+# 这里是页面中的参数配置，也是demo的主体之一
+form = st.form("参数设置")
+# 本demo是qa demo，所以要录入用户的文本输入，text_input组建可以实现
+input_text = form.text_input('请输入你的问题:',value='',placeholder='例如：糖尿病的症状有哪些？')
+form.form_submit_button("提交")
+```
+以上就把demo的参数基本配置完成了。
+
+### 第五步 模型预测
+```python 
+# 定义一个前向预测的方法
+# @st.cache(suppress_st_warning=True)
+def generate_qa(input_text,n_sample,model_id='7',length=64,translator='baidu',level=0.7):
+    # 这里我们是把模型用fastapi搭建了一个api服务
+    URL = 'http://192.168.190.63:6605/qa'
+    data = {
+            "text":input_text,"n_sample":n_sample,
+            "model_id":model_id,"length":length,
+            'translator':translator,'level':level
+            }
+    r = requests.get(URL,params=data)
+    return r.text
+# 模型预测结果
+results = generate_qa(input_text,n_sample,model_id=str(model_id),
+                    translator=translator,length=text_length,level=text_level)
+```
+这里说明一下，由于demo展示机器没有GPU，所以模型部署采用的是Fastapi部署在后台的。如果demo展示的机器可以直接部署模型，这里可以直接把模型预测的方法写在这里，不需要另外部署模型，再用api的方式调用。这样做有一个值得注意的地方，因为streamlit的代码每一次运行，都是从头到尾执行一遍，就导致模型可能会重复加载，所以这里需要用到st.cache组建，当内容没有更新的时候，会把这一步的结果缓存，而不会重新执行。保证了效率不会因此而下降。
+
+### 第六步 结果展示
+```python 
+with st.spinner('老夫正在思考中🤔...'):
+    if input_text:
+        results = generate_qa(input_text,n_sample,model_id=str(model_id),
+                        translator=translator,length=text_length,level=text_level)
+        for idx,item in enumerate(eval(results),start=1):
+            st.markdown(f"""
+            **候选回答「{idx}」:**\n
+            """)
+            st.info('中文：%s'%item['fy_next_sentence'])
+            st.info('英文：%s'%item['next_sentence'])
+```
+streamlit对不同格式的内容展示，有丰富的组建，对于文本可以用`st.markdown`组建以及`st.text`和`st.write`展示。更多组建和功能可以参考官方文档：https://docs.streamlit.io
+
+至此，一个完整的demo展示就完成了。效果图如下：
+
+![](./image/demo.png)
+
+完整的代码可以参考：`Fengshenbang-LM/fengshen/examples/FastDemo/YuyuanQA.py`
diff --git a/fengshen/examples/FastDemo/YuyuanQA.py b/fengshen/examples/FastDemo/YuyuanQA.py
new file mode 100644
index 0000000000000000000000000000000000000000..fed2d19bc61e0735f3868e1a30a532bd19fbb4b0
--- /dev/null
+++ b/fengshen/examples/FastDemo/YuyuanQA.py
@@ -0,0 +1,71 @@
+import requests
+import langid
+import streamlit as st
+from translate import baiduTranslatorMedical
+from translate import baiduTranslator
+
+langid.set_languages(['en', 'zh'])
+lang_dic = {'zh': 'en', 'en': 'zh'}
+
+st.set_page_config(
+    page_title="余元医疗问答",
+    page_icon=":shark:",
+    #  layout="wide",
+    initial_sidebar_state="expanded",
+    menu_items={
+        'Get Help': 'https://www.extremelycoolapp.com/help',
+        'Report a bug': "https://www.extremelycoolapp.com/bug",
+        'About': "# This is a header. This is an *extremely* cool app!"
+    }
+)
+st.title('Demo for MedicalQA')
+
+
+st.sidebar.header("参数配置")
+sbform = st.sidebar.form("固定参数设置")
+n_sample = sbform.slider("设置返回条数", min_value=1, max_value=10, value=3)
+text_length = sbform.slider('生成长度:', min_value=32, max_value=512, value=64, step=32)
+text_level = sbform.slider('文本多样性:', min_value=0.1, max_value=1.0, value=0.9, step=0.1)
+model_id = sbform.number_input('选择模型号:', min_value=0, max_value=13, value=13, step=1)
+trans = sbform.selectbox('选择翻译内核', ['百度通用', '医疗生物'])
+sbform.form_submit_button("配置")
+
+
+form = st.form("参数设置")
+input_text = form.text_input('请输入你的问题:', value='', placeholder='例如：糖尿病的症状有哪些？')
+if trans == '百度通用':
+    translator = 'baidu_common'
+else:
+    translator = 'baidu'
+if input_text:
+    lang = langid.classify(input_text)[0]
+    if translator == 'baidu':
+        st.write('**你的问题是:**', baiduTranslatorMedical(input_text, src=lang, dest=lang_dic[lang]).text)
+    else:
+        st.write('**你的问题是:**', baiduTranslator(input_text, src=lang, dest=lang_dic[lang]).text)
+
+form.form_submit_button("提交")
+
+# @st.cache(suppress_st_warning=True)
+
+
+def generate_qa(input_text, n_sample, model_id='7', length=64, translator='baidu', level=0.7):
+    # st.write('调用了generate函数')
+    URL = 'http://192.168.190.63:6605/qa'
+    data = {"text": input_text, "n_sample": n_sample, "model_id": model_id,
+            "length": length, 'translator': translator, 'level': level}
+    r = requests.get(URL, params=data)
+    return r.text
+# my_bar = st.progress(80)
+
+
+with st.spinner('老夫正在思考中🤔...'):
+    if input_text:
+        results = generate_qa(input_text, n_sample, model_id=str(model_id),
+                              translator=translator, length=text_length, level=text_level)
+        for idx, item in enumerate(eval(results), start=1):
+            st.markdown(f"""
+            **候选回答「{idx}」:**\n
+            """)
+            st.info('中文：%s' % item['fy_next_sentence'])
+            st.info('英文：%s' % item['next_sentence'])
diff --git a/fengshen/examples/FastDemo/image/demo.png b/fengshen/examples/FastDemo/image/demo.png
new file mode 100644
index 0000000000000000000000000000000000000000..3eee22e26192861429863058de716e457fc8fc57
Binary files /dev/null and b/fengshen/examples/FastDemo/image/demo.png differ
diff --git a/fengshen/examples/classification/demo_classification_afqmc_erlangshen_offload.sh b/fengshen/examples/classification/demo_classification_afqmc_erlangshen_offload.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f5ff555aa60e3cebd544b92a18443eb7505f8ae8
--- /dev/null
+++ b/fengshen/examples/classification/demo_classification_afqmc_erlangshen_offload.sh
@@ -0,0 +1,103 @@
+MODEL_NAME="IDEA-CCNL/Erlangshen-MegatronBert-1.3B"
+
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+BATCH_SIZE=1
+VAL_BATCH_SIZE=1
+ZERO_STAGE=3
+config_json="./ds_config.json"
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE},
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9
+    },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-1 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy deepspeed_stage_${ZERO_STAGE}_offload \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+ 
+python3 finetune_classification.py $options
+
diff --git a/fengshen/examples/classification/demo_classification_afqmc_roberta.sh b/fengshen/examples/classification/demo_classification_afqmc_roberta.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bad55f2de72f66f02b583d9b191802c55cfe0a4b
--- /dev/null
+++ b/fengshen/examples/classification/demo_classification_afqmc_roberta.sh
@@ -0,0 +1,62 @@
+MODEL_NAME="IDEA-CCNL/Erlangshen-Roberta-110M-NLI"
+
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+BATCH_SIZE=1
+VAL_BATCH_SIZE=1
+
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy ddp \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+ 
+python3 finetune_classification.py $options
+
diff --git a/fengshen/examples/classification/demo_classification_afqmc_roberta_deepspeed.sh b/fengshen/examples/classification/demo_classification_afqmc_roberta_deepspeed.sh
new file mode 100644
index 0000000000000000000000000000000000000000..48b003940a960454912a62731e5aec3b9046a6df
--- /dev/null
+++ b/fengshen/examples/classification/demo_classification_afqmc_roberta_deepspeed.sh
@@ -0,0 +1,90 @@
+MODEL_NAME="IDEA-CCNL/Erlangshen-Roberta-110M-NLI"
+
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+BATCH_SIZE=32
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+config_json="./ds_config.json"
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+ 
+python3 finetune_classification.py $options
+
diff --git a/fengshen/examples/classification/finetune_classification.py b/fengshen/examples/classification/finetune_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e643f2fcf560b6c817d22946ad4a6610b647e13
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification.py
@@ -0,0 +1,389 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# from fengshen.models.zen1 import ZenModel
+from dataclasses import dataclass
+from fengshen.models.megatron_t5 import T5EncoderModel
+from fengshen.models.roformer import RoFormerModel
+from fengshen.models.longformer import LongformerModel
+# from fengshen.models.cocolm.modeling_cocolm import COCOLMForSequenceClassification
+import numpy as np
+import os
+from tqdm import tqdm
+import json
+import torch
+import pytorch_lightning as pl
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.data._utils.collate import default_collate
+from transformers import (
+    BertModel,
+    BertConfig,
+    MegatronBertModel,
+    MegatronBertConfig,
+    AutoModel,
+    AutoConfig,
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+)
+# os.environ["CUDA_VISIBLE_DEVICES"] = '6'
+
+
+model_dict = {'huggingface-bert': BertModel,
+              'fengshen-roformer': RoFormerModel,
+              'huggingface-megatron_bert': MegatronBertModel,
+              'fengshen-megatron_t5': T5EncoderModel,
+              'fengshen-longformer': LongformerModel,
+              # 'fengshen-zen1': ZenModel,
+              'huggingface-auto': AutoModelForSequenceClassification,
+              }
+
+
+class TaskDataset(Dataset):
+    def __init__(self, data_path, args, label2id):
+        super().__init__()
+        self.args = args
+        self.label2id = label2id
+        self.max_length = args.max_length
+        self.data = self.load_data(data_path, args)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            samples = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                text_id = int(data[args.id_name]
+                              ) if args.id_name in data.keys() else 0
+                texta = data[args.texta_name] if args.texta_name in data.keys(
+                ) else ''
+                textb = data[args.textb_name] if args.textb_name in data.keys(
+                ) else ''
+                labels = self.label2id[data[args.label_name]
+                                       ] if args.label_name in data.keys() else 0
+                samples.append({args.texta_name: texta, args.textb_name: textb,
+                                args.label_name: labels, 'id': text_id})
+        return samples
+
+
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+
+    def __call__(self, samples):
+        sample_list = []
+        for item in samples:
+            if item[self.args.texta_name] != '' and item[self.args.textb_name] != '':
+                if self.args.model_type != 'fengshen-roformer':
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.args.texta_name], item[self.args.textb_name]],
+                        max_length=self.args.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+                else:
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.args.texta_name] +
+                            self.tokenizer.eos_token+item[self.args.textb_name]],
+                        max_length=self.args.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+            else:
+                encode_dict = self.tokenizer.encode_plus(
+                    item[self.args.texta_name],
+                    max_length=self.args.max_length,
+                    padding='max_length',
+                    truncation='longest_first')
+            sample = {}
+            for k, v in encode_dict.items():
+                sample[k] = torch.tensor(v)
+            sample['labels'] = torch.tensor(item[self.args.label_name]).long()
+            sample['id'] = item['id']
+            sample_list.append(sample)
+        return default_collate(sample_list)
+
+
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_length', default=128, type=int)
+
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+
+        parser.add_argument('--dataset_name', default=None, type=str)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path)
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = self.tokenizer
+        if args.dataset_name is None:
+            self.label2id, self.id2label = self.load_schema(os.path.join(
+                args.data_dir, args.train_data), args)
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), args, self.label2id)
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), args, self.label2id)
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), args, self.label2id)
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def load_schema(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            label_list = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                labels = data[args.label_name] if args.label_name in data.keys(
+                ) else 0
+                if labels not in label_list:
+                    label_list.append(labels)
+
+        label2id, id2label = {}, {}
+        for i, k in enumerate(label_list):
+            label2id[k] = i
+            id2label[i] = k
+        return label2id, id2label
+
+
+class taskModel(torch.nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        print('args mode type:', args.model_type)
+        self.bert_encoder = model_dict[args.model_type].from_pretrained(
+            args.pretrained_model_path)
+        self.config = self.bert_encoder.config
+        self.cls_layer = torch.nn.Linear(
+            in_features=self.config.hidden_size, out_features=self.args.num_labels)
+        self.loss_func = torch.nn.CrossEntropyLoss()
+
+    def forward(self, input_ids, attention_mask, token_type_ids, labels=None):
+        if self.args.model_type == 'fengshen-megatron_t5':
+            bert_output = self.bert_encoder(
+                input_ids=input_ids, attention_mask=attention_mask)  # (bsz, seq, dim)
+            encode = bert_output.last_hidden_state[:, 0, :]
+        else:
+            bert_output = self.bert_encoder(
+                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)  # (bsz, seq, dim)
+            encode = bert_output[1]
+        logits = self.cls_layer(encode)
+        if labels is not None:
+            loss = self.loss_func(logits, labels.view(-1,))
+            return loss, logits
+        else:
+            return 0, logits
+
+
+class LitModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--num_labels', default=2, type=int)
+
+        return parent_args
+
+    def __init__(self, args, num_data):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        self.model = model_dict[args.model_type].from_pretrained(
+            args.pretrained_model_path)
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+        # Calculate total steps
+        if self.trainer.max_epochs > 0:
+            world_size = self.trainer.world_size
+            tb_size = self.hparams.train_batchsize * max(1, world_size)
+            ab_size = self.trainer.accumulate_grad_batches
+            self.total_steps = (len(train_loader.dataset) *
+                                self.trainer.max_epochs // tb_size) // ab_size
+        else:
+            self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+        print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        del batch['id']
+        output = self.model(**batch)
+        loss, logits = output[0], output[1]
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('train_loss', loss)
+        self.log('train_acc', acc)
+        return loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        del batch['id']
+        output = self.model(**batch)
+        loss, logits = output[0], output[1]
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('val_loss', loss)
+        self.log('val_acc', acc, sync_dist=True)
+
+    def predict_step(self, batch, batch_idx):
+        ids = batch['id']
+        del batch['id']
+        output = self.model(**batch)
+        return {ids, output.logits}
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         every_n_epochs=1,
+                                         filename=args.filename)
+
+
+def save_test(data, args, data_model, rank):
+    file_name = args.output_save_path + f'.{rank}'
+    with open(file_name, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            ids, batch = data[i]
+            for id, sample in zip(ids, batch):
+                tmp_result = dict()
+                label_id = np.argmax(sample.cpu().numpy())
+                tmp_result['id'] = id.item()
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+file_name)
+
+
+def main():
+    pl.seed_everything(42)
+
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    total_parser.add_argument('--model_type',
+                              default='huggingface-bert', type=str)
+
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+
+    args = total_parser.parse_args()
+    print(args.pretrained_model_path)
+
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    early_stop_callback = EarlyStopping(
+        monitor="val_acc", min_delta=0.00, patience=5, verbose=False, mode="max")
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[
+                                                checkpoint_callback,
+                                                lr_monitor,
+                                                early_stop_callback]
+                                            )
+
+    data_model = TaskDataModel(args)
+    model = LitModel(args, len(data_model.train_dataloader()))
+
+    trainer.fit(model, data_model)
+    result = trainer.predict(
+        model, data_model, ckpt_path=trainer.checkpoint_callback.best_model_path)
+    save_test(result, args, data_model, trainer.global_rank)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/classification/finetune_classification.sh b/fengshen/examples/classification/finetune_classification.sh
new file mode 100644
index 0000000000000000000000000000000000000000..993071ceb0ceeb44c0bf887abcdbc0c9f982c4d5
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=2 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+
+MODEL_TYPE=fengshen-roformer
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Zhouwenwang-Unified-110M
+
+ROOT_PATH=cognitive_comp
+TASK=tnews
+
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/modelevaluation/tnews/
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize 32 \
+        --valid_batchsize 128 \
+        --max_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.00002 \
+        --weight_decay 0.1 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir ./log/ \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        --model_type $MODEL_TYPE \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+python3 $SCRIPT_PATH $options
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bart-base_afqmc.sh b/fengshen/examples/classification/finetune_classification_bart-base_afqmc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2700d2ad3d6fca47238db033781905ac372b183a
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bart-base_afqmc.sh
@@ -0,0 +1,143 @@
+#!/bin/bash
+#SBATCH --job-name=afqmc-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+
+MODEL_NAME=bart-base
+
+TASK=afqmc
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$MODEL_NAME/
+
+
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/ckpt/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}/predict.json
+
+
+config_json="./ds_config.${MODEL_NAME}.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-5,
+      "warmup_max_lr": 1e-4,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 64 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-6 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bart-base_ocnli.sh b/fengshen/examples/classification/finetune_classification_bart-base_ocnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6ef4886993eb2c1c8938180c940ece9bb156b73f
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bart-base_ocnli.sh
@@ -0,0 +1,143 @@
+#!/bin/bash
+#SBATCH --job-name=ocnli-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+
+MODEL_NAME=bart-base
+
+TASK=ocnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$MODEL_NAME/
+
+
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/ckpt/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}/predict.json
+
+
+config_json="./ds_config.${MODEL_NAME}.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-6 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_afqmc.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_afqmc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9d36b627d6cc1b0a8de575138eec6a7529b31137
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_afqmc.sh
@@ -0,0 +1,146 @@
+#!/bin/bash
+#SBATCH --job-name=afqmc # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=4 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=20 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:4 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+set -x -e
+echo "START TIME: $(date)"
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+
+BERT_NAME=bert-3.9B
+
+TASK=afqmc
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=2
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/fintune/ckpt/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${BERT_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${BERT_NAME}-${TASK}/predict.json
+
+
+config_json="./ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": 2
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-1
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 4 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --precision 16 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun -N 1 --job-name=afqmc --jobid=151522 --ntasks=4 --cpus-per-task=15 --gres=gpu:4 -o %x-%j.log singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_cmnli.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_cmnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..da10752cff77be9462d17cbb45882543a5e0ed48
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_cmnli.sh
@@ -0,0 +1,161 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+
+BERT_NAME=bert-3.9B
+
+TASK=cmnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+
+
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME/
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+
+
+config_json="./ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-6,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-3
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-6
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.000001 \
+        --weight_decay 0.001 \
+        --warmup 0.001 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_iflytek.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_iflytek.sh
new file mode 100644
index 0000000000000000000000000000000000000000..13e08efc318a60eabec72cd4357f8aa9dd558f44
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_iflytek.sh
@@ -0,0 +1,158 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+
+BERT_NAME=bert-3.9B
+
+TASK=iflytek
+TEXTA_NAME=sentence
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+
+
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/Fengshenbang-LM/fengshen/scripts/log/$TASK
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+
+
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 119 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_ocnli.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_ocnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8d3107931f88671d54d50325b8d469a12ee4e224
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_ocnli.sh
@@ -0,0 +1,163 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+
+BERT_NAME=bert-1.3B
+
+TASK=ocnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+
+
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+
+
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-6,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-6
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.000001 \
+        --weight_decay 0.001 \
+        --warmup 0.001 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_tnews.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_tnews.sh
new file mode 100644
index 0000000000000000000000000000000000000000..62a2349bd4ce90d20f9747fd570cb070ea60be2f
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_tnews.sh
@@ -0,0 +1,161 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=4 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:4 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+
+BERT_NAME=bert-3.9B
+
+TASK=tnews
+TEXTA_NAME=sentence
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+
+
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME/nograd
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+
+
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-5,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 4 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_bert-3.9B_wsc.sh b/fengshen/examples/classification/finetune_classification_bert-3.9B_wsc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5d05662f1a2252de3bbd4fd9719ef8d3262d9c63
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_bert-3.9B_wsc.sh
@@ -0,0 +1,158 @@
+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+
+BERT_NAME=bert-3.9B
+
+TASK=wsc
+TEXTA_NAME=texta
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+
+
+ROOT_PATH=cognitive_comp
+DATA_DIR=/cognitive_comp/yangping/data/unidata/multichoice/mrc_multichoice_data/other/cluewsc2020/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+
+
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/Fengshenbang-LM/fengshen/scripts/log/$TASK
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+
+
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 10 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 10 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_zen1-base_afqmc.sh b/fengshen/examples/classification/finetune_classification_zen1-base_afqmc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..845e93093cc6390db2c332c22e860ff88688a657
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_zen1-base_afqmc.sh
@@ -0,0 +1,151 @@
+#!/bin/bash
+#SBATCH --job-name=afqmc-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=fengshen-zen1
+
+TASK=afqmc
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/ZEN_pretrain_base_v0.1.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+
+config_json="${ROOT_DIR}/ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-5,
+      "warmup_max_lr": 1e-4,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 64 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+source activate base
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/finetune_classification_zen1-base_tnews.sh b/fengshen/examples/classification/finetune_classification_zen1-base_tnews.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eaa50ddac4376c8e86000852da138d0d4779126d
--- /dev/null
+++ b/fengshen/examples/classification/finetune_classification_zen1-base_tnews.sh
@@ -0,0 +1,150 @@
+#!/bin/bash
+#SBATCH --job-name=afqmc-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+export CUDA_VISIBLE_DEVICES='5'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=fengshen-zen1
+
+TASK=tnews
+TEXTA_NAME=sentence
+LABEL_NAME=label
+ID_NAME=id
+
+
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/ZEN_pretrain_base_v0.1.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+
+config_json="${ROOT_DIR}/ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 2e-5,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 2e-8,
+      "warmup_max_lr": 2e-5,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        --model_type $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+
+# python3 $SCRIPT_PATH $options
+source activate base
+singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/classification/readme.md b/fengshen/examples/classification/readme.md
new file mode 100644
index 0000000000000000000000000000000000000000..b90ce5a946acf55a6530b3c8d010a5ec2642f6ae
--- /dev/null
+++ b/fengshen/examples/classification/readme.md
@@ -0,0 +1,23 @@
+## 分类下游任务
+
+在当前目录下，我们提供丰富的分类任务的示例，其中我们提供三个一键式运行的示例。
+
+- demo_classification_afqmc_roberta.sh              使用DDP微调roberta
+- demo_classification_afqmc_roberta_deepspeed.sh    结合deepspeed微调roberta，获得更快的运算速度
+- demo_classification_afqmc_erlangshen_offload.sh   仅需7G显存即可微调我们效果最好的二郎神系列模型
+
+上述示例均采用AFQMC的数据集，关于数据集的介绍可以在[这里](https://www.cluebenchmarks.com/introduce.html)找到。
+同时我们处理过的数据文件已经放在Huggingface上，点击[这里](https://huggingface.co/datasets/IDEA-CCNL/AFQMC)直达源文件。
+仅需要按我们的格式稍微处理一下数据集，即可适配下游不同的分类任务。
+在脚本示例中，仅需要修改如下参数即可适配本地文件
+```
+        --dataset_name IDEA-CCNL/AFQMC \
+
+-------> 修改为
+
+        --data_dir $DATA_DIR \          # 数据目录
+        --train_data train.json \       # 数据文件
+        --valid_data dev.json \
+        --test_data test.json \
+
+```
\ No newline at end of file
diff --git a/fengshen/examples/clip_finetune/clip_finetune_flickr.py b/fengshen/examples/clip_finetune/clip_finetune_flickr.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cac74d87e861cf0ffff64c9ca03330208db90c3
--- /dev/null
+++ b/fengshen/examples/clip_finetune/clip_finetune_flickr.py
@@ -0,0 +1,259 @@
+import sys
+sys.path.append('../../')
+from data.clip_dataloader.flickr import FlickrDataModule
+import pytorch_lightning as pl
+import numpy as np
+import torch
+from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
+import torch.nn.functional as F
+import math
+import copy
+import argparse
+from transformers import CLIPModel, BertForSequenceClassification
+
+class CLIPLightning(pl.LightningModule):
+    def __init__(self, model_name='ViT-B/32', minibatch_size=2):
+        """A lightning wrapper for a CLIP model as specified in the paper.
+
+        Args:
+            model_name (str): A case sensitive visual model name.
+            config (dict): A dictionary containing the CLIP instantiation parameters.
+        """
+        super().__init__()
+
+        self.prepare_data_per_node = True
+        self.model_name = 'ViT-B/32'
+        # self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")  # NOTE load from openAI
+        self.text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese")
+        self.minibatch_size = minibatch_size
+        self.isViT = 'ViT' in self.model_name
+        self.automatic_optimization = False
+
+    # Training loss: https://github.com/openai/CLIP/issues/83
+    # Mini-batching thanks to https://github.com/crowsonkb / https://twitter.com/RiversHaveWings
+    # Multi-GPU support: https://github.com/MicPie/clasp
+
+    def training_step(self, train_batch, idx):
+        # get optimizers and scheduler
+        optimizer = self.optimizers()
+
+        image, text, labels = train_batch
+        n = math.ceil(len(image) // self.minibatch_size)
+        image_mbs = torch.chunk(image, n)
+        text_mbs = torch.chunk(text, n)
+
+        with torch.no_grad():
+            ims = [F.normalize(self.clip_model.get_image_features(im), dim=1) for im in image_mbs]
+            txt = [F.normalize(self.text_encoder(t).logits, dim=1) for t in text_mbs]
+            # gather from all GPUs 这里的LOSS要把所有GPU的汇集起来一起算才对
+            ims = self.all_gather(torch.cat(ims))
+            txt = self.all_gather(torch.cat(txt))
+
+            if len(ims.shape) == 3:
+                ims = list(ims)
+                txt = list(txt)
+            else:
+                ims = [ims]
+                txt = [txt]
+
+            image_logits = torch.cat(ims) @ torch.cat(txt).t() * self.clip_model.logit_scale.exp()
+            ground_truth = torch.arange(len(image_logits)).long().to(image_logits.device)
+            loss = (F.cross_entropy(image_logits, ground_truth) +
+                    F.cross_entropy(image_logits.t(), ground_truth)).div(2)
+            acc_i = (torch.argmax(image_logits, 1) == ground_truth).sum()
+            acc_t = (torch.argmax(image_logits, 0) == ground_truth).sum()
+            self.log_dict({'loss': loss / len(ims), 'acc': (acc_i + acc_t) / 2 / len(image) / len(ims)}, prog_bar=True)
+
+        if isinstance(optimizer, list):
+            optimizer = optimizer[0]
+        optimizer.zero_grad()
+
+        # image loss
+        for j, mb in enumerate(image_mbs[:-1]):
+            # 最后一部分样本舍弃。（对齐的bug）
+            images_tmp = copy.deepcopy(ims)
+            images_tmp[self.global_rank][j * self.minibatch_size:(j+1)*self.minibatch_size] = \
+                F.normalize(self.clip_model.get_image_features(mb), dim=1)
+            image_logits = torch.cat(images_tmp) @ torch.cat(txt).t() * self.clip_model.logit_scale.exp()
+            ground_truth = torch.arange(len(image_logits)).long().to(image_logits.device)
+            loss = (F.cross_entropy(image_logits, ground_truth) + F.cross_entropy(image_logits.t(), ground_truth))/2
+            self.manual_backward(loss)
+
+        # text loss
+        for j, mb in enumerate(text_mbs[:-1]):
+            text_tmp = copy.deepcopy(txt)
+            text_tmp[self.global_rank][j * self.minibatch_size:(j+1)*self.minibatch_size] = \
+                F.normalize(self.text_encoder(mb).logits, dim=1)
+            image_logits = torch.cat(ims) @ torch.cat(text_tmp).t() * self.clip_model.logit_scale.exp()
+            loss = (F.cross_entropy(image_logits, ground_truth) + F.cross_entropy(image_logits.t(), ground_truth))/2
+            self.manual_backward(loss)
+
+        optimizer.step()
+        lr_scheduler = self.lr_schedulers()
+        lr_scheduler.step()
+        self.clip_model.logit_scale.data.clamp_(-np.log(100), np.log(100))
+
+    def validation_step(self, val_batch, idx):
+        image, text, labels = val_batch
+        img_embed = self.clip_model.get_image_features(image)
+        txt_embed = self.text_encoder(text).logits
+        # print(img_embed.shape)
+        image_norm = F.normalize(img_embed, dim=1)
+        text_norm = F.normalize(txt_embed, dim=1)
+        image_logits = image_norm @ text_norm.t() * self.clip_model.logit_scale.exp()
+        text_logits = text_norm @ image_norm.t() * self.clip_model.logit_scale.exp()
+        # print(image_logits.shape)
+        # image_logits, text_logits = self.forward(image, text)
+        ground_truth = torch.arange(len(image_logits)).long().to(image_logits.device)
+        loss = (F.cross_entropy(image_logits, ground_truth) + F.cross_entropy(text_logits, ground_truth)).div(2)
+        self.log('val_loss', loss, prog_bar=True)
+        return [image_norm, text_norm, labels]
+
+    def validation_epoch_end(self, outputs):
+        image_features = torch.cat([x[0] for x in outputs])
+        text_features = torch.cat([x[1] for x in outputs])
+        labels = [label for x in outputs for label in x[2]]
+        print(image_features.shape, text_features.shape, len(labels))
+        self.get_metrics(image_features, text_features, labels, 100)
+
+    def test_step(self, test_batch, idx):
+        image, text, labels = test_batch
+        image_features = self.clip_model.get_image_features(image)
+        text_features = self.text_encoder(text).logits
+        image_features = image_features / image_features.norm(dim=1, keepdim=True)
+        text_features = text_features / text_features.norm(dim=1, keepdim=True)
+        return [image_features, text_features, labels]
+
+    def test_epoch_end(self, outputs):
+        image_features = torch.cat([x[0] for x in outputs])
+        text_features = torch.cat([x[1] for x in outputs])
+        labels = [label for x in outputs for label in x[2]]
+        print(image_features.shape, text_features.shape, len(labels))
+        self.get_metrics(image_features, text_features, labels, 100)
+
+    def get_metrics(self, image_features, text_features, labels, logit_scale):
+        # 计算相似度，支持多个样本的情况（比如一个图片有多个caption）
+        # img2txt计算的时候要用到，因为一张图片可能对应多个文本。
+        # txt2img计算的时候不需要（一般一个text只有一个对应图片）
+        # metrics = {}
+        logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()
+        logits_per_text = logits_per_image.t().detach().cpu()
+
+        logits = {"image_to_text": logits_per_image, "text_to_image": logits_per_text}
+
+        label2idx = {}  # 计算label到idx的映射。
+        repeat_id = []
+        for i, label in enumerate(labels):
+            if label not in label2idx:
+                label2idx[label] = [i]
+            else:
+                # 表示该index的标签出现过，记录这个index，后续算txt2img分数的时候，这些index的权值要降低。
+                label2idx[label].append(i)
+                repeat_id.append(i)
+        # print(label2idx)    # 标注了每个label的idx
+
+        # print('repeat_id:', repeat_id)
+        ground_truth = [label2idx[label] for label in labels]
+        # print(ground_truth)
+
+        for name, logit in logits.items():
+            # print(name, logit.shape)
+            if name == 'text_to_image':
+                logit[:, repeat_id] -= 1e8   # 这部分的分数要降低。（重复出现的图片，直接忽略）
+            r1_stat, r5_stat, r10_stat = [], [], []
+            ranking = torch.argsort(logit, descending=True)  # index of the largest element to the smallest
+            # print(name, ranking[:, :10])
+            for i, each_query in enumerate(ranking[:, :10]):
+                for j, q in enumerate(each_query):
+                    if q in ground_truth[i]:
+                        if j == 0:
+                            r1_stat.append(1)
+                            r5_stat.append(1)
+                            r10_stat.append(1)
+                            break
+                        if j < 5:
+                            r5_stat.append(1)
+                            r10_stat.append(1)
+                            break
+                        if j < 10:
+                            r10_stat.append(1)
+                            break
+            print(f'{name} r1:{sum(r1_stat)/len(logit)}, r5:{sum(r5_stat)/len(logit)}, r10:{sum(r10_stat)/len(logit)}')
+
+    def configure_optimizers(self):
+        lr = {
+            "RN50": 5e-4,
+            "RN101": 5e-4,
+            "RN50x4": 5e-4,
+            "RN50x16": 4e-4,
+            "RN50x64": 3.6e-4,
+            "ViT-B/32": 5e-4,
+            "ViT-B/16": 5e-4,
+            "ViT-L/14": 4e-4,
+            "ViT-L/14-336px": 2e-5
+        }[self.model_name]
+
+        optimizer = torch.optim.AdamW(
+            [{'params': self.clip_model.parameters()}, {'params': self.text_encoder.parameters()}],
+            lr=lr,
+            betas=(
+                0.9,
+                0.98 if self.isViT else 0.999
+            ),
+            eps=1e-6 if self.isViT else 1e-8,
+            weight_decay=0.2
+        )
+
+        # Source: https://github.com/openai/CLIP/issues/107
+        # Use pip install 'git+https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup'
+        lr_scheduler = CosineAnnealingWarmRestarts(
+            optimizer,
+            T_0=2000
+        )
+        # CosineAnnealingWarmupRestarts
+        return {'optimizer': optimizer, 'lr_scheduler': lr_scheduler}
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+
+    # model_name
+    parser.add_argument('--model', type=str,
+                        default="ViT-B/32",
+                        help='model definition')
+
+    # experiment setting
+    parser.add_argument('--batch_size', type=int, default=128)
+    parser.add_argument('--num_epoches', type=int, default=1)
+    parser.add_argument('--num_gpus', type=int, default=2)
+
+    # dataset
+    parser.add_argument('--train_filename', type=str,
+                        help='dir or csv file')
+    parser.add_argument('--train_root', type=str,
+                        help='image root path')
+    parser.add_argument('--val_filename', type=str,
+                        help='dir or csv file')
+    parser.add_argument('--val_root', type=str,
+                        help='image root path')
+    parser.add_argument('--test_filename', type=str,
+                        help='dir or csv file')
+    parser.add_argument('--test_root', type=str,
+                        help='image root path')
+    parser.add_argument('--num_workers', type=int, default=0)
+
+    # huggingface pretrain model 定义
+    parser.add_argument('--pretrain_model', type=str,
+                        default="openai/clip-vit-base-patch32",
+                        help='defalut load from openai')    # "wf-genius/TaiYi-CLIP-ViT-B-32" 是我训好的 NOTE
+
+    args = parser.parse_args()
+    dm = FlickrDataModule(args)
+
+    model = CLIPLightning(model_name=args.model, minibatch_size=args.batch_size//2)
+    trainer = pl.Trainer(gpus=args.num_gpus, precision=16, max_epochs=args.num_epoches)
+    trainer.test(model, dm)  # zero-shot test
+    trainer.fit(model, dm)  # finetune on train set
+    trainer.test(model, dm)  # test again
+
diff --git a/fengshen/examples/clip_finetune/finetune_flickr.sh b/fengshen/examples/clip_finetune/finetune_flickr.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0e8f8c79decdbd4a070188fbfa976bd4b90d0d8d
--- /dev/null
+++ b/fengshen/examples/clip_finetune/finetune_flickr.sh
@@ -0,0 +1,10 @@
+python clip_finetune_flickr.py --batch_size 512 \
+--num_gpus 1 \
+--num_workers 20 \
+--train_filename /shared_space/ccnl/mm_data/Flickr30k-CNA/train/flickr30k_cna_train.txt \
+--val_filename /shared_space/ccnl/mm_data/Flickr30k-CNA/val/flickr30k_cna_val.txt \
+--test_filename /shared_space/ccnl/mm_data/Flickr30k-CNA/test/flickr30k_cn_test.txt \
+--train_root /shared_space/ccnl/mm_data/Flickr30k-CNA/flickr30k/images \
+--val_root /shared_space/ccnl/mm_data/Flickr30k-CNA/flickr30k/images \
+--test_root /shared_space/ccnl/mm_data/Flickr30k-CNA/flickr30k/images \
+
diff --git a/fengshen/examples/clue_sim/README.md b/fengshen/examples/clue_sim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..41b5b72129491139fa6f21e7cc2ea07d027a60c3
--- /dev/null
+++ b/fengshen/examples/clue_sim/README.md
@@ -0,0 +1,90 @@
+# 二郎神打CLUE语义匹配榜
+  - [比赛介绍](#比赛介绍)
+  - [clue语义匹配榜打榜思路](#clue语义匹配榜-打榜思路)
+  - [数据集介绍](#数据集介绍)
+  - [环境](#环境)
+  - [用法](#用法)
+  - [提交](#提交)
+
+## 比赛介绍
+- clue的语义匹配榜 (https://www.cluebenchmarks.com/sim.html)
+- clue sim官方实例 (https://github.com/CLUEbenchmark/QBQTC)
+
+## clue语义匹配榜 打榜思路
+
+- 直接使用fengshenbang的二郎神模型，就打到了前三。
+- 为了解决标签平衡问题，设计了一个交叉熵平滑滤波loss，就达到了第一。
+
+详细的思路讲解在知乎: <a href="https://zhuanlan.zhihu.com/p/539870077?">链接</a>
+
+## 数据集介绍
+
+QQ浏览器搜索相关性数据集（QBQTC,QQ Browser Query Title Corpus），是QQ浏览器搜索引擎目前针对大搜场景构建的一个融合了相关性、权威性、内容质量、
+时效性等维度标注的学习排序（LTR）数据集，广泛应用在搜索引擎业务场景中。
+
+相关性的含义：0，相关程度差；1，有一定相关性；2，非常相关。数字越大相关性越高。
+
+**数据量统计**
+
+| 训练集（train) | 验证集（dev) | 公开测试集（test_public) | 私有测试集(test) |
+| :----: | :----: | :----: | :----: |
+| 180,000| 20,000| 5,000 | >=10,0000|
+
+**评测指标**
+
+f1_score来自于sklearn.metrics，计算公式如下：
+`F1 =  2 * (precision * recall) / (precision + recall)`
+
+## 环境
+* Python >= 3.6
+* torch == 1.8.0+cu111
+* transforms == 4.6.0
+* pytorch-lightning == 1.3.2
+* 一张GPU: A100 40G
+
+## 用法
+
+fengshenbang的二郎神模型的使用是非常简单的。
+
+该example下的代码和思想继承自<a href="https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/hf-ds/fengshen/examples/classification/finetune_classification.py">fengshen/examples/classification/finetune_classification.py</a>
+
+如果需要直接使用该python脚本，把官方的数据集处理成如下形式：
+
+```json
+{"sentence1": "应届生实习", "sentence2": "实习生招聘-应届生求职网", "label": "1", "id": 0}
+```
+
+然后修改其中的<a href="https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/hf-ds/fengshen/examples/classification/finetune_classification.sh">fengshen/examples/classification/finetune_classification.sh</a>的参数即可。
+
+下面介绍该example的用法：
+
+### 创建文件夹
+
+- dataset 文件夹，下载官方数据集后放进来就行
+- weights 文件夹，用以存放二郎神模型
+- submissions 文件夹，用以存放需要评测的json文件
+
+### Train
+```bash
+python main.py \
+    --mode 'Train' \
+    --model_path './weights/Erlangshen-MegatronBert-1.3B-Similarity' \
+    --model_name 'IDEA-CCNL/Erlangshen-MegatronBert-1.3B-Similarity'
+```
+
+加载最优的模型用以Test set的预测。
+
+### Test
+```bash
+python main.py \
+    --mode 'Test' \
+    --predict_model_path 'your_model_path' \
+    --model_path './weights/Erlangshen-MegatronBert-1.3B-Similarity' \
+    --model_name 'IDEA-CCNL/Erlangshen-MegatronBert-1.3B-Similarity'
+```
+
+## 提交
+
+在路径 ./submissions 下，找到 qbqtc_predict.json 并且提交到<a href="https://www.CLUEbenchmarks.com">测评系统</a>
+
+注意：名字必须为qbqtc_predict.json
diff --git a/fengshen/examples/clue_sim/__init__.py b/fengshen/examples/clue_sim/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fengshen/examples/clue_sim/finetune_clue_sim.py b/fengshen/examples/clue_sim/finetune_clue_sim.py
new file mode 100644
index 0000000000000000000000000000000000000000..b05f6ea6ce67c35cd39dedd924df0b663fd5a8b2
--- /dev/null
+++ b/fengshen/examples/clue_sim/finetune_clue_sim.py
@@ -0,0 +1,325 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+from sklearn import metrics
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader, ConcatDataset
+import pytorch_lightning as pl
+from collections import defaultdict
+from transformers import AutoConfig, AutoModel, get_cosine_schedule_with_warmup
+from loss import FocalLoss, LabelSmoothingCorrectionCrossEntropy
+
+
+class CustomDataset(Dataset):
+    def __init__(self, file, tokenizer, max_len, mode='no_test'):
+        self.tokenizer = tokenizer
+        self.max_len = max_len
+        self.mode = mode
+
+        self.ex_list = []
+        with open('./dataset/' + file, "r", encoding='utf-8') as f:
+            for line in f:
+                sample = json.loads(line)
+                query = sample["query"]
+                title = sample["title"]
+                id = int(sample["id"])
+                if self.mode == 'no_test':
+                    relevant = int(sample["label"])
+                    self.ex_list.append((query, title, relevant, id))
+                else:
+                    self.ex_list.append((query, title, id))
+
+    def __len__(self):
+        return len(self.ex_list)
+
+    def __getitem__(self, index):
+        if self.mode == 'no_test':
+            query, title, relevant, id = self.ex_list[index]
+        else:
+            query, title, id = self.ex_list[index]
+
+        inputs = self.tokenizer.encode_plus(
+            query, title,
+            truncation=True,
+            add_special_tokens=True,
+            max_length=self.max_len,
+            padding='max_length',
+            return_token_type_ids=True
+        )
+        ids = inputs['input_ids']
+        mask = inputs['attention_mask']
+        token_type_ids = inputs["token_type_ids"]
+        if self.mode == 'no_test':
+            return {
+                'ids': torch.tensor(ids, dtype=torch.long),
+                'mask': torch.tensor(mask, dtype=torch.long),
+                'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
+                'targets': torch.tensor(relevant, dtype=torch.float),
+                'id': torch.tensor(id, dtype=torch.long)
+            }
+        else:
+            return {
+                'ids': torch.tensor(ids, dtype=torch.long),
+                'mask': torch.tensor(mask, dtype=torch.long),
+                'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
+                'id': torch.tensor(id, dtype=torch.long)
+            }
+
+
+class CustomDataModule(pl.LightningDataModule):
+    def __init__(self, args, tokenizer):
+        super().__init__()
+        self.args = args
+        self.tokenizer = tokenizer
+        self.max_len = self.args.max_seq_length
+        self.train_dataset = None
+        self.val_dataset = None
+
+    def setup(self, stage):
+        data_path = "./dataset"
+        assert os.path.exists(os.path.join(data_path, 'train.json'))
+        assert os.path.exists(os.path.join(data_path, 'dev.json'))
+        assert os.path.exists(os.path.join(data_path, 'test_public.json'))
+        if stage == 'fit':
+            self.train_dataset = CustomDataset('train.json', self.tokenizer, self.max_len)
+            self.val_dataset = CustomDataset('dev.json', self.tokenizer, self.max_len)
+            self.test_dataset = CustomDataset('test_public.json', self.tokenizer, self.max_len)
+        elif stage == 'test':
+            self.test_dataset = CustomDataset('test_public.json', self.tokenizer, self.max_len)
+
+    def train_dataloader(self):
+        full_dataset = ConcatDataset([self.train_dataset, self.val_dataset])
+        train_dataloader = DataLoader(
+            full_dataset,
+            batch_size=self.args.batch_size,
+            num_workers=4,
+            shuffle=True,
+            pin_memory=True,
+            drop_last=True)
+        return train_dataloader
+
+    def val_dataloader(self):
+        val_dataloader = DataLoader(
+            self.test_dataset,
+            batch_size=self.args.val_batch_size,
+            num_workers=4,
+            shuffle=False,
+            pin_memory=True,
+            drop_last=False)
+        return val_dataloader
+
+    def test_dataloader(self):
+        test_dataloader = DataLoader(
+            self.test_dataset,
+            batch_size=self.args.val_batch_size,
+            num_workers=4,
+            shuffle=False,
+            pin_memory=True,
+            drop_last=False)
+        return test_dataloader
+
+
+class CustomModel(pl.LightningModule):
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.model = self.args.model_name
+        self.cache_dir = self.args.model_path
+        self.scheduler = self.args.scheduler
+        self.step_scheduler_after = "batch"
+        self.optimizer = self.args.optimizer
+        self.pooler = self.args.use_original_pooler
+        self.category = self.args.cate_performance
+        self.loss_func = self.args.loss_function
+
+        hidden_dropout_prob: float = 0.1
+        layer_norm_eps: float = 1e-7
+
+        config = AutoConfig.from_pretrained(self.model, cache_dir=self.cache_dir)
+
+        config.update(
+            {
+                "output_hidden_states": False,
+                "hidden_dropout_prob": hidden_dropout_prob,
+                "layer_norm_eps": layer_norm_eps,
+            }
+        )
+        self.transformer = AutoModel.from_pretrained(self.model, config=config, cache_dir=self.cache_dir)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.linear = torch.nn.Linear(config.hidden_size, self.args.num_labels, bias=True)  # 分三类
+
+    def configure_optimizers(self):
+        """Prepare optimizer and schedule"""
+        model = self.transformer
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": 0.01,
+            },
+            {
+                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+
+        optimizer_index = ['Adam', 'AdamW'].index(self.optimizer)
+        optimizer = [
+            torch.optim.Adam(optimizer_grouped_parameters, lr=self.args.learning_rate),
+            torch.optim.AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate)][optimizer_index]
+
+        scheduler_index = ['StepLR', 'CosineWarmup', 'CosineAnnealingLR'].index(self.scheduler)
+        scheduler = [
+            torch.optim.lr_scheduler.StepLR(optimizer, step_size=self.args.warmup_step,
+                                            gamma=self.args.warmup_proportion),
+            get_cosine_schedule_with_warmup(
+                optimizer,
+                num_warmup_steps=int(self.args.warmup_proportion * self.total_steps),
+                num_training_steps=self.total_steps,
+            ),
+            torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5, eta_min=2e-06)][scheduler_index]
+
+        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
+        return [optimizer], [scheduler]
+
+    def setup(self, stage=None):
+        if stage != "fit":
+            return
+        # calculate total steps
+        train_dataloader = self.trainer.datamodule.train_dataloader()
+        gpus = 0 if self.trainer.gpus is None else self.trainer.gpus
+        tb_size = self.args.batch_size * max(1, gpus)
+        ab_size = self.trainer.accumulate_grad_batches * float(self.trainer.max_epochs)
+        self.total_steps = (len(train_dataloader.dataset) // tb_size) // ab_size
+
+    def loss(self, outputs, targets):
+        lossf_index = ['CE', 'Focal', 'LSCE_correction'].index(self.loss_func)
+        loss_fct = [nn.CrossEntropyLoss(), FocalLoss(), LabelSmoothingCorrectionCrossEntropy()][lossf_index]
+        loss = loss_fct(outputs, targets)
+        return loss
+
+    def category_performance_measure(self, labels_right, labels_pred, num_label=3):
+        text_labels = [i for i in range(num_label)]
+
+        TP = dict.fromkeys(text_labels, 0)  # 预测正确的各个类的数目
+        TP_FP = dict.fromkeys(text_labels, 0)  # 测试数据集中各个类的数目
+        TP_FN = dict.fromkeys(text_labels, 0)  # 预测结果中各个类的数目
+
+        label_dict = defaultdict(list)
+        for num in range(num_label):
+            label_dict[num].append(str(num))
+
+        # 计算TP等数量
+        for i in range(0, len(labels_right)):
+            TP_FP[labels_right[i]] += 1
+            TP_FN[labels_pred[i]] += 1
+            if labels_right[i] == labels_pred[i]:
+                TP[labels_right[i]] += 1
+
+        # 计算准确率P，召回率R，F1值
+        results = []
+        for key in TP_FP:
+            P = float(TP[key]) / float(TP_FP[key] + 1e-9)
+            R = float(TP[key]) / float(TP_FN[key] + 1e-9)
+            F1 = P * R * 2 / (P + R) if (P + R) != 0 else 0
+            # results.append("%s:\t P:%f\t R:%f\t F1:%f" % (key, P, R, F1))
+            results.append(F1)
+        return results
+
+    def monitor_metrics(self, outputs, targets):
+        pred = torch.argmax(outputs, dim=1).cpu().numpy().tolist()
+        targets = targets.int().cpu().numpy().tolist()
+        if self.category:
+            category_results = self.category_performance_measure(
+                labels_right=targets,
+                labels_pred=pred,
+                num_label=self.args.num_labels
+            )
+            return {"f1": category_results}
+        else:
+            f1_score = metrics.f1_score(targets, pred, average="macro")
+            return {"f1": f1_score}
+
+    def forward(self, ids, mask, token_type_ids, labels):
+        transformer_out = self.transformer(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)
+
+        if self.pooler:
+            pooler_output = transformer_out.pooler_output
+        else:
+            sequence_output = transformer_out.last_hidden_state
+            pooler_output = torch.mean(sequence_output, dim=1)
+        logits = self.linear(self.dropout(pooler_output))
+
+        labels_hat = torch.argmax(logits, dim=1)
+        correct_count = torch.sum(labels == labels_hat)
+        return logits, correct_count
+
+    def predict(self, ids, mask, token_type_ids):
+        transformer_out = self.transformer(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)
+        pooler_output = transformer_out.pooler_output
+        logits = self.linear(self.dropout(pooler_output))
+        logits = torch.argmax(logits, dim=1)
+        return logits
+
+    def training_step(self, batch, batch_idx):
+        ids, mask, token_type_ids, labels = batch['ids'], batch['mask'], batch['token_type_ids'], batch['targets']
+        logits, correct_count = self.forward(ids, mask, token_type_ids, labels)
+        loss = self.loss(logits, labels.long())
+        f1 = self.monitor_metrics(logits, labels)["f1"]
+        self.log("train_loss", loss, logger=True, prog_bar=True)
+        self.log('train_acc', correct_count.float() / len(labels), logger=True, prog_bar=True)
+        if self.category:
+            self.log("train_f1_key0", f1[0], logger=True, prog_bar=True)
+            self.log("train_f1_key1", f1[1], logger=True, prog_bar=True)
+            self.log("train_f1_key2", f1[2], logger=True, prog_bar=True)
+        else:
+            self.log("train_f1", f1, logger=True, prog_bar=True)
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        ids, mask, token_type_ids, labels = batch['ids'], batch['mask'], batch['token_type_ids'], batch['targets']
+        logits, correct_count = self.forward(ids, mask, token_type_ids, labels)
+        loss = self.loss(logits, labels.long())
+        f1 = self.monitor_metrics(logits, labels)["f1"]
+        self.log("val_loss", loss, logger=True, prog_bar=True)
+        self.log("val_acc", correct_count.float() / len(labels), logger=True, prog_bar=True)
+        if self.category:
+            self.log("val_f1_key0", f1[0], logger=True, prog_bar=True)
+            self.log("val_f1_key1", f1[1], logger=True, prog_bar=True)
+            self.log("val_f1_key2", f1[2], logger=True, prog_bar=True)
+        else:
+            self.log("val_f1", f1, logger=True, prog_bar=True)
+
+    def test_step(self, batch, batch_idx):
+        ids, mask, token_type_ids, labels = batch['ids'], batch['mask'], batch['token_type_ids'], batch['targets']
+        logits, correct_count = self.forward(ids, mask, token_type_ids, labels)
+        loss = self.loss(logits, labels.long())
+        f1 = self.monitor_metrics(logits, labels)["f1"]
+        self.log("test_loss", loss, logger=True, prog_bar=True)
+        self.log("test_acc", correct_count.float() / len(labels), logger=True, prog_bar=True)
+        if self.category:
+            self.log("test_f1_key0", f1[0], logger=True, prog_bar=True)
+            self.log("test_f1_key1", f1[1], logger=True, prog_bar=True)
+            self.log("test_f1_key2", f1[2], logger=True, prog_bar=True)
+        else:
+            self.log("test_f1", f1, logger=True, prog_bar=True)
+        return {"test_loss": loss, "logits": logits, "labels": labels}
+
+    def predict_step(self, batch, batch_idx, dataloader_idx):
+        ids, mask, token_type_ids, id = batch['ids'], batch['mask'], batch['token_type_ids'], batch['id']
+        logits = self.predict(ids, mask, token_type_ids)
+        return {'id': id.cpu().numpy().tolist(), 'logits': logits.cpu().numpy().tolist()}
diff --git a/fengshen/examples/clue_sim/loss.py b/fengshen/examples/clue_sim/loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..537e2347f65aa952b0eb852c23a39901b0fef52e
--- /dev/null
+++ b/fengshen/examples/clue_sim/loss.py
@@ -0,0 +1,77 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+from torch.nn import functional as F
+
+
+class FocalLoss(torch.nn.Module):
+    """Multi-class Focal loss implementation"""
+
+    def __init__(self, gamma=2, weight=None, ignore_index=-100):
+        super(FocalLoss, self).__init__()
+        self.gamma = gamma
+        self.weight = weight
+        self.ignore_index = ignore_index
+
+    def forward(self, input, target):
+        """
+        input: [N, C]
+        target: [N, ]
+        """
+        logpt = F.log_softmax(input, dim=1)
+        pt = torch.exp(logpt)
+        logpt = (1-pt)**self.gamma * logpt
+        loss = F.nll_loss(logpt, target, self.weight, ignore_index=self.ignore_index)
+        return loss
+
+# 交叉熵平滑滤波 防止过拟合
+
+
+class LabelSmoothingCorrectionCrossEntropy(torch.nn.Module):
+    def __init__(self, eps=0.1, reduction='mean', ignore_index=-100):
+        super(LabelSmoothingCorrectionCrossEntropy, self).__init__()
+        self.eps = eps
+        self.reduction = reduction
+        self.ignore_index = ignore_index
+
+    def forward(self, output, target):
+        c = output.size()[-1]
+        log_preds = F.log_softmax(output, dim=-1)
+        if self.reduction == 'sum':
+            loss = -log_preds.sum()
+        else:
+            loss = -log_preds.sum(dim=-1)
+            if self.reduction == 'mean':
+                loss = loss.mean()
+
+        # task specific
+        labels_hat = torch.argmax(output, dim=1)
+        lt_sum = labels_hat + target
+        abs_lt_sub = abs(labels_hat - target)
+        correction_loss = 0
+        for i in range(c):
+            if lt_sum[i] == 0:
+                pass
+            elif lt_sum[i] == 1:
+                if abs_lt_sub[i] == 1:
+                    pass
+                else:
+                    correction_loss -= self.eps*(0.5945275813408382)
+            else:
+                correction_loss += self.eps*(1/0.32447699714575207)
+        correction_loss /= c
+        # print(correction_loss)
+        return loss*self.eps/c + (1-self.eps) * \
+            F.nll_loss(log_preds, target, reduction=self.reduction, ignore_index=self.ignore_index) + correction_loss
diff --git a/fengshen/examples/clue_sim/main.py b/fengshen/examples/clue_sim/main.py
new file mode 100644
index 0000000000000000000000000000000000000000..91c5a732d8cb1a683aa34a3b3f7c158861cd4492
--- /dev/null
+++ b/fengshen/examples/clue_sim/main.py
@@ -0,0 +1,133 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import jsonlines
+import torch
+import pytorch_lightning as pl
+from transformers import AutoTokenizer, BertTokenizer
+from train_func import CustomDataset, CustomDataModule, CustomModel
+import argparse
+import os
+import gpustat
+
+if __name__ == '__main__':
+    my_parser = argparse.ArgumentParser()
+    my_parser.add_argument(
+        "--model_path", default="./weights/Erlangshen-MegatronBert-1.3B-Similarity", type=str, required=False)
+    my_parser.add_argument(
+        "--model_name", default="IDEA-CCNL/Erlangshen-MegatronBert-1.3B-Similarity", type=str, required=False)
+    my_parser.add_argument("--max_seq_length", default=64, type=int, required=False)
+    my_parser.add_argument("--batch_size", default=32, type=int, required=False)
+    my_parser.add_argument("--val_batch_size", default=64, type=int, required=False)
+    my_parser.add_argument("--num_epochs", default=10, type=int, required=False)
+    my_parser.add_argument("--learning_rate", default=4e-5, type=float, required=False)
+    my_parser.add_argument("--warmup_proportion", default=0.2, type=int, required=False)
+    my_parser.add_argument("--warmup_step", default=2, type=int, required=False)
+    my_parser.add_argument("--num_labels", default=3, type=int, required=False)
+    my_parser.add_argument("--cate_performance", default=False, type=bool, required=False)
+    my_parser.add_argument("--use_original_pooler", default=True, type=bool, required=False)
+    my_parser.add_argument("--model_output_path", default='./pl_model', type=str, required=False)
+    my_parser.add_argument("--mode", type=str, choices=['Train', 'Test'], required=True)
+    my_parser.add_argument("--predict_model_path", default='./pl_model/', type=str, required=False)
+    my_parser.add_argument("--test_output_path", default='./submissions', type=str, required=False)
+    my_parser.add_argument("--optimizer", default='AdamW', type=str, required=False)  # ['Adam', 'AdamW']
+    # ['StepLR', 'CosineWarmup', 'CosineAnnealingLR']
+    my_parser.add_argument("--scheduler", default='CosineWarmup', type=str, required=False)
+    my_parser.add_argument("--loss_function", default='LSCE_correction', type=str,
+                           required=False)  # ['CE', 'Focal', 'LSCE_correction']
+
+    args = my_parser.parse_args()
+
+    print(args)
+    gpustat.print_gpustat()
+
+    if 'Erlangshen' in args.model_name:
+        tokenizer = BertTokenizer.from_pretrained(args.model_name, cache_dir=args.model_path)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(args.model_name, cache_dir=args.model_path)
+
+    seed = 1919
+    pl.seed_everything(seed)
+
+    dm = CustomDataModule(
+        args=args,
+        tokenizer=tokenizer,
+    )
+
+    metric_index = 2
+    checkpoint = pl.callbacks.ModelCheckpoint(
+        save_top_k=1,
+        verbose=True,
+        monitor=['val_loss', 'val_acc', 'val_f1'][metric_index],
+        mode=['min', 'max', 'max'][metric_index]
+    )
+
+    lr_monitor = pl.callbacks.LearningRateMonitor(logging_interval="step")
+    callbacks = [checkpoint, lr_monitor]
+
+    logger = pl.loggers.TensorBoardLogger(save_dir=os.getcwd(),
+                                          name='lightning_logs/' + args.model_name.split('/')[-1]),
+
+    trainer = pl.Trainer(
+        progress_bar_refresh_rate=50,
+        logger=logger,
+        gpus=-1 if torch.cuda.is_available() else None,
+        amp_backend='native',
+        amp_level='O2',
+        precision=16,
+        callbacks=callbacks,
+        gradient_clip_val=1.0,
+        max_epochs=args.num_epochs,
+        # accelerator='ddp',
+        # plugins='ddp_sharded',
+    )
+
+    if args.mode == 'Train':
+        print('Only Train')
+        model = CustomModel(
+            args=args,
+        )
+        trainer.fit(model, dm)
+
+    # Predict test, save results to json
+    if args.mode == 'Test':
+        print('Only Test')
+        test_loader = torch.utils.data.DataLoader(
+            CustomDataset('test.json', tokenizer, args.max_seq_length, 'test'),
+            batch_size=args.val_batch_size,
+            num_workers=4,
+            shuffle=False,
+            pin_memory=True,
+            drop_last=False
+        )
+
+        model = CustomModel(args=args).load_from_checkpoint(args.predict_model_path, args=args)
+
+        predict_results = trainer.predict(model, test_loader, return_predictions=True)
+
+        path = os.path.join(
+            args.test_output_path,
+            args.model_name.split('/')[-1].replace('-', '_'))
+        file_path = os.path.join(path, 'qbqtc_predict.json')
+
+        if not os.path.exists(path):
+            os.makedirs(path)
+        if os.path.exists(file_path):
+            print('Json文件已存在, 将用本次结果替换')
+
+        with jsonlines.open(file_path, 'w') as jsonf:
+            for predict_res in predict_results:
+                for i, p in zip(predict_res['id'], predict_res['logits']):
+                    jsonf.write({"id": i, "label": str(p)})
+        print('Json saved:', file_path)
diff --git a/fengshen/examples/hubert/pretrain_hubert.py b/fengshen/examples/hubert/pretrain_hubert.py
new file mode 100644
index 0000000000000000000000000000000000000000..6506364b9498c5b994c085e1a5342082283ef62b
--- /dev/null
+++ b/fengshen/examples/hubert/pretrain_hubert.py
@@ -0,0 +1,287 @@
+import fengshen.data.hubert.hubert_dataset as datasets
+from fengshen.data.universal_datamodule import UniversalDataModule
+from transformers import HubertConfig, HubertModel
+# from transformers.models.hubert.modeling_hubert import _compute_mask_indices
+import argparse
+from fairseq.data import Dictionary
+from pytorch_lightning import (
+    LightningModule,
+    Trainer,
+    loggers,
+)
+from pytorch_lightning.callbacks import LearningRateMonitor
+import torch
+import os
+import torch.nn.functional as F
+import torch.nn as nn
+
+
+class LabelEncoder(object):
+    def __init__(self, dictionary: Dictionary):
+        self.dictionary = dictionary
+
+    def __call__(self, label: str):
+        return self.dictionary.encode_line(
+            label,
+            append_eos=False,
+            add_if_not_exist=False,
+        )
+
+
+class HubertPretrainDataLoader():
+    def __init__(self, args):
+        self.cfg = args
+        self.dictionaries = self.load_dictionaries()
+        self.load_datasets = {}
+
+    # TODO 改成HuggingFace Tokenizer
+    def load_dictionaries(self):
+        label_dir = self.cfg.data if self.cfg.label_dir is None else self.cfg.label_dir
+        dictionaries = [
+            Dictionary.load(f"{label_dir}/dict.{label}.txt")
+            for label in self.cfg.labels
+        ]
+        return dictionaries
+
+    def get_label_dir(self):
+        if self.cfg.label_dir is None:
+            return self.cfg.data
+        return self.cfg.label_dir
+
+    @property
+    def datasets(self):
+        return self.load_datasets
+
+    def load_dataset(self, split: str, **kwargs):
+        manifest = f"{self.cfg.data}/{split}.tsv"
+        dicts = self.dictionaries
+        pad_list = [dict.pad() for dict in dicts]
+        eos_list = [dict.eos() for dict in dicts]
+        procs = [LabelEncoder(dict) for dict in dicts]
+        paths = [f"{self.get_label_dir()}/{split}.{lb}" for lb in self.cfg.labels]
+
+        # hubert v1: pad_audio=True, random_crop=False;
+        self.load_datasets[split] = datasets.HubertDataset(
+            manifest,
+            sample_rate=self.cfg.sample_rate,
+            label_paths=paths,
+            label_rates=self.cfg.label_rate,
+            pad_list=pad_list,
+            eos_list=eos_list,
+            label_processors=procs,
+            max_keep_sample_size=self.cfg.max_keep_size,
+            min_keep_sample_size=self.cfg.min_sample_size,
+            max_sample_size=self.cfg.max_sample_size,
+            pad_audio=self.cfg.pad_audio,
+            normalize=self.cfg.normalize,
+            store_labels=False,
+            random_crop=self.cfg.random_crop,
+            single_target=self.cfg.single_target,
+        )
+
+
+def perpare_data(args):
+    loader = HubertPretrainDataLoader(args)
+    loader.load_dataset('train')
+    loader.load_dataset('valid')
+    return loader
+
+
+class HubertLightning(LightningModule):
+    @staticmethod
+    def add_module_specific_args(parent_parser):
+        parser = parent_parser.add_argument_group('HuBert Lightning')
+        parser.add_argument('--pred_masked_weight', type=float, default=1.0)
+        parser.add_argument('--logit_temp', type=float, default=1.0)
+        parser.add_argument('--loss_weights', type=float, nargs='+')
+        # parser.add_argument('--mask_prob', type=float, default=0.65)
+        # parser.add_argument('--mask_length', type=int, default=10)
+        # parser.add_argument('--mask_selection', type=str, default='static',
+        #                     choice=["static", "uniform", "normal", "poisson"])
+        # parser.add_argument('--mask_other', type=float, default=0)
+        # parser.add_argument('--no_mask_overlap', type=bool, default=False)
+        # parser.add_argument('--mask_min_space', type=int, default=1)
+        return parent_parser
+
+    def __init__(self, args, loader, ** kwargs) -> None:
+        super().__init__()
+        self.save_hyperparameters(args)
+        config = HubertConfig.from_pretrained(args.model_path)
+        self.config = config
+        self.model = HubertModel(config=config)
+        self.num_classes = [len(d) for d in loader.dictionaries]
+        self.label_embs_concat = nn.Parameter(
+            torch.FloatTensor(sum(self.num_classes), self.config.conv_dim[-1] // 2)
+        )
+        self.final_proj = nn.Linear(
+            self.config.hidden_size, self.config.conv_dim[-1] // 2 * len(loader.dictionaries)
+        )
+        nn.init.uniform_(self.label_embs_concat)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+    def compute_nce(self, x, pos, negs):
+        neg_is_pos = (pos == negs).all(-1)
+        pos = pos.unsqueeze(0)
+        targets = torch.cat([pos, negs], dim=0)
+
+        logits = torch.cosine_similarity(x.float(), targets.float(), dim=-1).type_as(x)
+        logits /= self.hparams.logit_temp
+        if neg_is_pos.any():
+            logits[1:][neg_is_pos] = float("-inf")
+        logits = logits.transpose(0, 1)  # (num_x, num_cls+1)
+        return logits
+
+    def forward(self, **batch):
+
+        target_list = batch['target_list']
+        padding_mask = batch['net_input']['padding_mask']
+        input_values = batch['net_input']['source']
+        output = self.model(input_values=input_values,
+                            attention_mask=padding_mask,
+                            target_list=target_list,
+                            mask_time_indices=None,
+                            return_dict=False)
+
+        def compute_pred(proj_x, target, label_embs):
+            # compute logits for the i-th label set
+            y = torch.index_select(label_embs, 0, target.long())
+            negs = label_embs.unsqueeze(1).expand(-1, proj_x.size(0), -1)
+            # proj_x: (S, D)
+            # y: (S, D)
+            # negs: (Neg, S, D)
+            return self.compute_nce(proj_x, y, negs)
+
+        label_embs_list = self.label_embs_concat.split(self.num_classes, 0)
+
+        x, extra_losses, target_list, mask_indices, padding_mask = output[
+            0], output[-4], output[-3], output[-2], output[-1]
+
+        masked_indices = torch.logical_and(~padding_mask, mask_indices)
+        proj_x_m = self.final_proj(x[masked_indices])
+        proj_x_m_list = proj_x_m.chunk(len(target_list), dim=-1)
+        logp_m_list = [
+            compute_pred(proj_x_m, t[masked_indices], label_embs_list[i])
+            for i, (proj_x_m, t) in enumerate(zip(proj_x_m_list, target_list))
+        ]
+
+        targ_m_list = [x.new_zeros(x.size(0), dtype=torch.long) for x in logp_m_list]
+
+        loss = 0.0
+        loss_m_list = []
+
+        for i, (logp_m, targ_m) in enumerate(zip(logp_m_list, targ_m_list)):
+            loss_m = F.cross_entropy(logp_m, targ_m)
+            loss_m_list.append(loss_m)
+            self.log(f"loss_m_{i}", loss_m.detach().item())
+
+        loss += self.hparams.pred_masked_weight * sum(loss_m_list)
+
+        loss_weights = self.hparams.loss_weights
+        if loss_weights is not None:
+            if torch.is_tensor(extra_losses):
+                extra_losses = [extra_losses]
+                names = ['extra']
+            if len(loss_weights) == 1 and len(extra_losses) != 1:
+                loss_weights = [loss_weights[0]] * len(extra_losses)
+            assert len(extra_losses) == len(
+                loss_weights
+            ), f"{len(extra_losses)}, {len(loss_weights)}"
+            for p, n, coef in zip(extra_losses, names, loss_weights):
+                if coef != 0 and p is not None:
+                    p = coef * p.float()
+                    loss += p
+                    self.log(f"loss_{n}", p.item())
+
+        return {'loss': loss}
+
+    def training_step(self, batch, batch_idx):
+        output = self(**batch)
+        self.log('train_loss', output['loss'])
+        return output
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float()) / y_true.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self(**batch)
+        # self.log('val_loss', output.loss, sync_dist=True)
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        # self.log('val_acc', acc, sync_dist=True)
+        return output
+
+    def on_save_checkpoint(self, checkpoint) -> None:
+        # Save the current loop info in the mid of epoch
+        # if you lightning <= 1.6.0  uncomment the line below
+        # checkpoint['loops'] = self.trainer.checkpoint_connector._get_loops_state_dict()
+        if self.trainer.global_rank == 0:
+            self.model.save_pretrained(os.path.join(
+                self.trainer.checkpoint_callback.dirpath,
+                'hf_pretrained_epoch{}_step{}'.format(self.trainer.current_epoch, self.trainer.global_step)))
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+if __name__ == '__main__':
+    args_parser = argparse.ArgumentParser()
+    from fengshen.utils import UniversalCheckpoint
+    from fengshen.models.model_utils import add_module_args
+    args_parser = add_module_args(args_parser)
+    args_parser = datasets.add_data_specific_args(args_parser)
+    args_parser = UniversalDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = HubertLightning.add_module_specific_args(args_parser)
+    args_parser = UniversalCheckpoint.add_argparse_args(args_parser)
+    args_parser.add_argument('--ckpt_path', type=str, )
+    args = args_parser.parse_args()
+
+    data_module = UniversalDataModule(args=args, tokenizer=None, collate_fn=None)
+    data_loader = perpare_data(args)
+    data_module.datasets = data_loader.datasets
+    module = HubertLightning(args, loader=data_loader)
+
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+        args.default_root_dir, 'logs/'),
+        name=os.path.basename(os.path.dirname(args.model_path)))
+    checkpoint_callback = UniversalCheckpoint(args).callbacks
+
+    if args.ckpt_path is not None and \
+            not os.path.exists(args.ckpt_path):
+        print('--------warning no checkpoint found--------, remove args')
+        args.ckpt_path = None
+
+    trainer = Trainer.from_argparse_args(args,
+                                         logger=logger,
+                                         callbacks=[
+                                             lr_monitor,
+                                             checkpoint_callback])
+
+    trainer.fit(module, data_module, ckpt_path=args.ckpt_path)
diff --git a/fengshen/examples/hubert/pretrain_hubert_base.sh b/fengshen/examples/hubert/pretrain_hubert_base.sh
new file mode 100644
index 0000000000000000000000000000000000000000..11e5ddf38361d51910c35b02f10b7e285ab3f0fb
--- /dev/null
+++ b/fengshen/examples/hubert/pretrain_hubert_base.sh
@@ -0,0 +1,120 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_bart # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+#SBATCH -x dgx050
+
+MODEL_NAME=hubert-base-ls960
+config_json="./$MODEL_NAME.ds_config.json"
+export MASTER_PORT=29503
+MICRO_BATCH_SIZE=8
+ZERO_STAGE=1
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+    "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "tensorboard": {
+        "enabled": true,
+        "output_path": "/data/training_model/fengshen-${MODEL_NAME}/ds-tb-logs",
+        "job_name": "${MODEL_NAME}"
+    },
+    "#flops_profiler": {
+        "enabled": true,
+        "profile_step": 200,
+        "detailed": true,
+        "output_file": null
+    },
+    "steps_per_print": 100,
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+    "zero_allow_untested_optimizer": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/home/gaoxinyu/torch_extendsions
+
+DATA_DIR=/data/common_data/librispeech_tsv/datas
+LABELS_DIR=/data/common_data/librispeech_tsv/labels
+
+DATA_ARGS="\
+        --dataloader_workers 2 \
+        --train_batchsize $MICRO_BATCH_SIZE \
+        --val_batchsize 32 \
+        --test_batchsize 8  \
+        --val_datasets_field valid \
+        --test_datasets_field valid \
+        --sampler_type random \
+        --data ${DATA_DIR} \
+        --label_dir ${LABELS_DIR} \
+        --labels km \
+        --label_rate 100 \
+        --max_sample_size 250000 \
+        --min_sample_size 32000 \
+        --pad_audio False \
+        --random_crop True \
+        --normalize False \
+        "
+
+MODEL_ARGS="\
+        --model_path /data/pretrained_model/$MODEL_NAME/ \
+        --learning_rate 1e-4 \
+        --weight_decay 1e-2 \
+        --warmup_ratio 0.01 \
+        --pred_masked_weight 1.0 \
+        --loss_weights 10 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor train_loss \
+        --save_top_k 0 \
+        --mode min \
+        --every_n_train_steps 10000 \
+        --dirpath /data/training_model/ckpt/fengshen-$MODEL_NAME \
+        --filename model-{step:02d}-{train_loss:.4f} \
+        --every_n_epochs 0 \
+        --save_last \
+        --not_save_on_train_epoch_end \
+        "
+
+# deepspeed_stage_${ZERO_STAGE} \
+TRAINER_ARGS="\
+        --gradient_clip_val 1.0 \
+        --max_epochs 10 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --log_every_n_steps 100 \
+        --val_check_interval 500 \
+	    --limit_val_batches 10 \
+        --accumulate_grad_batches 1 \
+        --precision 16 \
+        --ckpt_path /data/training_model/ckpt/fengshen-${MODEL_NAME}/last.ckpt \
+        --default_root_dir /data/training_model/fengshen-$MODEL_NAME \
+        "
+
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+export SCRIPT_PATH=pretrain_hubert.py
+
+eval python3 -m debugpy --listen localhost:53005 --wait-for-client $SCRIPT_PATH $options
diff --git a/fengshen/examples/longformer/README.md b/fengshen/examples/longformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ef4706898b87d2f10eff5df2db24ae3a182ce673
--- /dev/null
+++ b/fengshen/examples/longformer/README.md
@@ -0,0 +1,34 @@
+# longformer model (Chinese)，one model of [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM).
+We modify the original position code of longformer to rotational position coding，and on the basis of [chinese_roformer_L-12_H-768_A-12.zip](https://github.com/ZhuiyiTechnology/roformer), use 180G of data to continue training
+
+## Usage
+There is no structure of Longformer-base in [Transformers](https://github.com/huggingface/transformers), you can run follow code to get structure of longformer from [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
+
+ ```shell
+ git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
+ ```
+
+### Load Model
+```python
+from fengshen import LongformerModel    
+from fengshen import LongformerConfig
+from transformers import BertTokenizer
+
+tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
+config = LongformerConfig.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
+model = LongformerModel.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
+```
+
+
+
+## Citation
+If you find the resource is useful, please cite the following website in your paper.
+
+```
+@misc{Fengshenbang-LM,
+  title={Fengshenbang-LM},
+  author={IDEA-CCNL},
+  year={2021},
+  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
+}
+```
diff --git a/fengshen/examples/mt5_summary/fastapi_mt5_summary.py b/fengshen/examples/mt5_summary/fastapi_mt5_summary.py
new file mode 100644
index 0000000000000000000000000000000000000000..44adaf8f5855260c683c0bcfe7986ffccc9f25c4
--- /dev/null
+++ b/fengshen/examples/mt5_summary/fastapi_mt5_summary.py
@@ -0,0 +1,93 @@
+import os
+import sys
+import uvicorn
+import torch
+from fastapi import Body, FastAPI
+from transformers import T5Tokenizer, MT5ForConditionalGeneration
+import pytorch_lightning as pl
+sys.path.append(os.path.abspath(os.path.join(
+    os.path.dirname(__file__), os.path.pardir)))
+os.environ["CUDA_VISIBLE_DEVICES"] = '5'
+os.environ["MASTER_ADDR"] = '127.0.0.1'
+os.environ["MASTER_PORT"] = '6000'
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+print('device')
+pretrain_model_path = '/cognitive_comp/ganruyi/hf_models/google/mt5-large'
+# pretrain_model_path = 'google/mt5-small'
+model_path = '/cognitive_comp/ganruyi/fengshen/mt5_large_summary/ckpt/epoch-0-last.ckpt'
+tokenizer = T5Tokenizer.from_pretrained(pretrain_model_path)
+print('load tokenizer')
+
+
+class MT5FinetuneSummary(pl.LightningModule):
+
+    def __init__(self):
+        super().__init__()
+        self.model = MT5ForConditionalGeneration.from_pretrained(pretrain_model_path)
+
+
+model = MT5FinetuneSummary.load_from_checkpoint(model_path)
+print('load checkpoint')
+model.to(device)
+model.eval()
+app = FastAPI()
+print('server start')
+
+# def flask_gen(text: str, level: float = 0.9, n_sample: int = 5, length: int = 32, is_beam_search=False):
+
+
+@app.post('/mt5_summary')
+async def flask_gen(text: str = Body('', title='原文', embed=True),
+                    n_sample: int = 5, length: int = 32, is_beam_search=False):
+    if len(text) > 128:
+        text = text[:128]
+    text = 'summary:'+text
+    print(text)
+    # inputs = tokenizer(text, return_tensors='pt')
+    inputs = tokenizer.encode_plus(
+        text, max_length=128, padding='max_length', truncation=True, return_tensors='pt')
+    # print(inputs)
+    if is_beam_search:
+        generated_ids = model.model.generate(
+            input_ids=inputs['input_ids'].to(device),
+            attention_mask=inputs['attention_mask'].to(device),
+            max_length=length,
+            num_beams=n_sample,
+            repetition_penalty=2.5,
+            length_penalty=1.0,
+            early_stopping=True,
+            num_return_sequences=n_sample
+        )
+    else:
+        generated_ids = model.model.generate(
+            input_ids=inputs['input_ids'].to(device),
+            attention_mask=inputs['attention_mask'].to(device),
+            max_length=length,
+            do_sample=True,
+            temperature=1.0,
+            top_p=1.0,
+            repetition_penalty=2.5,
+            # early_stopping=True,
+            num_return_sequences=n_sample
+        )
+    result = []
+    # print(tokenizer.all_special_tokens)
+    for sample in generated_ids:
+        preds = [tokenizer.decode(sample, skip_special_tokens=True,
+                                  clean_up_tokenization_spaces=True)]
+        preds = ''.join(preds)
+        # print(preds)
+        result.append(preds)
+    return result
+
+
+if __name__ == '__main__':
+    uvicorn.run(app, host="0.0.0.0", port=6607, log_level="debug")
+# #     article = "日前，方舟子发文直指林志颖旗下爱碧丽推销假保健品，引起哗然。调查发现，
+# 爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发，全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价>高达1080元，实际成本仅为每瓶4元！"
+#     article = '''在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中，中国选手谷爱凌夺得银牌。祝贺谷爱凌！
+# 今天上午，自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行，取选手最佳成绩排名决出奖牌。
+# 第一跳，中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后，谷爱凌又扮了个鬼脸，甚是可爱。
+# 第二轮中，谷爱凌在道具区第三个障碍处失误，落地时摔倒。获得16.98分。网友：摔倒了也没关系，继续加油！
+# 在第二跳失误摔倒的情况下，谷爱凌顶住压力，第三跳稳稳发挥，流畅落地！获得86.23分！此轮比赛，共12位选手参赛，谷爱凌第10位出场。网友：看比赛时我比谷爱凌紧张，加油！'''
+    # flask_gen(article, length=30)
diff --git a/fengshen/examples/mt5_summary/mt5_summary.py b/fengshen/examples/mt5_summary/mt5_summary.py
new file mode 100644
index 0000000000000000000000000000000000000000..de564026ae7a32873cc39515f421adfb9d7e4568
--- /dev/null
+++ b/fengshen/examples/mt5_summary/mt5_summary.py
@@ -0,0 +1,233 @@
+from fengshen.data.task_dataloader.task_datasets import LCSTSDataModel
+from transformers import T5Tokenizer, MT5ForConditionalGeneration
+from transformers.optimization import get_linear_schedule_with_warmup
+from pytorch_lightning import Trainer, loggers
+from pytorch_lightning.callbacks import ModelCheckpoint
+from transformers import AutoTokenizer
+import pytorch_lightning as pl
+import json
+import argparse
+import torch
+import os
+import sys
+sys.path.append('./')
+
+# os.environ["CUDA_VISIBLE_DEVICES"] = '4,5,6,7'
+
+
+def test():
+    tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
+    article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
+    summary = "Weiter Verhandlung in Syrien."
+    article = "日前，方舟子发文直指林志颖旗下爱碧丽推销假保健品，引起哗然。调查发现，爱碧丽没有自己的生产加工厂。 \
+    其胶原蛋白饮品无核心研发，全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价>高达1080元，实际成本仅为每瓶4元！"
+    summary = "林志颖公司疑涉虚假营销无厂房无研发"
+    inputs = tokenizer(article, rturn_tensors="pt")
+    tt = tokenizer.encode_plus(summary, max_length=64,
+                               padding='max_length', truncation='longest_first')
+    print('tt:', tt)
+    print('inputs:', inputs)
+    with tokenizer.as_target_tokenizer():
+        labels = tokenizer(summary, return_tensors="pt")
+    print('labels:', labels)
+    print('origin labels:', tokenizer.decode(labels['input_ids'][0]))
+
+    model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
+    # outputs = model(input_ids=inputs["input_ids"], labels=labels["input_ids"])
+    # print(outputs.keys())
+
+    # evaluation
+    model.eval()
+    generated_ids = model.generate(
+        input_ids=inputs['input_ids'],
+        attention_mask=inputs['attention_mask'],
+        max_length=150,
+        num_beams=2,
+        repetition_penalty=2.5,
+        length_penalty=1.0,
+        early_stopping=True
+    )
+    preds = [tokenizer.decode(g, skip_special_tokens=True,
+                              clean_up_tokenization_spaces=True) for g in generated_ids]
+    print(preds)
+
+
+class MT5FinetuneSummaryModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./ckpt/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_last', action='store_true', default=True)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename,
+                                         save_last=args.save_last)
+
+
+class MT5FinetuneSummary(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--learning_rate', default=1e-4, type=float)
+        parser.add_argument('--weight_decay', default=0.1, type=float)
+        parser.add_argument('--warmup', default=0.01, type=float)
+        return parent_args
+
+    def __init__(self, args, num_data):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        print('num_data:', num_data)
+        self.model = MT5ForConditionalGeneration.from_pretrained(args.pretrained_model_path)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
+            self.total_step = int(self.trainer.max_epochs * self.num_data /
+                                  (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
+            print('Total training step:', self.total_step)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss)
+        # self.log('val_acc', acc)
+
+    def predict_step(self, batch, batch_idx):
+        text = batch['text']
+        summary = batch['summary']
+        generated_ids = self.model.generate(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            max_length=self.args.max_dec_length
+        )
+        return {"pred": generated_ids, "text": text, "summary": summary}
+
+    def configure_optimizers(self):
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        paras = list(
+            filter(lambda p: p[1].requires_grad, self.named_parameters()))
+        paras = [{
+            'params':
+            [p for n, p in paras if not any(nd in n for nd in no_decay)],
+            'weight_decay': self.args.weight_decay
+        }, {
+            'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
+            'weight_decay': 0.0
+        }]
+        optimizer = torch.optim.AdamW(paras, lr=self.args.learning_rate)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, int(self.total_step * self.args.warmup),
+            self.total_step)
+
+        return [{
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,
+                'interval': 'step',
+                'frequency': 1
+            }
+        }]
+
+
+def save_test(data, args, data_model):
+    tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model_path)
+    with open(os.path.join(args.output_save_path), 'w', encoding='utf-8') as f:
+        for _, batch in enumerate(data):
+            texts = batch['text']
+            summarys = batch['summary']
+            preds = batch['pred']
+            for idx, pred_ids in enumerate(preds):
+                text = texts[idx]
+                summary = summarys[idx]
+                tmp_result = dict()
+                preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+                         for g in pred_ids]
+                tmp_result['summary'] = ''.join(preds)
+                tmp_result['label'] = summary
+                tmp_result['origin_text'] = text
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+    print('save the result to '+args.output_save_path)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Summary Task")
+    total_parser.add_argument('--do_eval_only', action='store_true', default=False)
+    total_parser.add_argument('--pretrained_model_path', default='google/mt5-small', type=str)
+    total_parser.add_argument('--output_save_path', default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = LCSTSDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = MT5FinetuneSummaryModelCheckpoint.add_argparse_args(total_parser)
+    total_parser = MT5FinetuneSummary.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+
+    data_model = LCSTSDataModel(args)
+    if not args.do_eval_only:
+        model = MT5FinetuneSummary(args, len(data_model.train_dataloader()))
+        checkpoint_callback = MT5FinetuneSummaryModelCheckpoint(args).callbacks
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'log/'), name='mt5_summary')
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[checkpoint_callback]
+                                             )
+        trainer.fit(model, data_model)
+    else:
+        trainer = Trainer.from_argparse_args(args)
+        model = MT5FinetuneSummary.load_from_checkpoint(
+            args.resume_from_checkpoint, args=args, num_data=len(data_model.predict_dataloader()))
+    result = trainer.predict(model, data_model)
+    if torch.distributed.get_rank() == 0:
+        save_test(result, args, data_model)
+
+
+if __name__ == '__main__':
+    main()
+    # test()
+
+'''
+python examples/mt5_summary.py --gpus=1 --test_data=test_public.jsonl
+--default_root_dir=/cognitive_comp/ganruyi/fengshen/mt5_summary/eval
+--do_eval_only
+--resume_from_checkpoint=/cognitive_comp/ganruyi/fengshen/mt5_summary/ckpt/model-epoch=01-train_loss=1.9166.ckpt
+--strategy=ddp
+'''
diff --git a/fengshen/examples/mt5_summary/pretrain_mt5_summary.sh b/fengshen/examples/mt5_summary/pretrain_mt5_summary.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a77b88006211d6f7a432672f4ac29a58d9865d66
--- /dev/null
+++ b/fengshen/examples/mt5_summary/pretrain_mt5_summary.sh
@@ -0,0 +1,118 @@
+#!/bin/bash
+#SBATCH --job-name=mt5_large_summary
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=4
+#SBATCH --gres=gpu:4               # number of gpus
+#SBATCH -o /cognitive_comp/ganruyi/fengshen/mt5_large_summary/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/fengshen/mt5_large_summary/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=16
+ROOT_DIR=/cognitive_comp/ganruyi/fengshen/mt5_large_summary
+
+ZERO_STAGE=2
+
+config_json="$ROOT_DIR/ds_config.$SLURM_JOBID.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": 16,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+# export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+TRAINER_ARGS="
+    --max_epochs 2 \
+    --gpus 4 \
+    --num_nodes 1 \
+    --strategy ddp \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+"
+DATA_DIR=/cognitive_comp/ganruyi/data_datasets_LCSTS_LCSTS/
+prompt="summary:"
+DATA_ARGS="
+    --data_dir $DATA_DIR
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data train.jsonl\
+    --valid_data valid.jsonl\
+    --test_data  valid.jsonl\
+    --prompt $prompt \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/hf_models/google/mt5-large \
+    --output_save_path $ROOT_DIR/mt5_large_predict_lcsts.json \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --warmup 0.01 \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/fengshen/examples/mt5_summary.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+#singularity exec --nv -B /cognitive_comp/ganruyi/Megatron/:/cognitive_comp/ganruyi/Megatron/,/cognitive_comp/gaoxinyu/:/cognitive_comp/gaoxinyu/ $SINGULARITY_PATH python $CMD
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+clear; srun singularity exec --nv -B /cognitive_comp/ganruyi/:/cognitive_comp/ganruyi/ $SINGULARITY_PATH bash -c 'python $CMD'
\ No newline at end of file
diff --git a/fengshen/examples/pegasus/README.md b/fengshen/examples/pegasus/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f04b83c348ab2fe34a06a428523bc48169c7b478
--- /dev/null
+++ b/fengshen/examples/pegasus/README.md
@@ -0,0 +1,78 @@
+# 燃灯系列-Pegasus摘要模型预训练
+Pegasus预训练模型是专门为摘要任务而设计的预训练模型，相比于其它通用预训练模型，Pegasus 模型的架构设计更贴近下游的摘要任务，在摘要抽取的效果上的表现相比其他通用模型表现更好
+
+### 模型架构和参数
+Pegasus的模型架构是标准的encoder-decoder的Transformer结构，训练任务是用的是GSG（ Gap Sentences Generation）任务。GSG任务主要是通过对文本中的重要的句子进行mask，然后再通过decoder恢复。模型详细参数可看config.json
+
+1. base版本
+
+| 配置 | 参数 |
+| ---- | ---- |
+| encoder layers | 12 |
+| encoder_attention_heads | 12 |
+| encoder_ffn_dim | 3072 |
+| decoder layers | 12 |
+| decoder_attention_heads| 12 |
+| decoder_ffn_dim | 3072 |
+| max_encode_length | 512 |
+
+2. large 版本
+   
+| 配置 | 参数 |
+| ---- | ---- |
+| encoder layers | 16 |
+| encoder_attention_heads | 16 |
+| encoder_ffn_dim | 4096 |
+| decoder layers | 16 |
+| decoder_attention_heads| 16 |
+| decoder_ffn_dim | 4096 |
+| max_encode_length | 1024 |
+
+### 训练数据
+训练数据使用的是wudao 180g数据。数据进行了简单的预处理包括：
+1. 过滤过长单句（这样的句子通常会包括一些乱码句，无上下文语义的列表句、各种符号句，歌词句等）
+2. 过滤句子数过少文本，如句子数少于3句则抛弃
+
+### 模型
+
+pegasus-base: [Randeng_pegasus_238M_summary](https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_238M_Summary) <br/>
+pegasus-large: [Randeng_pegasus_523M_summary](https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_523M_Summary)
+
+主要文件：
+- tokenizers_pegasus.py 中文版pegasus的tokenize实现
+- pretrain_pegasus.py 模型预训练的核心实现文件
+- pretrain_pegasusu.sh 预训练脚本，具体参数可通过此脚本修改
+- data_utils.py 模型的一些工具代码
+
+#### 使用方式
+可直接通过Hugging face或者pytoch-ligthning框架调用。下面给出的例子是hugging face的调用方法：
+```python
+from transformers import PegasusForConditionalGeneration
+# Need to download tokenizers_pegasus.py and other Python script from Fengshenbang-LM github repo in advance,
+# or you can mv download in tokenizers_pegasus.py and data_utils.py in https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_238M_Summary/tree/main
+# Stronly recomend you git clone the Fengshenbang-LM repo:
+# 1. git clone https://github.com/IDEA-CCNL/Fengshenbang-LM
+# 2. cd Fengshenbang-LM/fengshen/examples/pegasus/
+# and then you will see the tokenizers_pegasus.py and data_utils.py which are needed by pegasus model
+from tokenizers_pegasus import PegasusTokenizer
+
+model = PegasusForConditionalGeneration.from_pretrained("IDEA-CCNL/randeng_pegasus_238M_summary")
+tokenizer = PegasusTokenizer.from_pretrained("path/to/vocab.txt")
+
+text = "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中，中国选手谷爱凌夺得银牌。祝贺谷爱凌！今天上午，自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行，取选手最佳成绩排名决出奖牌。第一跳，中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后，谷爱凌又扮了个鬼脸，甚是可爱。第二轮中，谷爱凌在道具区第三个障碍处失误，落地时摔倒。获得16.98分。网友：摔倒了也没关系，继续加油！在第二跳失误摔倒的情况下，谷爱凌顶住压力，第三跳稳稳发挥，流畅落地！获得86.23分！此轮比赛，共12位选手参赛，谷爱凌第10位出场。网友：看比赛时我比谷爱凌紧张，加油！"
+inputs = tokenizer(text, max_length=1024, return_tensors="pt")
+
+# Generate Summary
+summary_ids = model.generate(inputs["input_ids"])
+tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+```
+
+### 下游效果
+
+#### LCSTS摘要数据finetune后效果
+
+| model | rouge-1 | rouge-2 | rouge-L |
+| ---- | ---- | ---- | ---- |
+| Pegasus-base  | 44.13 | 31.31 | 41.06 | 
+| Pegasus-large | 49.42 | 37.91 | 46.63 |
+
diff --git a/fengshen/examples/pegasus/data_utils.py b/fengshen/examples/pegasus/data_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..879798749bc06d6857c01ec101baf5f3fb61d012
--- /dev/null
+++ b/fengshen/examples/pegasus/data_utils.py
@@ -0,0 +1,319 @@
+# -*- coding: utf-8 -*-
+
+import re
+import six
+import unicodedata
+import torch
+import rouge
+import numpy as np
+import random
+# from fengshen.examples.pegasus.pegasus_utils import text_segmentate
+import sys
+
+sys.path.append('../../../')
+
+rouge = rouge.Rouge()
+
+
+is_py2 = six.PY2
+
+if not is_py2:
+    basestring = str
+
+
+def _is_chinese_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    #
+    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+    # despite its name. The modern Korean Hangul alphabet is a different block,
+    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+    # space-separated words, so they are not treated specially and handled
+    # like the all of the other languages.
+    if ((cp >= 0x4E00 and cp <= 0x9FFF) or (cp >= 0x3400 and cp <= 0x4DBF)
+            or (cp >= 0x20000 and cp <= 0x2A6DF)
+            or (cp >= 0x2A700 and cp <= 0x2B73F)
+            or (cp >= 0x2B740 and cp <= 0x2B81F)
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)):
+        return True
+
+    return False
+
+
+def _is_whitespace(char):
+    """Checks whether `char` is a whitespace character."""
+    # \t, \n, and \r are technically control characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `char` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `char` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (
+            cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
+
+
+def is_string(s):
+    """判断是否是字符串
+    """
+    return isinstance(s, basestring)
+
+
+def is_stopwords(word, stopwords):
+    if word in stopwords:
+        return True
+    else:
+        return False
+
+
+def text_segmentate(text):
+    en_seg_pattern = '((?:\\!|\\?|\\.|\\n)+(?:\\s)+)'
+    ch_seg_pattern = '((?:？|！|。|\\n)+)'
+    try:
+        text = re.sub(en_seg_pattern, r'\1[SEP]', text)
+        # print("sub text: ", text)
+    except Exception as e:
+        print("input: ", text)
+        raise e
+    text = re.sub(ch_seg_pattern, r'\1[SEP]', text)
+    # print("sub ch text: ", text)
+    text_list = text.split("[SEP]")
+    text_list = list(filter(lambda x: len(x) != 0, text_list))
+    return text_list
+
+
+def load_stopwords(stopwords_path):
+    stopwords_dict = {}
+    with open(stopwords_path, "r") as rf:
+        for line in rf:
+            line = line.strip()
+            if line not in stopwords_dict:
+                stopwords_dict[line] = 0
+            else:
+                pass
+    return stopwords_dict
+
+
+def text_process(text, max_length):
+    """分割文本
+    """
+    texts = text_segmentate(text)
+
+    result, length = [], 0
+    for text in texts:
+        if length + len(text) > max_length * 1.3 and len(result) >= 3:
+            yield result
+            result, length = [], 0
+        result.append(text)
+        length += len(text)
+    if result and len(result) >= 3:
+        yield result
+
+
+def text_process_split_long_content(text, max_length):
+    """分割长文本
+    """
+    texts = text_segmentate(text)
+
+    result, sentence_num = "", 0
+    for text in texts:
+        if len(text) > 500:
+            if len(result) > 300 and sentence_num >= 3:
+                yield result
+                result, sentence_num = "", 0
+            else:
+                result, sentence_num = "", 0
+                continue
+        else:
+            if len(result) + len(text) > max_length * 1.1 and sentence_num >= 3:
+                yield result
+                result, sentence_num = "", 0
+            result += text
+            sentence_num += 1
+
+    if result and sentence_num >= 3:
+        yield result
+
+
+def gather_join(texts, idxs):
+    """取出对应的text，然后拼接起来
+    """
+    return ''.join([texts[i] for i in idxs])
+
+
+def gather_join_f1(texts_token, idsx):
+    join_texts = []
+    for id in idsx:
+        join_texts.extend(texts_token[id])
+    return join_texts
+
+
+def compute_rouge(source, target):
+    """计算rouge-1、rouge-2、rouge-l
+    """
+    source, target = ' '.join(source), ' '.join(target)
+    try:
+        scores = rouge.get_scores(hyps=source, refs=target)
+        return {
+            'rouge-1': scores[0]['rouge-1']['f'],
+            'rouge-2': scores[0]['rouge-2']['f'],
+            'rouge-l': scores[0]['rouge-l']['f'],
+        }
+    except ValueError:
+        return {
+            'rouge-1': 0.0,
+            'rouge-2': 0.0,
+            'rouge-l': 0.0,
+        }
+
+
+def remove_stopwords(texts, stopwords_dict):
+    for i, text in enumerate(texts):
+        texts[i] = list(filter(lambda x: x not in stopwords_dict, text))
+    return texts
+
+
+def pseudo_summary_f1(texts,
+                      stopwords,
+                      tokenizer,
+                      max_length,
+                      rouge_strategy="rouge-l"):
+    """构建伪标签摘要数据集
+    """
+    summary_rate = 0.25
+    max_length = max_length - 1
+    texts_tokens = []
+    sentece_idxs_vec = []
+    for text in texts:
+        if len(texts) == 0:
+            continue
+        try:
+            ids = tokenizer.encode(text.strip())[:-1]
+        except ValueError:
+            print("error, input : ", text)
+            raise ValueError
+        sentece_idxs_vec.append(ids)
+        tokens = [tokenizer._convert_id_to_token(token) for token in ids]
+        texts_tokens.append(tokens)
+
+    texts_tokens_rm = remove_stopwords(texts_tokens, stopwords)
+    source_idxs, target_idxs = list(range(len(texts))), []
+
+    assert len(texts_tokens) == len(texts)
+    # truncate_index = 0
+    while True:
+        sims = []
+        for i in source_idxs:
+            new_source_idxs = [j for j in source_idxs if j != i]
+            new_target_idxs = sorted(target_idxs + [i])
+            new_source = gather_join_f1(texts_tokens_rm, new_source_idxs)
+            new_target = gather_join_f1(texts_tokens_rm, new_target_idxs)
+            sim = compute_rouge(new_source, new_target)[rouge_strategy]
+            sims.append(sim)
+        new_idx = source_idxs[np.argmax(sims)]
+        del sims
+        source_idxs.remove(new_idx)
+        target_idxs = sorted(target_idxs + [new_idx])
+        source = gather_join(texts, source_idxs)
+        target = gather_join(texts, target_idxs)
+        try:
+            if (len(source_idxs) == 1
+                    or 1.0 * len(target) / len(source) > summary_rate):
+                break
+        except ZeroDivisionError as e:
+            print(e.meesage)
+            print(texts)
+            print("source: ", source)
+            print("target: ", target)
+
+    if len(source) < len(target):
+        source, target = target, source
+        source_idxs, target_idxs = target_idxs, source_idxs
+
+    return sentece_idxs_vec, source, target, source_idxs, target_idxs
+
+
+def get_input_mask(sentence_id_vec, indexs):
+    target_idxs = []
+    input_idxs = []
+    kMaskSentenceTokenId = 2
+    kEosTokenId = 1
+    mask_sentence_options_cumulative_prob = [0.9, 0.9, 1, 1]
+    for index in indexs:
+        target_idxs.extend(sentence_id_vec[index])
+        choice = random.uniform(0, 1)
+        if choice < mask_sentence_options_cumulative_prob[0]:
+            # print("mask index: ", index)
+            sentence_id_vec[index] = [kMaskSentenceTokenId]
+        elif choice < mask_sentence_options_cumulative_prob[1]:
+            # print("replace index: ", index)
+            replace_id = random.randint(0, len(sentence_id_vec))
+            sentence_id_vec[index] = sentence_id_vec[replace_id]
+        elif choice < mask_sentence_options_cumulative_prob[2]:
+            pass
+        else:
+            sentence_id_vec[index] = []
+
+    target_idxs.append(kEosTokenId)
+    # print(sentence_id_vec)
+    for index, sentence_id in enumerate(sentence_id_vec):
+        # print(index, sentence_id)
+        if len(sentence_id) == 0:
+            continue
+        input_idxs.extend(sentence_id_vec[index])
+
+    input_idxs.append(kEosTokenId)
+    return input_idxs, target_idxs
+
+
+def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int,
+                       decoder_start_token_id: int):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    if pad_token_id is None:
+        raise ValueError("self.model.config.pad_token_id has to be defined.")
+    # replace possible -100 values in labels by `pad_token_id`
+    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+
+    return shifted_input_ids
+
+
+def padding_to_maxlength(ids, max_length, pad_id):
+    cur_len = len(ids)
+    len_diff = max_length - cur_len
+    return ids + [pad_id] * len_diff, [1] * cur_len + [0] * len_diff
diff --git a/fengshen/examples/pegasus/pretrain_pegasus.py b/fengshen/examples/pegasus/pretrain_pegasus.py
new file mode 100644
index 0000000000000000000000000000000000000000..0059355f5d5bf6d149e01fc3dc15d3a760932733
--- /dev/null
+++ b/fengshen/examples/pegasus/pretrain_pegasus.py
@@ -0,0 +1,181 @@
+# -*- coding: utf-8 -*-
+
+
+from fengshen.models.model_utils import add_module_args
+from transformers import PegasusForConditionalGeneration, PegasusConfig
+from pytorch_lightning import Trainer, loggers, LightningModule
+from pytorch_lightning.callbacks import LearningRateMonitor
+from tokenizers_pegasus import PegasusTokenizer
+from utils import UniversalCheckpoint
+from data.universal_datamodule import UniversalDataModule
+from data_utils import (
+    get_input_mask, pseudo_summary_f1, shift_tokens_right,
+    padding_to_maxlength, load_stopwords, text_segmentate)
+import argparse
+import torch
+import os
+import sys
+
+sys.path.append('../../')
+
+
+# os.environ["CUDA_VISIBLE_DEVICES"] = '6'
+
+
+class FakeAbstractCollator:
+
+    def __init__(self, tokenizer, stopwords_dict, max_enc_length):
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_enc_length
+        self.stopwords_dict = stopwords_dict
+
+    def __call__(self, samples):
+        # print("samples: ", samples)
+        labels = []
+        attn_mask = []
+        decoder_attn_mask = []
+        source_inputs = []
+
+        for text in samples:
+            texts = text["chunks"]
+            text = text_segmentate(texts)
+            sentence_id_vec, source, target, source_idxs, target_idxs = pseudo_summary_f1(
+                text, self.stopwords_dict, self.tokenizer, self.max_seq_length,
+                "rouge-l")
+            source_idxs, target_idxs = get_input_mask(sentence_id_vec,
+                                                      target_idxs)
+            if len(source_idxs) > self.max_seq_length:
+                if 2 not in source_idxs[self.max_seq_length - 1:]:
+                    source_idxs = source_idxs[:self.max_seq_length]
+                    source_idxs[-1] = self.tokenizer.eos_token_id
+                    sys.stderr.write("Warning split long line: " + source +
+                                     "\n")
+                else:
+                    continue
+
+            source_idxs, attention_mask = padding_to_maxlength(
+                source_idxs, self.max_seq_length, self.tokenizer.pad_token_id)
+            label, target_attention_mask = padding_to_maxlength(
+                target_idxs, self.max_seq_length, self.tokenizer.pad_token_id)
+            # print("sample len: ", len(source_idxs))
+            source_inputs.append(source_idxs)
+            attn_mask.append(attention_mask)
+            decoder_attn_mask.append(target_attention_mask)
+            labels.append(label)
+        labels = torch.tensor(labels)
+        decode_input_idxs = shift_tokens_right(labels,
+                                               self.tokenizer.pad_token_id,
+                                               self.tokenizer.pad_token_id)
+        end_token_index = torch.where(labels == self.tokenizer.eos_token_id)[1]
+        for idx, end_idx in enumerate(end_token_index):
+            labels[idx][end_idx + 1:] = -100
+
+        # print("call samples: ")
+        return {
+            "input_ids": torch.tensor(source_inputs),
+            "attention_mask": torch.tensor(attn_mask),
+            "labels": labels,
+            "decoder_input_ids": decode_input_idxs,
+            "decoder_attention_mask": torch.tensor(decoder_attn_mask)
+        }
+
+
+class PegasusChineseModel(LightningModule):
+
+    def __init__(self, args, **kwargs):
+        super().__init__()
+        self.args = args
+        self.save_hyperparameters(args)
+        config = PegasusConfig.from_json_file(
+            os.path.join(args.model_path, "config.json"))
+        print("vocab_size: ", config.vocab_size)
+        self.model = PegasusForConditionalGeneration(config=config)
+        print("model.num_parameters: ", self.model.num_parameters())
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader(
+            )
+
+            # Calculate total steps
+            tb_size = self.hparams.train_batchsize * max(1, self.trainer.gpus)
+            ab_size = self.trainer.accumulate_grad_batches * float(
+                self.trainer.max_epochs)
+            self.total_steps = (len(train_loader.dataset) //
+                                tb_size) // ab_size
+            print('Total training step:', self.total_steps)
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        self.log('train_loss', output.loss, sync_dist=True)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1, ))
+        y_true = labels.view(size=(-1, )).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float()) / labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss, sync_dist=True)
+        self.log('val_acc', acc, sync_dist=True)
+
+    def on_save_checkpoint(self, checkpoint) -> None:
+        if self.trainer._accelerator_connector.cluster_environment.global_rank(
+        ) == 0:
+            self.model.save_pretrained(
+                os.path.join(
+                    self.trainer.checkpoint_callback.dirpath,
+                    'hf_pretrained_epoch{}_step{}'.format(
+                        checkpoint['epoch'], checkpoint['global_step'])))
+
+
+def main():
+    args_parser = argparse.ArgumentParser("Pegasus Task")
+
+    args_parser = UniversalDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = UniversalCheckpoint.add_argparse_args(args_parser)
+    args_parser = add_module_args(args_parser)
+    args_parser.add_argument('--deepspeed')
+    args_parser.add_argument(
+        '--stopword_path',
+        default="/cognitive_comp/dongxiaoqun/project/pegasus/own/pegasus/stopwords",
+        type=str)
+    args_parser.add_argument('--max_seq_length', default=1024, type=int)
+    args = args_parser.parse_args()
+
+    tokenizer = PegasusTokenizer.from_pretrained(args.model_path)
+    stopwords_dict = load_stopwords(args.stopword_path)
+    collator = FakeAbstractCollator(tokenizer, stopwords_dict,
+                                    args.max_seq_length)
+    data_module = UniversalDataModule(tokenizer=tokenizer,
+                                      args=args,
+                                      collate_fn=collator)
+    module = PegasusChineseModel(args)
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    logger = loggers.TensorBoardLogger(
+        save_dir=os.path.join(args.default_root_dir, 'logs/'),
+        name=os.path.basename(os.path.dirname(args.model_path)))
+    checkpoint_callback = UniversalCheckpoint(args).callbacks
+
+    # autotuning
+    if args.deepspeed is not None:
+        os.environ['PL_DEEPSPEED_CONFIG_PATH'] = args.deepspeed
+
+    trainer = Trainer.from_argparse_args(
+        args, logger=logger, callbacks=[lr_monitor, checkpoint_callback])
+
+    trainer.fit(module, data_module)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/pegasus/pretrain_pegasus.sh b/fengshen/examples/pegasus/pretrain_pegasus.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3a371ac45463317fa01fa84a72f5df6bb9ca6bd5
--- /dev/null
+++ b/fengshen/examples/pegasus/pretrain_pegasus.sh
@@ -0,0 +1,119 @@
+#!/bin/bash
+#SBATCH --job-name=pegasus-base_last # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=pegasus-base_test
+
+config_json="./$MODEL_NAME.ds_config.json"
+export MASTER_PORT=$[RANDOM%10000+40000]
+
+MICRO_BATCH_SIZE=4
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+    "zero_optimization": {
+        "stage": 1
+    },
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "params": {
+            "betas": [
+                0.9,
+                0.999
+            ],
+            "eps": 1e-08,
+            "lr": 1e-04,
+            "weight_decay": 0.01
+        },
+        "type": "Adam"
+    },
+    "scheduler": {
+        "params": {
+            "warmup_max_lr": 1e-04,
+            "warmup_min_lr": 1e-05,
+            "total_num_steps": 80000000,
+            "warmup_num_steps" : 50000
+        },
+        "type": "WarmupDecayLR"
+    },
+    "steps_per_print": 100,
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+    "zero_allow_untested_optimizer": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/dongxiaoqun/torch_extendsions
+
+DATA_ARGS="\
+        --datasets_name wudao_180g_512 \
+        --num_workers 20 \
+        --train_batchsize $MICRO_BATCH_SIZE \
+        --val_batchsize 8 \
+        --test_batchsize 8  \
+        --max_seq_length 512 \
+        --val_datasets_field valid \
+        "
+
+MODEL_ARGS="\
+        --model_path /cognitive_comp/dongxiaoqun/pretrained_model/pegasus-base/ \
+        --learning_rate 1e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.001 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor train_loss \
+        --save_top_k 3 \
+        --mode min \
+        --every_n_train_steps 200 \
+        --dirpath /cognitive_comp/dongxiaoqun/train_model/fengshen-$MODEL_NAME_debug/ckpt \
+        --filename model-{step:02d}-{train_loss:.4f} \
+        --save_last \
+        "
+
+TRAINER_ARGS="\
+        --gradient_clip_val 1.0 \
+        --max_epochs 1 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy ddp \
+        --log_every_n_steps 100 \
+        --val_check_interval 0.1 \
+        --accumulate_grad_batches 8 \
+        --default_root_dir /cognitive_comp/dongxiaoqun/train_model/fengshen-$MODEL_NAME_debug \
+        --stopword_path /cognitive_comp/dongxiaoqun/pretrained_model/pegasus-large/stopwords \
+        "
+
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+SINGULARITY_PATH=/cognitive_comp/dongxiaoqun/software/docker/pytorch21_06_py3_docker_image_v2.sif
+export SCRIPT_PATH=/cognitive_comp/dongxiaoqun/project/idea-ccnl/bug_fix/Fengshenbang-LM/fengshen/examples/pegasus/pretrain_pegasus.py
+
+# python $SCRIPT_PATH $options
+source activate
+conda activate torchnew
+srun --nodes=1 --ntasks-per-node=1 --gres=gpu:2 --cpus-per-task=30 -o ${MODEL_NAME}-%J.log --jobid=226191 bash -c 'python3 $SCRIPT_PATH $options'
diff --git a/fengshen/examples/pegasus/tokenizers_pegasus.py b/fengshen/examples/pegasus/tokenizers_pegasus.py
new file mode 100644
index 0000000000000000000000000000000000000000..f532875987b59a42aca9ad35eb7a1945c992869b
--- /dev/null
+++ b/fengshen/examples/pegasus/tokenizers_pegasus.py
@@ -0,0 +1,597 @@
+from fengshen.examples.pegasus.data_utils import (
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+    _is_chinese_char)
+from transformers import PreTrainedTokenizer
+from transformers import logging
+from typing import List, Optional, Tuple, Union
+import collections
+import os
+import unicodedata
+import re
+import jieba
+import sys
+
+sys.path.append("../../../../")
+
+jieba.dt.tmp_dir = os.path.expanduser("~/.cache/")
+# jieba.enable_parallel(8)
+jieba.initialize()
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class PegasusTokenizer(PreTrainedTokenizer):
+    # copy from BertTokenizer
+    r"""
+    Construct a Pegasus tokenizer. Based on WordPiece.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+    Args:
+        vocab_file (`str`):
+            File containing the vocabulary.
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        do_basic_tokenize (`bool`, *optional*, defaults to `True`):
+            Whether or not to do basic tokenization before WordPiece.
+        never_split (`Iterable`, *optional*):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        unk_token (`str`, *optional*, defaults to `"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        pad_token (`str`, *optional*, defaults to `"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (`str`, *optional*, defaults to `"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        mask_token (`str`, *optional*, defaults to `"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
+            Whether or not to tokenize Chinese characters.
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
+        strip_accents (`bool`, *optional*):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    #     pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    #     pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    #     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self,
+                 vocab_file,
+                 do_lower_case=True,
+                 do_basic_tokenize=True,
+                 never_split=None,
+                 pad_token="<pad>",
+                 eos_token="</s>",
+                 unk_token="<unk>",
+                 mask_token="<mask_2>",
+                 mask_token_sent="<mask_1>",
+                 additional_special_tokens=None,
+                 sep_token="[SEP]",
+                 cls_token="[CLS]",
+                 tokenize_chinese_chars=True,
+                 strip_accents=None,
+                 offset=100,
+                 pre_tokenizer=lambda x: jieba.cut(x, HMM=False),
+                 **kwargs):
+        self.offset = offset
+
+        if additional_special_tokens is not None:
+            if not isinstance(additional_special_tokens, list):
+                raise TypeError(
+                    f"additional_special_tokens should be of type {type(list)}, \
+                     but is {type(additional_special_tokens)}"
+                )
+
+            additional_special_tokens_extended = (
+                ([mask_token_sent] + additional_special_tokens)
+                if mask_token_sent not in additional_special_tokens
+                and mask_token_sent is not None else additional_special_tokens)
+
+            # fill additional tokens with ..., <unk_token_102> in case not all additional tokens are already taken
+            additional_special_tokens_extended += [
+                f"<unk_{i}>" for i in range(
+                    len(additional_special_tokens_extended), self.offset - 1)
+            ]
+
+            if len(set(additional_special_tokens_extended)) != len(
+                    additional_special_tokens_extended):
+                raise ValueError(
+                    f"Please make sure that the provided additional_special_tokens \
+                        do not contain an incorrectly shifted list of <unk_x> tokens. \
+                        Found {additional_special_tokens_extended}."
+                )
+            additional_special_tokens = additional_special_tokens_extended
+        else:
+            additional_special_tokens = [
+                mask_token_sent
+            ] if mask_token_sent is not None else []
+            # additional_special_tokens += [f"<unk_{i}>" for i in range(3, self.offset)]
+
+        # print("additional_special_tokens: ", additional_special_tokens)
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                f"Can't find a vocabulary file at path '{vocab_file}'. \
+                To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
+            )
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            eos_token=eos_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            additional_special_tokens=additional_special_tokens,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        self.pre_tokenizer = pre_tokenizer
+        self.mask_token_sent = mask_token_sent
+        self.vocab = load_vocab(vocab_file)
+
+        self.vocab[self.eos_token] = self.vocab.pop("[unused1]")
+        # self.vocab[self.eos_token] = self.vocab.pop("[unused2]")
+        self.vocab[self.pad_token] = self.vocab.pop("[PAD]")
+        self.vocab[self.unk_token] = self.vocab.pop("[UNK]")
+
+        if self.mask_token_sent is not None:
+            self.vocab[self.mask_token] = self.vocab.pop("[unused3]")
+            self.vocab[self.mask_token_sent] = self.vocab.pop("[unused2]")
+
+        self.ids_to_tokens = collections.OrderedDict([
+            (ids, tok) for tok, ids in self.vocab.items()
+        ])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab,
+                                                      unk_token=self.unk_token)
+
+    @property
+    def do_lower_case(self):
+        return self.basic_tokenizer.do_lower_case
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        split_tokens = []
+        # print("pegasus_tokenizer: ", text)
+        for text in self.pre_tokenizer(text):
+            if text in self.vocab:
+                split_tokens.append(text)
+            else:
+                if self.do_basic_tokenize:
+                    for token in self.basic_tokenizer.tokenize(
+                            text, never_split=self.all_special_tokens):
+
+                        # If the token is part of the never_split set
+                        if token in self.basic_tokenizer.never_split:
+                            split_tokens.append(token)
+                        else:
+                            split_tokens += self.wordpiece_tokenizer.tokenize(
+                                token)
+                else:
+                    split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    @staticmethod
+    def _cjk_punctuation():
+        return u'\uff02\uff03\uff04\uff05\uff06\uff07\uff08\uff09\uff0a\uff0b\uff0c\uff0d\uff0f\uff1a\uff1b\uff1c\uff1d\
+            \uff1e\uff20\uff3b\uff3c\uff3d\uff3e\uff3f\uff40\uff5b\uff5c\uff5d\uff5e\uff5f\uff60\uff62\
+            \uff63\uff64\u3000\u3001\u3003\u3008\u3009\u300a\u300b\u300c\u300d\u300e\u300f\u3010\u3011\u3014\
+            \u3015\u3016\u3017\u3018\u3019\u301a\u301b\u301c\u301d\u301e\u301f\u3030\u303e\u303f\u2013\u2014\
+            \u2018\u2019\u201b\u201c\u201d\u201e\u201f\u2026\u2027\ufe4f\ufe51\ufe54\u00b7\uff01\uff1f\uff61\u3002'
+
+    def convert_ids_to_tokens(
+            self,
+            ids: Union[int, List[int]],
+            skip_special_tokens: bool = False) -> Union[str, List[str]]:
+        """
+        Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and
+        added tokens.
+        Args:
+            ids (`int` or `List[int]`):
+                The token id (or token ids) to convert to tokens.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+        Returns:
+            `str` or `List[str]`: The decoded token(s).
+        """
+        if isinstance(ids, int):
+            if ids in self.added_tokens_decoder:
+                return self.added_tokens_decoder[ids]
+            else:
+                return self._convert_id_to_token(ids)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids and index != 2:
+                continue
+            if index in self.added_tokens_decoder:
+                tokens.append(self.added_tokens_decoder[index])
+            else:
+                tokens.append(self._convert_id_to_token(index))
+        return tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        # for token in
+        # tokens = tokens or self.ids_to_tokens(ids)
+        # tokens = [token for token in tokens if not self._is_special(token)]
+
+        text = ''
+        for i, token in enumerate(tokens):
+            if token[:2] == '##':
+                text += token[2:]
+            elif len(token) == 1 and _is_chinese_char(ord(token)):
+                text += token
+            elif len(token) == 1 and _is_punctuation(token):
+                text += token
+                text += ' '
+            elif i > 0 and _is_chinese_char(ord(text[-1])):
+                text += token
+            elif tokens == "</s>":
+                continue
+            else:
+                text += ' '
+                text += token
+
+        text = re.sub(' +', ' ', text)
+        text = re.sub('\' (re|m|s|t|ve|d|ll) ', '\'\\1 ', text)
+        punctuation = re.sub(' +', '', self._cjk_punctuation()).strip() + '+-/={(<['
+        punctuation_regex = '|'.join([re.escape(p) for p in punctuation])
+        punctuation_regex = '(%s) ' % punctuation_regex
+        text = re.sub(punctuation_regex, '\\1', text)
+        text = re.sub(r'(\d\.) (\d)', '\\1\\2', text)
+
+        return text.strip()
+        # out_string = " ".join(tokens).replace(" ##", "").strip()
+
+    def build_inputs_with_special_tokens(
+            self,
+            token_ids_0: List[int],
+            token_ids_1: Optional[List[int]] = None) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequences for sequence classification tasks by concatenating
+        and adding special tokens. A PEGASUS sequence has the following format, where `X` represents the sequence:
+        - single sequence: `X </s>`
+        - pair of sequences: `A B </s>` (not intended use)
+        BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
+        separator.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return token_ids_0 + [self.eos_token_id]
+        return token_ids_0 + token_ids_1 + [self.eos_token_id]
+
+    def _special_token_mask(self, seq):
+        all_special_ids = set(
+            self.all_special_ids)  # call it once instead of inside list comp
+        # all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special
+
+        return [1 if x in all_special_ids else 0 for x in seq]
+
+    def get_special_tokens_mask(
+            self,
+            token_ids_0: List[int],
+            token_ids_1: Optional[List[int]] = None,
+            already_has_special_tokens: bool = False) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return self._special_token_mask(token_ids_0)
+        elif token_ids_1 is None:
+            return self._special_token_mask(token_ids_0) + [self.eos_token_id]
+        else:
+            return self._special_token_mask(token_ids_0 +
+                                            token_ids_1) + [self.eos_token_id]
+
+    def num_special_tokens_to_add(self, pair=False):
+        """Just EOS"""
+        return 1
+
+    def save_vocabulary(self,
+                        save_directory: str,
+                        filename_prefix: Optional[str] = None) -> Tuple[str]:
+        index = 0
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory,
+                (filename_prefix + "-" if filename_prefix else "") +
+                VOCAB_FILES_NAMES["vocab_file"])
+        else:
+            vocab_file = (filename_prefix +
+                          "-" if filename_prefix else "") + save_directory
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(),
+                                             key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!")
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file, )
+
+
+class BasicTokenizer(object):
+    """
+    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
+    Args:
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        never_split (`Iterable`, *optional*):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
+            Whether or not to tokenize Chinese characters.
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
+        strip_accents: (`bool`, *optional*):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+    """
+
+    def __init__(self,
+                 do_lower_case=True,
+                 never_split=None,
+                 tokenize_chinese_chars=True,
+                 strip_accents=None):
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+
+    def tokenize(self, text, never_split=None):
+        """
+        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
+        WordPieceTokenizer.
+        Args:
+            never_split (`List[str]`, *optional*)
+                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
+                [`PreTrainedTokenizer.tokenize`]) List of token not to split.
+        """
+        # union() returns a new set by concatenating the two sets.
+        never_split = self.never_split.union(
+            set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+                or (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """
+        Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
+        tokenization using the given vocabulary.
+        For example, `input = "unaffable"` wil return as output `["un", "##aff", "##able"]`.
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through *BasicTokenizer*.
+        Returns:
+            A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
diff --git a/fengshen/examples/pretrain_bert/README.md b/fengshen/examples/pretrain_bert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1761095920188083853fb3df47927f0f9c008b76
--- /dev/null
+++ b/fengshen/examples/pretrain_bert/README.md
@@ -0,0 +1,78 @@
+# Bert预训练
+
+## 背景
+
+我们有持续收集了一部分语料，有一套自建的数据处理流程。位了验证数据处理的效果，从零开始预训练了2个base级别的Bert模型，一个是基于自建数据，一个是基于同行们开源的数据。总体来说数据效果差别不大，下面只介绍一下本次预训练的流程。
+
+## 数据处理
+
+我们的原始语料主要源自common crawl以及一些开源的高质量语料，经过一些列的数据清洗之后，我们的数据格式为jsonline。例如（摘自内部数据）：
+```json
+{"text":"据悉,河南博物馆成立于1927年,目前拥有超过170000件(套)的文物收藏,包括Jiahu骨笛,雌性猫头鹰雕像,cloud-patterned铜禁,Duling Fangding,莲花和起重机广场,和玉柄剑,黄金从武则天滑落,四神云雾壁画和汝窑天蓝釉雕鹅颈瓶是九大镇厅的珍品。院中的藏品以史前文物、商周青铜器、陶瓷、玉器和石雕等为特色。高质量文物数量多、品种齐全、品位高、价值高。它们是见证中国文明发展、展示中国历史发展的文化艺术宝库。"}
+{"text": "功夫不负有心人，1925年，万氏兄弟试制动画片初获成果，并获得了商务印书馆的大力支持。其后兄弟们再接再厉，直到1927年，一部黑白无声动画片《大闹画室》诞生了爱尔兰风笛。据《申报》记载，“该片内容画人与真人合作锁梦楼，滑稽处甚多，令人观后，捧腹不止。”此片曾远销美国放映，并大受赞誉。1930年夏俊娜，万古蟾到大中华百合影片公司工作，万氏兄弟采用了同样的手法拍摄了第二部动画短片《纸人捣乱记》，并于1931年上映。"}
+```
+
+处理脚本路径：`/cognitive_comp/wuziwei/codes/Fengshenbang-LM/fengshen/data/bert_dataloader`
+
+该路径下面有3个文件，`auto_split.sh`和`preprocessing.py`是原始数据预处理的脚本，`load.py是fs_data`的处理脚本，执行顺序如下：
+
+#### step 1
+
+执行`auto_split.sh`文件，作用是分割大文件，超过1GB的文件，会自动分割未300M的小文件。使用方法如下：
+
+`sh auto_split.sh 你的数据文件路径`
+
+#### step 2
+
+执行`preprocessing.py`文件，该文件的作用主要是分句，为什么不嵌入到collate_fn中做，是发现那样效率会慢一些，所以单独拿出来做了。
+执行`python preprocessing.py`即可，注意修改脚本内的文件路径。
+
+#### step 3
+
+`load.py`文件是用fsdata的方式加载数据集，也是执行即可。执行一遍，后续的加载可以实现180GB的数据秒入～
+
+前面两步是为了提高load.py文件生成缓存文件的速度。经过这几步的处理以及collate_fn函数（bert mask 策略的实现），最终变成bert的输入。如下：
+
+*ps: collate_fn在`Fengshenbang-LM\fengshen\examples\pretrain_bert\pretrain_bert.py`脚本下，由DataCollate类实现。*
+
+```json
+{
+"input_ids": torch.tensor(input_ids),
+"labels": torch.tensor(batch_labels),
+"attention_mask": torch.tensor(attention_mask),
+"token_type_ids": torch.tensor(token_type_ids)
+}
+```
+
+## 模型结构
+
+模型结构即为标准的bert-base，即：
+|    配置     | 参数  |
+| :---------: | :---: |
+|   nlayers   |  12   |
+|  nheaders   |  12   |
+| hidden-size | 768  |
+| seq-length  | 512  |
+| vocab-size  | 21128  |
+
+## 任务以及Mask策略
+
+*mask策略的实现在`Fengshenbang-LM\fengshen\examples\pretrain_bert\pretrain_bert.py`的**DataCollate**类中*
+
+本次预训练取消了NSP任务，只做mask任务，具体mask策略如下：
+
+- 15%随机mask
+    - 80% mask
+    - 10% 随机替换
+    - 10% 保持不变
+- 全词mask （wwm）
+- n-gram mask
+
+由于加入了全词mask和n-gram mask 总体的mask token数量会比英文原始论文的mask比例略高
+
+## 预训练执行流程
+
+- 训练框架：[Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
+- 脚本执行：`sh Fengshenbang-LM\fengshen\examples\pretrain_bert\pretrain_bert.sh`
+
+*具体配置见`Fengshenbang-LM\fengshen\examples\pretrain_bert\pretrain_bert.sh`*
diff --git a/fengshen/examples/pretrain_bert/pretrain_bert.py b/fengshen/examples/pretrain_bert/pretrain_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..a07d7020e10503c4a2b15cfa8456de3264bd13f4
--- /dev/null
+++ b/fengshen/examples/pretrain_bert/pretrain_bert.py
@@ -0,0 +1,278 @@
+from data.bert_dataloader.load import BertDataModule
+from transformers import (
+    BertTokenizer,
+    BertConfig,
+    BertForPreTraining,
+    BertModel,
+    BertForMaskedLM
+)
+from pytorch_lightning import (
+    LightningDataModule,
+    LightningModule,
+    loggers,
+    Trainer,
+)
+from pytorch_lightning.callbacks import (
+    ModelCheckpoint,
+    LearningRateMonitor,
+)
+from typing import Optional
+from torch.utils.data import DataLoader
+from transformers.optimization import get_linear_schedule_with_warmup
+import argparse
+import sys
+import torch
+import os
+import re
+import jieba
+import numpy as np
+
+# 如果没有安装fengshen模块，请把Fengshenbang-LM/fengshen加入到系统环境变量
+sys.path.insert(0, '../../../fengshen')
+
+os.environ["CUDA_VISIBLE_DEVICES"] = '0,1'
+
+
+class DataCollate(object):
+
+    def __init__(self, tokenizer, max_length, mask_rate=0.15, max_ngram=3, if_padding=True) -> None:
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.word_cuter = jieba.cut
+        self.vocab_length = len(tokenizer)
+        self.mask_rate = mask_rate
+        self.ignore_labels = -100
+        self.ngrams = np.arange(1, max_ngram + 1, dtype=np.int64)
+        pvals = 1. / np.arange(1, max_ngram + 1)
+        pvals /= pvals.sum(keepdims=True)  # p(n) = 1/n / sigma(1/k)
+        self.pvals = pvals
+        self.padding = if_padding
+
+    def token_process(self, token_id):
+        rand = np.random.random()
+        if rand <= 0.8:
+            return self.tokenizer.mask_token_id
+        elif rand <= 0.9:
+            return token_id
+        else:
+            return np.random.randint(1, self.vocab_length)
+
+    def __call__(self, samples):
+        input_ids = []
+        attention_mask = []
+        token_type_ids = []
+        batch_labels = []
+        # print('^-^ batch size :',len(samples))
+        for sample in samples:
+            word_list = list(self.word_cuter(sample['text']))
+            mask_ids, labels = [], []
+
+            record = []
+            for i in range(len(word_list)):
+                rands = np.random.random()
+                if i in record:
+                    continue
+                word = word_list[i]
+                if rands > self.mask_rate and len(word) < 4:
+                    word = word_list[i]
+                    word_encode = tokenizer.encode(word, add_special_tokens=False)
+                    for token in word_encode:
+                        mask_ids.append(token)
+                        labels.append(self.ignore_labels)
+                    record.append(i)
+                else:
+                    n = np.random.choice(self.ngrams, p=self.pvals)
+                    for index in range(n):
+                        ind = index + i
+                        if ind in record or ind >= len(word_list):
+                            continue
+                        record.append(ind)
+                        word = word_list[ind]
+                        word_encode = tokenizer.encode(word, add_special_tokens=False)
+                        for token in word_encode:
+                            mask_ids.append(self.token_process(token))
+                            labels.append(token)
+            if self.padding:
+                if len(mask_ids) > self.max_length:
+                    input_ids.append(mask_ids[:self.max_length])
+                    batch_labels.append(labels[:self.max_length])
+                else:
+                    lenght = len(mask_ids)
+                    mask_ids.extend([0]*(self.max_length-lenght))
+                    labels.extend([-100]*(self.max_length-lenght))
+                    input_ids.append(mask_ids)
+                    batch_labels.append(labels)
+            attention_mask.append([1]*self.max_length)
+            token_type_ids.append([0]*self.max_length)
+
+        #     print('sentence:',sample['text'])
+        #     print('input_ids:',mask_ids)
+        #     print('decode inputids:',self.tokenizer.decode(mask_ids))
+        #     print('labels',labels)
+        #     print('decode labels:',self.tokenizer.decode(labels))
+        #     print('*'*20)
+        return {
+            'input_ids': torch.tensor(input_ids),
+            'labels': torch.tensor(batch_labels),
+            'attention_mask': torch.tensor(attention_mask),
+            'token_type_ids': torch.tensor(token_type_ids)
+        }
+
+
+class Bert(LightningModule):
+    @staticmethod
+    def add_module_specific_args(args_parser):
+        parser = args_parser.add_argument_group('Bert')
+        parser.add_argument('--model_path', type=str, default='')
+        parser.add_argument('--learning_rate', default=1e-5, type=float)
+        parser.add_argument('--weight_decay', default=0.1, type=float)
+        parser.add_argument('--warmup', default=0.01, type=float)
+        return args_parser
+
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.bertconfig = BertConfig.from_pretrained(args.model_path)
+        # self.model = BertForPreTraining(self.bertconfig)
+        self.model = BertForMaskedLM(self.bertconfig)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            tb_size = self.hparams.train_batchsize * max(1, self.trainer.gpus)
+            ab_size = self.trainer.accumulate_grad_batches * float(self.trainer.max_epochs)
+            self.total_steps = (len(train_loader.dataset) // tb_size) // ab_size
+
+    def configure_optimizers(self):
+
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        paras = list(
+            filter(lambda p: p[1].requires_grad, self.named_parameters()))
+        paras = [{
+            'params':
+            [p for n, p in paras if not any(nd in n for nd in no_decay)],
+            'weight_decay': self.hparams.weight_decay
+        }, {
+            'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
+            'weight_decay': 0.0
+        }]
+        optimizer = torch.optim.AdamW(paras, lr=self.hparams.learning_rate)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, int(self.total_steps * self.hparams.warmup),
+            self.total_steps)
+
+        return [{
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,
+                'interval': 'step',
+                'frequency': 1
+            }
+        }]
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        # print(output)
+        self.log('train_loss', output.loss)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        ones = torch.ones_like(labels)
+        zero = torch.zeros_like(labels)
+        mask = torch.where(labels < 0, zero, ones)
+        mask = mask.view(size=(-1,)).float()
+        # y_true=labels.view(size=(-1,)).float()
+
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        corr = torch.multiply(corr.float(), mask)
+        acc = torch.sum(corr.float()) / torch.sum(mask)
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        # print(output)
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        print('val_loss ', output.loss)
+        self.log('val_loss', output.loss)
+        self.log('val_acc', acc)
+        # pass
+
+    def predict_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        return output.prediction_logits
+
+
+class CustomCKPT:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('ckpt call back')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./ckpt/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_last', action='store_true', default=True)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', action='store_true', default=False)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename,
+                                         save_last=args.save_last)
+
+
+if __name__ == '__main__':
+    args_parser = argparse.ArgumentParser()
+    args_parser = BertDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = Bert.add_module_specific_args(args_parser)
+    args_parser = CustomCKPT.add_argparse_args(args_parser)
+    args_parser.add_argument('--deepspeed')
+    args_parser.add_argument('--seq_max_length')
+
+    args = args_parser.parse_args()
+
+    tokenizer = BertTokenizer.from_pretrained(args.model_path)
+    collate_fn = DataCollate(tokenizer, 512)
+    data_module = BertDataModule(tokenizer=tokenizer, args=args, collate_fn=collate_fn)
+
+    print('data load complete')
+
+    model = Bert(args)
+    print('model load complete')
+
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+        args.default_root_dir, 'logs/'),
+        name=os.path.basename(os.path.dirname(args.model_path)))
+    checkpoint_callback = CustomCKPT(args).callbacks
+
+    if args.resume_from_checkpoint is not None and \
+            not os.path.exists(args.resume_from_checkpoint):
+        print('--------warning no checkpoint found--------, remove args')
+        del args.resume_from_checkpoint
+
+    # autotuning
+    if args.deepspeed is not None:
+        os.environ['PL_DEEPSPEED_CONFIG_PATH'] = args.deepspeed
+
+    trainer = Trainer.from_argparse_args(args, logger=logger,
+                                         callbacks=[
+                                             lr_monitor,
+                                             checkpoint_callback])
+
+    trainer.fit(model, data_module)
diff --git a/fengshen/examples/pretrain_bert/pretrain_bert.sh b/fengshen/examples/pretrain_bert/pretrain_bert.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f6e6453826d1c6408de4a7e064a7756529b0c6cd
--- /dev/null
+++ b/fengshen/examples/pretrain_bert/pretrain_bert.sh
@@ -0,0 +1,116 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_bart # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+#SBATCH -x dgx050
+
+
+MODEL_NAME=bert-1.3B
+
+config_json="./$MODEL_NAME.ds_config.json"
+((MASTER_PORT=$RANDOM%10000+40000))
+echo $MASTER_PORT
+ZERO_STAGE=2
+MICRO_BATCH_SIZE=16
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+    "zero_optimization": {
+        "stage": $ZERO_STAGE,
+        "contiguous_gradients": true,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "allgather_bucket_size": 2e8
+    },
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "params": {
+            "betas": [
+                0.9,
+                0.999
+            ],
+            "eps": 1e-08,
+            "lr": 1e-04,
+            "weight_decay": 0.01
+        },
+        "type": "Adam"
+    },
+    "scheduler": {
+        "params": {
+            "warmup_max_lr": 1e-04,
+            "warmup_min_lr": 1e-05,
+            "total_num_steps": 536877,
+            "warmup_num_steps" : 50000
+        },
+        "type": "WarmupDecayLR"
+    },
+    "steps_per_print": 100,
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+    "zero_allow_untested_optimizer": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/home/wuziwei/torch_extendsions
+
+DATA_ARGS="\
+        --datasets_name wudao_180g \
+        --num_workers 16 \
+        --train_batchsize $MICRO_BATCH_SIZE 
+        "
+
+MODEL_ARGS="\
+        --model_path /data0/wuziwei/codes/Fengshenbang-LM/fengshen/examples/pretrain_bert/wudao180g_bert_base \
+        --learning_rate 1e-5 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor train_loss \
+        --save_top_k 3 \
+        --mode min \
+        --save_last \
+        --every_n_train_steps 5000 \
+        --dirpath /data0/wuziwei/codes/Fengshenbang-LM/fengshen/examples/pretrain_bert/$MODEL_NAME \
+        --filename model-{step:02d}-{train_loss:.4f} \
+        "
+TRAINER_ARGS="\
+        --gradient_clip_val 1.0 \
+        --max_epochs 1 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy ddp \
+        --log_every_n_steps 100 \
+        --val_check_interval 0.1 \
+        --check_val_every_n_epoch 1 \
+        --accumulate_grad_batches 1 \
+        --resume_from_checkpoint /data0/wuziwei/codes/Fengshenbang-LM/fengshen/examples/pretrain_bert/$MODEL_NAME/last.ckpt \
+        --default_root_dir /data0/wuziwei/codes/Fengshenbang-LM/fengshen/examples/pretrain_bert/$MODEL_NAME \
+        "
+
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+export SCRIPT_PATH=/data0/wuziwei/codes/Fengshenbang-LM/fengshen/examples/pretrain_bert/pretrain_bert.py
+
+bash -c 'python3 $SCRIPT_PATH $options'
+
diff --git a/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen.py b/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen.py
new file mode 100644
index 0000000000000000000000000000000000000000..1487abb15a7419b6c00056b6fcd78e96c8125d8b
--- /dev/null
+++ b/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen.py
@@ -0,0 +1,237 @@
+from dataclasses import dataclass
+from transformers import (
+    MegatronBertConfig,
+    MegatronBertForPreTraining,
+    AutoTokenizer,
+)
+from pytorch_lightning import (
+    LightningModule,
+    Trainer,
+)
+from pytorch_lightning.callbacks import (
+    LearningRateMonitor,
+)
+import argparse
+import torch
+import os
+import numpy as np
+import time
+from fengshen.data.universal_datamodule import UniversalDataModule
+from fengshen.data.data_utils.sop_utils import get_a_and_b_segments
+from fengshen.data.data_utils.truncate_utils import truncate_segments
+from fengshen.data.data_utils.token_type_utils import create_tokens_and_tokentypes
+from fengshen.data.data_utils.mask_utils import create_masked_lm_predictions
+from fengshen.models.model_utils import (
+    add_module_args,
+    configure_optimizers,
+    get_total_steps,
+)
+from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+from torch.utils.data._utils.collate import default_collate
+
+SHOW_DATA = False
+
+
+@dataclass
+class ErLangShenCollator:
+    '''
+    由input处理成samples，也就是最终模型的输入
+    其中主要处理逻辑在__call__里
+    包含Mask和Sop任务
+    '''
+    tokenizer: None  # 分词
+    max_seq_length: 512
+    masked_lm_prob: 0.15
+    content_key: str = 'text'
+    # 一些预处理操作
+
+    def setup(self):
+        from fengshen.data.data_utils.sentence_split import ChineseSentenceSplitter
+        self.sentence_split = ChineseSentenceSplitter()
+        self.np_rng = np.random.RandomState(seed=((int(time.time()) % 2**32)))
+        inv_vocab = {v: k for k, v in self.tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+
+    def __call__(self, samples):
+        '''
+        samples: 一个sample长这样{"text": "hello world"}
+        '''
+        model_inputs = []
+        for s in samples:
+            sentences = self.sentence_split.tokenize(s[self.content_key])
+            # Divide sample into two segments (A and B).
+            tokenized_sentences = [self.tokenizer.convert_tokens_to_ids(
+                self.tokenizer.tokenize(sent)) for sent in sentences]
+            if len(tokenized_sentences) == 0:
+                print('find empty sentence')
+                continue
+            if len(tokenized_sentences) > 1:
+                tokens_a, tokens_b, is_next_random = get_a_and_b_segments(tokenized_sentences,
+                                                                          self.np_rng)
+            else:
+                tokens_a = tokenized_sentences[0]
+                tokens_b = []
+                is_next_random = False
+            # max_seq_length - 3因为还需要拼上[CLS] [SEP] [SEP]
+            if len(tokens_a) == 0:
+                continue
+            _ = truncate_segments(tokens_a, tokens_b, len(tokens_a),
+                                  len(tokens_b), self.max_seq_length-3, self.np_rng)
+            # Build tokens and toketypes.
+            tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b,
+                                                              self.tokenizer.cls_token_id, self.tokenizer.sep_token_id)
+            # Masking.
+            max_predictions_per_seq = self.masked_lm_prob * len(tokens)
+            (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions(
+                tokens, self.vocab_id_list, self.vocab_id_to_token_dict, self.masked_lm_prob,
+                self.tokenizer.cls_token_id, self.tokenizer.sep_token_id, self.tokenizer.mask_token_id,
+                max_predictions_per_seq, self.np_rng,
+                masking_style='bert')
+
+            # Some checks.
+            num_tokens = len(tokens)
+            padding_length = self.max_seq_length - num_tokens
+            assert padding_length >= 0
+            assert len(tokentypes) == num_tokens
+            assert len(masked_positions) == len(masked_labels)
+
+            # Tokens and token types.
+            filler = [self.tokenizer.pad_token_id] * padding_length
+            tokens_np = np.array(tokens + filler, dtype=np.int64)
+            tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+            # Padding mask.
+            padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                                       dtype=np.int64)
+
+            # Lables and loss mask.
+            labels = [-100] * self.max_seq_length
+            for i in range(len(masked_positions)):
+                assert masked_positions[i] < num_tokens
+                labels[masked_positions[i]] = masked_labels[i]
+            labels_np = np.array(labels, dtype=np.int64)
+            model_inputs.append(
+                {
+                    'input_ids': tokens_np,
+                    'attention_mask': padding_mask_np,
+                    'token_type_ids': tokentypes_np,
+                    'labels': labels_np,
+                    'next_sentence_label': int(is_next_random)
+                }
+            )
+        return default_collate(model_inputs)
+
+
+class ErLangShenBert(LightningModule):
+    @staticmethod
+    def add_module_specific_args(parent_parser):
+        parser = parent_parser.add_argument_group('Erlangshen Bert')
+        parser.add_argument('--masked_lm_prob', type=float, default=0.15)
+        parser.add_argument('--max_seq_length', type=int, default=512)
+        parser.add_argument('--sample_content_key', type=str, default='text')
+        return parent_parser
+
+    def __init__(self, args, tokenizer, **kwargs) -> None:
+        super().__init__()
+        self.save_hyperparameters(args)
+        config = MegatronBertConfig.from_pretrained(args.model_path)
+        self.config = config
+        self.tokenizer = tokenizer
+        self.model = MegatronBertForPreTraining(config)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            self.total_steps = get_total_steps(self.trainer, self.hparams)
+            print('Total steps: {}' .format(self.total_steps))
+
+    def configure_optimizers(self):
+        return configure_optimizers(self)
+
+    def forward(self, **batch):
+        return self.model(**batch)
+
+    def detokenize(self, token_ids):
+        toks = self.tokenizer.convert_ids_to_tokens(token_ids)
+        return self.tokenizer.convert_tokens_to_string(toks)
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.shape[0]
+        return acc
+
+    def training_step(self, batch, batch_idx):
+        if self.trainer.global_rank == 0:
+            global SHOW_DATA
+            if not SHOW_DATA:
+                print(self.config)
+                print(self.model)
+                SHOW_DATA = True
+                print('source: {}'.format(batch['input_ids'][0]))
+                print('target: {}'.format(batch['labels'][0]))
+                print('source: {}'.format(self.detokenize(batch['input_ids'][0])))
+                label_idx = batch['labels'][0] != -100
+                print('target: {}'.format(self.detokenize(
+                    batch['labels'][0][label_idx])))
+        output = self(**batch)
+        self.log('train_loss', output.loss, sync_dist=True)
+        label_idx = batch['labels'] != -100
+        acc = self.comput_metrix(
+            output.prediction_logits[label_idx].view(-1, output.prediction_logits.size(-1)), batch['labels'][label_idx])
+        self.log('train_acc', acc, sync_dist=True)
+        return output.loss
+
+    def validation_step(self, batch, batch_idx):
+        output = self(**batch)
+        self.log('val_loss', output.loss, sync_dist=True)
+        return output.loss
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        # 兼容低版本lightning，低版本lightning从ckpt起来时steps数会被重置为0
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+if __name__ == '__main__':
+    args_parser = argparse.ArgumentParser()
+    args_parser = add_module_args(args_parser)
+    args_parser = UniversalDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = ErLangShenBert.add_module_specific_args(args_parser)
+    args_parser = UniversalCheckpoint.add_argparse_args(args_parser)
+    args = args_parser.parse_args()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    collate_fn = ErLangShenCollator(
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        masked_lm_prob=args.masked_lm_prob,
+        content_key=args.sample_content_key,
+    )
+    collate_fn.setup()
+    data_module = UniversalDataModule(tokenizer=tokenizer, args=args, collate_fn=collate_fn)
+    print('data load complete')
+
+    model = ErLangShenBert(args, tokenizer=tokenizer)
+    print('model load complete')
+
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    checkpoint_callback = UniversalCheckpoint(args)
+
+    # 做兼容，如果目录不存在的话把这个参数去掉，不然会报错
+    if args.load_ckpt_path is not None and \
+            not os.path.exists(args.load_ckpt_path):
+        print('--------warning no checkpoint found--------, remove args')
+        args.load_ckpt_path = None
+
+    trainer = Trainer.from_argparse_args(args,
+                                         callbacks=[
+                                             lr_monitor,
+                                             checkpoint_callback])
+
+    trainer.fit(model, data_module, ckpt_path=args.load_ckpt_path)
diff --git a/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen_base.sh b/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen_base.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d3368c20dc1d5d287bef0619e341b35cc6228362
--- /dev/null
+++ b/fengshen/examples/pretrain_erlangshen_bert/pretrain_erlangshen_base.sh
@@ -0,0 +1,87 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_bart # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+#SBATCH -x dgx050
+
+# pwd=Fengshenbang-LM/fengshen/examples/pretrain_erlangshen
+ROOT_DIR=../../workspace
+export TORCH_EXTENSIONS_DIR=${ROOT_DIR}/torch_extendsions
+
+MODEL_NAME=erlangshen-bert-base
+MODEL_ROOT_DIR=$ROOT_DIR/${MODEL_NAME}
+if [ ! -d ${MODEL_ROOT_DIR} ];then
+  mkdir ${MODEL_ROOT_DIR}
+fi
+
+NNODES=1
+GPUS_PER_NODE=1
+
+MICRO_BATCH_SIZE=32
+
+# 如果你不用Deepspeed的话 下面的一段话都可以删掉 Begin
+CONFIG_JSON="$MODEL_ROOT_DIR/${MODEL_NAME}.ds_config.json"
+ZERO_STAGE=1
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $CONFIG_JSON
+{
+    "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+    "fp16": {
+        "enabled": true
+    },
+    "gradient_clipping": 2,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$CONFIG_JSON
+### End
+
+DATA_ARGS="\
+        --dataloader_workers 2 \
+        --train_batchsize $MICRO_BATCH_SIZE  \
+        --val_batchsize $MICRO_BATCH_SIZE \
+        --test_batchsize $MICRO_BATCH_SIZE  \
+        --datasets_name IDEA-CCNL/PretrainCorpusDemo \
+        "
+# 如果你有一批数据，可以参照IDEA-CCNL/PretrainCorpusDemo的格式处理，通过参数传入
+# --train_file train.json
+# --val_file val.json
+# --test_file test.json
+
+MODEL_ARGS="\
+        --model_path $MODEL_ROOT_DIR/pretrain \
+        --learning_rate 1e-4 \
+        --weight_decay 1e-1 \
+        --warmup_ratio 0.01 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --save_last \
+        --save_ckpt_path ${MODEL_ROOT_DIR}/ckpt \
+        --load_ckpt_path ${MODEL_ROOT_DIR}/ckpt/last.ckpt \
+        "
+
+TRAINER_ARGS="\
+        --max_epoch 1 \
+        --gpus $GPUS_PER_NODE \
+        --num_nodes $NNODES \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --log_every_n_steps 1 \
+        --precision 16 \
+        --default_root_dir ${MODEL_ROOT_DIR} \
+        --replace_sampler_ddp False \
+        "
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+python3 pretrain_erlangshen.py $options
diff --git a/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta.py b/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6bd2f81781c5bfcdd55aa1514104f8dec5d8f50
--- /dev/null
+++ b/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta.py
@@ -0,0 +1,227 @@
+from dataclasses import dataclass
+from transformers import (
+    DebertaV2Config,
+    DebertaV2ForMaskedLM,
+    AutoTokenizer,
+)
+from pytorch_lightning import (
+    LightningModule,
+    Trainer,
+)
+from pytorch_lightning.callbacks import (
+    LearningRateMonitor,
+)
+import argparse
+import torch
+import os
+import numpy as np
+from fengshen.data.universal_datamodule import UniversalDataModule
+from fengshen.data.data_utils.truncate_utils import truncate_segments
+from fengshen.data.data_utils.token_type_utils import create_tokens_and_tokentypes
+from fengshen.data.data_utils.mask_utils import create_masked_lm_predictions
+from fengshen.models.model_utils import (
+    add_module_args,
+    configure_optimizers,
+    get_total_steps,
+)
+from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+from torch.utils.data._utils.collate import default_collate
+
+SHOW_DATA = False
+
+
+@dataclass
+class DeBERTaV2Collator:
+    '''
+    由input处理成samples，也就是最终模型的输入
+    其中主要处理逻辑在__call__里
+    包含Mask任务，使用Whole Word Mask
+    '''
+    tokenizer: None  # 分词
+    max_seq_length: 512
+    masked_lm_prob: 0.15
+    content_key: str = 'text'
+    # 一些预处理操作
+
+    def setup(self):
+        self.np_rng = np.random.RandomState(seed=42)
+        inv_vocab = {v: k for k, v in self.tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        import jieba_fast
+        self.zh_tokenizer = jieba_fast.lcut
+
+    def __call__(self, samples):
+        '''
+        samples: 一个sample长这样{"text": "hello world"}
+        '''
+        model_inputs = []
+        for s in samples:
+            tokenized_sentences = self.tokenizer.convert_tokens_to_ids(
+                self.tokenizer.tokenize(s[self.content_key]))
+            if len(tokenized_sentences) == 0:
+                print('find empty sentence')
+                continue
+            tokens_a = tokenized_sentences
+            # max_seq_length - 3因为还需要拼上[CLS] [SEP] [SEP]
+            if len(tokens_a) == 0:
+                continue
+            _ = truncate_segments(tokens_a, [], len(tokens_a),
+                                  0, self.max_seq_length-3, self.np_rng)
+            # Build tokens and toketypes.
+            tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, [],
+                                                              self.tokenizer.cls_token_id, self.tokenizer.sep_token_id)
+            # Masking.
+            max_predictions_per_seq = self.masked_lm_prob * len(tokens)
+            (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions(
+                tokens, self.vocab_id_list, self.vocab_id_to_token_dict, self.masked_lm_prob,
+                self.tokenizer.cls_token_id, self.tokenizer.sep_token_id, self.tokenizer.mask_token_id,
+                max_predictions_per_seq, self.np_rng,
+                masking_style='bert',
+                zh_tokenizer=self.zh_tokenizer)
+
+            # Some checks.
+            num_tokens = len(tokens)
+            padding_length = self.max_seq_length - num_tokens
+            assert padding_length >= 0
+            assert len(tokentypes) == num_tokens
+            assert len(masked_positions) == len(masked_labels)
+
+            # Tokens and token types.
+            filler = [self.tokenizer.pad_token_id] * padding_length
+            tokens_np = np.array(tokens + filler, dtype=np.int64)
+            tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+            # Padding mask.
+            padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                                       dtype=np.int64)
+
+            # Lables and loss mask.
+            labels = [-100] * self.max_seq_length
+            for i in range(len(masked_positions)):
+                assert masked_positions[i] < num_tokens
+                labels[masked_positions[i]] = masked_labels[i]
+            labels_np = np.array(labels, dtype=np.int64)
+            model_inputs.append(
+                {
+                    'input_ids': tokens_np,
+                    'attention_mask': padding_mask_np,
+                    'token_type_ids': tokentypes_np,
+                    'labels': labels_np,
+                }
+            )
+        return default_collate(model_inputs)
+
+
+class ErlangshenDeBERTaV2(LightningModule):
+    @staticmethod
+    def add_module_specific_args(parent_parser):
+        parser = parent_parser.add_argument_group('Erlangshen Bert')
+        parser.add_argument('--masked_lm_prob', type=float, default=0.15)
+        parser.add_argument('--max_seq_length', type=int, default=512)
+        parser.add_argument('--sample_content_key', type=str, default='text')
+        return parent_parser
+
+    def __init__(self, args, tokenizer, **kwargs) -> None:
+        super().__init__()
+        self.save_hyperparameters(args)
+        config = DebertaV2Config.from_pretrained(args.model_path)
+        self.config = config
+        self.tokenizer = tokenizer
+        self.model = DebertaV2ForMaskedLM(config)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            self.total_steps = get_total_steps(self.trainer, self.hparams)
+            print('Total steps: {}' .format(self.total_steps))
+
+    def configure_optimizers(self):
+        return configure_optimizers(self)
+
+    def forward(self, **batch):
+        return self.model(**batch)
+
+    def detokenize(self, token_ids):
+        toks = self.tokenizer.convert_ids_to_tokens(token_ids)
+        return self.tokenizer.convert_tokens_to_string(toks)
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.shape[0]
+        return acc
+
+    def training_step(self, batch, batch_idx):
+        if self.trainer.global_rank == 0:
+            global SHOW_DATA
+            if not SHOW_DATA:
+                print(self.config)
+                print(self.model)
+                SHOW_DATA = True
+                print('source: {}'.format(batch['input_ids'][0]))
+                print('target: {}'.format(batch['labels'][0]))
+                print('source: {}'.format(self.detokenize(batch['input_ids'][0])))
+                label_idx = batch['labels'][0] != -100
+                print('target: {}'.format(self.detokenize(
+                    batch['labels'][0][label_idx])))
+        output = self(**batch)
+        self.log('train_loss', output.loss, sync_dist=True)
+        label_idx = batch['labels'] != -100
+        acc = self.comput_metrix(
+            output.logits[label_idx].view(-1, output.logits.size(-1)), batch['labels'][label_idx])
+        self.log('train_acc', acc, sync_dist=True)
+        return output.loss
+
+    def validation_step(self, batch, batch_idx):
+        output = self(**batch)
+        self.log('val_loss', output.loss, sync_dist=True)
+        return output.loss
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        # 兼容低版本lightning，低版本lightning从ckpt起来时steps数会被重置为0
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+if __name__ == '__main__':
+    args_parser = argparse.ArgumentParser()
+    args_parser = add_module_args(args_parser)
+    args_parser = UniversalDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = ErlangshenDeBERTaV2.add_module_specific_args(args_parser)
+    args_parser = UniversalCheckpoint.add_argparse_args(args_parser)
+    args = args_parser.parse_args()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    collate_fn = DeBERTaV2Collator(
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        masked_lm_prob=args.masked_lm_prob,
+        content_key=args.sample_content_key,
+    )
+    collate_fn.setup()
+    data_module = UniversalDataModule(tokenizer=tokenizer, args=args, collate_fn=collate_fn)
+    print('data load complete')
+
+    model = ErlangshenDeBERTaV2(args, tokenizer=tokenizer)
+    print('model load complete')
+
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    checkpoint_callback = UniversalCheckpoint(args)
+
+    # 做兼容，如果目录不存在的话把这个参数去掉，不然会报错
+    if args.load_ckpt_path is not None and \
+            not os.path.exists(args.load_ckpt_path):
+        print('--------warning no checkpoint found--------, remove args')
+        args.load_ckpt_path = None
+
+    trainer = Trainer.from_argparse_args(args,
+                                         callbacks=[
+                                             lr_monitor,
+                                             checkpoint_callback])
+
+    trainer.fit(model, data_module, ckpt_path=args.load_ckpt_path)
diff --git a/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta_base.sh b/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta_base.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bf6ad5cb30f14173854aa66bf91d731151ec47d7
--- /dev/null
+++ b/fengshen/examples/pretrain_erlangshen_deberta_v2/pretrain_deberta_base.sh
@@ -0,0 +1,88 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_bart # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+#SBATCH -x dgx050
+
+# pwd=Fengshenbang-LM/fengshen/examples/pretrain_erlangshen
+ROOT_DIR=../../workspace
+export TORCH_EXTENSIONS_DIR=${ROOT_DIR}/torch_extendsions
+
+MODEL_NAME=erlangshen-deberta-base
+MODEL_ROOT_DIR=$ROOT_DIR/${MODEL_NAME}
+if [ ! -d ${MODEL_ROOT_DIR} ];then
+  mkdir ${MODEL_ROOT_DIR}
+fi
+
+NNODES=1
+GPUS_PER_NODE=1
+
+MICRO_BATCH_SIZE=32
+
+# 如果你不用Deepspeed的话 下面的一段话都可以删掉 Begin
+CONFIG_JSON="$MODEL_ROOT_DIR/${MODEL_NAME}.ds_config.json"
+ZERO_STAGE=1
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $CONFIG_JSON
+{
+    "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+    "fp16": {
+        "enabled": true
+    },
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$CONFIG_JSON
+### End
+
+DATA_ARGS="\
+        --dataloader_workers 2 \
+        --train_batchsize $MICRO_BATCH_SIZE  \
+        --val_batchsize $MICRO_BATCH_SIZE \
+        --test_batchsize $MICRO_BATCH_SIZE  \
+        --datasets_name IDEA-CCNL/PretrainCorpusDemo \
+        "
+# 如果你有一批数据，可以参照IDEA-CCNL/PretrainCorpusDemo的格式处理，通过参数传入
+# --train_file train.json
+# --val_file val.json
+# --test_file test.json
+
+MODEL_ARGS="\
+        --model_path $MODEL_ROOT_DIR/pretrain \
+        --learning_rate 1e-4 \
+        --weight_decay 1e-1 \
+        --warmup_ratio 0.01 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --save_last \
+        --save_ckpt_path ${MODEL_ROOT_DIR}/ckpt \
+        --load_ckpt_path ${MODEL_ROOT_DIR}/ckpt/last.ckpt \
+        "
+
+TRAINER_ARGS="\
+        --max_epoch 10 \
+        --gpus $GPUS_PER_NODE \
+        --num_nodes $NNODES \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --log_every_n_steps 1 \
+        --precision 16 \
+        --default_root_dir ${MODEL_ROOT_DIR} \
+        --replace_sampler_ddp False \
+        "
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+python3 pretrain_deberta.py $options
+#srun -N $NNODES --gres=gpu:$GPUS_PER_NODE --ntasks-per-node=$GPUS_PER_NODE --cpus-per-task=20 python3 pretrain_deberta.py $options
diff --git a/fengshen/examples/pretrain_randeng_bart/pretrain_bart.py b/fengshen/examples/pretrain_randeng_bart/pretrain_bart.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8c779de17c7b990b05e0e189cc1c486b8678115
--- /dev/null
+++ b/fengshen/examples/pretrain_randeng_bart/pretrain_bart.py
@@ -0,0 +1,281 @@
+from transformers import AutoTokenizer, BartForConditionalGeneration, BartConfig
+from pytorch_lightning import (
+    LightningModule,
+    Trainer,
+)
+from pytorch_lightning.callbacks import LearningRateMonitor
+from dataclasses import dataclass
+import os
+import argparse
+import torch
+import math
+import time
+from torch.utils.data._utils.collate import default_collate
+from fengshen.data.data_utils.mask_utils import create_masked_lm_predictions
+from fengshen.data.universal_datamodule import UniversalDataModule
+from fengshen.utils import UniversalCheckpoint
+from fengshen.models.model_utils import (
+    get_total_steps,
+    configure_optimizers,
+    add_module_args,
+)
+import numpy as np
+SHOW_DATA = False
+
+
+@ dataclass
+class BartCollator:
+    '''
+    由input处理成samples，也就是最终模型的输入
+    其中主要处理逻辑在__call__里
+    包含text infilling和sentence shuffle任务
+    '''
+    tokenizer: None  # 分词
+    max_seq_length: 512
+    masked_lm_prob: 0.15
+    permute_sentence_ratio: 1.0
+    content_key: str = 'text'
+
+    def setup(self):
+        from fengshen.data.data_utils.sentence_split import ChineseSentenceSplitter
+        self.sentence_split = ChineseSentenceSplitter()
+        self.np_rng = np.random.RandomState(seed=((int(time.time()) % 2**32)))
+        inv_vocab = {v: k for k, v in self.tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        import jieba_fast
+        self.zh_tokenizer = jieba_fast.lcut
+        seg_tokens = ['。', ';', '；', '!', '！', '?', '？']
+        seg_token_ids = []
+        for t in seg_tokens:
+            if t in self.tokenizer.vocab:
+                seg_token_ids.append(self.tokenizer.vocab[t])
+            else:
+                print('seg_token "{}" not in vocab'.format(t))
+        self.seg_token_ids = set(seg_token_ids)
+
+    def permute_sentences(self, source, full_stops, p=1.0):
+        # Tokens that are full stops, where the previous token is not
+        sentence_ends = (full_stops[1:] * ~full_stops[:-1]).nonzero(as_tuple=False) + 2
+        result = source.clone()
+
+        num_sentences = sentence_ends.size(0)
+        num_to_permute = math.ceil((num_sentences * 2 * p) / 2.0)
+        substitutions = torch.randperm(num_sentences)[:num_to_permute]
+        ordering = torch.arange(0, num_sentences)
+        ordering[substitutions] = substitutions[torch.randperm(num_to_permute)]
+
+        # Ignore <bos> at start
+        index = 1
+        for i in ordering:
+            sentence = source[(sentence_ends[i - 1] if i > 0 else 1): sentence_ends[i]]
+            result[index: index + sentence.size(0)] = sentence
+            index += sentence.size(0)
+        return result
+
+    def __call__(self, samples):
+        '''
+        samples: 一个sample长这样{"text": "hello world"}
+        '''
+        model_inputs = []
+        for s in samples:
+            sentences = self.sentence_split.tokenize(s[self.content_key])
+            tokenized_sentences = [self.tokenizer.convert_tokens_to_ids(
+                self.tokenizer.tokenize(sent)) for sent in sentences]
+            if len(tokenized_sentences) == 0:
+                print('find empty sentence')
+                continue
+
+            tokens = [self.tokenizer.cls_token_id]
+            for sent in tokenized_sentences:
+                for t in sent:
+                    tokens.append(t)
+            if tokens[-1] != self.tokenizer.sep_token_id:
+                tokens.append(self.tokenizer.sep_token_id)
+
+            if len(tokens) > self.max_seq_length:
+                # 找到最后的一句话，如果有的话，尽量保证最后一句话的完整
+                last_pos = self.max_seq_length - 1
+                for i in range(self.max_seq_length - 1, 0, -1):
+                    if tokens[i-1] in self.seg_token_ids:
+                        last_pos = i
+                        break
+                tokens = tokens[:last_pos]
+
+                tokens.append(self.tokenizer.sep_token_id)
+            tokens = torch.LongTensor(tokens)
+
+            full_stops = torch.any(torch.stack([torch.eq(tokens, aelem).logical_or_(
+                torch.eq(tokens, aelem)) for aelem in self.seg_token_ids], dim=0), dim=0)
+
+            assert (self.max_seq_length -
+                    tokens.shape[0]) >= 0, (tokens.size(), tokens[-1], self.max_seq_length)
+
+            source, target = tokens, tokens.clone()
+
+            if self.permute_sentence_ratio > 0.0:
+                source = self.permute_sentences(source, full_stops, self.permute_sentence_ratio)
+
+            if self.masked_lm_prob > 0.0:
+                mask_prob = self.masked_lm_prob * 2
+                max_predictions_per_seq = mask_prob * len(source)
+                (source, _, _, _, _) = create_masked_lm_predictions(
+                    source.numpy(), self.vocab_id_list, self.vocab_id_to_token_dict, mask_prob,
+                    self.tokenizer.cls_token_id, self.tokenizer.sep_token_id, self.tokenizer.mask_token_id,
+                    max_predictions_per_seq, self.np_rng,
+                    masking_style='bert', zh_tokenizer=self.zh_tokenizer)
+                # 合并[MASK] 因为这里用的是Bert的mask函数，Bert是按字mask的，
+                # 这里把连续的mask合并成一个MASK从而达到span mask的效果
+                span_mask_souce = []
+                for t in source:
+                    # 如果是连续的多个mask，则跳过
+                    if len(span_mask_souce) > 0 \
+                            and t is self.tokenizer.mask_token_id \
+                            and span_mask_souce[-1] is self.tokenizer.mask_token_id:
+                        continue
+                    span_mask_souce.append(t)
+
+                source = torch.LongTensor(span_mask_souce)
+
+            assert (source >= 0).all()
+            # assert (source[1:-1] >= 1).all(), source
+            assert (source <= self.tokenizer.vocab_size).all()
+            assert source[0] == self.tokenizer.cls_token_id
+            assert source[-1] == self.tokenizer.sep_token_id
+
+            prev_output_tokens = torch.zeros_like(target)
+            # match the preprocessing in fairseq
+            prev_output_tokens[0] = self.tokenizer.sep_token_id
+            prev_output_tokens[1:] = target[:-1]
+
+            source_ = torch.full((self.max_seq_length,),
+                                 self.tokenizer.pad_token_id, dtype=torch.long)
+            source_[:source.shape[0]] = source
+            target_ = torch.full((self.max_seq_length,), -100, dtype=torch.long)
+            target_[:target.shape[0]] = target
+            prev_output_tokens_ = torch.full(
+                (self.max_seq_length,), self.tokenizer.pad_token_id, dtype=torch.long)
+            prev_output_tokens_[:prev_output_tokens.shape[0]] = prev_output_tokens
+            attention_mask = torch.full((self.max_seq_length,), 0, dtype=torch.long)
+            attention_mask[:source.shape[0]] = 1
+            model_inputs.append({
+                "input_ids": source_,
+                "labels": target_,
+                "decoder_input_ids": prev_output_tokens_,
+                "attention_mask": attention_mask,
+            })
+        return default_collate(model_inputs)
+
+
+class RandengBart(LightningModule):
+    @staticmethod
+    def add_module_specific_args(parent_parser):
+        parser = parent_parser.add_argument_group('Randeng BART')
+        parser.add_argument('--masked_lm_prob', type=float, default=0.15)
+        parser.add_argument('--max_seq_length', type=int, default=512)
+        parser.add_argument('--sample_content_key', type=str, default='text')
+        parser.add_argument('--permute_sentence_ratio', type=str, default=1.0)
+        return parent_parser
+
+    def __init__(self, args, tokenizer, **kwargs) -> None:
+        super().__init__()
+        self.save_hyperparameters(args)
+        config = BartConfig.from_pretrained(args.model_path)
+        self.model = BartForConditionalGeneration(config)
+        self.tokenizer = tokenizer
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            self.total_steps = get_total_steps(self.trainer, self.hparams)
+
+    def configure_optimizers(self):
+        return configure_optimizers(self)
+
+    def detokenize(self, token_ids):
+        toks = self.tokenizer.convert_ids_to_tokens(token_ids)
+        return self.tokenizer.convert_tokens_to_string(toks)
+
+    def training_step(self, batch, batch_idx):
+        if self.trainer.global_rank == 0:
+            global SHOW_DATA
+            if not SHOW_DATA:
+                SHOW_DATA = True
+                print('source: {}'.format(batch['input_ids'][0]))
+                print('target: {}'.format(batch['labels'][0]))
+                print('decoder source: {}'.format(batch['decoder_input_ids'][0]))
+
+                print('source: {}'.format(self.detokenize(batch['input_ids'][0])))
+                print('decoder source: {}'.format(self.detokenize(batch['decoder_input_ids'][0])))
+                label_idx = batch['labels'][0] != -100
+                print('target: {}'.format(self.detokenize(
+                    batch['labels'][0][label_idx])))
+        output = self.model(**batch)
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss, sync_dist=True)
+        self.log('train_acc', acc, sync_dist=True)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        label_idx = labels != -100
+        labels = labels[label_idx]
+        logits = logits[label_idx].view(-1, logits.size(-1))
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.shape[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss, sync_dist=True)
+        self.log('val_acc', acc, sync_dist=True)
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        # 兼容低版本lightning，低版本lightning从ckpt起来时steps数会被重置为0
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+if __name__ == '__main__':
+    args_parser = argparse.ArgumentParser()
+    args_parser = add_module_args(args_parser)
+    args_parser = UniversalDataModule.add_data_specific_args(args_parser)
+    args_parser = Trainer.add_argparse_args(args_parser)
+    args_parser = RandengBart.add_module_specific_args(args_parser)
+    args_parser = UniversalCheckpoint.add_argparse_args(args_parser)
+    args = args_parser.parse_args()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+
+    collator = BartCollator(
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        masked_lm_prob=args.masked_lm_prob,
+        content_key=args.sample_content_key,
+        permute_sentence_ratio=args.permute_sentence_ratio,
+    )
+    # 准备一些额外参数
+    collator.setup()
+    data_module = UniversalDataModule(tokenizer=tokenizer, args=args, collate_fn=collator)
+
+    module = RandengBart(args, tokenizer=tokenizer)
+
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    checkpoint_callback = UniversalCheckpoint(args)
+
+    # 做兼容，如果目录不存在的话把这个参数去掉，不然会报错
+    if args.load_ckpt_path is not None and \
+            not os.path.exists(args.load_ckpt_path):
+        print('--------warning no checkpoint found--------, remove args')
+        args.load_ckpt_path = None
+
+    trainer = Trainer.from_argparse_args(args,
+                                         callbacks=[
+                                             lr_monitor,
+                                             checkpoint_callback])
+
+    trainer.fit(module, data_module, ckpt_path=args.load_ckpt_path)
diff --git a/fengshen/examples/pretrain_randeng_bart/pretrain_bart_base.sh b/fengshen/examples/pretrain_randeng_bart/pretrain_bart_base.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2ac4d8d40a2135c7439c150d7b208f94ba002a0d
--- /dev/null
+++ b/fengshen/examples/pretrain_randeng_bart/pretrain_bart_base.sh
@@ -0,0 +1,87 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_bart # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks-per-node=8 # number of tasks to run per node
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:8 # number of gpus per node
+#SBATCH -o %x-%j.log # output and error log file names (%x for job id)
+#SBATCH -x dgx050
+
+# pwd=Fengshenbang-LM/fengshen/examples/pretrain_erlangshen
+ROOT_DIR=../../workspace
+export TORCH_EXTENSIONS_DIR=${ROOT_DIR}/torch_extendsions
+
+MODEL_NAME=randeng-bart-base
+MODEL_ROOT_DIR=$ROOT_DIR/${MODEL_NAME}
+if [ ! -d ${MODEL_ROOT_DIR} ];then
+  mkdir ${MODEL_ROOT_DIR}
+fi
+
+NNODES=1
+GPUS_PER_NODE=1
+
+MICRO_BATCH_SIZE=32
+
+# 如果你不用Deepspeed的话 下面的一段话都可以删掉 Begin
+CONFIG_JSON="$MODEL_ROOT_DIR/${MODEL_NAME}.ds_config.json"
+ZERO_STAGE=1
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $CONFIG_JSON
+{
+    "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+    "fp16": {
+        "enabled": true
+    },
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$CONFIG_JSON
+### End
+
+DATA_ARGS="\
+        --dataloader_workers 2 \
+        --train_batchsize $MICRO_BATCH_SIZE  \
+        --val_batchsize $MICRO_BATCH_SIZE \
+        --test_batchsize $MICRO_BATCH_SIZE  \
+        "
+# 如果你有一批数据，可以参照IDEA-CCNL/PretrainCorpusDemo的格式处理，通过参数传入
+# --train_file train.json
+# --val_file val.json
+# --test_file test.json
+
+MODEL_ARGS="\
+        --model_path $MODEL_ROOT_DIR/pretrain \
+        --learning_rate 1e-4 \
+        --weight_decay 1e-1 \
+        --warmup_ratio 0.01 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --save_last \
+        --save_ckpt_path ${MODEL_ROOT_DIR}/ckpt \
+        --load_ckpt_path ${MODEL_ROOT_DIR}/ckpt/last.ckpt \
+        "
+
+TRAINER_ARGS="\
+        --max_epoch 10 \
+        --gpus $GPUS_PER_NODE \
+        --num_nodes $NNODES \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --log_every_n_steps 1 \
+        --precision 16 \
+        --default_root_dir ${MODEL_ROOT_DIR} \
+        --replace_sampler_ddp False \
+        "
+
+export options=" \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+
+python3 pretrain_bart.py $options
+#srun -N $NNODES --gres=gpu:$GPUS_PER_NODE --ntasks-per-node=$GPUS_PER_NODE --cpus-per-task=20 python3 pretrain_bart.py $options
diff --git a/fengshen/examples/pretrain_t5/convert_ckpt_randeng_t5_char.sh b/fengshen/examples/pretrain_t5/convert_ckpt_randeng_t5_char.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c446fd8784477d1caa1519b614d759aa3cb6ec8
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/convert_ckpt_randeng_t5_char.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+set -x -e
+
+echo "START TIME: $(date)"
+BIN_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/randeng_t5_char_57M
+if [ ! -d ${BIN_DIR} ];then
+  mkdir ${BIN_DIR}
+  echo ${BIN_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${BIN_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+
+MODEL_ARGS="
+    --ckpt_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/ckpt/last.ckpt/checkpoint/mp_rank_00_model_states.pt \
+    --bin_path ${BIN_DIR}/pytorch_model.bin \
+    --rm_prefix module.model. \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/convert_ckpt_to_bin.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $MODEL_ARGS \
+    "
+
+echo $CMD
+/home/ganruyi/anaconda3/bin/python $CMD
diff --git a/fengshen/examples/pretrain_t5/convert_ckpt_to_bin.py b/fengshen/examples/pretrain_t5/convert_ckpt_to_bin.py
new file mode 100644
index 0000000000000000000000000000000000000000..2aeef8c860864d138b0c970baca72a568bf51a19
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/convert_ckpt_to_bin.py
@@ -0,0 +1,37 @@
+import time
+from builtins import print
+import argparse
+
+import torch
+# os.environ["CUDA_VISIBLE_DEVICES"] = '3'
+
+
+def get_time_str():
+    return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Pretrain Unsupervise.")
+    total_parser.add_argument('--ckpt_path', default=None, type=str)
+    total_parser.add_argument('--bin_path', default=None, type=str)
+    total_parser.add_argument('--rm_prefix', default=None, type=str)
+    # * Args for base model
+    args = total_parser.parse_args()
+    print('Argument parse success.')
+    state_dict = torch.load(args.ckpt_path)['module']
+    new_state_dict = {}
+
+    if args.rm_prefix is not None:
+        prefix_len = len(args.rm_prefix)
+        for k, v in state_dict.items():
+            if k[:prefix_len] == args.rm_prefix:
+                new_state_dict[k[prefix_len:]] = v
+            else:
+                new_state_dict[k] = v
+    else:
+        new_state_dict = state_dict
+    torch.save(new_state_dict, args.bin_path)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/pretrain_t5/finetune_t5.py b/fengshen/examples/pretrain_t5/finetune_t5.py
new file mode 100644
index 0000000000000000000000000000000000000000..497b1ca26817d2c1dbf8d1be4b5cea51ad846f4e
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/finetune_t5.py
@@ -0,0 +1,144 @@
+import time
+from builtins import print
+import sys
+import os
+import torch
+import argparse
+import pytorch_lightning as pl
+from pytorch_lightning import Trainer, loggers
+from transformers import MT5ForConditionalGeneration
+from pytorch_lightning.callbacks import LearningRateMonitor
+# os.environ["CUDA_VISIBLE_DEVICES"] = '3'
+
+
+class MT5FinetuneModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--keep_tokens_path', default=None, type=str)
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.model = MT5ForConditionalGeneration.from_pretrained(
+            args.pretrained_model_path
+        )
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches * float(self.trainer.max_epochs)
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            labels=batch['labels'])
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss, sync_dist=True)
+        self.log('train_acc', acc, sync_dist=True)
+        return output.loss
+
+    def validation_step(self, batch, batch_idx):
+        # print('is out of index: ', batch['input_ids'][batch['input_ids'] >= 32598])
+        output = self.model(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            labels=batch['labels'])
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        cond_output = self.model.generate(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            force_words_ids=batch['force_words_ids'],
+            num_beams=2,
+        )
+        cond_acc = self.comput_metrix(cond_output, batch['labels'])
+        self.log('val_loss', output.loss, sync_dist=True)
+        self.log('val_acc', acc, sync_dist=True)
+        self.log('cond_acc', cond_acc, sync_dist=True)
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/y_true.shape[0]
+        return acc
+
+    def on_save_checkpoint(self, checkpoint) -> None:
+        # Save the current loop info in the mid of epoch
+        # if you lightning <= 1.6.0  uncomment the line below
+        # checkpoint['loops'] = self.trainer.checkpoint_connector._get_loops_state_dict()
+        if self.trainer.global_rank == 0 and self.trainer.global_step % self.hparams.every_n_train_steps == 0:
+            self.model.save_pretrained(os.path.join(
+                self.trainer.checkpoint_callback.dirpath,
+                'hf_pretrained_epoch{}_step{}'.format(self.trainer.current_epoch, self.trainer.global_step)))
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+def get_time_str():
+    return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Pretrain Unsupervise.")
+    total_parser.add_argument(
+        '--do_eval_only', action='store_true', default=False)
+    total_parser.add_argument(
+        '--pretrained_model_path', default=None, type=str)
+    total_parser.add_argument(
+        '--new_vocab_path', default=None, type=str)
+    total_parser.add_argument('--max_seq_length', default=1024, type=int)
+    total_parser.add_argument('--ckpt_path', default=None, type=str)
+    sys.path.append('../../../')
+    from fengshen.data.t5_dataloader.t5_datasets import TaskT5DataModel
+    from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+    # * Args for data preprocessing
+    total_parser = TaskT5DataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = UniversalCheckpoint.add_argparse_args(total_parser)
+    total_parser = MT5FinetuneModel.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+    print('Argument parse success.')
+    print('TaskT5DataModel load start {}'.format(get_time_str()))
+    data_model = TaskT5DataModel(args)
+    print('TaskT5DataModel load end {}'.format(get_time_str()))
+    if not args.do_eval_only:
+        model = MT5FinetuneModel(args)
+        checkpoint_callback = UniversalCheckpoint(args)
+        lr_monitor = LearningRateMonitor(logging_interval='step')
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'logs/'))
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[checkpoint_callback, lr_monitor]
+                                             )
+        trainer.fit(model, data_model, ckpt_path=args.ckpt_path)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/pretrain_t5/finetune_unimc_randeng_t5_char_57M.sh b/fengshen/examples/pretrain_t5/finetune_unimc_randeng_t5_char_57M.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fccf833bdc954707bdc94d6bef3821239006a2c6
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/finetune_unimc_randeng_t5_char_57M.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+#SBATCH --job-name=finetune_unimc_randeng_t5_char_57M
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=32 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=64
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/finetune_unimc_randeng_t5_char_57M/
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.finetune_unimc_randeng_t5_char_57M.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+export CUDA_VISIBLE_DEVICES='6'
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 240000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 100000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.1 \
+    --dataset_num_workers 4 \
+    --dataloader_num_workers 4 \
+    --replace_sampler_ddp False \
+"
+# --accumulate_grad_batches 8 \
+TRAIN_DATA_DIR=/cognitive_comp/yangping/data/unidata/multiplechoice/pretraining_alldata/alldata/train.json
+VALID_DATA_DIR=/cognitive_comp/yangping/data/unidata/multiplechoice/pretraining_alldata/alldata/dev.json
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data_path ${TRAIN_DATA_DIR} \
+    --valid_data_path ${TRAIN_DATA_DIR} \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/randeng_t5_char_57M \
+    --tokenizer_type bert_tokenizer \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/finetune_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+/home/ganruyi/anaconda3/bin/python $CMD
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
diff --git a/fengshen/examples/pretrain_t5/pretrain_mt5_small.sh b/fengshen/examples/pretrain_t5/pretrain_mt5_small.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4e9d49e3a83d9a886890740179a9ae3739a58654
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_mt5_small.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_t5_77M
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o %x-%j.log
+#SBATCH -e %x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=64
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_77M/
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.t5_cn_small_pretrain.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 100000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 8 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 50000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.01 \
+    --preprocessing_num_workers 20 \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_t5_tokenized_512
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/hf_models/google/mt5-small \
+    --new_vocab_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn.model \
+    --keep_tokens_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn_keep_tokens.json \
+"
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+srun --jobid=171866 --job-name=randeng_t5_77M --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 -e %x-%j.err -o %x-%j.log singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+# salloc --nodes=1 --gres=gpu:2 --cpus-per-gpu=20 -t 24:00:00
+# clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+# clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python -u -m debugpy --listen 192.168.190.2:53005 --wait-for-client $CMD'
\ No newline at end of file
diff --git a/fengshen/examples/pretrain_t5/pretrain_mt5_small_continue.sh b/fengshen/examples/pretrain_t5/pretrain_mt5_small_continue.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0a539a7e6a7fb4b750b441df98dd49f166c3c49b
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_mt5_small_continue.sh
@@ -0,0 +1,120 @@
+#!/bin/bash
+#SBATCH --job-name=t5_cn_small_pretrain_v2
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o %x-%j.log
+#SBATCH -e %x-%j.err
+#SBATCH -x dgx050
+
+set -x -e
+source activate base
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=32
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/t5_cn_small_pretrain_v2/
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.t5_cn_small_pretrain_v2.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+
+cat <<EOT > $config_json
+{
+    "zero_optimization": {
+        "stage": 1
+    },
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "params": {
+            "betas": [
+                0.9,
+                0.95
+            ],
+            "eps": 1e-08,
+            "lr": 1e-04,
+            "weight_decay": 0.01
+        },
+        "type": "AdamW"
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params":{
+            "warmup_min_lr": 0,
+            "warmup_max_lr": 1e-4,
+            "warmup_num_steps": 10000
+        }
+    },
+    "steps_per_print": 100,
+    "gradient_clipping": 1,
+    "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+    "zero_allow_untested_optimizer": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 8 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 0 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.01 \
+    --preprocessing_num_workers 20 \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_mt5_tokenized
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 1024 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/t5_cn_small_pretrain/Randeng-T5-77M \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --keep_tokens_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn_keep_tokens.json \
+"
+# --resume_from_checkpoint /cognitive_comp/ganruyi/fengshen/t5_cn_small_pretrain/ckpt/last.ckpt \
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+# salloc --nodes=1 --gres=gpu:2 --cpus-per-gpu=20 -t 24:00:00
+clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+# clear; srun --job-name=t5_cn_small_pretrain_v2 --jobid=153124 --nodes=1 --ntasks-per-node=8 --gres=gpu:8 --cpus-per-task=30 -o %x-%j.log -e %x-%j.err singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
diff --git a/fengshen/examples/pretrain_t5/pretrain_mt5_small_predict.sh b/fengshen/examples/pretrain_t5/pretrain_mt5_small_predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..be643bb12ddf613e99a5f6ac3bd23f3ab0773a33
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_mt5_small_predict.sh
@@ -0,0 +1,126 @@
+#!/bin/bash
+#SBATCH --job-name=t5_cn_small_pretrain
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/fengshen/t5_cn_small_pretrain/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/fengshen/t5_cn_small_pretrain/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=128
+ROOT_DIR=/cognitive_comp/ganruyi/fengshen/t5_cn_small_pretrain/
+
+ZERO_STAGE=2
+
+config_json="$ROOT_DIR/ds_config.t5_cn_small_pretrain.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": 128,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "AdamW",
+    "params": {
+      "lr": 1e-4,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 0,
+      "warmup_max_lr": 1e-4,
+      "warmup_num_steps": 10000
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_2
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 10 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.01 \
+    --accumulate_grad_batches 8 \
+    --resume_from_checkpoint /cognitive_comp/ganruyi/fengshen/t5_cn_small_pretrain/old-ckpt/last.ckpt \
+    --do_eval_only \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_mt5_tokenized
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data wudao_180g_mt5_tokenized\
+    --train_split_size 0.999 \
+    --max_seq_length 1024 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/hf_models/google/mt5-small \
+    --new_vocab_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn.model \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --keep_tokens_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn_keep_tokens.json \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/fengshen/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+# clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+/home/ganruyi/anaconda3/bin/python $CMD
\ No newline at end of file
diff --git a/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_10B.sh b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_10B.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6b85b4886dffc191c6d4856f66c2b3fd51817f69
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_10B.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_randeng_t5_char_10B
+#SBATCH --nodes=4
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=32 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/experiments/randeng_t5_char_10B/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/experiments/randeng_t5_char_10B/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=1
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_char_10B/
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=2
+
+config_json="$ROOT_DIR/ds_config.randeng_t5_char_10B.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+export CUDA_VISIBLE_DEVICES='1,2,3,4'
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "cpu_offload": true,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 100000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_${ZERO_STAGE}
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 4 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 1000000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.1 \
+    --dataset_num_workers 4 \
+    --dataloader_num_workers 4 \
+    --replace_sampler_ddp False \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_bert_tokenized_512
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data_path ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_10B/randeng_t5_char_10B \
+    --tokenizer_type bert_tokenizer \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+/home/ganruyi/anaconda3/bin/python $CMD
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
diff --git a/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_57M.sh b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_57M.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8e86e8b077019a57c5a6ac28ab29749f1a2787aa
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_57M.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_randeng_t5_char_57M
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=32 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=64
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.randeng_t5_char_57M.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+# export CUDA_VISIBLE_DEVICES='4,5'
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 240000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 8 \
+    --num_nodes 1 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 100000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.1 \
+    --dataset_num_workers 4 \
+    --dataloader_num_workers 4 \
+    --replace_sampler_ddp False \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_bert_tokenized_512
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data_path ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_57M/randeng_t5_char_57M \
+    --tokenizer_type bert_tokenizer \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+/home/ganruyi/anaconda3/bin/python $CMD
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
diff --git a/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_700M.sh b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_700M.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5b3b2c6c87831ebce78d4f7e0ed133b7a8468ba2
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_char_700M.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+#SBATCH --job-name=pretrain_randeng_t5_char_700M
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/experiments/randeng_t5_char_700M/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/experiments/randeng_t5_char_700M/%x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=8
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_char_700M/
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.randeng_t5_char_700M.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+# export CUDA_VISIBLE_DEVICES='2,5'
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 400000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 8 \
+    --num_nodes 2 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 100000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.1 \
+    --dataset_num_workers 4 \
+    --dataloader_num_workers 4 \
+    --replace_sampler_ddp False \
+    --accumulate_grad_batches 2 \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_bert_tokenized_512
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data_path ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_700M/randeng_t5_char_700M \
+    --tokenizer_type bert_tokenizer \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+# /home/ganruyi/anaconda3/bin/python $CMD
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
diff --git a/fengshen/examples/pretrain_t5/pretrain_randeng_t5_large.sh b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_large.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a91d7082a4c945fe78a2fb0ce99be7c7d9a02745
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_randeng_t5_large.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_t5_large
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o %x-%j.log
+#SBATCH -e %x-%j.err
+
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=8
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/randeng_t5_large_v2/
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=1
+
+config_json="$ROOT_DIR/ds_config.randeng_t5_large_pretrain.$SLURM_JOBID.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 100000,
+      "warmup_num_steps" : 10000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+# strategy=ddp
+strategy=deepspeed_stage_1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 8 \
+    --num_nodes 2 \
+    --strategy ${strategy} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --every_n_train_steps 1000000 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --val_check_interval 0.01 \
+    --preprocessing_num_workers 20 \
+"
+# --accumulate_grad_batches 8 \
+DATA_DIR=wudao_180g_t5_tokenized_512
+
+DATA_ARGS="
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data ${DATA_DIR} \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/hf_models/google/mt5-large \
+    --new_vocab_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn.model \
+    --keep_tokens_path /cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn_keep_tokens.json \
+"
+# --ckpt_path /cognitive_comp/ganruyi/experiments/randeng_t5_large/ckpt/last.ckpt \
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/pretrain_t5.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+# source activate base
+# python $CMD
+# srun --nodes=1 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 --jobid=171866 -e %x-%j.err -o %x-%j.log python $CMD
+
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+srun --jobid=172781 --job-name=randeng_t5_large --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --cpus-per-task=30 -e randeng_t5_large-%j.err -o randeng_t5_large-%j.log singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+# salloc --nodes=1 --gres=gpu:2 --cpus-per-gpu=20 -t 24:00:00
+# clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+# clear; srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python -u -m debugpy --listen 192.168.190.2:53005 --wait-for-client $CMD'
\ No newline at end of file
diff --git a/fengshen/examples/pretrain_t5/pretrain_t5.py b/fengshen/examples/pretrain_t5/pretrain_t5.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a95bc8781ca5f4e0fa3ef0cb1eea98e5d4abbe6
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/pretrain_t5.py
@@ -0,0 +1,175 @@
+import time
+from builtins import print
+import sys
+import os
+import torch
+import argparse
+import json
+import pytorch_lightning as pl
+from transformers import MT5Config, MT5Tokenizer
+from pytorch_lightning import Trainer, loggers
+from transformers import MT5ForConditionalGeneration
+from pytorch_lightning.callbacks import LearningRateMonitor
+# os.environ["CUDA_VISIBLE_DEVICES"] = '3'
+
+
+class MT5PretrainModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--keep_tokens_path', default=None, type=str)
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        if args.tokenizer_type == 't5_tokenizer':
+            if args.new_vocab_path is not None:
+                # 用于从mt5继续训练，此时只保留中英文词表，spm采用新模型
+                assert args.keep_tokens_path is not None
+                keep_tokens = json.load(open(args.keep_tokens_path))
+                self.model = MT5ForConditionalGeneration.from_pretrained(
+                    args.pretrained_model_path)
+                new_config = self.model.config
+                new_config.vocab_size = len(keep_tokens)
+                print('vocab_size:', new_config.vocab_size)
+
+                new_state_dict = self.model.state_dict()
+                select_index = torch.tensor(keep_tokens)
+                new_state_dict['encoder.embed_tokens.weight'] = torch.index_select(
+                    new_state_dict['encoder.embed_tokens.weight'], dim=0, index=select_index)
+                new_state_dict['shared.weight'] = torch.index_select(
+                    new_state_dict['shared.weight'], dim=0, index=select_index)
+                new_state_dict['decoder.embed_tokens.weight'] = torch.index_select(
+                    new_state_dict['decoder.embed_tokens.weight'], dim=0, index=select_index)
+                new_state_dict['lm_head.weight'] = torch.index_select(
+                    new_state_dict['lm_head.weight'], dim=0, index=select_index)
+                self.model = MT5ForConditionalGeneration.from_pretrained(
+                    args.pretrained_model_path, config=new_config, state_dict=new_state_dict)
+                # self.model = MT5ForConditionalGeneration(config=new_config)
+            else:
+                # 用于继续训练
+                self.model = MT5ForConditionalGeneration.from_pretrained(
+                    args.pretrained_model_path
+                )
+        else:
+            self.model = MT5ForConditionalGeneration(
+                MT5Config.from_pretrained(args.pretrained_model_path)
+            )
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches * float(self.trainer.max_epochs)
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(
+            input_ids=batch['input_ids'], labels=batch['labels'])
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss, sync_dist=True)
+        self.log('train_acc', acc, sync_dist=True)
+        return output.loss
+
+    def validation_step(self, batch, batch_idx):
+        # print('is out of index: ', batch['input_ids'][batch['input_ids'] >= 32598])
+        output = self.model(
+            input_ids=batch['input_ids'], labels=batch['labels'])
+        acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss, sync_dist=True)
+        self.log('val_acc', acc, sync_dist=True)
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/y_true.shape[0]
+        return acc
+
+    def on_save_checkpoint(self, checkpoint) -> None:
+        # Save the current loop info in the mid of epoch
+        # if you lightning <= 1.6.0  uncomment the line below
+        # checkpoint['loops'] = self.trainer.checkpoint_connector._get_loops_state_dict()
+        if self.trainer.global_rank == 0 and self.trainer.global_step % self.hparams.every_n_train_steps == 0:
+            self.model.save_pretrained(os.path.join(
+                self.trainer.checkpoint_callback.dirpath,
+                'hf_pretrained_epoch{}_step{}'.format(self.trainer.current_epoch, self.trainer.global_step)))
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        global_step_offset = checkpoint["global_step"]
+        if 'global_samples' in checkpoint:
+            self.consumed_samples = checkpoint['global_samples']
+        self.trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
+
+
+def get_time_str():
+    return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Pretrain Unsupervise.")
+    total_parser.add_argument(
+        '--do_eval_only', action='store_true', default=False)
+    total_parser.add_argument(
+        '--pretrained_model_path', default=None, type=str)
+    total_parser.add_argument(
+        '--new_vocab_path', default=None, type=str)
+    total_parser.add_argument('--max_seq_length', default=1024, type=int)
+    total_parser.add_argument('--ckpt_path', default=None, type=str)
+    sys.path.append('../../../')
+    from fengshen.data.t5_dataloader.t5_datasets import UnsuperviseT5DataModel
+    from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+    # * Args for data preprocessing
+    total_parser = UnsuperviseT5DataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = UniversalCheckpoint.add_argparse_args(total_parser)
+    total_parser = MT5PretrainModel.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+    print('Argument parse success.')
+    print('UnsuperviseT5DataModel load start {}'.format(get_time_str()))
+    data_model = UnsuperviseT5DataModel(args)
+    print('UnsuperviseT5DataModel load end {}'.format(get_time_str()))
+    if not args.do_eval_only:
+        model = MT5PretrainModel(args)
+        checkpoint_callback = UniversalCheckpoint(args)
+        lr_monitor = LearningRateMonitor(logging_interval='step')
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'logs/'))
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[checkpoint_callback, lr_monitor]
+                                             )
+        trainer.fit(model, data_model, ckpt_path=args.ckpt_path)
+    else:
+        tokenizer = MT5Tokenizer.from_pretrained(args.new_vocab_path, extra_ids=0)
+        model = MT5PretrainModel(args=args, num_data=len(data_model.predict_dataloader()))
+        trainer = Trainer.from_argparse_args(args)
+
+        result = trainer.predict(model, data_model)
+        result = result[0]
+        for i in range(4):
+            print(tokenizer.batch_decode(result['input_ids'][i]))
+            print(tokenizer.batch_decode(result['predict_ids'][i]))
+            print(tokenizer.batch_decode(result['labels'][i]))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/pretrain_t5/process_data.py b/fengshen/examples/pretrain_t5/process_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..bae164f107f7ec3569227f3e40a292ee1641fd21
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/process_data.py
@@ -0,0 +1,65 @@
+# coding=utf8
+import argparse
+import sys
+import os
+from concurrent.futures import ProcessPoolExecutor
+
+
+def _generate_cache_arrow(index, ds, path):
+    print('saving dataset shard {}'.format(index))
+    ds.save_to_disk(os.path.join(path, 'part_{}'.format(index)))
+    return 'saving dataset shard {} done'.format(index)
+
+
+def generate_arrow_cache(ds, args) -> None:
+    '''
+    读取wudao_180g等原数据或者tokenized之后的数据，并进行train test split
+    同时利用seed 42做shuffle 缓存下来
+    '''
+    ds = ds.train_test_split(train_size=args.train_split_size, seed=42)
+    print(ds)
+    p = ProcessPoolExecutor(max_workers=args.preprocessing_num_workers)
+    res = []
+    train_shard_part = args.saved_data_shards
+    for i in range(0, train_shard_part):
+        res.append(p.submit(_generate_cache_arrow, i,
+                            ds['train'].shard(train_shard_part, i), args.saved_train_data_path))
+
+    p.shutdown(wait=True)
+    for future in res:
+        print(future.result(), flush=True)
+
+    ds['test'].save_to_disk(args.saved_test_data_path)
+    print('done')
+
+
+if __name__ == '__main__':
+    total_parser = argparse.ArgumentParser("Save data Task")
+    total_parser.add_argument(
+        '--new_vocab_path', default='/cognitive_comp/ganruyi/hf_models/t5_cn_small/sentencepiece_cn.model', type=str)
+    total_parser.add_argument('--preprocessing_num_workers', default=30, type=int)
+    total_parser.add_argument(
+        '--train_data_path', default='/cognitive_comp/common_data/test_wudao_180g_mt5_tokenized/', type=str)
+    total_parser.add_argument('--saved_data_shards', default=800, type=int)
+    total_parser.add_argument('--saved_train_data_path', default=None, type=str)
+    total_parser.add_argument('--saved_test_data_path', default=None, type=str)
+    total_parser.add_argument('--max_seq_length', default=512, type=int)
+    total_parser.add_argument('--train_split_size', default=0.999, type=float)
+    total_parser.add_argument('--pretrained_model_path', default=None, type=str)
+    total_parser.add_argument('--tokenizer_type', default='t5_tokenizer', choices=['t5_tokenizer', 'bert_tokenizer'])
+    total_parser.add_argument('--text_column_name', default='text')
+    total_parser.add_argument('--remove_columns', nargs='+', default=[])
+
+    # * Args for data preprocessing
+    args = total_parser.parse_args()
+    sys.path.append('../../../')
+    from fengshen.data.t5_dataloader.t5_datasets import UnsuperviseT5Dataset
+    ds = UnsuperviseT5Dataset(args.train_data_path, args)
+    print(ds)
+    generate_arrow_cache(ds.data, args=args)
+    # ds = UnsuperviseT5Dataset(args.train_data_path, args, load_data_type=0)
+    for i in range(0, 2):
+        print(ds.data[i])
+        print(ds.tokenizer.decode(ds.data[i]['input_ids']))
+
+    print(ds.data)
diff --git a/fengshen/examples/pretrain_t5/process_data_bert_tokenizer.sh b/fengshen/examples/pretrain_t5/process_data_bert_tokenizer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b17187c6a26c0a5edf46cf2d9c5736338e6ff934
--- /dev/null
+++ b/fengshen/examples/pretrain_t5/process_data_bert_tokenizer.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+#SBATCH --job-name=process_data_bert_tokenizer
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1             # number of gpus
+#SBATCH --cpus-per-task=120 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH -o /cognitive_comp/ganruyi/experiments/randeng_t5_char_77M/%x-%j.log
+#SBATCH -e /cognitive_comp/ganruyi/experiments/randeng_t5_char_77M/%x-%j.err
+set -x -e
+
+echo "START TIME: $(date)"
+
+DATA_ARGS="
+    --tokenizer_type bert_tokenizer \
+    --train_data_path wudao_180g \
+    --train_split_size 0.999 \
+    --max_seq_length 512 \
+    --preprocessing_num_workers 100 \
+    --saved_data_shards 800 \
+    --saved_train_data_path /cognitive_comp/common_data/wudao_180g_bert_tokenized_512_train/ \
+    --saved_test_data_path /cognitive_comp/common_data/wudao_180g_bert_tokenized_512_test/ \
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_char_77M/randeng_t5_char_77M \
+    --text_column_name text \
+    --remove_columns token_type_ids text \
+"
+
+    # --remove_columns text \
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/pretrain_t5/process_data.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+source activate base
+/home/ganruyi/anaconda3/bin/python $CMD
\ No newline at end of file
diff --git a/fengshen/examples/summary/pretrain_bart_summary.sh b/fengshen/examples/summary/pretrain_bart_summary.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f8a6af24f935cc563891922b8a50cd293231367b
--- /dev/null
+++ b/fengshen/examples/summary/pretrain_bart_summary.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+#SBATCH --job-name=bart_summary
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=4
+#SBATCH --gres=gpu:4               # number of gpus
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=bart-base
+MICRO_BATCH_SIZE=16
+ROOT_DIR=/cognitive_comp/dongxiaoqun/finetune/${MODEL_NAME}
+
+ZERO_STAGE=1
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/dongxiaoqun/torch_extendsions
+config_json="./ds_config.${MODEL_NAME}.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 5e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-4
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+# export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+TRAINER_ARGS="
+    --max_epochs 2 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy deepspeed_stage_${ZERO_STAGE} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor val_loss \
+    --mode min \
+    --save_last \
+    --every_n_train_steps 0 \
+    --val_check_interval 0.1 \
+"
+
+prompt='"'
+DATA_ARGS="
+    --datasets_name lcsts \
+    --num_workers 8 \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --val_batchsize $MICRO_BATCH_SIZE \
+    --test_batchsize $MICRO_BATCH_SIZE \
+    --max_enc_length 128 \
+    --max_dec_length 64 \
+    --val_datasets_field val \
+    --prompt $prompt \
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/gaoxinyu/pretrained_model/bart-base \
+    --output_save_path $ROOT_DIR/${MODEL_NAME}_predict_lcsts.json \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --precision 16 \
+"
+
+SCRIPTS_PATH=seq2seq_summary.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+#singularity exec --nv -B /cognitive_comp/ganruyi/Megatron/:/cognitive_comp/ganruyi/Megatron/,/cognitive_comp/gaoxinyu/:/cognitive_comp/gaoxinyu/ $SINGULARITY_PATH python $CMD
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+# srun --nodes=1 --gres=gpu:4 --ntasks-per-node=4 --cpus-per-gpu=20 
+source activate
+conda activate torchnew
+srun --nodes=1 --ntasks-per-node=1 --gres=gpu:1 --cpus-per-task=30 -o ${MODEL_NAME}-%J.log --jobid=229623 bash -c 'python3 $SCRIPT_PATH $CMD'
diff --git a/fengshen/examples/summary/randeng_pegasus_523M_summary.sh b/fengshen/examples/summary/randeng_pegasus_523M_summary.sh
new file mode 100644
index 0000000000000000000000000000000000000000..10f6d29a6acd1fe70117d0f1b8d33ce58cdb1384
--- /dev/null
+++ b/fengshen/examples/summary/randeng_pegasus_523M_summary.sh
@@ -0,0 +1,143 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_pegasus_523M_summary
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH --cpus-per-task=30
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=randeng_pegasus_523M_summary_last
+MICRO_BATCH_SIZE=128
+ROOT_DIR=/cognitive_comp/dongxiaoqun/finetune/${MODEL_NAME}
+
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+output_save_path=$ROOT_DIR/${MODEL_NAME}.json
+if [ -f ${output_save_path} ];then
+  echo ${output_save_path} exist, rm it!!!!!!!!!!!!!!!!!
+  rm ${output_save_path}
+fi
+
+ZERO_STAGE=1
+
+config_json="${ROOT_DIR}/ds_config.${MODEL_NAME}.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 1000,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 5e-5,
+      "betas": [
+        0.9,
+        0.999
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_min_lr": 1e-8,
+      "warmup_max_lr": 1e-4,
+      "total_num_steps": 60000,
+      "warmup_num_steps" : 1000
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/dongxiaoqun/torch_extendsions
+# export MASTER_PORT=$[RANDOM%10000+50000]
+# 
+# --strategy deepspeed_stage_${ZERO_STAGE} \
+TRAINER_ARGS="
+    --max_epochs 10 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy deepspeed_stage_${ZERO_STAGE} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor val_loss \
+    --mode min \
+    --save_last \
+    --every_n_train_steps 10000 \
+    --val_check_interval 0.1 \
+"
+prompt='"'
+DATA_ARGS="
+    --datasets_name lcsts \
+    --num_workers 30 \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --val_batchsize $MICRO_BATCH_SIZE \
+    --test_batchsize $MICRO_BATCH_SIZE \
+    --max_enc_length 128 \
+    --max_dec_length 64 \
+    --val_datasets_field val \
+    --prompt $prompt \
+"
+
+# --prompt $prompt \
+# --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_77M_summary/ckpt/hf_pretrained_epoch1_step75019 \
+
+# mode_path="/cognitive_comp/dongxiaoqun/train_model/fengshen-pegasus-base/ckpt/hf_pretrained_epoch0_step22200/"
+mode_path="/cognitive_comp/dongxiaoqun/train_model/fengshen-pegasus-large/ckpt/hf_pretrained_epoch0_step122000"
+cp /cognitive_comp/dongxiaoqun/pretrained_model/pegasus-large/vocab.txt $mode_path/
+
+MODEL_ARGS="
+    --pretrained_model_path  $mode_path \
+    --output_save_path $output_save_path \
+    --self_tokenizer \
+"
+
+SCRIPTS_PATH=/cognitive_comp/dongxiaoqun/debug/Fengshenbang-LM/fengshen/examples/summary/seq2seq_summary.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+source activate
+conda activate torchnew
+srun --nodes=1 --ntasks-per-node=1 --gres=gpu:1 --cpus-per-task=30 -o ${MODEL_NAME}-%J.log --jobid=229555 bash -c 'python3 $SCRIPT_PATH $CMD'
+
diff --git a/fengshen/examples/summary/randeng_t5_70M_summary.sh b/fengshen/examples/summary/randeng_t5_70M_summary.sh
new file mode 100644
index 0000000000000000000000000000000000000000..403d8d4dd022bf90fe9f50854291ec4e48f13aff
--- /dev/null
+++ b/fengshen/examples/summary/randeng_t5_70M_summary.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_t5_77M_summary
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2               # number of gpus
+#SBATCH --cpus-per-task=30
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=randeng_t5_77M_summary_test2
+MICRO_BATCH_SIZE=64
+ROOT_DIR=/cognitive_comp/dongxiaoqun/finetune/${MODEL_NAME}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+output_save_path=$ROOT_DIR/${MODEL_NAME}.json
+if [ -f ${output_save_path} ];then
+  echo ${output_save_path} exist, rm it!!!!!!!!!!!!!!!!!
+  rm ${output_save_path}
+fi
+ZERO_STAGE=1
+
+config_json="${ROOT_DIR}/ds_config.${MODEL_NAME}.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 60000,
+      "warmup_num_steps" : 500
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/dongxiaoqun/torch_extendsions
+# export MASTER_PORT=$[RANDOM%10000+30000]
+# export PL_FAULT_TOLERANT_TRAINING=1
+
+TRAINER_ARGS="
+    --max_epochs 2 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy deepspeed_stage_${ZERO_STAGE} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor val_loss \
+    --mode min \
+    --save_last \
+    --every_n_train_steps 0 \
+    --val_check_interval 0.1 \
+"
+
+prompt="summary:"
+DATA_ARGS="
+    --datasets_name lcsts \
+    --num_workers 30 \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --val_batchsize $MICRO_BATCH_SIZE \
+    --test_batchsize $MICRO_BATCH_SIZE \
+    --max_enc_length 128 \
+    --max_dec_length 64 \
+    --val_datasets_field val \
+    --prompt $prompt \
+"
+# --prompt $prompt \
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_77M/ckpt/hf_pretrained_epoch0_step183100 \
+    --output_save_path $ROOT_DIR/randeng_t5_77M_predict_lcsts.json \
+"
+
+SCRIPTS_PATH=/cognitive_comp/dongxiaoqun/debug/Fengshenbang-LM/fengshen/examples/summary/seq2seq_summary.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+echo $CMD
+# python $CMD
+
+source activate
+conda activate torchnew
+srun --nodes=1 --ntasks-per-node=1 --gres=gpu:1 --cpus-per-task=30 -o ${MODEL_NAME}-%J.log --jobid=229623 bash -c 'python3 $SCRIPT_PATH $CMD'
diff --git a/fengshen/examples/summary/randeng_t5_70M_summary_predict.sh b/fengshen/examples/summary/randeng_t5_70M_summary_predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ccbf410fa92b1d5e09c97d6ae3af7bb4ff121c64
--- /dev/null
+++ b/fengshen/examples/summary/randeng_t5_70M_summary_predict.sh
@@ -0,0 +1,138 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_t5_77M_summary_predict
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2               # number of gpus
+#SBATCH --cpus-per-task=30
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=randeng_t5_77M_summary_predict
+MICRO_BATCH_SIZE=16
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/${MODEL_NAME}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+output_save_path=$ROOT_DIR/randeng_t5_77M_predict_lcsts.json
+if [ -f ${output_save_path} ];then
+  echo ${output_save_path} exist, rm it!!!!!!!!!!!!!!!!!
+  rm ${output_save_path}
+fi
+
+ZERO_STAGE=1
+
+config_json="${ROOT_DIR}/ds_config.${MODEL_NAME}.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 5e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-4
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+export MASTER_PORT=$[RANDOM%10000+50000]
+
+# --strategy deepspeed_stage_${ZERO_STAGE} \
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 2 \
+    --num_nodes 1 \
+    --strategy ddp \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+    --every_n_train_steps 0 \
+"
+DATA_DIR=/cognitive_comp/ganruyi/data_datasets_LCSTS_LCSTS/
+prompt="summary:"
+DATA_ARGS="
+    --datasets_name lcsts \
+    --num_workers 30 \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --val_batchsize $MICRO_BATCH_SIZE \
+    --test_batchsize $MICRO_BATCH_SIZE \
+    --max_enc_length 128 \
+    --max_dec_length 64 \
+    --val_datasets_field val \
+    --prompt $prompt \
+"
+# --prompt $prompt \
+# --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_77M_summary/ckpt/hf_pretrained_epoch1_step75019 \
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/gaoxinyu/pretrained_model/bart-759M \
+    --output_save_path $ROOT_DIR/randeng_t5_77M_predict_lcsts.json \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --precision 16 \
+    --warmup 0.01 \
+    --do_eval_only \
+    --max_dec_length 32 \
+"
+
+SCRIPTS_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/summary/seq2seq_summary.py
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+echo $CMD
+source activate base
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+python $CMD
\ No newline at end of file
diff --git a/fengshen/examples/summary/randeng_t5_784M_summary.sh b/fengshen/examples/summary/randeng_t5_784M_summary.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5b3e60c8784ac563eff09763591e00b6d250444f
--- /dev/null
+++ b/fengshen/examples/summary/randeng_t5_784M_summary.sh
@@ -0,0 +1,130 @@
+#!/bin/bash
+#SBATCH --job-name=randeng_t5_77M_summary
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2               # number of gpus
+#SBATCH --cpus-per-task=30
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+MODEL_NAME=randeng_t5_784M_summary
+MICRO_BATCH_SIZE=8
+ROOT_DIR=/cognitive_comp/dongxiaoqun/finetune/${MODEL_NAME}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+ZERO_STAGE=1
+
+config_json="${ROOT_DIR}/ds_config.${MODEL_NAME}.json"
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": ${MICRO_BATCH_SIZE},
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": false,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 50000000,
+    "allgather_bucket_size": 500000000
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-4,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "params": {
+      "warmup_max_lr": 1e-04,
+      "warmup_min_lr": 1e-05,
+      "total_num_steps": 60000,
+      "warmup_num_steps" : 500
+    },
+    "type": "WarmupDecayLR"  
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/dongxiaoqun/torch_extendsions
+# export MASTER_PORT=$[RANDOM%10000+30000]
+# export PL_FAULT_TOLERANT_TRAINING=1
+
+TRAINER_ARGS="
+    --max_epochs 1 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy deepspeed_stage_${ZERO_STAGE} \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor val_loss \
+    --mode min \
+    --save_last \
+    --every_n_train_steps 0 \
+    --val_check_interval 0.1 \
+"
+
+prompt="summary:"
+DATA_ARGS="
+    --datasets_name lcsts \
+    --num_workers 30 \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --val_batchsize $MICRO_BATCH_SIZE \
+    --test_batchsize $MICRO_BATCH_SIZE \
+    --max_enc_length 128 \
+    --max_dec_length 64 \
+    --val_datasets_field val \
+    --prompt $prompt \
+"
+# --prompt $prompt \
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/ganruyi/experiments/randeng_t5_large_v2/ckpt/hf_pretrained_epoch0_step732500 \
+    --output_save_path $ROOT_DIR/randeng_t5_784M_predict_lcsts.json \
+"
+
+SCRIPTS_PATH=/cognitive_comp/dongxiaoqun/debug/Fengshenbang-LM/fengshen/examples/summary/seq2seq_summary.py
+SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+echo $CMD
+
+source activate
+conda activate torchnew
+srun --nodes=1 --ntasks-per-node=1 --gres=gpu:1 --cpus-per-task=30 -o ${MODEL_NAME}-%J.log --jobid=229668 bash -c 'python3 $SCRIPT_PATH $CMD'
+# source activate base
+# python $CMD
+
+# srun --jobid=229668 --nodes=1 --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=30 -e ${ROOT_DIR}/${MODEL_NAME}-%j.err -o ${ROOT_DIR}/${MODEL_NAME}-%j.log singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
+
+# srun python $CMD
+# srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c '/home/ganruyi/anaconda3/bin/python $CMD'
diff --git a/fengshen/examples/summary/seq2seq_summary.py b/fengshen/examples/summary/seq2seq_summary.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0c725c215d61dc5c6fa0fbf6603b7f06f0a317b
--- /dev/null
+++ b/fengshen/examples/summary/seq2seq_summary.py
@@ -0,0 +1,197 @@
+
+import torch
+import os
+import argparse
+import json
+import pytorch_lightning as pl
+from fengshen.models.model_utils import add_module_args
+from fengshen.data.task_dataloader.task_datasets import AbstractCollator
+from fengshen.data.universal_datamodule import UniversalDataModule
+from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+from fengshen.utils.utils import chinese_char_tokenize
+from torchmetrics.text.rouge import ROUGEScore
+from pytorch_lightning import Trainer, loggers
+from pytorch_lightning.callbacks import LearningRateMonitor
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import sys
+sys.path.append('../../../')
+
+
+# os.environ["CUDA_VISIBLE_DEVICES"] = '3,4'
+
+
+class FinetuneSummary(pl.LightningModule):
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--rouge_keys', default='rougeL,rouge1,rouge2', type=str)
+        return parent_args
+
+    def __init__(self, args, tokenizer=None):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.model = AutoModelForSeq2SeqLM.from_pretrained(
+            args.pretrained_model_path)
+        self.tokenizer = tokenizer
+        assert self.tokenizer, "tokenizer is None!"
+        self.rouge_keys = tuple(args.rouge_keys.split(','))
+        self.rouge_metric = ROUGEScore(rouge_keys=self.rouge_keys, normalizer=lambda x: x)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            tb_size = self.hparams.train_batchsize * max(1, self.trainer.gpus)
+            ab_size = self.trainer.accumulate_grad_batches * \
+                float(self.trainer.max_epochs)
+            self.total_steps = (
+                len(train_loader.dataset) // tb_size) // ab_size
+            print('total_steps is :', self.total_steps)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        self.log('train_loss', output.loss, sync_dist=True)
+        return output.loss
+
+    def on_validation_start(self) -> None:
+        # rm file at validation start
+        prefix, ext = os.path.splitext(self.hparams.output_save_path)
+        file_path_rank = '{}_{}{}'.format(
+            prefix, self.trainer._accelerator_connector.cluster_environment.global_rank(), ext)
+        if os.path.exists(file_path_rank):
+            print('rm {}'.format(file_path_rank))
+            os.remove(file_path_rank)
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        generated_ids = self.model.generate(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            max_length=self.hparams.max_dec_length
+        )
+
+        preds = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        labels = torch.where(batch['labels'] != -100, batch['labels'],
+                             self.tokenizer.pad_token_id)
+        labels = self.tokenizer.batch_decode(
+            labels, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        # save preds for every rank
+        prefix, ext = os.path.splitext(self.hparams.output_save_path)
+        file_path_rank = '{}_{}{}'.format(
+            prefix, self.trainer._accelerator_connector.cluster_environment.global_rank(), ext)
+        self.save_prediction_to_file(preds=preds, texts=batch['text'],
+                                     summarys=batch['summary'], file_path=file_path_rank)
+        # you need to split chinese char with space for rouge metric
+        new_preds = [chinese_char_tokenize(p) for p in preds]
+        new_labels = [chinese_char_tokenize(label) for label in labels]
+        # update metric
+        self.rouge_metric.update(preds=new_preds, target=new_labels)
+        self.log('val_loss', output.loss, sync_dist=True)
+
+    def validation_epoch_end(self, outputs):
+        # compute metric for all process
+        rouge_dict = self.rouge_metric.compute()
+        # reset the metric after once validation
+        self.rouge_metric.reset()
+        for k, v in rouge_dict.items():
+            self.log('val_{}'.format(k), v, sync_dist=True)
+        if self.trainer._accelerator_connector.cluster_environment.global_rank() == 0:
+            print('rouge:\n', rouge_dict)
+
+    def on_save_checkpoint(self, checkpoint) -> None:
+        if self.trainer._accelerator_connector.cluster_environment.global_rank() == 0:
+            self.model.save_pretrained(os.path.join(
+                self.trainer.checkpoint_callback.dirpath,
+                'hf_pretrained_epoch{}_step{}'.format(checkpoint['epoch'], checkpoint['global_step'])))
+
+    def save_prediction_to_file(self, preds, texts, summarys, file_path):
+        with open(file_path, 'a', encoding='utf-8') as f:
+            for idx, pred in enumerate(preds):
+                text = texts[idx]
+                summary = summarys[idx]
+                tmp_result = dict()
+                tmp_result['pred'] = pred
+                tmp_result['label'] = summary
+                tmp_result['text'] = text
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data + '\n')
+
+    def predict_step(self, batch, batch_idx):
+        # print(batch)
+        texts = batch['text']
+        # output summary and metrics
+        generated_ids = self.model.generate(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            max_length=self.hparams.max_dec_length
+        )
+        preds = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        labels = self.tokenizer.batch_decode(
+            batch['labels'], skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        print(batch_idx, len(preds), len(labels))
+        self.save_prediction_to_file(preds, texts, labels)
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Summary Task")
+    total_parser.add_argument('--do_eval_only',
+                              action='store_true',
+                              default=False)
+    total_parser.add_argument('--pretrained_model_path',
+                              default='google/mt5-small',
+                              type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json',
+                              type=str)
+    total_parser.add_argument('--self_tokenizer',
+                              action='store_true',
+                              default=False)
+    total_parser.add_argument('--max_enc_length', default=1024, type=int)
+    total_parser.add_argument('--max_dec_length', default=256, type=int)
+    total_parser.add_argument('--prompt', default='summarize:', type=str)
+    # * Args for data preprocessing
+    # from fengshen.data.task_dataloader.task_datasets import LCSTSDataModel
+    total_parser = UniversalDataModule.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = add_module_args(total_parser)
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = UniversalCheckpoint.add_argparse_args(total_parser)
+    total_parser = FinetuneSummary.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+
+    if args.self_tokenizer:
+        from fengshen.examples.pegasus.tokenizers_pegasus import PegasusTokenizer
+        tokenizer = PegasusTokenizer.from_pretrained(args.pretrained_model_path)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model_path, use_fast=False)
+    collator = AbstractCollator(tokenizer, args.max_enc_length,
+                                args.max_dec_length, args.prompt)
+    data_model = UniversalDataModule(tokenizer=tokenizer, args=args, collate_fn=collator)
+    model = FinetuneSummary(args, tokenizer)
+    if not args.do_eval_only:
+        lr_monitor = LearningRateMonitor(logging_interval='step')
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'log/'))
+        checkpoint_callback = UniversalCheckpoint(args)
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[lr_monitor,
+                                                        checkpoint_callback]
+                                             )
+        trainer.fit(model, data_model)
+    else:
+        trainer = Trainer.from_argparse_args(args)
+        # trainer.predict(model, data_model)
+        trainer.validate(model, data_model)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/ubert/README.md b/fengshen/examples/ubert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fdad2ca0d948830c51bf141dceb907c4531a4690
--- /dev/null
+++ b/fengshen/examples/ubert/README.md
@@ -0,0 +1,280 @@
+# Ubert: 统一 NLU 任务新范式
+- 论文：[https://arxiv.org/pdf/2206.12094.pdf](https://arxiv.org/pdf/2206.12094.pdf)
+- 知乎：[https://zhuanlan.zhihu.com/p/539958182?](https://zhuanlan.zhihu.com/p/539958182?)
+
+### 简介
+Ubert 是我们在做 [2022AIWIN 世界人工智能创新大赛：中文保险小样本多任务](http://ailab.aiwin.org.cn/competitions/68#results) 时提出的一种解决方案。并取得A/B榜榜首的成绩，且B榜综合成绩领先第二名超过 1 个百分点，领先第三名接近 5 个百分点。相比于官方提供的 baseline，提高 20 个百分点。Ubert 不仅可以完成 实体识别、事件抽取等常见抽取任务，还可以完成新闻分类、自然语言推理等分类任务，且所有任务是共享一个统一框架、统一任务、统一训练目标的模型。解题思路和方案可以参考我们的答辩PPT，或者参考我们的[知乎文章](https://zhuanlan.zhihu.com/p/539958182?)
+
+## 开源模型列表
+ 开源的模型是我们在比赛模型的基础上重新整理 70+ 份数据，共 100万+条样本，进行预训练而得到的，可直接开箱即用。开源模型地址如下：
+| 模型 | 地址   |
+|:---------:|:--------------:|
+| Erlangshen-Ubert-110M-Chinese  | [https://huggingface.co/IDEA-CCNL/Erlangshen-Ubert-110M-Chinese](https://huggingface.co/IDEA-CCNL/Erlangshen-Ubert-110M-Chinese)       |
+| Erlangshen-Ubert-330M-Chinese  | [https://huggingface.co/IDEA-CCNL/Erlangshen-Ubert-330M-Chinese](https://huggingface.co/IDEA-CCNL/Erlangshen-Ubert-330M-Chinese)   |
+
+
+## 快速开箱使用
+安装我们的 fengshen 框架，我们暂且提供如下方式安装
+```python
+git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
+cd Fengshenbang-LM
+pip install --editable ./
+```
+
+一键运行下面代码得到预测结果, 你可以任意修改示例 text 和要抽取的 entity_type，体验一下 Zero-Shot 性能
+```python
+import argparse
+from fengshen import UbertPiplines
+
+total_parser = argparse.ArgumentParser("TASK NAME")
+total_parser = UbertPiplines.piplines_args(total_parser)
+args = total_parser.parse_args()
+
+test_data=[
+    {
+        "task_type": "抽取任务", 
+        "subtask_type": "实体识别", 
+        "text": "这也让很多业主据此认为，雅清苑是政府公务员挤对了国家的经适房政策。", 
+        "choices": [ 
+            {"entity_type": "小区名字"}, 
+            {"entity_type": "岗位职责"}
+            ],
+        "id": 0}
+]
+
+model = UbertPiplines(args)
+result = model.predict(test_data)
+for line in result:
+    print(line)
+```
+
+## 继续 finetune 使用
+
+开源的模型我们已经经过大量的数据进行预训练而得到，可以直接进行 Zero-Shot，如果你还想继续finetune,可以参考我们的 [example.py](https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/main/fengshen/examples/ubert/example.py)。你只需要将我们数据预处理成为我们定义的格式，即可使用简单的几行代码完成模型的训练和推理。我们是复用 pytorch-lightning 的 trainer 。在训练时，可以直接传入 trainer 的参数，此外我们还定义了一些其他参数。常用的参数如下：
+
+
+```sh
+--pretrained_model_path       #预训练模型的路径，默认
+--load_checkpoints_path       #加载模型的路径，如果你finetune完，想加载模型进行预测可以传入这个参数
+--batchsize                   #批次大小, 默认 8
+--monitor                     #保存模型需要监控的变量，例如我们可监控 val_span_acc
+--checkpoint_path             #模型保存的路径, 默认 ./checkpoint
+--save_top_k                  #最多保存几个模型, 默认 3
+--every_n_train_steps         #多少步保存一次模型, 默认 100
+--learning_rate               #学习率, 默认 2e-5
+--warmup                      #预热的概率, 默认 0.01
+--default_root_dir            #模型日子默认输出路径
+--gradient_clip_val           #梯度截断， 默认 0.25
+--gpus                        #gpu 的数量
+--check_val_every_n_epoch     #多少次验证一次， 默认 100
+--max_epochs                  #多少个 epochs， 默认 5
+--max_length                  #句子最大长度， 默认 512
+--num_labels                  #训练每条样本最多取多少个label，超过则进行随机采样负样本， 默认 10
+```
+
+## 数据预处理示例
+
+整个模型的 Piplines 我们已经写好，所以为了方便，我们定义了数据格式。目前我们在预训练中主要含有一下几种任务类型
+
+| task_type | subtask_type   |
+|:---------:|:--------------:|
+| 分类任务  | 文本分类       |
+|           | 自然语言推理   |
+|           | 情感分析       |
+|           | 多项式阅读理解 |
+| 抽取任务  | 实体识别       |
+|           | 事件抽取       |
+|           | 抽取式阅读理解 |
+|           | 关系抽取       |
+
+### 分类任务
+
+#### 普通分类任务
+对于分类任务，我们把类别描述当作是 entity_type，我们主要关注 label 字段，label为 1 表示该该标签是正确的标签。如下面示例所示
+```json
+{
+	"task_type": "分类任务",
+	"subtask_type": "文本分类",
+	"text": "7000亿美元救市方案将成期市毒药",
+	"choices": [{
+		"entity_type": "一则股票新闻",
+		"label": 1,
+		"entity_list": []
+	}, {
+		"entity_type": "一则教育新闻",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "一则科学新闻",
+		"label": 0,
+		"entity_list": []
+	}],
+	"id": 0
+}
+
+```
+
+#### 自然语言推理
+```json
+{
+	"task_type": "分类任务",
+	"subtask_type": "自然语言推理",
+	"text": "在白云的蓝天下，一个孩子伸手摸着停在草地上的一架飞机的螺旋桨。",
+	"choices": [{
+		"entity_type": "可以推断出：一个孩子正伸手摸飞机的螺旋桨。",
+		"label": 1,
+		"entity_list": []
+	}, {
+		"entity_type": "不能推断出：一个孩子正伸手摸飞机的螺旋桨。",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "很难推断出：一个孩子正伸手摸飞机的螺旋桨。",
+		"label": 0,
+		"entity_list": []
+	}],
+	"id": 0
+}
+```
+
+
+#### 语义匹配
+
+```json
+{
+	"task_type": "分类任务",
+	"subtask_type": "语义匹配",
+	"text": "不要借了我是试试看能否操作的",
+	"choices": [{
+		"entity_type": "不能理解为：借款审核期间能否取消借款",
+		"label": 1,
+		"entity_list": []
+	}, {
+		"entity_type": "可以理解为：借款审核期间能否取消借款",
+		"label": 0,
+		"entity_list": []
+	}],
+	"id": 0
+}
+
+```
+
+### 抽取任务
+对于抽取任务，label 字段是无效的
+#### 实体识别
+```json
+{
+	"task_type": "抽取任务",
+	"subtask_type": "实体识别",
+	"text": "彭小军认为，国内银行现在走的是台湾的发卡模式，先通过跑马圈地再在圈的地里面选择客户，",
+	"choices": [{
+		"entity_type": "地址",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "台湾",
+			"entity_type": "地址",
+			"entity_idx": [
+				[15, 16]
+			]
+		}]
+	}{
+		"entity_type": "政府机构",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "电影名称",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "人物姓名",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "彭小军",
+			"entity_type": "人物姓名",
+			"entity_idx": [
+				[0, 2]
+			]
+		}]
+	},
+	"id": 0
+}
+
+```
+#### 事件抽取
+```json
+
+{
+	"task_type": "抽取任务",
+	"subtask_type": "事件抽取",
+	"text": "小米9价格首降，6GB+128GB跌了200，却不如红米新机值得买",
+	"choices": [{
+		"entity_type": "降价的时间",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "降价的降价方",
+		"label": 0,
+		"entity_list": []
+	}, {
+		"entity_type": "降价的降价物",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "小米9",
+			"entity_type": "降价的降价物",
+			"entity_idx": [
+				[0, 2]
+			]
+		}, {
+			"entity_name": "小米9",
+			"entity_type": "降价的降价物",
+			"entity_idx": [
+				[0, 2]
+			]
+		}]
+	}, {
+		"entity_type": "降价的降价幅度",
+		"label": 0,
+		"entity_list": []
+	}],
+	"id": 0
+}
+```
+#### 抽取式阅读理解
+
+```json
+{
+	"task_type": "抽取任务",
+	"subtask_type": "抽取式阅读理解",
+	"text": "截至2014年7月1日，圣地亚哥人口估计为1381069人，是美国第八大城市，加利福尼亚州第二大城市。它是圣迭戈-蒂华纳城市群的一部分，是美国与底特律-温莎之后的第二大跨境城市群，人口4922723。圣地亚哥是加州的出生地，以全年温和的气候、天然的深水港、广阔的海滩、与美国海军的长期联系以及最近作为医疗和生物技术发展中心而闻名。",
+	"choices": [{
+		"entity_type": "除了医疗保健，圣迭戈哪个就业部门已经强势崛起？",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "生物技术发展",
+			"entity_idx": [
+				[153, 158]
+			]
+		}]
+	}, {
+		"entity_type": "在所有的军事部门中，哪一个在圣地亚哥的存在最为强大？",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "美国海军",
+			"entity_idx": [
+				[135, 138]
+			]
+		}]
+	}, {
+		"entity_type": "在美国十大城市中，圣迭戈排名哪一位？",
+		"label": 0,
+		"entity_list": [{
+			"entity_name": "第八",
+			"entity_idx": [
+				[33, 34]
+			]
+		}]
+	}],
+	"id": 0
+}
+```
+
diff --git a/fengshen/examples/ubert/example.py b/fengshen/examples/ubert/example.py
new file mode 100644
index 0000000000000000000000000000000000000000..a36f649ce85404ce36be47f639d675aa88faeaf2
--- /dev/null
+++ b/fengshen/examples/ubert/example.py
@@ -0,0 +1,95 @@
+import argparse
+from fengshen import UbertPiplines
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = '6'
+
+
+def main():
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser = UbertPiplines.piplines_args(total_parser)
+    args = total_parser.parse_args()
+
+    # 设置一些训练要使用到的参数
+    args.pretrained_model_path = 'IDEA-CCNL/Erlangshen-Ubert-110M-Chinese' #预训练模型的路径，我们提供的预训练模型存放在HuggingFace上
+    args.default_root_dir = './'  #默认主路径，用来放日志、tensorboard等
+    args.max_epochs = 5
+    args.gpus = 1
+    args.batch_size = 1
+
+    # 只需要将数据处理成为下面数据的 json 样式就可以一键训练和预测，下面只是提供了一条示例样本
+    train_data = [
+        {
+            "task_type": "抽取任务",
+            "subtask_type": "实体识别",
+            "text": "彭小军认为，国内银行现在走的是台湾的发卡模式，先通过跑马圈地再在圈的地里面选择客户，",
+            "choices": [
+                {"entity_type": "地址", "label": 0, "entity_list": [
+                    {"entity_name": "台湾", "entity_type": "地址", "entity_idx": [[15, 16]]}]},
+                {"entity_type": "书名", "label": 0, "entity_list": []},
+                {"entity_type": "公司", "label": 0, "entity_list": []},
+                {"entity_type": "游戏", "label": 0, "entity_list": []},
+                {"entity_type": "政府机构", "label": 0, "entity_list": []},
+                {"entity_type": "电影名称", "label": 0, "entity_list": []},
+                {"entity_type": "人物姓名", "label": 0, "entity_list": [
+                    {"entity_name": "彭小军", "entity_type": "人物姓名", "entity_idx": [[0, 2]]}]},
+                {"entity_type": "组织机构", "label": 0, "entity_list": []},
+                {"entity_type": "岗位职位", "label": 0, "entity_list": []},
+                {"entity_type": "旅游景点", "label": 0, "entity_list": []}
+            ],
+            "id": 0}
+    ]
+    dev_data = [
+        {
+            "task_type": "抽取任务",
+            "subtask_type": "实体识别",
+            "text": "就天涯网推出彩票服务频道是否是业内人士所谓的打政策“擦边球”，记者近日对此事求证彩票监管部门。",
+            "choices": [
+                {"entity_type": "地址", "label": 0, "entity_list": []},
+                {"entity_type": "书名", "label": 0, "entity_list": []},
+                {"entity_type": "公司", "label": 0, "entity_list": [
+                    {"entity_name": "天涯网", "entity_type": "公司", "entity_idx": [[1, 3]]}]},
+                {"entity_type": "游戏", "label": 0, "entity_list": []},
+                {"entity_type": "政府机构", "label": 0, "entity_list": []},
+                {"entity_type": "电影名称", "label": 0, "entity_list": []},
+                {"entity_type": "人物姓名", "label": 0, "entity_list": []},
+                {"entity_type": "组织机构", "label": 0, "entity_list": [
+                    {"entity_name": "彩票监管部门", "entity_type": "组织机构", "entity_idx": [[40, 45]]}]},
+                {"entity_type": "岗位职位", "label": 0, "entity_list": [
+                    {"entity_name": "记者", "entity_type": "岗位职位", "entity_idx": [[31, 32]]}]},
+                {"entity_type": "旅游景点", "label": 0, "entity_list": []}
+            ],
+
+            "id": 0}
+
+    ]
+    test_data = [
+        {
+            "task_type": "抽取任务",
+            "subtask_type": "实体识别",
+            "text": "这也让很多业主据此认为，雅清苑是政府公务员挤对了国家的经适房政策。",
+            "choices": [
+                {"entity_type": "地址", "label": 0, "entity_list": [
+                    {"entity_name": "雅清苑", "entity_type": "地址", "entity_idx": [[12, 14]]}]},
+                {"entity_type": "书名", "label": 0, "entity_list": []},
+                {"entity_type": "公司", "label": 0, "entity_list": []},
+                {"entity_type": "游戏", "label": 0, "entity_list": []},
+                {"entity_type": "政府机构", "label": 0, "entity_list": []},
+                {"entity_type": "电影名称", "label": 0, "entity_list": []},
+                {"entity_type": "人物姓名", "label": 0, "entity_list": []},
+                {"entity_type": "组织机构", "label": 0, "entity_list": []},
+                {"entity_type": "岗位职位", "label": 0, "entity_list": [
+                    {"entity_name": "公务员", "entity_type": "岗位职位", "entity_idx": [[18, 20]]}]},
+                {"entity_type": "旅游景点", "label": 0, "entity_list": []}
+            ],
+            "id": 0},
+    ]
+
+    model = UbertPiplines(args)
+    model.fit(train_data, dev_data)
+    result = model.predict(test_data)
+    for line in result:
+        print(line)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/wenzhong_qa/README.md b/fengshen/examples/wenzhong_qa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8b424909f39c5b1480fbc5cc7015e82714292930
--- /dev/null
+++ b/fengshen/examples/wenzhong_qa/README.md
@@ -0,0 +1,75 @@
+# <center> yuyuanQA模型finetune
+本示例主要实现了基于GPT2结构的Yuyuan医疗大模型，通过医疗问答对Finetune，使大模型能够有closebook-qa的能力。
+### 数据和模型
+#### 模型：
+finetune的模型是yuyuan模型，余元模型是GPT2的结构，在预训练阶段主要是用PubMed医疗相关的数据集进行的预训练。是一个医疗领域的大模型。模型共有35亿参数，主要参数如下表所示：
+
+|    配置     | 参数  |
+| :---------: | :---: |
+|   nlayers   |  30   |
+|  nheaders   |  32   |
+| hidden-size | 3072  |
+| seq-length  | 1024  |
+
+预训练的数据，主要医疗相关的论文、杂志期刊等，以英文语料为主。
+#### 数据：
+用于finetune的语料是清洗于[MedQuAD](https://github.com/abachaa/MedQuAD)数据集，清洗完成后是下面的格式：
+```text
+......
+{'question':'.........','answer':'........'}
+{'question':'.........','answer':'........'}
+......
+```
+### finetune框架以及参数配置
+#### 框架 ：
+finetune的框架是IDEA研究院CCNL小组整合各大框架的优点开源的[封神框架](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen)，具体代码可以参考[finetune_medicalQA.py](https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/dev_wzw/fengshen/examples/wenzhong_qa/finetune_medicalQA.py)和[medicalQADataset.py](https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/dev_wzw/fengshen/data/task_dataloader/medicalQADataset.py)。
+#### 训练参数：
+训练参数，我们采用了deepspeed相关的配置，用2个集群的节点共16张A100，在很短的时间内完成了finetune。具体参数配置可以参考[finetune_GPT2_medicalQA.sh](https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/dev_wzw/fengshen/examples/wenzhong_qa/finetune_GPT2_medicalQA.sh)
+### finetune后的效果以及使用
+#### 效果对比：
+finetune后的模型，用100对问答对，基于BLEU分与之前用Magetron框架训练的模型进行了简单的对比，效果比较接近。
+
+unsmoth method:
+| 框架     | 1-gram             | 2-gram             | 3-gram             | 4-gram              |
+| -------- | ------------------ | ------------------ | ------------------ | ------------------- |
+| Fengshen | 0.5241376169070796 | 0.5215762466122144 | 0.4894353584800885 | 0.44840139357073466 |
+| Magetron | 0.5321340489166898 | 0.5110257474778213 | 0.4703745962926368 | 0.4310875933354554  |
+
+smoth method:
+| 框架     | 1-gram            | 2-gram             | 3-gram             | 4-gram             |
+| -------- | ----------------- | ------------------ | ------------------ | ------------------ |
+| Fengshen | 0.717829796617609 | 0.6516910802858905 | 0.5859726677095979 | 0.525510691686505  |
+| Magetron | 0.776190980974117 | 0.6749801211321476 | 0.5897846253142169 | 0.5230773076722481 |
+#### 使用方式：
+支持直接用Haggingface或者pytorch-lightning框架调用。由于在finetune的时候，加入了prompt，在问答的时候，输入应该是："
+`Question:your question about medical? answer:`",接着模型就回以续写的方式回答你的问题。用huggingface的调用代码可以参考下面的代码：
+```python 
+from transformers import GPT2Tokenizer,GPT2LMHeadModel
+model_path = 'pretrained_model_hf/yuyuanQA-v1' # input your own model file path
+model = GPT2LMHeadModel.from_pretrained(model_path)
+tokenizer = GPT2Tokenizer.from_pretrained(model_path)
+model = model.cuda(6) # move your model to the GPU
+model.eval() # just do predict
+
+def answering(question):
+# question = "What should gout patients pay attention to in diet?"
+    inputs = tokenizer(f'Question:{question} answer:',return_tensors='pt').input_ids.to(model.device)
+    
+    generation_output = model.generate(input_ids = inputs,
+                                return_dict_in_generate=True,
+                                output_scores=True,
+                                max_length=150,
+                                # max_new_tokens=80,
+                                do_sample=True,
+                                top_p = 0.9,
+                                eos_token_id=50256,
+                                pad_token_id=0,
+                                num_return_sequences = 5)
+    answers = []
+    for idx,sentence in enumerate(generation_output.sequences):
+        next_sentence = tokenizer.decode(sentence).split('<|endoftext|>')[0]
+        answer = next_sentence.split(sep='answer:',maxsplit=1)[1]
+        answers.append(answer)
+    return answers
+answering('your question?')
+```
\ No newline at end of file
diff --git a/fengshen/examples/wenzhong_qa/finetune_GPT2_medicalQA.sh b/fengshen/examples/wenzhong_qa/finetune_GPT2_medicalQA.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d9a81670ed121ecfb9fa3e0e546f0773374087af
--- /dev/null
+++ b/fengshen/examples/wenzhong_qa/finetune_GPT2_medicalQA.sh
@@ -0,0 +1,123 @@
+#!/bin/bash
+#SBATCH --job-name=medical_qa_finetune
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8               # number of gpus
+#SBATCH -o /cognitive_comp/wuziwei/task/fs_medical_qa_finetune/%x-%j.log
+#SBATCH -e /cognitive_comp/wuziwei/task/fs_medical_qa_finetune/%x-%j.err
+#SBATCH -x dgx[050,049]
+
+#export NCCL_DEBUG=INFO
+
+# export PATH=$PATH:/cognitive_comp/wuziwei/codes/fengshen/fengshen
+set -x -e
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=1
+ROOT_DIR=/cognitive_comp/wuziwei/task/fs_medical_qa_finetune
+
+ZERO_STAGE=2
+
+config_json="$ROOT_DIR/training_config.json"
+export MASTER_PORT=$[RANDOM%10000+30000]
+
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+  "zero_optimization": {
+    "stage": $ZERO_STAGE,
+    "contiguous_gradients": true,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 2e8,
+    "allgather_bucket_size": 2e8
+  },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [0.9,0.95],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "initial_scale_power": 32,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false,
+  "zero_allow_untested_optimizer": false,
+  "train_micro_batch_size_per_gpu": 1,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0
+}
+EOT
+
+# export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/wuziwei/torch_extendsions
+TRAINER_ARGS="
+    --max_epochs 10 \
+    --gpus 16 \
+    --num_nodes 2 \
+    --strategy deepspeed_stage_2 \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+"
+DATA_DIR=/cognitive_comp/wuziwei/task-data/medical_qa
+DATA_ARGS="
+    --data_dir $DATA_DIR \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data train.txt \
+    --valid_data valid.txt \
+    --test_data  test.txt
+"
+
+# PRETRAINED_MODEL_PATH=/cognitive_comp/wuziwei/pretrained_model_hf/gpt2
+PRETRAINED_MODEL_PATH=/cognitive_comp/wuziwei/pretrained_model_hf/medical_v2
+MODEL_ARGS="
+    --pretrained_model_path ${PRETRAINED_MODEL_PATH} \
+    --output_save_path $ROOT_DIR/predict.json \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --warmup 0.01 \
+"
+
+SCRIPTS_PATH=/cognitive_comp/wuziwei/codes/fengshen/fengshen/examples/GPT_pretrain_finetune/finetune_medicalQA.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+SINGULARITY_PATH=/cognitive_comp/wuziwei/container/oneflow-cuda11.sif
+# singularity exec --nv -B /cognitive_comp/wuziwei/:/cognitive_comp/wuziwei/ $SINGULARITY_PATH python $CMD
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+
+srun singularity exec --nv -B /cognitive_comp/wuziwei/:/cognitive_comp/wuziwei/ $SINGULARITY_PATH bash -c 'python $CMD'
diff --git a/fengshen/examples/wenzhong_qa/finetune_medicalQA.py b/fengshen/examples/wenzhong_qa/finetune_medicalQA.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a79948d5f7fe736856e44392a834edfa6ac51d9
--- /dev/null
+++ b/fengshen/examples/wenzhong_qa/finetune_medicalQA.py
@@ -0,0 +1,176 @@
+from transformers import GPT2LMHeadModel
+from data.task_dataloader.medicalQADataset import GPT2QADataModel
+from transformers.optimization import get_linear_schedule_with_warmup
+from pytorch_lightning import Trainer, loggers
+from pytorch_lightning.callbacks import ModelCheckpoint
+import pytorch_lightning as pl
+import argparse
+import torch
+import os
+import sys
+sys.path.insert(0, '/cognitive_comp/wuziwei/codes/fengshen/fengshen')
+# sys.path.append('../../')
+# sys.path.append('../')
+# os.environ["CUDA_VISIBLE_DEVICES"] = '4,5,6,7'
+
+
+class GPT2FinetuneMedicalQAModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./ckpt/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_last', action='store_true', default=True)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=1000, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         #  every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename,
+                                         save_last=args.save_last)
+
+
+class GPT2FinetuneMedicalQA(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--learning_rate', default=1e-4, type=float)
+        parser.add_argument('--weight_decay', default=0.1, type=float)
+        parser.add_argument('--warmup', default=0.01, type=float)
+        return parent_args
+
+    def __init__(self, args, num_data):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        print('num_data:', num_data)
+        self.model = GPT2LMHeadModel.from_pretrained(
+            args.pretrained_model_path)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
+            self.total_step = int(self.trainer.max_epochs * self.num_data /
+                                  (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
+            print('Total training step:', self.total_step)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(input_ids=batch['input_ids'],
+                            attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss)
+        # self.log('val_acc', acc)
+
+    def configure_optimizers(self):
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        paras = list(
+            filter(lambda p: p[1].requires_grad, self.named_parameters()))
+        paras = [{
+            'params':
+            [p for n, p in paras if not any(nd in n for nd in no_decay)],
+            'weight_decay': self.args.weight_decay
+        }, {
+            'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
+            'weight_decay': 0.0
+        }]
+        optimizer = torch.optim.AdamW(paras, lr=self.args.learning_rate)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, int(self.total_step * self.args.warmup),
+            self.total_step)
+
+        return [{
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,
+                'interval': 'step',
+                'frequency': 1
+            }
+        }]
+
+
+def main():
+    total_parser = argparse.ArgumentParser("Summary Task")
+    total_parser.add_argument(
+        '--do_eval_only', action='store_true', default=False)
+    total_parser.add_argument(
+        '--pretrained_model_path', default=None, type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = GPT2QADataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = GPT2FinetuneMedicalQAModelCheckpoint.add_argparse_args(
+        total_parser)
+    total_parser = GPT2FinetuneMedicalQA.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+
+    data_model = GPT2QADataModel(args)
+    if not args.do_eval_only:
+        model = GPT2FinetuneMedicalQA(args, len(data_model.train_dataloader()))
+        checkpoint_callback = GPT2FinetuneMedicalQAModelCheckpoint(
+            args).callbacks
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'log/'), name='MedicalQA-GPT2')
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[checkpoint_callback]
+                                             )
+        trainer.fit(model, data_model)
+
+        # result = trainer.predict(model, data_model)
+        # with open('test_results.txt', 'wt', encoding='utf-8') as w:
+        #     for line in result:
+        #         w.writelines(line)
+
+        model.model.save_pretrained(
+            '/cognitive_comp/wuziwei/pretrained_model_hf')
+    else:
+        print('save to hf.....')
+        trainer = Trainer.from_argparse_args(args)
+        model = GPT2FinetuneMedicalQA(
+            args, len(data_model.predict_dataloader()))
+
+        result = trainer.predict(
+            model, data_model, ckpt_path='/cognitive_comp/wuziwei/task/fs_medical_qa_finetune/ckpt/last.ckpt')
+        # with open('test_results.txt','wt',encoding='utf-8') as w:
+        #     for line in result:
+        #         w.writelines(line)
+
+        model.model.save_pretrained(
+            '/cognitive_comp/wuziwei/pretrained_model_hf')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fengshen/examples/wenzhong_qa/finetune_wenzhong.py b/fengshen/examples/wenzhong_qa/finetune_wenzhong.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcdeda71fd2d2d70dd56148451ddf2d4946bf31c
--- /dev/null
+++ b/fengshen/examples/wenzhong_qa/finetune_wenzhong.py
@@ -0,0 +1,153 @@
+# sys.path.append('./')
+import os
+import torch
+import argparse
+import pytorch_lightning as pl
+from pytorch_lightning.callbacks import ModelCheckpoint
+from pytorch_lightning import Trainer, loggers
+from transformers.optimization import get_linear_schedule_with_warmup
+from transformers import GPT2LMHeadModel
+from fengshen.data.task_dataloader.medicalQADataset import GPT2QADataModel
+
+
+class GPT2FinetuneMedicalQAModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./ckpt/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_last', action='store_true', default=True)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename,
+                                         save_last=args.save_last)
+
+
+class GPT2FinetuneMedicalQA(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--learning_rate', default=1e-4, type=float)
+        parser.add_argument('--weight_decay', default=0.1, type=float)
+        parser.add_argument('--warmup', default=0.01, type=float)
+        return parent_args
+
+    def __init__(self, args, num_data):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        print('num_data:', num_data)
+        self.model = GPT2LMHeadModel.from_pretrained(args.pretrained_model_path)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
+            self.total_step = int(self.trainer.max_epochs * self.num_data
+                                  / (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
+            print('Total training step:', self.total_step)
+
+    def training_step(self, batch, batch_idx):
+        output = self.model(
+            input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('train_loss', output.loss)
+        return output.loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float()) / labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        output = self.model(
+            input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['labels'])
+        # output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
+        # acc = self.comput_metrix(output.logits, batch['labels'])
+        self.log('val_loss', output.loss)
+        # self.log('val_acc', acc)
+
+    def configure_optimizers(self):
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        paras = list(
+            filter(lambda p: p[1].requires_grad, self.named_parameters()))
+        paras = [{
+            'params':
+            [p for n, p in paras if not any(nd in n for nd in no_decay)],
+            'weight_decay': self.args.weight_decay
+        }, {
+            'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
+            'weight_decay': 0.0
+        }]
+        optimizer = torch.optim.AdamW(paras, lr=self.args.learning_rate)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, int(self.total_step * self.args.warmup),
+            self.total_step)
+
+        return [{
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,
+                'interval': 'step',
+                'frequency': 1
+            }
+        }]
+
+
+def main():
+    total_parser = argparse.ArgumentParser("QA Task")
+    total_parser.add_argument('--do_eval_only', action='store_true', default=False)
+    total_parser.add_argument('--pretrained_model_path', default='google/mt5-small', type=str)
+    total_parser.add_argument('--output_save_path', default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = GPT2QADataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = Trainer.add_argparse_args(total_parser)
+    total_parser = GPT2FinetuneMedicalQAModelCheckpoint.add_argparse_args(total_parser)
+    total_parser = GPT2FinetuneMedicalQA.add_model_specific_args(total_parser)
+    # * Args for base model
+    args = total_parser.parse_args()
+
+    data_model = GPT2QADataModel(args)
+    if not args.do_eval_only:
+        model = GPT2FinetuneMedicalQA(args, len(data_model.train_dataloader()))
+        checkpoint_callback = GPT2FinetuneMedicalQAModelCheckpoint(args).callbacks
+        logger = loggers.TensorBoardLogger(save_dir=os.path.join(
+            args.default_root_dir, 'log/'), name='WenZhong')
+        trainer = Trainer.from_argparse_args(args,
+                                             logger=logger,
+                                             callbacks=[checkpoint_callback]
+                                             )
+        trainer.fit(model, data_model)
+
+
+if __name__ == '__main__':
+    main()
+    # test()
+
+'''
+# python examples/mt5_summary.py --gpus=1 --test_data=test_public.jsonl
+# --default_root_dir=/cognitive_comp/ganruyi/fengshen/mt5_summary/eval
+# --do_eval_only
+# --resume_from_checkpoint=/cognitive_comp/ganruyi/fengshen/mt5_summary/ckpt/model-epoch=01-train_loss=1.9166.ckpt
+# --strategy=ddp
+'''
diff --git a/fengshen/examples/wenzhong_qa/finetune_wenzhong.sh b/fengshen/examples/wenzhong_qa/finetune_wenzhong.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0100377bf5c54c0eba3088e3b09368a5b31f9c06
--- /dev/null
+++ b/fengshen/examples/wenzhong_qa/finetune_wenzhong.sh
@@ -0,0 +1,126 @@
+#!/bin/bash
+#SBATCH --job-name=finetune_wenzhong
+#SBATCH --cpus-per-task=50
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gres=gpu:1               # number of gpus
+#SBATCH -o %x-%j.log
+#SBATCH -e %x-%j.err
+
+set -x -e
+
+export MASTER_PORT=$[RANDOM%10000+50000]
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/torch_extendsions
+
+echo "START TIME: $(date)"
+MICRO_BATCH_SIZE=1
+ROOT_DIR=/cognitive_comp/gaoxinyu/FS/fengshen/fengshen
+
+ZERO_STAGE=3
+
+config_json="$ROOT_DIR/ds_config.$SLURM_JOBID.json"
+#config_json="$ROOT_DIR/ds_config.wzw.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+cat <<EOT > $config_json
+{
+    "train_micro_batch_size_per_gpu":1,
+    "steps_per_print":100,
+    "gradient_clipping":1,
+    "zero_optimization":{
+        "stage": $ZERO_STAGE,
+        "offload_optimizer":{
+          "device":"cpu",
+          "pin_memory":true
+        },
+        "offload_param":{
+          "device":"cpu",
+          "pin_memory":true
+        },
+        "overlap_comm":true,
+        "contiguous_gradients":true,
+        "sub_group_size":1000000000,
+        "stage3_max_live_parameters":1000000000,
+        "stage3_max_reuse_distance":1000000000,
+        "stage3_gather_fp16_weights_on_model_save":true
+    },
+    "optimizer":{
+        "type":"Adam",
+        "params":{
+            "lr": 1e-5,
+            "weight_decay":0.01
+        }
+    },
+    "scheduler":{
+        "type":"WarmupLR",
+        "params":{
+            "warmup_min_lr":5e-6,
+            "warmup_max_lr":1e-5
+        }
+    },
+    "zero_allow_untested_optimizer":false,
+    "fp16":{
+        "enabled":true,
+        "loss_scale":0,
+        "loss_scale_window":1000,
+        "hysteresis":2,
+        "min_loss_scale":1
+    },
+    "activation_checkpointing":{
+        "partition_activations":false,
+        "contiguous_memory_optimization":false
+    },
+    "wall_clock_breakdown":false
+}
+EOT
+
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+
+TRAINER_ARGS="
+    --max_epochs 2 \
+    --gpus 1 \
+    --num_nodes 1 \
+    --strategy deepspeed_stage_3 \
+    --precision 16 \
+    --default_root_dir $ROOT_DIR \
+    --dirpath $ROOT_DIR/ckpt \
+    --save_top_k 3 \
+    --monitor train_loss \
+    --mode min \
+    --save_last \
+"
+DATA_DIR=/cognitive_comp/gaoxinyu/data/yuyuan
+DATA_ARGS="
+    --data_dir $DATA_DIR \
+    --train_batchsize $MICRO_BATCH_SIZE \
+    --valid_batchsize $MICRO_BATCH_SIZE \
+    --train_data train.txt \
+    --valid_data valid.txt \
+    --test_data  test.txt
+"
+
+MODEL_ARGS="
+    --pretrained_model_path /cognitive_comp/gaoxinyu/hf_model/wenzhong \
+    --output_save_path $ROOT_DIR/predict.json \
+    --learning_rate 1e-4 \
+    --weight_decay 0.1 \
+    --warmup 0.01 \
+"
+
+SCRIPTS_PATH=/cognitive_comp/gaoxinyu/FS/fengshen/finetune_wenzhong.py
+
+export CMD=" \
+    $SCRIPTS_PATH \
+    $TRAINER_ARGS \
+    $MODEL_ARGS \
+    $DATA_ARGS \
+    "
+
+echo $CMD
+
+SINGULARITY_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+
+# to debug - add echo (it exits and prints what it would have launched)
+#run_cmd="$PY_LAUNCHER $CMD"
+
+clear; srun --jobid $SLURM_JOBID singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH bash -c 'python $CMD'
+# bash -c 'python $CMD'
\ No newline at end of file
diff --git a/fengshen/examples/zen1_finetune/fengshen_sequence_level_ft_task.py b/fengshen/examples/zen1_finetune/fengshen_sequence_level_ft_task.py
new file mode 100755
index 0000000000000000000000000000000000000000..1404571159ea95776c3953fdecb28a84031c1347
--- /dev/null
+++ b/fengshen/examples/zen1_finetune/fengshen_sequence_level_ft_task.py
@@ -0,0 +1,610 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from fengshen.models.zen1.tokenization import BertTokenizer
+from fengshen.models.zen1.modeling import ZenForSequenceClassification
+from fengshen.models.zen1.ngram_utils import ZenNgramDict
+from pytorch_lightning.callbacks import LearningRateMonitor
+import csv
+from dataclasses import dataclass
+import logging
+import math
+import numpy as np
+import os
+from tqdm import tqdm
+import json
+import torch
+import pytorch_lightning as pl
+from random import shuffle
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint
+from torch.utils.data import Dataset, DataLoader
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt='%m/%d/%Y %H:%M:%S',
+                    level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+        Args:
+            guid: Unique id for the example.
+            text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+            text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id, ngram_ids, ngram_positions, ngram_lengths,
+                 ngram_tuples, ngram_seg_ids, ngram_masks):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+
+        self.ngram_ids = ngram_ids
+        self.ngram_positions = ngram_positions
+        self.ngram_lengths = ngram_lengths
+        self.ngram_tuples = ngram_tuples
+        self.ngram_seg_ids = ngram_seg_ids
+        self.ngram_masks = ngram_masks
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_examples(self, data_path, mode):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                # if sys.version_info[0] == 2:
+                #     line = list(unicode(cell, 'utf-8') for cell in line)
+                lines.append(line)
+            return lines
+
+    @classmethod
+    def _read_json(cls, input_file):
+        """Reads a jsonl file."""
+        with open(input_file, "r", encoding="utf-8") as f:
+            lines = f.readlines()
+            samples = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                samples.append(data)
+            return samples
+
+
+class TnewsProcessor(DataProcessor):
+    """Processor for the tnews data set (HIT version)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_json(os.path.join(data_dir, "train.json")), "train")
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence']
+            label = line['label'] if 'label' in line.keys() else None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+
+class OcnliProcessor(DataProcessor):
+    """Processor for the ocnli or cmnli data set (HIT version)."""
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence1']
+            text_b = line['sentence2']
+            label = line['label'] if 'label' in line.keys() else None
+            # 特殊处理，cmnli有label为-的
+            if label == '-':
+                label = None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class IflytekProcessor(DataProcessor):
+    """Processor for the iflytek data set (HIT version)."""
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence']
+            label = line['label'] if 'label' in line.keys() else None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+
+def convert_examples_to_features(examples, label_map, max_seq_length, tokenizer, ngram_dict):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            logger.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+        tokens_a = tokenizer.tokenize(example.text_a)
+
+        tokens_b = None
+        if example.text_b:
+            tokens_b = tokenizer.tokenize(example.text_b)
+            # Modifies `tokens_a` and `tokens_b` in place so that the total
+            # length is less than the specified length.
+            # Account for [CLS], [SEP], [SEP] with "- 3"
+            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+        else:
+            # Account for [CLS] and [SEP] with "- 2"
+            if len(tokens_a) > max_seq_length - 2:
+                tokens_a = tokens_a[:(max_seq_length - 2)]
+
+        # The convention in BERT is:
+        # (a) For sequence pairs:
+        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
+        # (b) For single sequences:
+        #  tokens:   [CLS] the dog is hairy . [SEP]
+        #  type_ids: 0   0   0   0  0     0 0
+        #
+        # Where "type_ids" are used to indicate whether this is the first
+        # sequence or the second sequence. The embedding vectors for `type=0` and
+        # `type=1` were learned during pre-training and are added to the wordpiece
+        # embedding vector (and position vector). This is not *strictly* necessary
+        # since the [SEP] token unambiguously separates the sequences, but it makes
+        # it easier for the model to learn the concept of sequences.
+        #
+        # For classification tasks, the first vector (corresponding to [CLS]) is
+        # used as as the "sentence vector". Note that this only makes sense because
+        # the entire model is fine-tuned.
+        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
+        segment_ids = [0] * len(tokens)
+
+        if tokens_b:
+            tokens += tokens_b + ["[SEP]"]
+            segment_ids += [1] * (len(tokens_b) + 1)
+
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+        # The mask has 1 for real tokens and 0 for padding tokens. Only real
+        # tokens are attended to.
+        input_mask = [1] * len(input_ids)
+
+        # Zero-pad up to the sequence length.
+        padding = [0] * (max_seq_length - len(input_ids))
+        input_ids += padding
+        input_mask += padding
+        segment_ids += padding
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        label_id = label_map[example.label]
+
+        # ----------- code for ngram BEGIN-----------
+        ngram_matches = []
+        #  Filter the word segment from 2 to 7 to check whether there is a word
+        for p in range(2, 8):
+            for q in range(0, len(tokens) - p + 1):
+                character_segment = tokens[q:q + p]
+                # j is the starting position of the word
+                # i is the length of the current word
+                character_segment = tuple(character_segment)
+                if character_segment in ngram_dict.ngram_to_id_dict:
+                    ngram_index = ngram_dict.ngram_to_id_dict[character_segment]
+                    ngram_matches.append([ngram_index, q, p, character_segment])
+
+        shuffle(ngram_matches)
+        # max_word_in_seq_proportion = max_word_in_seq
+        max_word_in_seq_proportion = math.ceil((len(tokens) / max_seq_length) * ngram_dict.max_ngram_in_seq)
+        if len(ngram_matches) > max_word_in_seq_proportion:
+            ngram_matches = ngram_matches[:max_word_in_seq_proportion]
+        ngram_ids = [ngram[0] for ngram in ngram_matches]
+        ngram_positions = [ngram[1] for ngram in ngram_matches]
+        ngram_lengths = [ngram[2] for ngram in ngram_matches]
+        ngram_tuples = [ngram[3] for ngram in ngram_matches]
+        ngram_seg_ids = [0 if position < (len(tokens_a) + 2) else 1 for position in ngram_positions]
+
+        ngram_mask_array = np.zeros(ngram_dict.max_ngram_in_seq, dtype=np.bool)
+        ngram_mask_array[:len(ngram_ids)] = 1
+
+        # record the masked positions
+        ngram_positions_matrix = np.zeros(shape=(max_seq_length, ngram_dict.max_ngram_in_seq), dtype=np.int32)
+        for i in range(len(ngram_ids)):
+            ngram_positions_matrix[ngram_positions[i]:ngram_positions[i] + ngram_lengths[i], i] = 1.0
+
+        # Zero-pad up to the max word in seq length.
+        padding = [0] * (ngram_dict.max_ngram_in_seq - len(ngram_ids))
+        ngram_ids += padding
+        ngram_lengths += padding
+        ngram_seg_ids += padding
+
+        # ----------- code for ngram END-----------
+        label_id = label_map[example.label] if example.label is not None else 0
+        features.append(
+            InputFeatures(input_ids=input_ids,
+                          input_mask=input_mask,
+                          segment_ids=segment_ids,
+                          label_id=label_id,
+                          ngram_ids=ngram_ids,
+                          ngram_positions=ngram_positions_matrix,
+                          ngram_lengths=ngram_lengths,
+                          ngram_tuples=ngram_tuples,
+                          ngram_seg_ids=ngram_seg_ids,
+                          ngram_masks=ngram_mask_array))
+
+    return features
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+class TaskDataset(Dataset):
+    def __init__(self, data_path, processor, mode='train'):
+        super().__init__()
+        self.data = self.load_data(data_path, processor, mode)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path, processor, mode):
+        if mode == "train":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "test":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "dev":
+            examples = processor.get_examples(data_path, mode)
+        return examples
+
+
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+    ngram_dict = None
+    label2id = None
+
+    def __call__(self, samples):
+        features = convert_examples_to_features(samples, self.label2id, self.args.max_seq_length, self.tokenizer, self.ngram_dict)
+        # logger.info("  Num examples = %d", len(samples))
+        input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+        input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+        segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+        label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+        ngram_ids = torch.tensor([f.ngram_ids for f in features], dtype=torch.long)
+        ngram_positions = torch.tensor([f.ngram_positions for f in features], dtype=torch.long)
+        # ngram_lengths = torch.tensor([f.ngram_lengths for f in features], dtype=torch.long)
+        # ngram_seg_ids = torch.tensor([f.ngram_seg_ids for f in features], dtype=torch.long)
+        # ngram_masks = torch.tensor([f.ngram_masks for f in features], dtype=torch.long)
+
+        return {
+            'input_ids': input_ids,
+            'input_ngram_ids': ngram_ids,
+            'ngram_position_matrix': ngram_positions,
+            'attention_mask': input_mask,
+            'token_type_ids': segment_ids,
+            'labels': label_ids,
+
+        }
+        # return default_collate(sample_list)
+
+
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_seq_length', default=128, type=int)
+
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+
+        parser.add_argument('--dataset_name', default=None, type=str)
+        parser.add_argument('--vocab_file',
+                            type=str, default=None,
+                            help="Vocabulary mapping/file BERT was pretrainined on")
+        parser.add_argument("--do_lower_case",
+                            action='store_true',
+                            help="Set this flag if you are using an uncased model.")
+        parser.add_argument('--task_name', default='tnews', type=str)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path, do_lower_case=args.do_lower_case)
+        self.collator.ngram_dict = ZenNgramDict.from_pretrained(args.pretrained_model_path, tokenizer=self.collator.tokenizer)
+
+        processors = {
+            'afqmc': OcnliProcessor,
+            'tnews': TnewsProcessor,
+            'ocnli': OcnliProcessor,
+            'cmnli': OcnliProcessor,
+            'iflytek': IflytekProcessor,
+        }
+        if args.task_name not in processors:
+            raise ValueError("Task not found: %s" % (args.task_name))
+        processor = processors[args.task_name]()
+        if args.dataset_name is None:
+            self.label2id, self.id2label = self.load_schema(os.path.join(
+                args.data_dir, args.train_data), args)
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), processor, mode='train')
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), processor, mode='dev')
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), processor, mode='test')
+            self.collator.label2id = self.label2id
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def load_schema(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            label_list = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                labels = data[args.label_name] if args.label_name in data.keys(
+                ) else 0
+                if labels not in label_list:
+                    label_list.append(labels)
+
+        label2id, id2label = {}, {}
+        for i, k in enumerate(label_list):
+            label2id[k] = i
+            id2label[i] = k
+        return label2id, id2label
+
+
+class LitModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--num_labels', default=2, type=int)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.model = ZenForSequenceClassification.from_pretrained(args.pretrained_model_path, num_labels=args.num_labels)
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        loss, logits = self.model(**batch)
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('train_loss', loss)
+        self.log('train_acc', acc)
+        return loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        loss, logits = self.model(**batch)
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('val_loss', loss)
+        self.log('val_acc', acc)
+
+    def predict_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        return output.logits
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename)
+
+
+def save_test(data, args, data_model):
+    with open(args.output_save_path, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            batch = data[i]
+            for sample in batch:
+                tmp_result = dict()
+                label_id = np.argmax(sample.numpy())
+                tmp_result['id'] = data_model.test_data.data[idx]['id']
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+args.output_save_path)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+
+    args = total_parser.parse_args()
+
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[checkpoint_callback, lr_monitor]
+                                            )
+
+    data_model = TaskDataModel(args)
+    model = LitModel(args)
+    trainer.fit(model, data_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/zen1_finetune/fengshen_token_level_ft_task.py b/fengshen/examples/zen1_finetune/fengshen_token_level_ft_task.py
new file mode 100755
index 0000000000000000000000000000000000000000..8cb77bbe0edf675300614982466e802964f8c625
--- /dev/null
+++ b/fengshen/examples/zen1_finetune/fengshen_token_level_ft_task.py
@@ -0,0 +1,647 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from fengshen.models.zen1.ngram_utils import ZenNgramDict
+from fengshen.models.zen1.modeling import ZenForTokenClassification
+from fengshen.metric.metric import SeqEntityScore
+from fengshen.models.zen1.tokenization import BertTokenizer
+from random import shuffle
+from pytorch_lightning.callbacks import LearningRateMonitor
+from dataclasses import dataclass
+import logging
+import math
+import numpy as np
+import os
+import json
+import torch
+import pytorch_lightning as pl
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint
+from torch.utils.data import Dataset, DataLoader
+
+import torch.nn.functional as F
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt='%m/%d/%Y %H:%M:%S',
+                    level=logging.ERROR)
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+        Args:
+            guid: Unique id for the example.
+            text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+            text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id, ngram_ids, ngram_positions, ngram_lengths,
+                 ngram_tuples, ngram_seg_ids, ngram_masks, valid_ids=None, label_mask=None):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.valid_ids = valid_ids
+        self.label_mask = label_mask
+
+        self.ngram_ids = ngram_ids
+        self.ngram_positions = ngram_positions
+        self.ngram_lengths = ngram_lengths
+        self.ngram_tuples = ngram_tuples
+        self.ngram_seg_ids = ngram_seg_ids
+        self.ngram_masks = ngram_masks
+
+
+def convert_examples_to_features(examples, label_map, max_seq_length, tokenizer, ngram_dict):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    # label_map = {label: i for i, label in enumerate(label_list, 1)}
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        textlist = example.text_a
+        labellist = example.label
+        tokens = []
+        labels = []
+        valid = []
+        label_mask = []
+        for i, word in enumerate(textlist):
+            token = tokenizer.tokenize(word)
+            tokens.extend(token)
+            label_1 = labellist[i]
+            for m in range(len(token)):
+                if m == 0:
+                    labels.append(label_1)
+                    valid.append(1)
+                    label_mask.append(1)
+                else:
+                    valid.append(0)
+        if len(tokens) >= max_seq_length - 1:
+            tokens = tokens[0:(max_seq_length - 2)]
+            labels = labels[0:(max_seq_length - 2)]
+            valid = valid[0:(max_seq_length - 2)]
+            label_mask = label_mask[0:(max_seq_length - 2)]
+        ntokens = []
+        segment_ids = []
+        label_ids = []
+        ntokens.append("[CLS]")
+        segment_ids.append(0)
+        valid.insert(0, 1)
+        label_mask.insert(0, 1)
+        label_ids.append(label_map["[CLS]"])
+        for i, token in enumerate(tokens):
+            ntokens.append(token)
+            segment_ids.append(0)
+            if len(labels) > i:
+                label_ids.append(label_map[labels[i]])
+        ntokens.append("[SEP]")
+        segment_ids.append(0)
+        valid.append(1)
+        label_mask.append(1)
+        label_ids.append(label_map["[SEP]"])
+        input_ids = tokenizer.convert_tokens_to_ids(ntokens)
+        input_mask = [1] * len(input_ids)
+        label_mask = [1] * len(label_ids)
+        while len(input_ids) < max_seq_length:
+            input_ids.append(0)
+            input_mask.append(0)
+            segment_ids.append(0)
+            label_ids.append(0)
+            valid.append(1)
+            label_mask.append(0)
+        while len(label_ids) < max_seq_length:
+            label_ids.append(0)
+            label_mask.append(0)
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+        assert len(label_ids) == max_seq_length
+        assert len(valid) == max_seq_length
+        assert len(label_mask) == max_seq_length
+
+        # ----------- code for ngram BEGIN-----------
+        ngram_matches = []
+        #  Filter the ngram segment from 2 to 7 to check whether there is a ngram
+        for p in range(2, 8):
+            for q in range(0, len(tokens) - p + 1):
+                character_segment = tokens[q:q + p]
+                # j is the starting position of the ngram
+                # i is the length of the current ngram
+                character_segment = tuple(character_segment)
+                if character_segment in ngram_dict.ngram_to_id_dict:
+                    ngram_index = ngram_dict.ngram_to_id_dict[character_segment]
+                    ngram_matches.append([ngram_index, q, p, character_segment])
+
+        shuffle(ngram_matches)
+
+        max_ngram_in_seq_proportion = math.ceil((len(tokens) / max_seq_length) * ngram_dict.max_ngram_in_seq)
+        if len(ngram_matches) > max_ngram_in_seq_proportion:
+            ngram_matches = ngram_matches[:max_ngram_in_seq_proportion]
+
+        ngram_ids = [ngram[0] for ngram in ngram_matches]
+        ngram_positions = [ngram[1] for ngram in ngram_matches]
+        ngram_lengths = [ngram[2] for ngram in ngram_matches]
+        ngram_tuples = [ngram[3] for ngram in ngram_matches]
+        ngram_seg_ids = [0 if position < (len(tokens) + 2) else 1 for position in ngram_positions]
+
+        ngram_mask_array = np.zeros(ngram_dict.max_ngram_in_seq, dtype=np.bool)
+        ngram_mask_array[:len(ngram_ids)] = 1
+
+        # record the masked positions
+        ngram_positions_matrix = np.zeros(shape=(max_seq_length, ngram_dict.max_ngram_in_seq), dtype=np.int32)
+        for i in range(len(ngram_ids)):
+            ngram_positions_matrix[ngram_positions[i]:ngram_positions[i] + ngram_lengths[i], i] = 1.0
+
+        # Zero-pad up to the max ngram in seq length.
+        padding = [0] * (ngram_dict.max_ngram_in_seq - len(ngram_ids))
+        ngram_ids += padding
+        ngram_lengths += padding
+        ngram_seg_ids += padding
+
+        # ----------- code for ngram END-----------
+
+        features.append(
+            InputFeatures(input_ids=input_ids,
+                          input_mask=input_mask,
+                          segment_ids=segment_ids,
+                          label_id=label_ids,
+                          ngram_ids=ngram_ids,
+                          ngram_positions=ngram_positions_matrix,
+                          ngram_lengths=ngram_lengths,
+                          ngram_tuples=ngram_tuples,
+                          ngram_seg_ids=ngram_seg_ids,
+                          ngram_masks=ngram_mask_array,
+                          valid_ids=valid,
+                          label_mask=label_mask))
+    return features
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_examples(self, data_path, set_type, quotechar=' '):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(data_path, self.get_quotechar()), set_type)
+
+    def _create_examples(self, lines, set_type):
+        examples = []
+        for i, (sentence, label) in enumerate(lines):
+            guid = "%s-%s" % (set_type, i)
+            text_a = sentence
+            label = label
+            examples.append(InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    def get_quotechar(self):
+        return ' '
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        '''
+        read file
+        return format :
+        [ ['EU', 'B-ORG'], ['rejects', 'O'], ['German', 'B-MISC'], ['call', 'O'], ['to', 'O'], ['boycott', 'O'], ['British', 'B-MISC'], ['lamb', 'O'], ['.', 'O'] ]
+        '''
+        f = open(input_file)
+        data = []
+        sentence = []
+        label = []
+        for line in f:
+            if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == "\n":
+                if len(sentence) > 0:
+                    data.append((sentence, label))
+                    sentence = []
+                    label = []
+                continue
+            splits = line.split(quotechar)
+            sentence.append(splits[0])
+            label.append(splits[-1][:-1])
+
+        if len(sentence) > 0:
+            data.append((sentence, label))
+            sentence = []
+            label = []
+        return data
+
+
+class MSRAProcessor(DataProcessor):
+    """Processor for the msra data set."""
+
+    def get_labels(self):
+        return ['B-NR', 'B-NS', 'B-NT', 'E-NR', 'E-NS', 'E-NT', 'M-NR',
+                'M-NS', 'M-NT', 'O', 'S-NR', 'S-NS', 'S-NT', '[CLS]', '[SEP]']
+
+
+class OntoNotes4Processor(DataProcessor):
+    """Processor for the OntoNotes4 data set."""
+
+    def get_labels(self):
+        return ['B-GPE', 'B-LOC', 'B-ORG', 'B-PER', 'E-GPE', 'E-LOC',
+                'E-ORG', 'E-PER', 'M-GPE', 'M-LOC', 'M-ORG', 'M-PER', 'O',
+                'S-GPE', 'S-LOC', 'S-ORG', 'S-PER', '[CLS]', '[SEP]']
+
+
+class WeiboProcessor(DataProcessor):
+    """Processor for the Weibo data set."""
+
+    def get_labels(self):
+        return ['B-GPE.NAM', 'B-GPE.NOM', 'B-LOC.NAM', 'B-LOC.NOM',
+                'B-ORG.NAM', 'B-ORG.NOM', 'B-PER.NAM', 'B-PER.NOM', 'E-GPE.NAM',
+                'E-GPE.NOM', 'E-LOC.NAM', 'E-LOC.NOM', 'E-ORG.NAM', 'E-ORG.NOM',
+                'E-PER.NAM', 'E-PER.NOM', 'M-GPE.NAM', 'M-LOC.NAM', 'M-LOC.NOM',
+                'M-ORG.NAM', 'M-ORG.NOM', 'M-PER.NAM', 'M-PER.NOM', 'O',
+                'S-GPE.NAM', 'S-LOC.NOM', 'S-PER.NAM', 'S-PER.NOM', '[CLS]', '[SEP]']
+
+
+class ResumeProcessor(DataProcessor):
+    """Processor for the resume data set."""
+
+    def get_labels(self):
+        return ['B-CONT', 'B-EDU', 'B-LOC', 'B-NAME', 'B-ORG', 'B-PRO',
+                'B-RACE', 'B-TITLE', 'E-CONT', 'E-EDU', 'E-LOC', 'E-NAME',
+                'E-ORG', 'E-PRO', 'E-RACE', 'E-TITLE', 'M-CONT', 'M-EDU',
+                'M-LOC', 'M-NAME', 'M-ORG', 'M-PRO', 'M-RACE', 'M-TITLE',
+                'O', 'S-NAME', 'S-ORG', 'S-RACE', '[CLS]', '[SEP]']
+
+
+class CMeEEProcessor(DataProcessor):
+    """Processor for the CMeEE data set."""
+
+    def get_quotechar(self):
+        return '\t'
+
+    def get_labels(self):
+        return ['B-临床表现', 'B-医学检验项目', 'B-医疗程序', 'B-医疗设备',
+                'B-微生物类', 'B-疾病', 'B-科室', 'B-药物', 'B-身体', 'I-临床表现',
+                'I-医学检验项目', 'I-医疗程序', 'I-医疗设备', 'I-微生物类',
+                'I-疾病', 'I-科室', 'I-药物', 'I-身体', 'O', '[CLS]', '[SEP]']
+
+
+class CLUENERProcessor(DataProcessor):
+    """Processor for the CLUENER data set."""
+
+    def get_quotechar(self):
+        return '\t'
+
+    def get_labels(self):
+        return ['B-书名', 'B-公司', 'B-地址', 'B-姓名', 'B-政府', 'B-景点',
+                'B-游戏', 'B-电影', 'B-组织机构', 'B-职位', 'I-书名', 'I-公司',
+                'I-地址', 'I-姓名', 'I-政府', 'I-景点', 'I-游戏', 'I-电影',
+                'I-组织机构', 'I-职位', 'O', '[CLS]', '[SEP]']
+
+
+class TaskDataset(Dataset):
+    def __init__(self, data_path, processor, mode='train'):
+        super().__init__()
+        self.data = self.load_data(data_path, processor, mode)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path, processor, mode):
+        if mode == "train":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "test":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "dev":
+            examples = processor.get_examples(data_path, mode)
+        return examples
+
+
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+    ngram_dict = None
+    label2id = None
+
+    def __call__(self, samples):
+        features = convert_examples_to_features(samples, self.label2id, self.args.max_seq_length, self.tokenizer, self.ngram_dict)
+        # logger.info("  Num examples = %d", len(samples))
+
+        input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+        input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+        segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+        label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+        valid_ids = torch.tensor([f.valid_ids for f in features], dtype=torch.long)
+
+        ngram_ids = torch.tensor([f.ngram_ids for f in features], dtype=torch.long)
+        ngram_positions = torch.tensor([f.ngram_positions for f in features], dtype=torch.long)
+        # ngram_lengths = torch.tensor([f.ngram_lengths for f in features], dtype=torch.long)
+        # ngram_seg_ids = torch.tensor([f.ngram_seg_ids for f in features], dtype=torch.long)
+        # ngram_masks = torch.tensor([f.ngram_masks for f in features], dtype=torch.long)
+
+        # label_mask = torch.tensor([f.label_mask for f in features], dtype=torch.long)
+        return {
+            'input_ids': input_ids,
+            'ngram_ids': ngram_ids,
+            'ngram_positions': ngram_positions,
+            'attention_mask': input_mask,
+            'token_type_ids': segment_ids,
+            'labels': label_ids,
+            'valid_ids': valid_ids,
+        }
+
+
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_seq_length', default=128, type=int)
+
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+
+        parser.add_argument('--dataset_name', default=None, type=str)
+        parser.add_argument('--vocab_file',
+                            type=str, default=None,
+                            help="Vocabulary mapping/file BERT was pretrainined on")
+        parser.add_argument("--do_lower_case",
+                            action='store_true',
+                            help="Set this flag if you are using an uncased model.")
+        parser.add_argument('--task_name', default='weibo', type=str)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path, do_lower_case=args.do_lower_case)
+        self.collator.ngram_dict = ZenNgramDict.from_pretrained(args.pretrained_model_path, tokenizer=self.collator.tokenizer)
+
+        processors = {
+            'weibo': WeiboProcessor,
+            'resume': ResumeProcessor,
+            'msra': MSRAProcessor,
+            'ontonotes4': OntoNotes4Processor,
+            'cmeee': CMeEEProcessor,
+            'cluener': CLUENERProcessor,
+        }
+        if args.task_name not in processors:
+            raise ValueError("Task not found: %s" % (args.task_name))
+        processor = processors[args.task_name]()
+        # 生成id映射
+        label_list = processor.get_labels()
+        label2id = {label: i for i, label in enumerate(label_list, 1)}
+        label2id["[PAD]"] = 0
+        self.id2label = {v: k for k, v in label2id.items()}
+        self.collator.label2id = label2id
+
+        if args.dataset_name is None:
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), processor, mode='train')
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), processor, mode='dev')
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), processor, mode='test')
+
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+
+class LitModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--markup', default='bios', type=str)
+        parser.add_argument('--middle_prefix', default='I-', type=str)
+        return parent_args
+
+    def __init__(self, args, id2label):
+        super().__init__()
+        # config = ZenConfig(os.path.join(args.pretrained_model_path, 'config.json'))
+        self.model = ZenForTokenClassification.from_pretrained(args.pretrained_model_path, num_labels=len(id2label))
+        self.seq_entity_score = SeqEntityScore(id2label, markup=args.markup, middle_prefix=args.middle_prefix)
+        self.train_seq_entity_score = SeqEntityScore(id2label, markup=args.markup, middle_prefix=args.middle_prefix)
+        self.id2label = id2label
+        self.label2id = {v: k for k, v in id2label.items()}
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss, _ = outputs
+        # logits = outputs.logits
+        # preds = torch.argmax(F.log_softmax(logits, dim=2), dim=2)
+        # preds = preds.detach().cpu().numpy()
+        # labels = batch['labels'].detach().cpu().numpy()
+        # num_labels = len(self.label2id)
+        # y_true = []
+        # y_pred = []
+        # for i, label in enumerate(labels):
+        #     temp_1 = []
+        #     temp_2 = []
+        #     for j, m in enumerate(label):
+        #         if j == 0:
+        #             continue
+        #         elif labels[i][j] == num_labels - 1:
+        #             y_true.append(temp_1)
+        #             y_pred.append(temp_2)
+        #             break
+        #         else:
+        #             temp_1.append(self.id2label[labels[i][j]])
+        #             temp_2.append(self.id2label[preds[i][j]])
+
+        # self.train_seq_entity_score.update(y_true, y_pred)
+        # result = self.train_seq_entity_score.result()
+        # self.train_seq_entity_score.reset()
+        self.log('train_loss', loss)
+
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss, logits = outputs
+        preds = torch.argmax(F.log_softmax(logits, dim=2), dim=2)
+        preds = preds.detach().cpu().numpy()
+        labels = batch['labels'].detach().cpu().numpy()
+        num_labels = len(self.label2id)
+        y_true = []
+        y_pred = []
+        for i, label in enumerate(labels):
+            temp_1 = []
+            temp_2 = []
+            for j, m in enumerate(label):
+                if j == 0:
+                    continue
+                elif labels[i][j] == num_labels - 1:
+                    y_true.append(temp_1)
+                    y_pred.append(temp_2)
+                    break
+                else:
+                    temp_1.append(self.id2label[labels[i][j]])
+                    temp_2.append(self.id2label[preds[i][j]])
+
+        self.seq_entity_score.update(y_true, y_pred)
+        self.log('val_loss', loss)
+
+    def validation_epoch_end(self, outputs):
+        # compute metric for all process
+        score_dict, _ = self.seq_entity_score.result()
+        if self.trainer._accelerator_connector.cluster_environment.global_rank() == 0:
+            print('score_dict:\n', score_dict)
+        # reset the metric after once validation
+        self.seq_entity_score.reset()
+        for k, v in score_dict.items():
+            self.log('val_{}'.format(k), v)
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename)
+
+
+def save_test(data, args, data_model):
+    with open(args.output_save_path, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            batch = data[i]
+            for sample in batch:
+                tmp_result = dict()
+                label_id = np.argmax(sample.numpy())
+                tmp_result['id'] = data_model.test_data.data[idx]['id']
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+args.output_save_path)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+
+    args = total_parser.parse_args()
+
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[checkpoint_callback, lr_monitor]
+                                            )
+
+    data_model = TaskDataModel(args)
+    id2label = data_model.id2label
+    print('id2label:', id2label)
+    model = LitModel(args, id2label)
+    trainer.fit(model, data_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/zen1_finetune/fs_zen1_tnews.sh b/fengshen/examples/zen1_finetune/fs_zen1_tnews.sh
new file mode 100644
index 0000000000000000000000000000000000000000..39f2b54063725514f3fd57fa56346a0796e26828
--- /dev/null
+++ b/fengshen/examples/zen1_finetune/fs_zen1_tnews.sh
@@ -0,0 +1,95 @@
+#!/bin/bash
+#SBATCH --job-name=zen1_tnews # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='1'
+export CUDA_LAUNCH_BLOCKING=1
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen1
+
+TASK=tnews
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/ZEN_pretrain_base_v0.1.0
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name tnews \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.01 \
+        --warmup_ratio 0.01 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 400 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 400 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen1_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen1_finetune/ner_zen1_ontonotes4.sh b/fengshen/examples/zen1_finetune/ner_zen1_ontonotes4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..be51a3f3d709d761b6dcb4e5759cc5b92a09a609
--- /dev/null
+++ b/fengshen/examples/zen1_finetune/ner_zen1_ontonotes4.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen1_base_ontonotes4 # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen1_base_ontonotes4/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='1'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen1_base
+
+TASK=ontonotes4
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/OntoNotes4/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/ZEN_pretrain_base_v0.1.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 64 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --task_name ontonotes4 \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 200 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen1_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py b/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
new file mode 100755
index 0000000000000000000000000000000000000000..ed400468cc3d0820d4b34385f270639014039ad1
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
@@ -0,0 +1,649 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from fengshen.models.zen2.modeling import ZenForSequenceClassification
+from fengshen.models.zen2.ngram_utils import ZenNgramDict
+from fengshen.models.zen2.tokenization import BertTokenizer
+from pytorch_lightning.callbacks import LearningRateMonitor
+import csv
+from dataclasses import dataclass
+import logging
+import math
+import numpy as np
+import os
+from tqdm import tqdm
+import json
+import torch
+import pytorch_lightning as pl
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint
+from torch.utils.data import Dataset, DataLoader
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt='%m/%d/%Y %H:%M:%S',
+                    level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None, qid=0):
+        """Constructs a InputExample.
+
+        Args:
+            guid: Unique id for the example.
+            text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+            text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+        self.qid = qid
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id,
+                 ngram_ids, ngram_starts, ngram_lengths, ngram_tuples, ngram_seg_ids, ngram_masks, ngram_freqs,
+                 qid=-1):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.qid = qid
+
+        self.ngram_ids = ngram_ids
+        self.ngram_starts = ngram_starts
+        self.ngram_lengths = ngram_lengths
+        self.ngram_tuples = ngram_tuples
+        self.ngram_seg_ids = ngram_seg_ids
+        self.ngram_masks = ngram_masks
+        self.ngram_freqs = ngram_freqs
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_examples(self, data_path, mode):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                # if sys.version_info[0] == 2:
+                #     line = list(unicode(cell, 'utf-8') for cell in line)
+                lines.append(line)
+            return lines
+
+    @classmethod
+    def _read_json(cls, input_file):
+        """Reads a jsonl file."""
+        with open(input_file, "r", encoding="utf-8") as f:
+            lines = f.readlines()
+            samples = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                samples.append(data)
+            return samples
+
+
+class TnewsProcessor(DataProcessor):
+    """Processor for the tnews data set (HIT version)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_json(os.path.join(data_dir, "train.json")), "train")
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence']
+            label = line['label'] if 'label' in line.keys() else None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+
+class OcnliProcessor(DataProcessor):
+    """Processor for the ocnli or cmnli data set (HIT version)."""
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence1']
+            text_b = line['sentence2']
+            label = line['label'] if 'label' in line.keys() else None
+            # 特殊处理，cmnli有label为-的
+            if label == '-':
+                label = None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class IflytekProcessor(DataProcessor):
+    """Processor for the iflytek data set (HIT version)."""
+
+    def get_examples(self, data_path, mode):
+        return self._create_examples(
+            self._read_json(data_path),
+            set_type=mode
+        )
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # if i == 0:
+            #    continue
+            guid = "%s-%s" % (set_type, i)
+            # text_a = line[0]
+            text_a = line['sentence']
+            label = line['label'] if 'label' in line.keys() else None
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+
+def convert_examples_to_features(examples, label_map, max_seq_length, tokenizer, ngram_dict):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    # label_map = {label : i for i, label in enumerate(label_list)}
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        tokens_a = tokenizer.tokenize(example.text_a)
+
+        tokens_b = None
+        if example.text_b:
+            tokens_b = tokenizer.tokenize(example.text_b)
+            # Modifies `tokens_a` and `tokens_b` in place so that the total
+            # length is less than the specified length.
+            # Account for [CLS], [SEP], [SEP] with "- 3"
+            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+        else:
+            # Account for [CLS] and [SEP] with "- 2"
+            if len(tokens_a) > max_seq_length - 2:
+                tokens_a = tokens_a[:(max_seq_length - 2)]
+
+        # The convention in BERT is:
+        # (a) For sequence pairs:
+        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
+        # (b) For single sequences:
+        #  tokens:   [CLS] the dog is hairy . [SEP]
+        #  type_ids: 0   0   0   0  0     0 0
+        #
+        # Where "type_ids" are used to indicate whether this is the first
+        # sequence or the second sequence. The embedding vectors for `type=0` and
+        # `type=1` were learned during pre-training and are added to the wordpiece
+        # embedding vector (and position vector). This is not *strictly* necessary
+        # since the [SEP] token unambigiously separates the sequences, but it makes
+        # it easier for the model to learn the concept of sequences.
+        #
+        # For classification tasks, the first vector (corresponding to [CLS]) is
+        # used as as the "sentence vector". Note that this only makes sense because
+        # the entire model is fine-tuned.
+        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
+        segment_ids = [0] * len(tokens)
+
+        if tokens_b:
+            tokens += tokens_b + ["[SEP]"]
+            segment_ids += [1] * (len(tokens_b) + 1)
+
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+        # The mask has 1 for real tokens and 0 for padding tokens. Only real
+        # tokens are attended to.
+        input_mask = [1] * len(input_ids)
+
+        # Zero-pad up to the sequence length.
+        padding = [0] * (max_seq_length - len(input_ids))
+        input_ids += padding
+        input_mask += padding
+        segment_ids += padding
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        # ----------- code for ngram BEGIN-----------
+        ngram_matches = []
+        #  Filter the word segment from 2 to max_ngram_len to check whether there is a word
+        max_gram_n = ngram_dict.max_ngram_len
+        for p in range(2, max_gram_n):
+            for q in range(0, len(tokens) - p + 1):
+                character_segment = tokens[q:q + p]
+                # j is the starting position of the word
+                # i is the length of the current word
+                character_segment = tuple(character_segment)
+                if character_segment in ngram_dict.ngram_to_id_dict:
+                    ngram_index = ngram_dict.ngram_to_id_dict[character_segment]
+                    ngram_freq = ngram_dict.ngram_to_freq_dict[character_segment]
+                    ngram_matches.append([ngram_index, q, p, character_segment, ngram_freq])
+
+        # shuffle(ngram_matches)
+        ngram_matches = sorted(ngram_matches, key=lambda s: s[0])
+        # max_word_in_seq_proportion = max_word_in_seq
+        max_word_in_seq_proportion = math.ceil((len(tokens) / max_seq_length) * ngram_dict.max_ngram_in_seq)
+        if len(ngram_matches) > max_word_in_seq_proportion:
+            ngram_matches = ngram_matches[:max_word_in_seq_proportion]
+        ngram_ids = [ngram[0] for ngram in ngram_matches]
+        ngram_positions = [ngram[1] for ngram in ngram_matches]
+        ngram_lengths = [ngram[2] for ngram in ngram_matches]
+        ngram_tuples = [ngram[3] for ngram in ngram_matches]
+        ngram_freqs = [ngram[4] for ngram in ngram_matches]
+        ngram_seg_ids = [0 if position < len([id for id in segment_ids if id == 0]) else 1 for position in
+                         ngram_positions]
+
+        ngram_mask_array = np.zeros(ngram_dict.max_ngram_in_seq, dtype=np.bool)
+        ngram_mask_array[:len(ngram_ids)] = 1
+
+        # Zero-pad up to the max word in seq length.
+        padding = [0] * (ngram_dict.max_ngram_in_seq - len(ngram_ids))
+        ngram_ids += padding
+        ngram_positions += padding
+        ngram_lengths += padding
+        ngram_seg_ids += padding
+        ngram_freqs += padding
+
+        # ----------- code for ngram END-----------
+
+        label_id = label_map[example.label] if example.label is not None else 0
+        # if ex_index < 5:
+        #     logger.info("*** Example ***")
+        #     logger.info("guid: %s" % (example.guid))
+        #     logger.info("tokens: %s" % " ".join(
+        #             [str(x) for x in tokens]))
+        #     logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        #     logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        #     logger.info(
+        #             "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        #     logger.info("label: %s (id = %d)" % (example.label, label_id))
+        #     logger.info("ngram_ids: %s" % " ".join([str(x) for x in ngram_ids]))
+        #     logger.info("ngram_positions: %s" % " ".join([str(x) for x in ngram_positions]))
+        #     logger.info("ngram_lengths: %s" % " ".join([str(x) for x in ngram_lengths]))
+        #     logger.info("ngram_tuples: %s" % " ".join([str(x) for x in ngram_tuples]))
+        #     logger.info("ngram_seg_ids: %s" % " ".join([str(x) for x in ngram_seg_ids]))
+        #     logger.info("ngram_freqs: %s" % " ".join([str(x) for x in ngram_freqs]))
+
+        features.append(
+            InputFeatures(input_ids=input_ids,
+                          input_mask=input_mask,
+                          segment_ids=segment_ids,
+                          label_id=label_id,
+                          ngram_ids=ngram_ids,
+                          ngram_starts=ngram_positions,
+                          ngram_lengths=ngram_lengths,
+                          ngram_tuples=ngram_tuples,
+                          ngram_seg_ids=ngram_seg_ids,
+                          ngram_masks=ngram_mask_array,
+                          ngram_freqs=ngram_freqs,
+                          qid=example.qid))
+    return features
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+class TaskDataset(Dataset):
+    def __init__(self, data_path, processor, mode='train'):
+        super().__init__()
+        self.data = self.load_data(data_path, processor, mode)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path, processor, mode):
+        if mode == "train":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "test":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "dev":
+            examples = processor.get_examples(data_path, mode)
+        return examples
+
+
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+    ngram_dict = None
+    label2id = None
+
+    def __call__(self, samples):
+        features = convert_examples_to_features(samples, self.label2id, self.args.max_seq_length, self.tokenizer, self.ngram_dict)
+        # logger.info("  Num examples = %d", len(samples))
+        input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+        input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+        segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+        label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+        # qids = torch.tensor([f.qid for f in features], dtype=torch.long)
+
+        ngram_ids = torch.tensor([f.ngram_ids for f in features], dtype=torch.long)
+        ngram_starts = torch.tensor([f.ngram_starts for f in features], dtype=torch.long)
+        ngram_lengths = torch.tensor([f.ngram_lengths for f in features], dtype=torch.long)
+        # ngram_seg_ids = torch.tensor([f.ngram_seg_ids for f in features], dtype=torch.long)
+        # ngram_masks = torch.tensor([f.ngram_masks for f in features], dtype=torch.long)
+        ngram_freqs = torch.tensor([f.ngram_freqs for f in features], dtype=torch.long)
+
+        batch_size = len(samples)
+        ngram_positions_matrix = torch.zeros(
+            size=(batch_size, self.args.max_seq_length, self.ngram_dict.max_ngram_in_seq),
+            dtype=torch.int)
+        for batch_id in range(batch_size):
+            ngram_id = ngram_ids[batch_id]
+            ngram_start = ngram_starts[batch_id]
+            ngram_length = ngram_lengths[batch_id]
+            for i in range(len(ngram_id)):
+                ngram_positions_matrix[batch_id][ngram_start[i]:ngram_start[i] + ngram_length[i], i] = ngram_freqs[batch_id][i]
+            ngram_positions_matrix[batch_id] \
+                = torch.div(ngram_positions_matrix[batch_id],
+                            torch.stack([torch.sum(ngram_positions_matrix[batch_id], 1)] *
+                                        ngram_positions_matrix[batch_id].size(1)).t() + 1e-10)
+
+        return {
+            'input_ids': input_ids,
+            'input_ngram_ids': ngram_ids,
+            'ngram_position_matrix': ngram_positions_matrix,
+            'attention_mask': input_mask,
+            'token_type_ids': segment_ids,
+            'labels': label_ids
+
+        }
+
+        # return default_collate(sample_list)
+
+
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_seq_length', default=128, type=int)
+
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+
+        parser.add_argument('--dataset_name', default=None, type=str)
+        parser.add_argument('--vocab_file',
+                            type=str, default=None,
+                            help="Vocabulary mapping/file BERT was pretrainined on")
+        parser.add_argument("--do_lower_case",
+                            action='store_true',
+                            help="Set this flag if you are using an uncased model.")
+        parser.add_argument('--task_name', default='tnews', type=str)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path, do_lower_case=args.do_lower_case)
+        self.collator.ngram_dict = ZenNgramDict.from_pretrained(args.pretrained_model_path, tokenizer=self.collator.tokenizer)
+
+        processors = {
+            'afqmc': OcnliProcessor,
+            'tnews': TnewsProcessor,
+            'ocnli': OcnliProcessor,
+            'cmnli': OcnliProcessor,
+            'iflytek': IflytekProcessor,
+        }
+        if args.task_name not in processors:
+            raise ValueError("Task not found: %s" % (args.task_name))
+        processor = processors[args.task_name]()
+        if args.dataset_name is None:
+            self.label2id, self.id2label = self.load_schema(os.path.join(
+                args.data_dir, args.train_data), args)
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), processor, mode='train')
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), processor, mode='dev')
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), processor, mode='test')
+            self.collator.label2id = self.label2id
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def load_schema(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            label_list = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                labels = data[args.label_name] if args.label_name in data.keys(
+                ) else 0
+                if labels not in label_list:
+                    label_list.append(labels)
+
+        label2id, id2label = {}, {}
+        for i, k in enumerate(label_list):
+            label2id[k] = i
+            id2label[i] = k
+        return label2id, id2label
+
+
+class LitModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--num_labels', default=2, type=int)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.model = ZenForSequenceClassification.from_pretrained(args.pretrained_model_path, num_labels=args.num_labels)
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        loss, logits = self.model(**batch)
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('train_loss', loss)
+        self.log('train_acc', acc)
+        return loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        loss, logits = self.model(**batch)
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('val_loss', loss)
+        self.log('val_acc', acc)
+
+    def predict_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        return output.logits
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename)
+
+
+def save_test(data, args, data_model):
+    with open(args.output_save_path, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            batch = data[i]
+            for sample in batch:
+                tmp_result = dict()
+                label_id = np.argmax(sample.numpy())
+                tmp_result['id'] = data_model.test_data.data[idx]['id']
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+args.output_save_path)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+
+    args = total_parser.parse_args()
+
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[checkpoint_callback, lr_monitor]
+                                            )
+
+    data_model = TaskDataModel(args)
+    model = LitModel(args)
+    trainer.fit(model, data_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py b/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
new file mode 100755
index 0000000000000000000000000000000000000000..619847c1555311226be69d7d0558368dfd048546
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
@@ -0,0 +1,678 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from fengshen.models.zen2.modeling import ZenForTokenClassification
+from fengshen.metric.metric import SeqEntityScore
+from fengshen.models.zen2.tokenization import BertTokenizer
+from fengshen.models.zen2.ngram_utils import ZenNgramDict
+from pytorch_lightning.callbacks import LearningRateMonitor
+from dataclasses import dataclass
+import logging
+import math
+import numpy as np
+import os
+import json
+import torch
+import pytorch_lightning as pl
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint
+from torch.utils.data import Dataset, DataLoader
+
+import torch.nn.functional as F
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt='%m/%d/%Y %H:%M:%S',
+                    level=logging.ERROR)
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+        Args:
+            guid: Unique id for the example.
+            text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+            text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id, ngram_ids, ngram_positions, ngram_lengths,
+                 ngram_tuples, ngram_seg_ids, ngram_masks, valid_ids=None, label_mask=None, b_use_valid_filter=False):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.valid_ids = valid_ids
+        self.label_mask = label_mask
+
+        self.ngram_ids = ngram_ids
+        self.ngram_positions = ngram_positions
+        self.ngram_lengths = ngram_lengths
+        self.ngram_tuples = ngram_tuples
+        self.ngram_seg_ids = ngram_seg_ids
+        self.ngram_masks = ngram_masks
+
+        self.b_use_valid_filter = b_use_valid_filter
+
+
+def convert_examples_to_features(examples, label_map, max_seq_length, tokenizer, ngram_dict):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    # label_map = {label: i for i, label in enumerate(label_list, 1)}
+    # label_map["[PAD]"] = 0
+
+    features = []
+    b_use_valid_filter = False
+    for (ex_index, example) in enumerate(examples):
+        textlist = example.text_a
+        labellist = example.label
+        tokens = []
+        labels = []
+        valid = []
+        label_mask = []
+        for i, word in enumerate(textlist):
+            token = tokenizer.tokenize(word)
+            if len(tokens) + len(token) > max_seq_length - 2:
+                break
+            tokens.extend(token)
+            label_1 = labellist[i]
+            for m in range(len(token)):
+                if m == 0:
+                    labels.append(label_1)
+                    valid.append(1)
+                    label_mask.append(1)
+                else:
+                    valid.append(0)
+                    b_use_valid_filter = True
+        ntokens = []
+        segment_ids = []
+        label_ids = []
+        ntokens.append("[CLS]")
+        segment_ids.append(0)
+        valid.insert(0, 1)
+        label_mask.insert(0, 1)
+        label_ids.append(label_map["[CLS]"])
+        for i, token in enumerate(tokens):
+            ntokens.append(token)
+            segment_ids.append(0)
+            if len(labels) > i:
+                label_ids.append(label_map[labels[i]])
+        ntokens.append("[SEP]")
+        segment_ids.append(0)
+        valid.append(1)
+        label_mask.append(1)
+        label_ids.append(label_map["[SEP]"])
+        input_ids = tokenizer.convert_tokens_to_ids(ntokens)
+        input_mask = [1] * len(input_ids)
+        label_mask = [1] * len(label_ids)
+        while len(input_ids) < max_seq_length:
+            input_ids.append(0)
+            input_mask.append(0)
+            segment_ids.append(0)
+            label_ids.append(0)
+            valid.append(1)
+            label_mask.append(0)
+        while len(label_ids) < max_seq_length:
+            label_ids.append(0)
+            label_mask.append(0)
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+        assert len(label_ids) == max_seq_length
+        assert len(valid) == max_seq_length
+        assert len(label_mask) == max_seq_length
+
+        # ----------- code for ngram BEGIN-----------
+        ngram_matches = []
+        #  Filter the ngram segment from 2 to 7 to check whether there is a ngram
+        max_gram_n = ngram_dict.max_ngram_len
+        for p in range(2, max_gram_n):
+            for q in range(0, len(tokens) - p + 1):
+                character_segment = tokens[q:q + p]
+                # j is the starting position of the ngram
+                # i is the length of the current ngram
+                character_segment = tuple(character_segment)
+                if character_segment in ngram_dict.ngram_to_id_dict:
+                    ngram_index = ngram_dict.ngram_to_id_dict[character_segment]
+                    ngram_freq = ngram_dict.ngram_to_freq_dict[character_segment]
+                    ngram_matches.append([ngram_index, q, p, character_segment, ngram_freq])
+
+        ngram_matches = sorted(ngram_matches, key=lambda s: s[0])
+
+        max_ngram_in_seq_proportion = math.ceil((len(tokens) / max_seq_length) * ngram_dict.max_ngram_in_seq)
+        if len(ngram_matches) > max_ngram_in_seq_proportion:
+            ngram_matches = ngram_matches[:max_ngram_in_seq_proportion]
+
+        ngram_ids = [ngram[0] for ngram in ngram_matches]
+        ngram_positions = [ngram[1] for ngram in ngram_matches]
+        ngram_lengths = [ngram[2] for ngram in ngram_matches]
+        ngram_tuples = [ngram[3] for ngram in ngram_matches]
+        ngram_freqs = [ngram[4] for ngram in ngram_matches]
+        ngram_seg_ids = [0 if position < (len(tokens) + 2) else 1 for position in ngram_positions]
+
+        ngram_mask_array = np.zeros(ngram_dict.max_ngram_in_seq, dtype=np.bool)
+        ngram_mask_array[:len(ngram_ids)] = 1
+
+        # record the masked positions
+        ngram_positions_matrix = np.zeros(shape=(max_seq_length, ngram_dict.max_ngram_in_seq), dtype=np.int32)
+        for i in range(len(ngram_ids)):
+            ngram_positions_matrix[ngram_positions[i]:ngram_positions[i] + ngram_lengths[i], i] = ngram_freqs[i]
+        ngram_positions_matrix = torch.from_numpy(ngram_positions_matrix.astype(np.float))
+        ngram_positions_matrix = torch.div(ngram_positions_matrix, torch.stack(
+            [torch.sum(ngram_positions_matrix, 1)] * ngram_positions_matrix.size(1)).t() + 1e-10)
+        ngram_positions_matrix = ngram_positions_matrix.numpy()
+
+        # Zero-pad up to the max ngram in seq length.
+        padding = [0] * (ngram_dict.max_ngram_in_seq - len(ngram_ids))
+        ngram_ids += padding
+        ngram_lengths += padding
+        ngram_seg_ids += padding
+
+        # ----------- code for ngram END-----------
+
+        if ex_index < 5:
+            logger.info("*** Example ***")
+            logger.info("guid: %s" % (example.guid))
+            logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
+            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+            logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+            logger.info("label: %s (id = %s)" % (",".join([str(x) for x in example.label]), ",".join([str(x) for x in label_ids])))
+            logger.info("valid: %s" % " ".join([str(x) for x in valid]))
+            logger.info("b_use_valid_filter: %s" % str(b_use_valid_filter))
+            logger.info("ngram_ids: %s" % " ".join([str(x) for x in ngram_ids]))
+            logger.info("ngram_positions: %s" % " ".join([str(x) for x in ngram_positions]))
+            logger.info("ngram_lengths: %s" % " ".join([str(x) for x in ngram_lengths]))
+            logger.info("ngram_tuples: %s" % " ".join([str(x) for x in ngram_tuples]))
+            logger.info("ngram_seg_ids: %s" % " ".join([str(x) for x in ngram_seg_ids]))
+
+        features.append(
+            InputFeatures(input_ids=input_ids,
+                          input_mask=input_mask,
+                          segment_ids=segment_ids,
+                          label_id=label_ids,
+                          ngram_ids=ngram_ids,
+                          ngram_positions=ngram_positions_matrix,
+                          ngram_lengths=ngram_lengths,
+                          ngram_tuples=ngram_tuples,
+                          ngram_seg_ids=ngram_seg_ids,
+                          ngram_masks=ngram_mask_array,
+                          valid_ids=valid,
+                          label_mask=label_mask,
+                          b_use_valid_filter=b_use_valid_filter))
+    return features
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_examples(self, data_path, set_type, quotechar=' '):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(data_path, self.get_quotechar()), set_type)
+
+    def _create_examples(self, lines, set_type):
+        examples = []
+        for i, (sentence, label) in enumerate(lines):
+            guid = "%s-%s" % (set_type, i)
+            text_a = sentence
+            label = label
+            examples.append(InputExample(guid=guid, text_a=text_a, label=label))
+        return examples
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    def get_quotechar(self):
+        return ' '
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        '''
+        read file
+        return format :
+        [ ['EU', 'B-ORG'], ['rejects', 'O'], ['German', 'B-MISC'], ['call', 'O'], ['to', 'O'], ['boycott', 'O'], ['British', 'B-MISC'], ['lamb', 'O'], ['.', 'O'] ]
+        '''
+        f = open(input_file)
+        data = []
+        sentence = []
+        label = []
+        for line in f:
+            if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == "\n":
+                if len(sentence) > 0:
+                    data.append((sentence, label))
+                    sentence = []
+                    label = []
+                continue
+            splits = line.split(quotechar)
+            sentence.append(splits[0])
+            label.append(splits[-1][:-1])
+
+        if len(sentence) > 0:
+            data.append((sentence, label))
+            sentence = []
+            label = []
+        return data
+
+
+class MSRAProcessor(DataProcessor):
+    """Processor for the msra data set."""
+
+    def get_labels(self):
+        return ['B-NR', 'B-NS', 'B-NT', 'E-NR', 'E-NS', 'E-NT', 'M-NR',
+                'M-NS', 'M-NT', 'O', 'S-NR', 'S-NS', 'S-NT', '[CLS]', '[SEP]']
+
+
+class OntoNotes4Processor(DataProcessor):
+    """Processor for the OntoNotes4 data set."""
+
+    def get_labels(self):
+        return ['B-GPE', 'B-LOC', 'B-ORG', 'B-PER', 'E-GPE', 'E-LOC',
+                'E-ORG', 'E-PER', 'M-GPE', 'M-LOC', 'M-ORG', 'M-PER', 'O',
+                'S-GPE', 'S-LOC', 'S-ORG', 'S-PER', '[CLS]', '[SEP]']
+
+
+class WeiboProcessor(DataProcessor):
+    """Processor for the Weibo data set."""
+
+    def get_labels(self):
+        return ['B-GPE.NAM', 'B-GPE.NOM', 'B-LOC.NAM', 'B-LOC.NOM',
+                'B-ORG.NAM', 'B-ORG.NOM', 'B-PER.NAM', 'B-PER.NOM', 'E-GPE.NAM',
+                'E-GPE.NOM', 'E-LOC.NAM', 'E-LOC.NOM', 'E-ORG.NAM', 'E-ORG.NOM',
+                'E-PER.NAM', 'E-PER.NOM', 'M-GPE.NAM', 'M-LOC.NAM', 'M-LOC.NOM',
+                'M-ORG.NAM', 'M-ORG.NOM', 'M-PER.NAM', 'M-PER.NOM', 'O',
+                'S-GPE.NAM', 'S-LOC.NOM', 'S-PER.NAM', 'S-PER.NOM', '[CLS]', '[SEP]']
+
+
+class ResumeProcessor(DataProcessor):
+    """Processor for the resume data set."""
+
+    def get_labels(self):
+        return ['B-CONT', 'B-EDU', 'B-LOC', 'B-NAME', 'B-ORG', 'B-PRO',
+                'B-RACE', 'B-TITLE', 'E-CONT', 'E-EDU', 'E-LOC', 'E-NAME',
+                'E-ORG', 'E-PRO', 'E-RACE', 'E-TITLE', 'M-CONT', 'M-EDU',
+                'M-LOC', 'M-NAME', 'M-ORG', 'M-PRO', 'M-RACE', 'M-TITLE',
+                'O', 'S-NAME', 'S-ORG', 'S-RACE', '[CLS]', '[SEP]']
+
+
+class CMeEEProcessor(DataProcessor):
+    """Processor for the CMeEE data set."""
+
+    def get_quotechar(self):
+        return '\t'
+
+    def get_labels(self):
+        return ['B-临床表现', 'B-医学检验项目', 'B-医疗程序', 'B-医疗设备',
+                'B-微生物类', 'B-疾病', 'B-科室', 'B-药物', 'B-身体', 'I-临床表现',
+                'I-医学检验项目', 'I-医疗程序', 'I-医疗设备', 'I-微生物类',
+                'I-疾病', 'I-科室', 'I-药物', 'I-身体', 'O', '[CLS]', '[SEP]']
+
+
+class CLUENERProcessor(DataProcessor):
+    """Processor for the CLUENER data set."""
+
+    def get_quotechar(self):
+        return '\t'
+
+    def get_labels(self):
+        return ['B-书名', 'B-公司', 'B-地址', 'B-姓名', 'B-政府', 'B-景点',
+                'B-游戏', 'B-电影', 'B-组织机构', 'B-职位', 'I-书名', 'I-公司',
+                'I-地址', 'I-姓名', 'I-政府', 'I-景点', 'I-游戏', 'I-电影',
+                'I-组织机构', 'I-职位', 'O', '[CLS]', '[SEP]']
+
+
+class TaskDataset(Dataset):
+    def __init__(self, data_path, processor, mode='train'):
+        super().__init__()
+        self.data = self.load_data(data_path, processor, mode)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def load_data(self, data_path, processor, mode):
+        if mode == "train":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "test":
+            examples = processor.get_examples(data_path, mode)
+        elif mode == "dev":
+            examples = processor.get_examples(data_path, mode)
+        return examples
+
+
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+    ngram_dict = None
+    label2id = None
+
+    def __call__(self, samples):
+        features = convert_examples_to_features(samples, self.label2id, self.args.max_seq_length, self.tokenizer, self.ngram_dict)
+        # logger.info("  Num examples = %d", len(samples))
+
+        input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+        input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+        segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+        label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+        valid_ids = torch.tensor([f.valid_ids for f in features], dtype=torch.long)
+
+        ngram_ids = torch.tensor([f.ngram_ids for f in features], dtype=torch.long)
+        ngram_positions = torch.tensor([f.ngram_positions for f in features], dtype=torch.long)
+        # ngram_lengths = torch.tensor([f.ngram_lengths for f in features], dtype=torch.long)
+        # ngram_seg_ids = torch.tensor([f.ngram_seg_ids for f in features], dtype=torch.long)
+        # ngram_masks = torch.tensor([f.ngram_masks for f in features], dtype=torch.long)
+
+        # label_mask = torch.tensor([f.label_mask for f in features], dtype=torch.long)
+        b_use_valid_filter = torch.tensor([f.b_use_valid_filter for f in features], dtype=torch.bool)
+        # 取第一个出来？
+        # b_use_valid_filter = b_use_valid_filter.detach().cpu().numpy()[0]
+        b_use_valid_filter = b_use_valid_filter[0]
+        return {
+            'input_ids': input_ids,
+            'input_ngram_ids': ngram_ids,
+            'ngram_position_matrix': ngram_positions,
+            'attention_mask': input_mask,
+            'token_type_ids': segment_ids,
+            'labels': label_ids,
+            'valid_ids': valid_ids,
+            'b_use_valid_filter': b_use_valid_filter,
+        }
+
+
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_seq_length', default=128, type=int)
+
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+
+        parser.add_argument('--dataset_name', default=None, type=str)
+        parser.add_argument('--vocab_file',
+                            type=str, default=None,
+                            help="Vocabulary mapping/file BERT was pretrainined on")
+        parser.add_argument("--do_lower_case",
+                            action='store_true',
+                            help="Set this flag if you are using an uncased model.")
+        parser.add_argument('--task_name', default='weibo', type=str)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path, do_lower_case=args.do_lower_case)
+        self.collator.ngram_dict = ZenNgramDict.from_pretrained(args.pretrained_model_path, tokenizer=self.collator.tokenizer)
+
+        processors = {
+            'weibo': WeiboProcessor,
+            'resume': ResumeProcessor,
+            'msra': MSRAProcessor,
+            'ontonotes4': OntoNotes4Processor,
+            'cmeee': CMeEEProcessor,
+            'cluener': CLUENERProcessor,
+        }
+        if args.task_name not in processors:
+            raise ValueError("Task not found: %s" % (args.task_name))
+        processor = processors[args.task_name]()
+        # 生成id映射
+        label_list = processor.get_labels()
+        label2id = {label: i for i, label in enumerate(label_list, 1)}
+        label2id["[PAD]"] = 0
+        self.id2label = {v: k for k, v in label2id.items()}
+        self.collator.label2id = label2id
+
+        if args.dataset_name is None:
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), processor, mode='train')
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), processor, mode='dev')
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), processor, mode='test')
+
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+
+
+class LitModel(pl.LightningModule):
+
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--markup', default='bios', type=str)
+        parser.add_argument('--middle_prefix', default='I-', type=str)
+        return parent_args
+
+    def __init__(self, args, id2label):
+        super().__init__()
+        # config = ZenConfig(os.path.join(args.pretrained_model_path, 'config.json'))
+        self.model = ZenForTokenClassification.from_pretrained(args.pretrained_model_path, num_labels=len(id2label))
+        self.seq_entity_score = SeqEntityScore(id2label, markup=args.markup, middle_prefix=args.middle_prefix)
+        self.train_seq_entity_score = SeqEntityScore(id2label, markup=args.markup, middle_prefix=args.middle_prefix)
+        self.id2label = id2label
+        self.label2id = {v: k for k, v in id2label.items()}
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss = outputs.loss
+        # logits = outputs.logits
+        # preds = torch.argmax(F.log_softmax(logits, dim=2), dim=2)
+        # preds = preds.detach().cpu().numpy()
+        # labels = batch['labels'].detach().cpu().numpy()
+        # num_labels = len(self.label2id)
+        # y_true = []
+        # y_pred = []
+        # for i, label in enumerate(labels):
+        #     temp_1 = []
+        #     temp_2 = []
+        #     for j, m in enumerate(label):
+        #         if j == 0:
+        #             continue
+        #         elif labels[i][j] == num_labels - 1:
+        #             y_true.append(temp_1)
+        #             y_pred.append(temp_2)
+        #             break
+        #         else:
+        #             temp_1.append(self.id2label[labels[i][j]])
+        #             temp_2.append(self.id2label[preds[i][j]])
+
+        # self.train_seq_entity_score.update(y_true, y_pred)
+        # result = self.train_seq_entity_score.result()
+        # self.train_seq_entity_score.reset()
+        self.log('train_loss', loss)
+
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss = outputs.loss
+        logits = outputs.logits
+        preds = torch.argmax(F.log_softmax(logits, dim=2), dim=2)
+        preds = preds.detach().cpu().numpy()
+        labels = batch['labels'].detach().cpu().numpy()
+        num_labels = len(self.label2id)
+        y_true = []
+        y_pred = []
+        for i, label in enumerate(labels):
+            temp_1 = []
+            temp_2 = []
+            for j, m in enumerate(label):
+                if j == 0:
+                    continue
+                elif labels[i][j] == num_labels - 1:
+                    y_true.append(temp_1)
+                    y_pred.append(temp_2)
+                    break
+                else:
+                    temp_1.append(self.id2label[labels[i][j]])
+                    temp_2.append(self.id2label[preds[i][j]])
+
+        self.seq_entity_score.update(y_true, y_pred)
+        self.log('val_loss', loss)
+
+    def validation_epoch_end(self, outputs):
+        # compute metric for all process
+        score_dict, _ = self.seq_entity_score.result()
+        if self.trainer._accelerator_connector.cluster_environment.global_rank() == 0:
+            print('score_dict:\n', score_dict)
+        # reset the metric after once validation
+        self.seq_entity_score.reset()
+        for k, v in score_dict.items():
+            self.log('val_{}'.format(k), v)
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         filename=args.filename)
+
+
+def save_test(data, args, data_model):
+    with open(args.output_save_path, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            batch = data[i]
+            for sample in batch:
+                tmp_result = dict()
+                label_id = np.argmax(sample.numpy())
+                tmp_result['id'] = data_model.test_data.data[idx]['id']
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+args.output_save_path)
+
+
+def main():
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+
+    args = total_parser.parse_args()
+
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[checkpoint_callback, lr_monitor]
+                                            )
+
+    data_model = TaskDataModel(args)
+    id2label = data_model.id2label
+    print('id2label:', id2label)
+    model = LitModel(args, id2label)
+    trainer.fit(model, data_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_base_afqmc.sh b/fengshen/examples/zen2_finetune/fs_zen2_base_afqmc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7143e61be485f0d6dc2d7912b5b30250df408b75
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_base_afqmc.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_afqmc # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=afqmc
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+# PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name afqmc \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_base_cmnli.sh b/fengshen/examples/zen2_finetune/fs_zen2_base_cmnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f6f4f7e9eec1d11a2bf1d09f8d57303ca139f8e2
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_base_cmnli.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_cmnli # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='4'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=cmnli
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/cmnli_public/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 64 \
+        --valid_batchsize 32 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name cmnli \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_base_iflytek.sh b/fengshen/examples/zen2_finetune/fs_zen2_base_iflytek.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9171a7c3264a856915fd9147096f097b8ebd43c8
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_base_iflytek.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_iflytek # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='0'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=iflytek
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name iflytek \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 119 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_base_ocnli.sh b/fengshen/examples/zen2_finetune/fs_zen2_base_ocnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f635330a4b260391a3f9d4b01998ce8305d55b8e
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_base_ocnli.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_ocnli # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='1'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=ocnli
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name ocnli \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_base_tnews.sh b/fengshen/examples/zen2_finetune/fs_zen2_base_tnews.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dee88afbe2639a514745771538d6c0d40e8d3329
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_base_tnews.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_tnews # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=tnews
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+# PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name tnews \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.01 \
+        --warmup_ratio 0.01 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 400 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 400 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_large_afqmc.sh b/fengshen/examples/zen2_finetune/fs_zen2_large_afqmc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1f44844a127b5bb39226c56b70bba85957dd735a
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_large_afqmc.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_afqmc # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='1'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=afqmc
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name afqmc \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_large_cmnli.sh b/fengshen/examples/zen2_finetune/fs_zen2_large_cmnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b2d6dfff35668596c0c748003b7b937d98604922
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_large_cmnli.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_cmnli # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='3'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=cmnli
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/cmnli_public/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 32 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name cmnli \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_large_iflytek.sh b/fengshen/examples/zen2_finetune/fs_zen2_large_iflytek.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7afd7b24d27ddd1a6834935222a100351111d570
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_large_iflytek.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_iflytek # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='5'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=iflytek
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name iflytek \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 119 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_large_ocnli.sh b/fengshen/examples/zen2_finetune/fs_zen2_large_ocnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5598ee8027a9bc41c4c196d71d98341557e0f4eb
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_large_ocnli.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_ocnli # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+export CUDA_VISIBLE_DEVICES='6'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=ocnli
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name ocnli \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --num_labels 3 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/fs_zen2_large_tnews.sh b/fengshen/examples/zen2_finetune/fs_zen2_large_tnews.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ec081cd3191f951c3815af423329540a219b0114
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/fs_zen2_large_tnews.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_tnews # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=tnews
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/classification_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        --task_name tnews \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 2e-5 \
+        --weight_decay 0.01 \
+        --warmup_ratio 0.01 \
+        --num_labels 15 \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 400 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 10 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 400 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_sequence_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_cluener.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_cluener.sh
new file mode 100644
index 0000000000000000000000000000000000000000..04b97b5fe5123af3170523dfde0ae008a78b2428
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_cluener.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_cluener # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_cluener/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=cluener
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/CLUENER/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.txt \
+        --valid_data dev.char.txt \
+        --test_data dev.char.txt \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name cluener \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bio \
+        --middle_prefix I- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_cmeee.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_cmeee.sh
new file mode 100644
index 0000000000000000000000000000000000000000..46f27f142891c62587f6c7184c372f4883215bbf
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_cmeee.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_cmeee # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_cmeee/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=cmeee
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/CMeEE/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bio \
+        --valid_data dev.char.bio \
+        --test_data dev.char.bio \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name cmeee \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bio \
+        --middle_prefix I- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_msra.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_msra.sh
new file mode 100644
index 0000000000000000000000000000000000000000..397c3ea6adc3d9f275389509aa41d0e4050b3c14
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_msra.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_msra # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_msra/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=msra
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/MSRA/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train_dev.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name msra \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 800 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 800 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_ontonotes4.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_ontonotes4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1e1237967712a6862e5770e90d4e8db8d074d320
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_ontonotes4.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_ontonotes4 # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_ontonotes4/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=ontonotes4
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/OntoNotes4/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name ontonotes4 \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 200 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_resume.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_resume.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a7aee577ed035c0f39b883aa8a2a4dd6fffd479d
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_resume.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_resume # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_resume/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=resume
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/Resume/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name resume \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_base_weibo.sh b/fengshen/examples/zen2_finetune/ner_zen2_base_weibo.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b3f4667e59fe0b7ba98f37dec65e12fdf6faf555
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_base_weibo.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_base_weibo # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_base_weibo/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_base
+
+TASK=weibo
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/weibo/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_base_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.all.bmes \
+        --valid_data test.all.bmes \
+        --test_data test.all.bmes \
+        --train_batchsize 32 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name weibo \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 20 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_cluener.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_cluener.sh
new file mode 100644
index 0000000000000000000000000000000000000000..07193e3f15ca69755853623a57fee0a573db6593
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_cluener.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_cluener # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_large_cluener/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=cluener
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/CLUENER/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.txt \
+        --valid_data dev.char.txt \
+        --test_data dev.char.txt \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name cluener \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bio \
+        --middle_prefix I- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 200 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_cmeee.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_cmeee.sh
new file mode 100644
index 0000000000000000000000000000000000000000..02409b04501bf6155481673b3acd0bd22914d3f3
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_cmeee.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_cmeee # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_large_cmeee/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=cmeee
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/CMeEE/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bio \
+        --valid_data dev.char.bio \
+        --test_data dev.char.bio \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name cmeee \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bio \
+        --middle_prefix I- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 200 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_msra.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_msra.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cef8f1f70babc94ed77dc585fbba47f5b45ff7a5
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_msra.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_msra # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_large_msra/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=msra
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/MSRA/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train_dev.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name msra \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 800 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 800 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_ontonotes4.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_ontonotes4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f8bb41316b4cec4bb94fa36ac9bc39c9f3ce41f8
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_ontonotes4.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_ontonotes4 # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_large_ontonotes4/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=ontonotes4
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/OntoNotes4/
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name ontonotes4 \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 200 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_resume.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_resume.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e21a61f48a96f1d831c90d3cbc3a9cbe8eb7de38
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_resume.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_resume # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o /cognitive_comp/ganruyi/experiments/ner_finetune/zen2_large_resume/%x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=resume
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/Resume/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.char.bmes \
+        --valid_data test.char.bmes \
+        --test_data test.char.bmes \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name resume \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/examples/zen2_finetune/ner_zen2_large_weibo.sh b/fengshen/examples/zen2_finetune/ner_zen2_large_weibo.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7fab2998437ef8c12dcd93466371d0324eec4c79
--- /dev/null
+++ b/fengshen/examples/zen2_finetune/ner_zen2_large_weibo.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+#SBATCH --job-name=zen2_large_weibo # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc. 
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+
+
+# export CUDA_VISIBLE_DEVICES='2'
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/ganruyi/tmp/torch_extendsions
+
+MODEL_NAME=zen2_large
+
+TASK=weibo
+
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+
+ROOT_DIR=/cognitive_comp/ganruyi/experiments/ner_finetune/${MODEL_NAME}_${TASK}
+if [ ! -d ${ROOT_DIR} ];then
+  mkdir -p ${ROOT_DIR}
+  echo ${ROOT_DIR} created!!!!!!!!!!!!!!
+else
+  echo ${ROOT_DIR} exist!!!!!!!!!!!!!!!
+fi
+
+DATA_DIR=/cognitive_comp/lujunyu/data_zh/NER_Aligned/weibo/
+PRETRAINED_MODEL_PATH=/cognitive_comp/ganruyi/hf_models/zen/zh_zen_large_2.0
+
+CHECKPOINT_PATH=${ROOT_DIR}/ckpt/
+OUTPUT_PATH=${ROOT_DIR}/predict.json
+
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.all.bmes \
+        --valid_data test.all.bmes \
+        --test_data test.all.bmes \
+        --train_batchsize 16 \
+        --valid_batchsize 16 \
+        --max_seq_length 256 \
+        --task_name weibo \
+        "
+
+MODEL_ARGS="\
+        --learning_rate 3e-5 \
+        --weight_decay 0.1 \
+        --warmup_ratio 0.01 \
+        --markup bioes \
+        --middle_prefix M- \
+        "
+
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_f1 \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_f1:.4f} \
+        "
+
+TRAINER_ARGS="\
+        --max_epochs 30 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 20 \
+        --default_root_dir $ROOT_DIR \
+        "
+
+
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --vocab_file $PRETRAINED_MODEL_PATH/vocab.txt \
+        --do_lower_case \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+"
+SCRIPT_PATH=/cognitive_comp/ganruyi/Fengshenbang-LM/fengshen/examples/zen2_finetune/fengshen_token_level_ft_task.py
+/home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
+# SINGULARITY_PATH=/cognitive_comp/ganruyi/pytorch21_06_py3_docker_image_v2.sif
+# python3 $SCRIPT_PATH $options
+# source activate base
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $SINGULARITY_PATH /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+# /home/ganruyi/anaconda3/bin/python $SCRIPT_PATH $options
+
diff --git a/fengshen/metric/metric.py b/fengshen/metric/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..c59c2c06d0c5cb583ba0cb3943da5d8f95308b75
--- /dev/null
+++ b/fengshen/metric/metric.py
@@ -0,0 +1,91 @@
+# coding=utf-8
+from collections import Counter
+import torch
+from torch import nn
+# import seqeval
+
+from .utils_ner import get_entities
+
+
+class metrics_mlm_acc(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, logits, labels, masked_lm_metric):
+
+        # if len(list(logits.shape))==3:
+        mask_label_size = 0
+        for i in masked_lm_metric:
+            for j in i:
+                if j > 0:
+                    mask_label_size += 1
+
+        y_pred = torch.argmax(logits, dim=-1)
+
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,))
+        masked_lm_metric = masked_lm_metric.view(size=(-1,))
+
+        corr = torch.eq(y_pred, y_true)
+        corr = torch.multiply(masked_lm_metric, corr)
+
+        acc = torch.sum(corr.float())/mask_label_size
+        return acc
+
+
+class SeqEntityScore(object):
+    def __init__(self, id2label, markup='bios', middle_prefix='I-'):
+        self.id2label = id2label
+        self.markup = markup
+        self.middle_prefix = middle_prefix
+        self.reset()
+
+    def reset(self):
+        self.origins = []
+        self.founds = []
+        self.rights = []
+
+    def compute(self, origin, found, right):
+        recall = 0 if origin == 0 else (right / origin)
+        precision = 0 if found == 0 else (right / found)
+        f1 = 0. if recall + precision == 0 else (2 * precision * recall) / (precision + recall)
+        return recall, precision, f1
+
+    def result(self):
+        class_info = {}
+        origin_counter = Counter([x[0] for x in self.origins])
+        found_counter = Counter([x[0] for x in self.founds])
+        right_counter = Counter([x[0] for x in self.rights])
+        for type_, count in origin_counter.items():
+            origin = count
+            found = found_counter.get(type_, 0)
+            right = right_counter.get(type_, 0)
+            # print('origin:', origin, ' found:', found, ' right:', right)
+            recall, precision, f1 = self.compute(origin, found, right)
+            class_info[type_] = {"acc": round(precision, 4), 'recall': round(recall, 4), 'f1': round(f1, 4)}
+        origin = len(self.origins)
+        found = len(self.founds)
+        right = len(self.rights)
+        recall, precision, f1 = self.compute(origin, found, right)
+        return {'acc': precision, 'recall': recall, 'f1': f1}, class_info
+
+    def update(self, label_paths, pred_paths):
+        '''
+        labels_paths: [[],[],[],....]
+        pred_paths: [[],[],[],.....]
+
+        :param label_paths:
+        :param pred_paths:
+        :return:
+        Example:
+            >>> labels_paths = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+            >>> pred_paths = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+        '''
+        for label_path, pre_path in zip(label_paths, pred_paths):
+            label_entities = get_entities(label_path, self.id2label, self.markup, self.middle_prefix)
+            pre_entities = get_entities(pre_path, self.id2label, self.markup, self.middle_prefix)
+            # print('label:', label_path, ',label_entities: ', label_entities)
+            # print('pred:', pre_path, ',pre_entities: ', pre_entities)
+            self.origins.extend(label_entities)
+            self.founds.extend(pre_entities)
+            self.rights.extend([pre_entity for pre_entity in pre_entities if pre_entity in label_entities])
diff --git a/fengshen/metric/utils_ner.py b/fengshen/metric/utils_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..20efe33defdcbef59d75e83a1bf993eaadd962c8
--- /dev/null
+++ b/fengshen/metric/utils_ner.py
@@ -0,0 +1,261 @@
+import csv
+import json
+import torch
+from transformers import BertTokenizer
+
+
+class CNerTokenizer(BertTokenizer):
+    def __init__(self, vocab_file, do_lower_case=True):
+        super().__init__(vocab_file=str(vocab_file), do_lower_case=do_lower_case)
+        self.vocab_file = str(vocab_file)
+        self.do_lower_case = do_lower_case
+
+    def tokenize(self, text):
+        _tokens = []
+        for c in text:
+            if self.do_lower_case:
+                c = c.lower()
+            if c in self.vocab:
+                _tokens.append(c)
+            else:
+                _tokens.append('[UNK]')
+        return _tokens
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf-8-sig") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+    @classmethod
+    def _read_text(self, input_file):
+        lines = []
+        with open(input_file, 'r') as f:
+            words = []
+            labels = []
+            for line in f:
+                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                    if words:
+                        lines.append({"words": words, "labels": labels})
+                        words = []
+                        labels = []
+                else:
+                    splits = line.split(" ")
+                    words.append(splits[0])
+                    if len(splits) > 1:
+                        labels.append(splits[-1].replace("\n", ""))
+                    else:
+                        # Examples could have no label for mode = "test"
+                        labels.append("O")
+            if words:
+                lines.append({"words": words, "labels": labels})
+        return lines
+
+    @classmethod
+    def _read_json(self, input_file):
+        lines = []
+        with open(input_file, 'r', encoding='utf8') as f:
+            for line in f:
+                line = json.loads(line.strip())
+                text = line['text']
+                label_entities = line.get('label', None)
+                words = list(text)
+                labels = ['O'] * len(words)
+                if label_entities is not None:
+                    for key, value in label_entities.items():
+                        for sub_name, sub_index in value.items():
+                            for start_index, end_index in sub_index:
+                                assert ''.join(words[start_index:end_index+1]) == sub_name
+                                if start_index == end_index:
+                                    labels[start_index] = 'S-'+key
+                                else:
+                                    if end_index - start_index == 1:
+                                        labels[start_index] = 'B-' + key
+                                        labels[end_index] = 'E-' + key
+                                    else:
+                                        labels[start_index] = 'B-' + key
+                                        labels[start_index + 1:end_index] = ['I-' + key] * (len(sub_name) - 2)
+                                        labels[end_index] = 'E-' + key
+                lines.append({"words": words, "labels": labels})
+        return lines
+
+
+def get_entity_bios(seq, id2label, middle_prefix='I-'):
+    """Gets entities from sequence.
+    note: BIOS
+    Args:
+        seq (list): sequence of labels.
+    Returns:
+        list: list of (chunk_type, chunk_start, chunk_end).
+    Example:
+        # >>> seq = ['B-PER', 'I-PER', 'O', 'S-LOC']
+        # >>> get_entity_bios(seq)
+        [['PER', 0,1], ['LOC', 3, 3]]
+    """
+    chunks = []
+    chunk = [-1, -1, -1]
+    for indx, tag in enumerate(seq):
+        if not isinstance(tag, str):
+            tag = id2label[tag]
+        if tag.startswith("S-"):
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+            chunk[1] = indx
+            chunk[2] = indx
+            chunk[0] = tag.split('-')[1]
+            chunks.append(chunk)
+            chunk = (-1, -1, -1)
+        if tag.startswith("B-"):
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+            chunk[1] = indx
+            chunk[0] = tag.split('-')[1]
+        elif tag.startswith(middle_prefix) and chunk[1] != -1:
+            _type = tag.split('-')[1]
+            if _type == chunk[0]:
+                chunk[2] = indx
+            if indx == len(seq) - 1:
+                chunks.append(chunk)
+        else:
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+    return chunks
+
+
+def get_entity_bio(seq, id2label, middle_prefix='I-'):
+    """Gets entities from sequence.
+    note: BIO
+    Args:
+        seq (list): sequence of labels.
+    Returns:
+        list: list of (chunk_type, chunk_start, chunk_end).
+    Example:
+        seq = ['B-PER', 'I-PER', 'O', 'B-LOC']
+        get_entity_bio(seq)
+        #output
+        [['PER', 0,1], ['LOC', 3, 3]]
+    """
+    chunks = []
+    chunk = [-1, -1, -1]
+    for indx, tag in enumerate(seq):
+        if not isinstance(tag, str):
+            tag = id2label[tag]
+        if tag.startswith("B-"):
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+            chunk[1] = indx
+            chunk[0] = tag.split('-')[1]
+            chunk[2] = indx
+            if indx == len(seq) - 1:
+                chunks.append(chunk)
+        elif tag.startswith(middle_prefix) and chunk[1] != -1:
+            _type = tag.split('-')[1]
+            if _type == chunk[0]:
+                chunk[2] = indx
+
+            if indx == len(seq) - 1:
+                chunks.append(chunk)
+        else:
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+    return chunks
+
+
+def get_entity_bioes(seq, id2label, middle_prefix='I-'):
+    """Gets entities from sequence.
+    note: BIOS
+    Args:
+        seq (list): sequence of labels.
+    Returns:
+        list: list of (chunk_type, chunk_start, chunk_end).
+    Example:
+        # >>> seq = ['B-PER', 'I-PER', 'O', 'S-LOC']
+        # >>> get_entity_bios(seq)
+        [['PER', 0,1], ['LOC', 3, 3]]
+    """
+    chunks = []
+    chunk = [-1, -1, -1]
+    for indx, tag in enumerate(seq):
+        if not isinstance(tag, str):
+            tag = id2label[tag]
+        if tag.startswith("S-"):
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+            chunk[1] = indx
+            chunk[2] = indx
+            chunk[0] = tag.split('-')[1]
+            chunks.append(chunk)
+            chunk = (-1, -1, -1)
+        if tag.startswith("B-"):
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+            chunk[1] = indx
+            chunk[0] = tag.split('-')[1]
+        elif (tag.startswith(middle_prefix) or tag.startswith("E-")) and chunk[1] != -1:
+            _type = tag.split('-')[1]
+            if _type == chunk[0]:
+                chunk[2] = indx
+            if indx == len(seq) - 1:
+                chunks.append(chunk)
+        else:
+            if chunk[2] != -1:
+                chunks.append(chunk)
+            chunk = [-1, -1, -1]
+    return chunks
+
+
+def get_entities(seq, id2label, markup='bio', middle_prefix='I-'):
+    '''
+    :param seq:
+    :param id2label:
+    :param markup:
+    :return:
+    '''
+    assert markup in ['bio', 'bios', 'bioes']
+    if markup == 'bio':
+        return get_entity_bio(seq, id2label, middle_prefix)
+    elif markup == 'bios':
+        return get_entity_bios(seq, id2label, middle_prefix)
+    else:
+        return get_entity_bioes(seq, id2label, middle_prefix)
+
+
+def bert_extract_item(start_logits, end_logits):
+    S = []
+    start_pred = torch.argmax(start_logits, -1).cpu().numpy()[0][1:-1]
+    end_pred = torch.argmax(end_logits, -1).cpu().numpy()[0][1:-1]
+    for i, s_l in enumerate(start_pred):
+        if s_l == 0:
+            continue
+        for j, e_l in enumerate(end_pred[i:]):
+            if s_l == e_l:
+                S.append((s_l, i, i + j))
+                break
+    return S
diff --git a/fengshen/models/__init__.py b/fengshen/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bad5790a5799b96f2e164d825c0b1f8ec0c2dfb
--- /dev/null
+++ b/fengshen/models/__init__.py
@@ -0,0 +1 @@
+# coding=utf-8
diff --git a/fengshen/models/auto/__init__.py b/fengshen/models/auto/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef185f32cc2d9f9b30db1a6a681ce2df34936351
--- /dev/null
+++ b/fengshen/models/auto/__init__.py
@@ -0,0 +1,56 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from transformers.file_utils import _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "auto_factory": ["get_values"],
+    "configuration_auto": ["ALL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CONFIG_MAPPING", "MODEL_NAMES_MAPPING", "AutoConfig"],
+    "tokenization_auto": ["TOKENIZER_MAPPING", "AutoTokenizer"],
+}
+
+if is_torch_available():
+    _import_structure["modeling_auto"] = [
+        "AutoModel",
+        "AutoModelForMaskedLM",
+        "AutoModelForMultipleChoice",
+        "AutoModelForPreTraining",
+        "AutoModelForQuestionAnswering",
+        "AutoModelForSequenceClassification",
+        "AutoModelForTokenClassification",
+    ]
+
+if TYPE_CHECKING:
+    from .auto_factory import get_values
+    from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, MODEL_NAMES_MAPPING, AutoConfig
+    from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
+    if is_torch_available():
+        from .modeling_auto import (
+            AutoModel,
+            AutoModelForMaskedLM,
+            AutoModelForMultipleChoice,
+            AutoModelForPreTraining,
+            AutoModelForQuestionAnswering,
+            AutoModelForSequenceClassification,
+            AutoModelForTokenClassification,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
diff --git a/fengshen/models/auto/auto_factory.py b/fengshen/models/auto/auto_factory.py
new file mode 100644
index 0000000000000000000000000000000000000000..688bbd4853284305d047be0552077f721e2f97de
--- /dev/null
+++ b/fengshen/models/auto/auto_factory.py
@@ -0,0 +1,644 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Factory function to build auto-model classes."""
+import importlib
+from collections import OrderedDict
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.file_utils import copy_func
+from transformers.utils import logging
+from .configuration_auto import AutoConfig, model_type_to_module_name, replace_list_option_in_docstrings
+from .dynamic import get_class_from_dynamic_module
+
+
+logger = logging.get_logger(__name__)
+
+
+CLASS_DOCSTRING = """
+    This is a generic model class that will be instantiated as one of the model classes of the library when created
+    with the [`~BaseAutoModelClass.from_pretrained`] class method or the [`~BaseAutoModelClass.from_config`] class
+    method.
+
+    This class cannot be instantiated directly using `__init__()` (throws an error).
+"""
+
+FROM_CONFIG_DOCSTRING = """
+        Instantiates one of the model classes of the library from a configuration.
+
+        Note:
+            Loading a model from its configuration file does **not** load the model weights. It only affects the
+            model's configuration. Use [`~BaseAutoModelClass.from_pretrained`] to load the model weights.
+
+        Args:
+            config ([`PretrainedConfig`]):
+                The model class to instantiate is selected based on the configuration class:
+
+                List options
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoConfig, BaseAutoModelClass
+
+        >>> # Download configuration from huggingface.co and cache.
+        >>> config = AutoConfig.from_pretrained("checkpoint_placeholder")
+        >>> model = BaseAutoModelClass.from_config(config)
+        ```
+"""
+
+FROM_PRETRAINED_TORCH_DOCSTRING = """
+        Instantiate one of the model classes of the library from a pretrained model.
+
+        The model class to instantiate is selected based on the `model_type` property of the config object (either
+        passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
+        falling back to using pattern matching on `pretrained_model_name_or_path`:
+
+        List options
+
+        The model is set in evaluation mode by default using `model.eval()` (so for instance, dropout modules are
+        deactivated). To train the model, you should first set it back in training mode with `model.train()`
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                Can be either:
+
+                    - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
+                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
+                      user or organization name, like `dbmdz/bert-base-german-cased`.
+                    - A path to a *directory* containing model weights saved using
+                      [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
+                    - A path or url to a *tensorflow index checkpoint file* (e.g, `./tf_model/model.ckpt.index`). In
+                      this case, `from_tf` should be set to `True` and a configuration object should be provided as
+                      `config` argument. This loading path is slower than converting the TensorFlow checkpoint in a
+                      PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+            model_args (additional positional arguments, *optional*):
+                Will be passed along to the underlying model `__init__()` method.
+            config ([`PretrainedConfig`], *optional*):
+                Configuration for the model to use instead of an automatically loaded configuration. Configuration can
+                be automatically loaded when:
+
+                    - The model is a model provided by the library (loaded with the *model id* string of a pretrained
+                      model).
+                    - The model was saved using [`~PreTrainedModel.save_pretrained`] and is reloaded by supplying the
+                      save directory.
+                    - The model is loaded by supplying a local directory as `pretrained_model_name_or_path` and a
+                      configuration JSON file named *config.json* is found in the directory.
+            state_dict (*Dict[str, torch.Tensor]*, *optional*):
+                A state dictionary to use instead of a state dictionary loaded from saved weights file.
+
+                This option can be used if you want to create a model from a pretrained configuration but load your own
+                weights. In this case though, you should check if using [`~PreTrainedModel.save_pretrained`] and
+                [`~PreTrainedModel.from_pretrained`] is not a simpler option.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            from_tf (`bool`, *optional*, defaults to `False`):
+                Load the model weights from a TensorFlow checkpoint save file (see docstring of
+                `pretrained_model_name_or_path` argument).
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
+                file exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            output_loading_info(`bool`, *optional*, defaults to `False`):
+                Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.
+            local_files_only(`bool`, *optional*, defaults to `False`):
+                Whether or not to only look at local files (e.g., not try downloading the model).
+            revision(`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs (additional keyword arguments, *optional*):
+                Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
+                `output_attentions=True`). Behaves differently depending on whether a `config` is provided or
+                automatically loaded:
+
+                    - If a configuration is provided with `config`, `**kwargs` will be directly passed to the
+                      underlying model's `__init__` method (we assume all relevant updates to the configuration have
+                      already been done)
+                    - If a configuration is not provided, `kwargs` will be first passed to the configuration class
+                      initialization function ([`~PretrainedConfig.from_pretrained`]). Each key of `kwargs` that
+                      corresponds to a configuration attribute will be used to override said attribute with the
+                      supplied `kwargs` value. Remaining keys that do not correspond to any configuration attribute
+                      will be passed to the underlying model's `__init__` function.
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoConfig, BaseAutoModelClass
+
+        >>> # Download model and configuration from huggingface.co and cache.
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder")
+
+        >>> # Update configuration during loading
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder", output_attentions=True)
+        >>> model.config.output_attentions
+        True
+
+        >>> # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+        >>> config = AutoConfig.from_pretrained("./tf_model/shortcut_placeholder_tf_model_config.json")
+        >>> model = BaseAutoModelClass.from_pretrained(
+        ...     "./tf_model/shortcut_placeholder_tf_checkpoint.ckpt.index", from_tf=True, config=config
+        ... )
+        ```
+"""
+
+FROM_PRETRAINED_TF_DOCSTRING = """
+        Instantiate one of the model classes of the library from a pretrained model.
+
+        The model class to instantiate is selected based on the `model_type` property of the config object (either
+        passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
+        falling back to using pattern matching on `pretrained_model_name_or_path`:
+
+        List options
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                Can be either:
+
+                    - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
+                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
+                      user or organization name, like `dbmdz/bert-base-german-cased`.
+                    - A path to a *directory* containing model weights saved using
+                      [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
+                    - A path or url to a *PyTorch state_dict save file* (e.g, `./pt_model/pytorch_model.bin`). In this
+                      case, `from_pt` should be set to `True` and a configuration object should be provided as `config`
+                      argument. This loading path is slower than converting the PyTorch model in a TensorFlow model
+                      using the provided conversion scripts and loading the TensorFlow model afterwards.
+            model_args (additional positional arguments, *optional*):
+                Will be passed along to the underlying model `__init__()` method.
+            config ([`PretrainedConfig`], *optional*):
+                Configuration for the model to use instead of an automatically loaded configuration. Configuration can
+                be automatically loaded when:
+
+                    - The model is a model provided by the library (loaded with the *model id* string of a pretrained
+                      model).
+                    - The model was saved using [`~PreTrainedModel.save_pretrained`] and is reloaded by supplying the
+                      save directory.
+                    - The model is loaded by supplying a local directory as `pretrained_model_name_or_path` and a
+                      configuration JSON file named *config.json* is found in the directory.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            from_pt (`bool`, *optional*, defaults to `False`):
+                Load the model weights from a PyTorch checkpoint save file (see docstring of
+                `pretrained_model_name_or_path` argument).
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
+                file exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            output_loading_info(`bool`, *optional*, defaults to `False`):
+                Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.
+            local_files_only(`bool`, *optional*, defaults to `False`):
+                Whether or not to only look at local files (e.g., not try downloading the model).
+            revision(`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs (additional keyword arguments, *optional*):
+                Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
+                `output_attentions=True`). Behaves differently depending on whether a `config` is provided or
+                automatically loaded:
+
+                    - If a configuration is provided with `config`, `**kwargs` will be directly passed to the
+                      underlying model's `__init__` method (we assume all relevant updates to the configuration have
+                      already been done)
+                    - If a configuration is not provided, `kwargs` will be first passed to the configuration class
+                      initialization function ([`~PretrainedConfig.from_pretrained`]). Each key of `kwargs` that
+                      corresponds to a configuration attribute will be used to override said attribute with the
+                      supplied `kwargs` value. Remaining keys that do not correspond to any configuration attribute
+                      will be passed to the underlying model's `__init__` function.
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoConfig, BaseAutoModelClass
+
+        >>> # Download model and configuration from huggingface.co and cache.
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder")
+
+        >>> # Update configuration during loading
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder", output_attentions=True)
+        >>> model.config.output_attentions
+        True
+
+        >>> # Loading from a PyTorch checkpoint file instead of a TensorFlow model (slower)
+        >>> config = AutoConfig.from_pretrained("./pt_model/shortcut_placeholder_pt_model_config.json")
+        >>> model = BaseAutoModelClass.from_pretrained(
+        ...     "./pt_model/shortcut_placeholder_pytorch_model.bin", from_pt=True, config=config
+        ... )
+        ```
+"""
+
+FROM_PRETRAINED_FLAX_DOCSTRING = """
+        Instantiate one of the model classes of the library from a pretrained model.
+
+        The model class to instantiate is selected based on the `model_type` property of the config object (either
+        passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
+        falling back to using pattern matching on `pretrained_model_name_or_path`:
+
+        List options
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                Can be either:
+
+                    - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
+                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
+                      user or organization name, like `dbmdz/bert-base-german-cased`.
+                    - A path to a *directory* containing model weights saved using
+                      [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
+                    - A path or url to a *PyTorch state_dict save file* (e.g, `./pt_model/pytorch_model.bin`). In this
+                      case, `from_pt` should be set to `True` and a configuration object should be provided as `config`
+                      argument. This loading path is slower than converting the PyTorch model in a TensorFlow model
+                      using the provided conversion scripts and loading the TensorFlow model afterwards.
+            model_args (additional positional arguments, *optional*):
+                Will be passed along to the underlying model `__init__()` method.
+            config ([`PretrainedConfig`], *optional*):
+                Configuration for the model to use instead of an automatically loaded configuration. Configuration can
+                be automatically loaded when:
+
+                    - The model is a model provided by the library (loaded with the *model id* string of a pretrained
+                      model).
+                    - The model was saved using [`~PreTrainedModel.save_pretrained`] and is reloaded by supplying the
+                      save directory.
+                    - The model is loaded by supplying a local directory as `pretrained_model_name_or_path` and a
+                      configuration JSON file named *config.json* is found in the directory.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            from_pt (`bool`, *optional*, defaults to `False`):
+                Load the model weights from a PyTorch checkpoint save file (see docstring of
+                `pretrained_model_name_or_path` argument).
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
+                file exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            output_loading_info(`bool`, *optional*, defaults to `False`):
+                Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.
+            local_files_only(`bool`, *optional*, defaults to `False`):
+                Whether or not to only look at local files (e.g., not try downloading the model).
+            revision(`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs (additional keyword arguments, *optional*):
+                Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
+                `output_attentions=True`). Behaves differently depending on whether a `config` is provided or
+                automatically loaded:
+
+                    - If a configuration is provided with `config`, `**kwargs` will be directly passed to the
+                      underlying model's `__init__` method (we assume all relevant updates to the configuration have
+                      already been done)
+                    - If a configuration is not provided, `kwargs` will be first passed to the configuration class
+                      initialization function ([`~PretrainedConfig.from_pretrained`]). Each key of `kwargs` that
+                      corresponds to a configuration attribute will be used to override said attribute with the
+                      supplied `kwargs` value. Remaining keys that do not correspond to any configuration attribute
+                      will be passed to the underlying model's `__init__` function.
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoConfig, BaseAutoModelClass
+
+        >>> # Download model and configuration from huggingface.co and cache.
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder")
+
+        >>> # Update configuration during loading
+        >>> model = BaseAutoModelClass.from_pretrained("checkpoint_placeholder", output_attentions=True)
+        >>> model.config.output_attentions
+        True
+
+        >>> # Loading from a PyTorch checkpoint file instead of a TensorFlow model (slower)
+        >>> config = AutoConfig.from_pretrained("./pt_model/shortcut_placeholder_pt_model_config.json")
+        >>> model = BaseAutoModelClass.from_pretrained(
+        ...     "./pt_model/shortcut_placeholder_pytorch_model.bin", from_pt=True, config=config
+        ... )
+        ```
+"""
+
+
+def _get_model_class(config, model_mapping):
+    supported_models = model_mapping[type(config)]
+    if not isinstance(supported_models, (list, tuple)):
+        return supported_models
+
+    name_to_model = {model.__name__: model for model in supported_models}
+    architectures = getattr(config, "architectures", [])
+    for arch in architectures:
+        if arch in name_to_model:
+            return name_to_model[arch]
+        elif f"TF{arch}" in name_to_model:
+            return name_to_model[f"TF{arch}"]
+        elif f"Flax{arch}" in name_to_model:
+            return name_to_model[f"Flax{arch}"]
+
+    # If not architecture is set in the config or match the supported models, the first element of the tuple is the
+    # defaults.
+    return supported_models[0]
+
+
+class _BaseAutoModelClass:
+    # Base class for auto models.
+    _model_mapping = None
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path)` or "
+            f"`{self.__class__.__name__}.from_config(config)` methods."
+        )
+
+    @classmethod
+    def from_config(cls, config, **kwargs):
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
+        if hasattr(config, "auto_map") and cls.__name__ in config.auto_map:
+            if not trust_remote_code:
+                raise ValueError(
+                    "Loading this model requires you to execute the modeling file in that repo "
+                    "on your local machine. Make sure you have read the code there to avoid malicious use, then set "
+                    "the option `trust_remote_code=True` to remove this error."
+                )
+            if kwargs.get("revision", None) is None:
+                logger.warn(
+                    "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure "
+                    "no malicious code has been contributed in a newer revision."
+                )
+            class_ref = config.auto_map[cls.__name__]
+            module_file, class_name = class_ref.split(".")
+            model_class = get_class_from_dynamic_module(
+                config.name_or_path, module_file + ".py", class_name, **kwargs)
+            return model_class._from_config(config, **kwargs)
+        elif type(config) in cls._model_mapping.keys():
+            model_class = _get_model_class(config, cls._model_mapping)
+            return model_class._from_config(config, **kwargs)
+
+        raise ValueError(
+            f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
+            f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
+        )
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        config = kwargs.pop("config", None)
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
+        kwargs["_from_auto"] = True
+        if not isinstance(config, PretrainedConfig):
+            config, kwargs = AutoConfig.from_pretrained(
+                pretrained_model_name_or_path, return_unused_kwargs=True, trust_remote_code=trust_remote_code, **kwargs
+            )
+        if hasattr(config, "auto_map") and cls.__name__ in config.auto_map:
+            if not trust_remote_code:
+                raise ValueError(
+                    f"Loading {pretrained_model_name_or_path} requires you to execute the modeling file in that repo "
+                    "on your local machine. Make sure you have read the code there to avoid malicious use, then set "
+                    "the option `trust_remote_code=True` to remove this error."
+                )
+            if kwargs.get("revision", None) is None:
+                logger.warn(
+                    "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure "
+                    "no malicious code has been contributed in a newer revision."
+                )
+            class_ref = config.auto_map[cls.__name__]
+            module_file, class_name = class_ref.split(".")
+            model_class = get_class_from_dynamic_module(
+                pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs
+            )
+            return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
+        elif type(config) in cls._model_mapping.keys():
+            model_class = _get_model_class(config, cls._model_mapping)
+            return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
+        raise ValueError(
+            f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
+            f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
+        )
+
+    @classmethod
+    def register(cls, config_class, model_class):
+        """
+        Register a new model for this class.
+
+        Args:
+            config_class ([`PretrainedConfig`]):
+                The configuration corresponding to the model to register.
+            model_class ([`PreTrainedModel`]):
+                The model to register.
+        """
+        if hasattr(model_class, "config_class") and model_class.config_class != config_class:
+            raise ValueError(
+                "The model class you are passing has a `config_class` attribute that is not consistent with the "
+                f"config class you passed (model has {model_class.config_class} and you passed {config_class}. Fix "
+                "one of those so they match!"
+            )
+        cls._model_mapping.register(config_class, model_class)
+
+
+def insert_head_doc(docstring, head_doc=""):
+    if len(head_doc) > 0:
+        return docstring.replace(
+            "one of the model classes of the library ",
+            f"one of the model classes of the library (with a {head_doc} head) ",
+        )
+    return docstring.replace(
+        "one of the model classes of the library ", "one of the base model classes of the library "
+    )
+
+
+def auto_class_update(cls, checkpoint_for_example="bert-base-cased", head_doc=""):
+    # Create a new class with the right name from the base class
+    model_mapping = cls._model_mapping
+    name = cls.__name__
+    class_docstring = insert_head_doc(CLASS_DOCSTRING, head_doc=head_doc)
+    cls.__doc__ = class_docstring.replace("BaseAutoModelClass", name)
+
+    # Now we need to copy and re-register `from_config` and `from_pretrained` as class methods otherwise we can't
+    # have a specific docstrings for them.
+    from_config = copy_func(_BaseAutoModelClass.from_config)
+    from_config_docstring = insert_head_doc(
+        FROM_CONFIG_DOCSTRING, head_doc=head_doc)
+    from_config_docstring = from_config_docstring.replace(
+        "BaseAutoModelClass", name)
+    from_config_docstring = from_config_docstring.replace(
+        "checkpoint_placeholder", checkpoint_for_example)
+    from_config.__doc__ = from_config_docstring
+    from_config = replace_list_option_in_docstrings(
+        model_mapping._model_mapping, use_model_types=False)(from_config)
+    cls.from_config = classmethod(from_config)
+
+    if name.startswith("TF"):
+        from_pretrained_docstring = FROM_PRETRAINED_TF_DOCSTRING
+    elif name.startswith("Flax"):
+        from_pretrained_docstring = FROM_PRETRAINED_FLAX_DOCSTRING
+    else:
+        from_pretrained_docstring = FROM_PRETRAINED_TORCH_DOCSTRING
+    from_pretrained = copy_func(_BaseAutoModelClass.from_pretrained)
+    from_pretrained_docstring = insert_head_doc(
+        from_pretrained_docstring, head_doc=head_doc)
+    from_pretrained_docstring = from_pretrained_docstring.replace(
+        "BaseAutoModelClass", name)
+    from_pretrained_docstring = from_pretrained_docstring.replace(
+        "checkpoint_placeholder", checkpoint_for_example)
+    shortcut = checkpoint_for_example.split("/")[-1].split("-")[0]
+    from_pretrained_docstring = from_pretrained_docstring.replace(
+        "shortcut_placeholder", shortcut)
+    from_pretrained.__doc__ = from_pretrained_docstring
+    from_pretrained = replace_list_option_in_docstrings(
+        model_mapping._model_mapping)(from_pretrained)
+    cls.from_pretrained = classmethod(from_pretrained)
+    return cls
+
+
+def get_values(model_mapping):
+    result = []
+    for model in model_mapping.values():
+        if isinstance(model, (list, tuple)):
+            result += list(model)
+        else:
+            result.append(model)
+
+    return result
+
+
+def getattribute_from_module(module, attr):
+    if attr is None:
+        return None
+    if isinstance(attr, tuple):
+        return tuple(getattribute_from_module(module, a) for a in attr)
+    if hasattr(module, attr):
+        return getattr(module, attr)
+    # Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the
+    # object at the top level.
+    transformers_module = importlib.import_module("fengshen")
+    return getattribute_from_module(transformers_module, attr)
+
+
+class _LazyAutoMapping(OrderedDict):
+    """
+    " A mapping config to object (model or tokenizer for instance) that will load keys and values when it is accessed.
+
+    Args:
+
+        - config_mapping: The map model type to config class
+        - model_mapping: The map model type to model (or tokenizer) class
+    """
+
+    def __init__(self, config_mapping, model_mapping):
+        self._config_mapping = config_mapping
+        self._reverse_config_mapping = {
+            v: k for k, v in config_mapping.items()}
+        self._model_mapping = model_mapping
+        self._extra_content = {}
+        self._modules = {}
+
+    def __getitem__(self, key):
+        if key in self._extra_content:
+            return self._extra_content[key]
+        model_type = self._reverse_config_mapping[key.__name__]
+        if model_type not in self._model_mapping:
+            raise KeyError(key)
+        model_name = self._model_mapping[model_type]
+        return self._load_attr_from_module(model_type, model_name)
+
+    def _load_attr_from_module(self, model_type, attr):
+        module_name = model_type_to_module_name(model_type)
+        if module_name not in self._modules:
+            self._modules[module_name] = importlib.import_module(
+                f".{module_name}", "fengshen.models")
+        return getattribute_from_module(self._modules[module_name], attr)
+
+    def keys(self):
+        mapping_keys = [
+            self._load_attr_from_module(key, name)
+            for key, name in self._config_mapping.items()
+            if key in self._model_mapping.keys()
+        ]
+        return mapping_keys + list(self._extra_content.keys())
+
+    def get(self, key, default):
+        try:
+            return self.__getitem__(key)
+        except KeyError:
+            return default
+
+    def __bool__(self):
+        return bool(self.keys())
+
+    def values(self):
+        mapping_values = [
+            self._load_attr_from_module(key, name)
+            for key, name in self._model_mapping.items()
+            if key in self._config_mapping.keys()
+        ]
+        return mapping_values + list(self._extra_content.values())
+
+    def items(self):
+        mapping_items = [
+            (
+                self._load_attr_from_module(key, self._config_mapping[key]),
+                self._load_attr_from_module(key, self._model_mapping[key]),
+            )
+            for key in self._model_mapping.keys()
+            if key in self._config_mapping.keys()
+        ]
+        return mapping_items + list(self._extra_content.items())
+
+    def __iter__(self):
+        return iter(self.keys())
+
+    def __contains__(self, item):
+        if item in self._extra_content:
+            return True
+        if not hasattr(item, "__name__") or item.__name__ not in self._reverse_config_mapping:
+            return False
+        model_type = self._reverse_config_mapping[item.__name__]
+        return model_type in self._model_mapping
+
+    def register(self, key, value):
+        """
+        Register a new model in this mapping.
+        """
+        if hasattr(key, "__name__") and key.__name__ in self._reverse_config_mapping:
+            model_type = self._reverse_config_mapping[key.__name__]
+            if model_type in self._model_mapping.keys():
+                raise ValueError(
+                    f"'{key}' is already used by a Transformers model.")
+
+        self._extra_content[key] = value
diff --git a/fengshen/models/auto/configuration_auto.py b/fengshen/models/auto/configuration_auto.py
new file mode 100644
index 0000000000000000000000000000000000000000..81676226e57ca519273b98328a1afe6961c37ce3
--- /dev/null
+++ b/fengshen/models/auto/configuration_auto.py
@@ -0,0 +1,403 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Config class."""
+import importlib
+import re
+import warnings
+from collections import OrderedDict
+from typing import List, Union
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.file_utils import CONFIG_NAME
+from transformers.utils import logging
+from .dynamic import get_class_from_dynamic_module
+
+
+logger = logging.get_logger(__name__)
+
+CONFIG_MAPPING_NAMES = OrderedDict(
+    [
+        # Add configs here
+        ("roformer", "RoFormerConfig"),
+        ("longformer", "LongformerConfig"),
+    ]
+)
+
+CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
+    [
+        # Add archive maps here
+        ("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("longformer", "LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+    ]
+)
+
+MODEL_NAMES_MAPPING = OrderedDict(
+    [
+        # Add full (and cased) model names here
+        ("roformer", "Roformer"),
+        ("longformer", "Longformer"),
+    ]
+)
+
+SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict([("openai-gpt", "openai")])
+
+
+def model_type_to_module_name(key):
+    """Converts a config key to the corresponding module."""
+    # Special treatment
+    if key in SPECIAL_MODEL_TYPE_TO_MODULE_NAME:
+        return SPECIAL_MODEL_TYPE_TO_MODULE_NAME[key]
+
+    return key.replace("-", "_")
+
+
+def config_class_to_model_type(config):
+    """Converts a config class name to the corresponding model type"""
+    for key, cls in CONFIG_MAPPING_NAMES.items():
+        if cls == config:
+            return key
+    return None
+
+
+class _LazyConfigMapping(OrderedDict):
+    """
+    A dictionary that lazily load its values when they are requested.
+    """
+
+    def __init__(self, mapping):
+        self._mapping = mapping
+        self._extra_content = {}
+        self._modules = {}
+
+    def __getitem__(self, key):
+        if key in self._extra_content:
+            return self._extra_content[key]
+        if key not in self._mapping:
+            raise KeyError(key)
+        value = self._mapping[key]
+        module_name = model_type_to_module_name(key)
+        if module_name not in self._modules:
+            self._modules[module_name] = importlib.import_module(f".{module_name}", "fengshen.models")
+
+        return getattr(self._modules[module_name], value)
+
+    def keys(self):
+        return list(self._mapping.keys()) + list(self._extra_content.keys())
+
+    def values(self):
+        return [self[k] for k in self._mapping.keys()] + list(self._extra_content.values())
+
+    def items(self):
+        return [(k, self[k]) for k in self._mapping.keys()] + list(self._extra_content.items())
+
+    def __iter__(self):
+        return iter(list(self._mapping.keys()) + list(self._extra_content.keys()))
+
+    def __contains__(self, item):
+        return item in self._mapping or item in self._extra_content
+
+    def register(self, key, value):
+        """
+        Register a new configuration in this mapping.
+        """
+        if key in self._mapping.keys():
+            raise ValueError(f"'{key}' is already used by a Transformers config, pick another name.")
+        self._extra_content[key] = value
+
+
+CONFIG_MAPPING = _LazyConfigMapping(CONFIG_MAPPING_NAMES)
+
+
+class _LazyLoadAllMappings(OrderedDict):
+    """
+    A mapping that will load all pairs of key values at the first access (either by indexing, requestions keys, values,
+    etc.)
+
+    Args:
+        mapping: The mapping to load.
+    """
+
+    def __init__(self, mapping):
+        self._mapping = mapping
+        self._initialized = False
+        self._data = {}
+
+    def _initialize(self):
+        if self._initialized:
+            return
+        warnings.warn(
+            "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP is deprecated and will be removed in v5 of Transformers. "
+            "It does not contain all available model checkpoints, far from it. Checkout hf.co/models for that.",
+            FutureWarning,
+        )
+
+        for model_type, map_name in self._mapping.items():
+            module_name = model_type_to_module_name(model_type)
+            module = importlib.import_module(f".{module_name}", "transformers.models")
+            mapping = getattr(module, map_name)
+            self._data.update(mapping)
+
+        self._initialized = True
+
+    def __getitem__(self, key):
+        self._initialize()
+        return self._data[key]
+
+    def keys(self):
+        self._initialize()
+        return self._data.keys()
+
+    def values(self):
+        self._initialize()
+        return self._data.values()
+
+    def items(self):
+        self._initialize()
+        return self._data.keys()
+
+    def __iter__(self):
+        self._initialize()
+        return iter(self._data)
+
+    def __contains__(self, item):
+        self._initialize()
+        return item in self._data
+
+
+ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = _LazyLoadAllMappings(CONFIG_ARCHIVE_MAP_MAPPING_NAMES)
+
+
+def _get_class_name(model_class: Union[str, List[str]]):
+    if isinstance(model_class, (list, tuple)):
+        return " or ".join([f"[`{c}`]" for c in model_class if c is not None])
+    return f"[`{model_class}`]"
+
+
+def _list_model_options(indent, config_to_class=None, use_model_types=True):
+    if config_to_class is None and not use_model_types:
+        raise ValueError("Using `use_model_types=False` requires a `config_to_class` dictionary.")
+    if use_model_types:
+        if config_to_class is None:
+            model_type_to_name = {model_type: f"[`{config}`]" for model_type, config in CONFIG_MAPPING_NAMES.items()}
+        else:
+            model_type_to_name = {
+                model_type: _get_class_name(model_class)
+                for model_type, model_class in config_to_class.items()
+                if model_type in MODEL_NAMES_MAPPING
+            }
+        lines = [
+            f"{indent}- **{model_type}** -- {model_type_to_name[model_type]} ({MODEL_NAMES_MAPPING[model_type]} model)"
+            for model_type in sorted(model_type_to_name.keys())
+        ]
+    else:
+        config_to_name = {
+            CONFIG_MAPPING_NAMES[config]: _get_class_name(clas)
+            for config, clas in config_to_class.items()
+            if config in CONFIG_MAPPING_NAMES
+        }
+        config_to_model_name = {
+            config: MODEL_NAMES_MAPPING[model_type] for model_type, config in CONFIG_MAPPING_NAMES.items()
+        }
+        lines = [
+            f"{indent}- [`{config_name}`] configuration class: {config_to_name[config_name]} ({config_to_model_name[config_name]} model)"
+            for config_name in sorted(config_to_name.keys())
+        ]
+    return "\n".join(lines)
+
+
+def replace_list_option_in_docstrings(config_to_class=None, use_model_types=True):
+    def docstring_decorator(fn):
+        docstrings = fn.__doc__
+        lines = docstrings.split("\n")
+        i = 0
+        while i < len(lines) and re.search(r"^(\s*)List options\s*$", lines[i]) is None:
+            i += 1
+        if i < len(lines):
+            indent = re.search(r"^(\s*)List options\s*$", lines[i]).groups()[0]
+            if use_model_types:
+                indent = f"{indent}    "
+            lines[i] = _list_model_options(indent, config_to_class=config_to_class, use_model_types=use_model_types)
+            docstrings = "\n".join(lines)
+        else:
+            raise ValueError(
+                f"The function {fn} should have an empty 'List options' in its docstring as placeholder, current docstring is:\n{docstrings}"
+            )
+        fn.__doc__ = docstrings
+        return fn
+
+    return docstring_decorator
+
+
+class AutoConfig:
+    r"""
+    This is a generic configuration class that will be instantiated as one of the configuration classes of the library
+    when created with the [`~AutoConfig.from_pretrained`] class method.
+
+    This class cannot be instantiated directly using `__init__()` (throws an error).
+    """
+
+    def __init__(self):
+        raise EnvironmentError(
+            "AutoConfig is designed to be instantiated "
+            "using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method."
+        )
+
+    @classmethod
+    def for_model(cls, model_type: str, *args, **kwargs):
+        if model_type in CONFIG_MAPPING:
+            config_class = CONFIG_MAPPING[model_type]
+            return config_class(*args, **kwargs)
+        raise ValueError(
+            f"Unrecognized model identifier: {model_type}. Should contain one of {', '.join(CONFIG_MAPPING.keys())}"
+        )
+
+    @classmethod
+    @replace_list_option_in_docstrings()
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        Instantiate one of the configuration classes of the library from a pretrained model configuration.
+
+        The configuration class to instantiate is selected based on the `model_type` property of the config object that
+        is loaded, or when it's missing, by falling back to using pattern matching on `pretrained_model_name_or_path`:
+
+        List options
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                Can be either:
+
+                    - A string, the *model id* of a pretrained model configuration hosted inside a model repo on
+                      huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                      namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                    - A path to a *directory* containing a configuration file saved using the
+                      [`~PretrainedConfig.save_pretrained`] method, or the [`~PreTrainedModel.save_pretrained`] method,
+                      e.g., `./my_model_directory/`.
+                    - A path or url to a saved configuration JSON *file*, e.g.,
+                      `./my_model_directory/configuration.json`.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download the model weights and configuration files and override the
+                cached versions if they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
+                file exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            revision(`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final configuration object.
+
+                If `True`, then this functions returns a `Tuple(config, unused_kwargs)` where *unused_kwargs* is a
+                dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the
+                part of `kwargs` which has not been used to update `config` and is otherwise ignored.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs(additional keyword arguments, *optional*):
+                The values in kwargs of any keys which are configuration attributes will be used to override the loaded
+                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
+                by the `return_unused_kwargs` keyword parameter.
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoConfig
+
+        >>> # Download configuration from huggingface.co and cache.
+        >>> config = AutoConfig.from_pretrained("bert-base-uncased")
+
+        >>> # Download configuration from huggingface.co (user-uploaded) and cache.
+        >>> config = AutoConfig.from_pretrained("dbmdz/bert-base-german-cased")
+
+        >>> # If configuration file is in a directory (e.g., was saved using *save_pretrained('./test/saved_model/')*).
+        >>> config = AutoConfig.from_pretrained("./test/bert_saved_model/")
+
+        >>> # Load a specific configuration file.
+        >>> config = AutoConfig.from_pretrained("./test/bert_saved_model/my_configuration.json")
+
+        >>> # Change some config attributes when loading a pretrained config.
+        >>> config = AutoConfig.from_pretrained("bert-base-uncased", output_attentions=True, foo=False)
+        >>> config.output_attentions
+        True
+
+        >>> config, unused_kwargs = AutoConfig.from_pretrained(
+        ...     "bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True
+        ... )
+        >>> config.output_attentions
+        True
+
+        >>> config.unused_kwargs
+        {'foo': False}
+        ```"""
+        kwargs["_from_auto"] = True
+        kwargs["name_or_path"] = pretrained_model_name_or_path
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
+        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        if "auto_map" in config_dict and "AutoConfig" in config_dict["auto_map"]:
+            if not trust_remote_code:
+                raise ValueError(
+                    f"Loading {pretrained_model_name_or_path} requires you to execute the configuration file in that repo "
+                    "on your local machine. Make sure you have read the code there to avoid malicious use, then set "
+                    "the option `trust_remote_code=True` to remove this error."
+                )
+            if kwargs.get("revision", None) is None:
+                logger.warn(
+                    "Explicitly passing a `revision` is encouraged when loading a configuration with custom code to "
+                    "ensure no malicious code has been contributed in a newer revision."
+                )
+            class_ref = config_dict["auto_map"]["AutoConfig"]
+            module_file, class_name = class_ref.split(".")
+            config_class = get_class_from_dynamic_module(
+                pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs
+            )
+            return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif "model_type" in config_dict:
+            config_class = CONFIG_MAPPING[config_dict["model_type"]]
+            return config_class.from_dict(config_dict, **kwargs)
+        else:
+            # Fallback: use pattern matching on the string.
+            for pattern, config_class in CONFIG_MAPPING.items():
+                if pattern in str(pretrained_model_name_or_path):
+                    return config_class.from_dict(config_dict, **kwargs)
+
+        raise ValueError(
+            f"Unrecognized model in {pretrained_model_name_or_path}. "
+            f"Should have a `model_type` key in its {CONFIG_NAME}, or contain one of the following strings "
+            f"in its name: {', '.join(CONFIG_MAPPING.keys())}"
+        )
+
+    @staticmethod
+    def register(model_type, config):
+        """
+        Register a new configuration for this class.
+
+        Args:
+            model_type (`str`): The model type like "bert" or "gpt".
+            config ([`PretrainedConfig`]): The config to register.
+        """
+        if issubclass(config, PretrainedConfig) and config.model_type != model_type:
+            raise ValueError(
+                "The config you are passing has a `model_type` attribute that is not consistent with the model type "
+                f"you passed (config has {config.model_type} and you passed {model_type}. Fix one of those so they "
+                "match!"
+            )
+        CONFIG_MAPPING.register(model_type, config)
diff --git a/fengshen/models/auto/dynamic.py b/fengshen/models/auto/dynamic.py
new file mode 100644
index 0000000000000000000000000000000000000000..5760f6e9292195674d7096996cf3cc0ac35aa0c4
--- /dev/null
+++ b/fengshen/models/auto/dynamic.py
@@ -0,0 +1,235 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utilities to dynamically load model and tokenizer from the Hub."""
+
+import importlib
+import os
+import re
+import shutil
+import sys
+from pathlib import Path
+from typing import Dict, Optional, Union
+
+from transformers.file_utils import (
+    HF_MODULES_CACHE,
+    TRANSFORMERS_DYNAMIC_MODULE_NAME,
+    cached_path,
+    hf_bucket_url,
+    is_offline_mode,
+)
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+def init_hf_modules():
+    """
+    Creates the cache directory for modules with an init, and adds it to the Python path.
+    """
+    # This function has already been executed if HF_MODULES_CACHE already is in the Python path.
+    if HF_MODULES_CACHE in sys.path:
+        return
+
+    sys.path.append(HF_MODULES_CACHE)
+    os.makedirs(HF_MODULES_CACHE, exist_ok=True)
+    init_path = Path(HF_MODULES_CACHE) / "__init__.py"
+    if not init_path.exists():
+        init_path.touch()
+
+
+def create_dynamic_module(name: Union[str, os.PathLike]):
+    """
+    Creates a dynamic module in the cache directory for modules.
+    """
+    init_hf_modules()
+    dynamic_module_path = Path(HF_MODULES_CACHE) / name
+    # If the parent module does not exist yet, recursively create it.
+    if not dynamic_module_path.parent.exists():
+        create_dynamic_module(dynamic_module_path.parent)
+    os.makedirs(dynamic_module_path, exist_ok=True)
+    init_path = dynamic_module_path / "__init__.py"
+    if not init_path.exists():
+        init_path.touch()
+
+
+def check_imports(filename):
+    """
+    Check if the current Python environment contains all the libraries that are imported in a file.
+    """
+    with open(filename, "r", encoding="utf-8") as f:
+        content = f.read()
+
+    # Imports of the form `import xxx`
+    imports = re.findall("^\s*import\s+(\S+)\s*$", content, flags=re.MULTILINE)
+    # Imports of the form `from xxx import yyy`
+    imports += re.findall("^\s*from\s+(\S+)\s+import", content, flags=re.MULTILINE)
+    # Only keep the top-level module
+    imports = [imp.split(".")[0] for imp in imports if not imp.startswith(".")]
+
+    # Unique-ify and test we got them all
+    imports = list(set(imports))
+    missing_packages = []
+    for imp in imports:
+        try:
+            importlib.import_module(imp)
+        except ImportError:
+            missing_packages.append(imp)
+
+    if len(missing_packages) > 0:
+        raise ImportError(
+            "This modeling file requires the following packages that were not found in your environment: "
+            f"{', '.join(missing_packages)}. Run `pip install {' '.join(missing_packages)}`"
+        )
+
+
+def get_class_in_module(class_name, module_path):
+    """
+    Import a module on the cache directory for modules and extract a class from it.
+    """
+    module_path = module_path.replace(os.path.sep, ".")
+    module = importlib.import_module(module_path)
+    return getattr(module, class_name)
+
+
+def get_class_from_dynamic_module(
+    pretrained_model_name_or_path: Union[str, os.PathLike],
+    module_file: str,
+    class_name: str,
+    cache_dir: Optional[Union[str, os.PathLike]] = None,
+    force_download: bool = False,
+    resume_download: bool = False,
+    proxies: Optional[Dict[str, str]] = None,
+    use_auth_token: Optional[Union[bool, str]] = None,
+    revision: Optional[str] = None,
+    local_files_only: bool = False,
+    **kwargs,
+):
+    """
+    Extracts a class from a module file, present in the local folder or repository of a model.
+
+    <Tip warning={true}>
+
+    Calling this function will execute the code in the module file found locally or downloaded from the Hub. It should
+    therefore only be called on trusted repos.
+
+    </Tip>
+
+    Args:
+        pretrained_model_name_or_path (`str` or `os.PathLike`):
+            This can be either:
+
+            - a string, the *model id* of a pretrained model configuration hosted inside a model repo on
+              huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced
+              under a user or organization name, like `dbmdz/bert-base-german-cased`.
+            - a path to a *directory* containing a configuration file saved using the
+              [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
+
+        module_file (`str`):
+            The name of the module file containing the class to look for.
+        class_name (`str`):
+            The name of the class to import in the module.
+        cache_dir (`str` or `os.PathLike`, *optional*):
+            Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
+            cache should not be used.
+        force_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to force to (re-)download the configuration files and override the cached versions if they
+            exist.
+        resume_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
+        proxies (`Dict[str, str]`, *optional*):
+            A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+            'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+        use_auth_token (`str` or *bool*, *optional*):
+            The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
+            when running `transformers-cli login` (stored in `~/.huggingface`).
+        revision(`str`, *optional*, defaults to `"main"`):
+            The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+            git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+            identifier allowed by git.
+        local_files_only (`bool`, *optional*, defaults to `False`):
+            If `True`, will only try to load the tokenizer configuration from local files.
+
+    <Tip>
+
+    Passing `use_auth_token=True` is required when you want to use a private model.
+
+    </Tip>
+
+    Returns:
+        `type`: The class, dynamically imported from the module.
+
+    Examples:
+
+    ```python
+    # Download module *modeling.py* from huggingface.co and cache then extract the class *MyBertModel* from this
+    # module.
+    cls = get_class_from_dynamic_module("sgugger/my-bert-model", "modeling.py", "MyBertModel")
+    ```"""
+    if is_offline_mode() and not local_files_only:
+        logger.info("Offline mode: forcing local_files_only=True")
+        local_files_only = True
+
+    # Download and cache module_file from the repo `pretrained_model_name_or_path` of grab it if it's a local file.
+    pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+    if os.path.isdir(pretrained_model_name_or_path):
+        module_file_or_url = os.path.join(pretrained_model_name_or_path, module_file)
+        submodule = "local"
+    else:
+        module_file_or_url = hf_bucket_url(
+            pretrained_model_name_or_path, filename=module_file, revision=revision, mirror=None
+        )
+        submodule = pretrained_model_name_or_path.replace("/", os.path.sep)
+
+    try:
+        # Load from URL or cache if already cached
+        resolved_module_file = cached_path(
+            module_file_or_url,
+            cache_dir=cache_dir,
+            force_download=force_download,
+            proxies=proxies,
+            resume_download=resume_download,
+            local_files_only=local_files_only,
+            use_auth_token=use_auth_token,
+        )
+
+    except EnvironmentError:
+        logger.error(f"Could not locate the {module_file} inside {pretrained_model_name_or_path}.")
+        raise
+
+    # Check we have all the requirements in our environment
+    check_imports(resolved_module_file)
+
+    # Now we move the module inside our cached dynamic modules.
+    full_submodule = TRANSFORMERS_DYNAMIC_MODULE_NAME + os.path.sep + submodule
+    create_dynamic_module(full_submodule)
+    submodule_path = Path(HF_MODULES_CACHE) / full_submodule
+    if submodule == "local":
+        # We always copy local files (we could hash the file to see if there was a change, and give them the name of
+        # that hash, to only copy when there is a modification but it seems overkill for now).
+        # The only reason we do the copy is to avoid putting too many folders in sys.path.
+        module_name = module_file
+        shutil.copy(resolved_module_file, submodule_path / module_file)
+    else:
+        # The module file will end up being named module_file + the etag. This way we get the benefit of versioning.
+        resolved_module_file_name = Path(resolved_module_file).name
+        module_name_parts = [module_file.replace(".py", "")] + resolved_module_file_name.split(".")
+        module_name = "_".join(module_name_parts) + ".py"
+        if not (submodule_path / module_name).exists():
+            shutil.copy(resolved_module_file, submodule_path / module_name)
+
+    # And lastly we get the class inside our newly created module
+    final_module = os.path.join(full_submodule, module_name.replace(".py", ""))
+    return get_class_in_module(class_name, final_module)
diff --git a/fengshen/models/auto/modeling_auto.py b/fengshen/models/auto/modeling_auto.py
new file mode 100644
index 0000000000000000000000000000000000000000..3805e86d239d63d826092fa811261b2334e608f7
--- /dev/null
+++ b/fengshen/models/auto/modeling_auto.py
@@ -0,0 +1,272 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class."""
+
+import warnings
+from collections import OrderedDict
+
+from transformers.utils import logging
+from .auto_factory import _BaseAutoModelClass, _LazyAutoMapping, auto_class_update
+from .configuration_auto import CONFIG_MAPPING_NAMES
+
+
+logger = logging.get_logger(__name__)
+
+
+MODEL_MAPPING_NAMES = OrderedDict(
+    [
+        # Base model mapping
+        ("roformer", "RoFormerModel"),
+        ("longformer", "LongformerModel"),
+    ]
+)
+
+MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for pre-training mapping
+        ("longformer", "LongformerForMaskedLM"),
+    ]
+)
+
+MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
+    [
+        # Model with LM heads mapping
+        ("roformer", "RoFormerForMaskedLM"),
+        ("longformer", "LongformerForMaskedLM"),
+    ]
+)
+
+MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Causal LM mapping
+        ("roformer", "RoFormerForCausalLM"),
+    ]
+)
+
+
+MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Masked LM mapping
+        ("roformer", "RoFormerForMaskedLM"),
+        ("longformer", "LongformerForMaskedLM"),
+    ]
+)
+
+
+MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Seq2Seq Causal LM mapping
+        ("t5", "T5ForConditionalGeneration"),
+
+    ]
+)
+
+MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
+    [
+        ("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
+        ("speech_to_text", "Speech2TextForConditionalGeneration"),
+    ]
+)
+
+MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Sequence Classification mapping
+        ("roformer", "RoFormerForSequenceClassification"),
+        ("longformer", "LongformerForSequenceClassification"),
+    ]
+)
+
+MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Question Answering mapping
+        ("roformer", "RoFormerForQuestionAnswering"),
+        ("longformer", "LongformerForQuestionAnswering"),
+    ]
+)
+
+MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Table Question Answering mapping
+        ("tapas", "TapasForQuestionAnswering"),
+    ]
+)
+
+MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Token Classification mapping
+        ("roformer", "RoFormerForTokenClassification"),
+        ("longformer", "LongformerForTokenClassification"),
+    ]
+)
+
+MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Multiple Choice mapping
+        ("roformer", "RoFormerForMultipleChoice"),
+        ("longformer", "LongformerForMultipleChoice"),
+    ]
+)
+
+
+
+
+MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_MAPPING_NAMES)
+
+MODEL_FOR_PRETRAINING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_PRETRAINING_MAPPING_NAMES)
+
+MODEL_WITH_LM_HEAD_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_WITH_LM_HEAD_MAPPING_NAMES)
+
+MODEL_FOR_CAUSAL_LM_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_CAUSAL_LM_MAPPING_NAMES)
+
+MODEL_FOR_MASKED_LM_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_MASKED_LM_MAPPING_NAMES)
+
+MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
+)
+MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES
+)
+MODEL_FOR_QUESTION_ANSWERING_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES
+)
+MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING_NAMES
+)
+MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES
+)
+MODEL_FOR_MULTIPLE_CHOICE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES)
+
+MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES)
+
+
+
+class AutoModel(_BaseAutoModelClass):
+    _model_mapping = MODEL_MAPPING
+
+
+AutoModel = auto_class_update(AutoModel)
+
+
+class AutoModelForPreTraining(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_PRETRAINING_MAPPING
+
+
+AutoModelForPreTraining = auto_class_update(AutoModelForPreTraining, head_doc="pretraining")
+
+
+# Private on purpose, the public class will add the deprecation warnings.
+class _AutoModelWithLMHead(_BaseAutoModelClass):
+    _model_mapping = MODEL_WITH_LM_HEAD_MAPPING
+
+
+_AutoModelWithLMHead = auto_class_update(_AutoModelWithLMHead, head_doc="language modeling")
+
+
+class AutoModelForCausalLM(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_CAUSAL_LM_MAPPING
+
+
+AutoModelForCausalLM = auto_class_update(AutoModelForCausalLM, head_doc="causal language modeling")
+
+
+class AutoModelForMaskedLM(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_MASKED_LM_MAPPING
+
+
+AutoModelForMaskedLM = auto_class_update(AutoModelForMaskedLM, head_doc="masked language modeling")
+
+
+class AutoModelForSeq2SeqLM(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING
+
+
+AutoModelForSeq2SeqLM = auto_class_update(
+    AutoModelForSeq2SeqLM, head_doc="sequence-to-sequence language modeling", checkpoint_for_example="t5-base"
+)
+
+
+class AutoModelForSequenceClassification(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
+
+
+AutoModelForSequenceClassification = auto_class_update(
+    AutoModelForSequenceClassification, head_doc="sequence classification"
+)
+
+
+class AutoModelForQuestionAnswering(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_QUESTION_ANSWERING_MAPPING
+
+
+AutoModelForQuestionAnswering = auto_class_update(AutoModelForQuestionAnswering, head_doc="question answering")
+
+
+class AutoModelForTableQuestionAnswering(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING
+
+
+AutoModelForTableQuestionAnswering = auto_class_update(
+    AutoModelForTableQuestionAnswering,
+    head_doc="table question answering",
+    checkpoint_for_example="google/tapas-base-finetuned-wtq",
+)
+
+
+class AutoModelForTokenClassification(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
+
+
+AutoModelForTokenClassification = auto_class_update(AutoModelForTokenClassification, head_doc="token classification")
+
+
+class AutoModelForMultipleChoice(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_MULTIPLE_CHOICE_MAPPING
+
+
+AutoModelForMultipleChoice = auto_class_update(AutoModelForMultipleChoice, head_doc="multiple choice")
+
+
+
+class AutoModelForSpeechSeq2Seq(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING
+
+
+AutoModelForSpeechSeq2Seq = auto_class_update(
+    AutoModelForSpeechSeq2Seq, head_doc="sequence-to-sequence speech-to-text modeing"
+)
+
+
+
+class AutoModelWithLMHead(_AutoModelWithLMHead):
+    @classmethod
+    def from_config(cls, config):
+        warnings.warn(
+            "The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use "
+            "`AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and "
+            "`AutoModelForSeq2SeqLM` for encoder-decoder models.",
+            FutureWarning,
+        )
+        return super().from_config(config)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        warnings.warn(
+            "The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use "
+            "`AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and "
+            "`AutoModelForSeq2SeqLM` for encoder-decoder models.",
+            FutureWarning,
+        )
+        return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/fengshen/models/auto/tokenization_auto.py b/fengshen/models/auto/tokenization_auto.py
new file mode 100644
index 0000000000000000000000000000000000000000..6555191bef55336708cabc5e9b17c0322318a417
--- /dev/null
+++ b/fengshen/models/auto/tokenization_auto.py
@@ -0,0 +1,449 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Tokenizer class."""
+
+import importlib
+import json
+import os
+from collections import OrderedDict
+from pathlib import Path
+from typing import TYPE_CHECKING, Dict, Optional, Tuple, Union
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.file_utils import (
+    cached_path,
+    get_list_of_files,
+    hf_bucket_url,
+    is_offline_mode,
+    is_sentencepiece_available,
+    is_tokenizers_available,
+)
+from transformers.tokenization_utils import PreTrainedTokenizer
+from transformers.tokenization_utils_base import TOKENIZER_CONFIG_FILE
+from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
+from transformers.utils import logging
+# from ..encoder_decoder import EncoderDecoderConfig
+from .auto_factory import _LazyAutoMapping
+from .configuration_auto import (
+    CONFIG_MAPPING_NAMES,
+    AutoConfig,
+    config_class_to_model_type,
+    model_type_to_module_name,
+    replace_list_option_in_docstrings,
+)
+from .dynamic import get_class_from_dynamic_module
+
+
+logger = logging.get_logger(__name__)
+
+if TYPE_CHECKING:
+    # This significantly improves completion suggestion performance when
+    # the transformers package is used with Microsoft's Pylance language server.
+    TOKENIZER_MAPPING_NAMES: OrderedDict[str,
+                                         Tuple[Optional[str], Optional[str]]] = OrderedDict()
+else:
+    TOKENIZER_MAPPING_NAMES = OrderedDict(
+        [
+            ("roformer", ("RoFormerTokenizer", None)),
+            ("longformer", ("LongformerTokenizer", None)),
+        ]
+    )
+
+TOKENIZER_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, TOKENIZER_MAPPING_NAMES)
+
+CONFIG_TO_TYPE = {v: k for k, v in CONFIG_MAPPING_NAMES.items()}
+
+
+def tokenizer_class_from_name(class_name: str):
+    if class_name == "PreTrainedTokenizerFast":
+        return PreTrainedTokenizerFast
+
+    for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
+        if class_name in tokenizers:
+            module_name = model_type_to_module_name(module_name)
+
+            module = importlib.import_module(
+                f".{module_name}", "transformers.models")
+            return getattr(module, class_name)
+
+    for config, tokenizers in TOKENIZER_MAPPING._extra_content.items():
+        for tokenizer in tokenizers:
+            if getattr(tokenizer, "__name__", None) == class_name:
+                return tokenizer
+
+    return None
+
+
+def get_tokenizer_config(
+    pretrained_model_name_or_path: Union[str, os.PathLike],
+    cache_dir: Optional[Union[str, os.PathLike]] = None,
+    force_download: bool = False,
+    resume_download: bool = False,
+    proxies: Optional[Dict[str, str]] = None,
+    use_auth_token: Optional[Union[bool, str]] = None,
+    revision: Optional[str] = None,
+    local_files_only: bool = False,
+    **kwargs,
+):
+    """
+    Loads the tokenizer configuration from a pretrained model tokenizer configuration.
+
+    Args:
+        pretrained_model_name_or_path (`str` or `os.PathLike`):
+            This can be either:
+
+            - a string, the *model id* of a pretrained model configuration hosted inside a model repo on
+              huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced
+              under a user or organization name, like `dbmdz/bert-base-german-cased`.
+            - a path to a *directory* containing a configuration file saved using the
+              [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
+
+        cache_dir (`str` or `os.PathLike`, *optional*):
+            Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
+            cache should not be used.
+        force_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to force to (re-)download the configuration files and override the cached versions if they
+            exist.
+        resume_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
+        proxies (`Dict[str, str]`, *optional*):
+            A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+            'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+        use_auth_token (`str` or *bool*, *optional*):
+            The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
+            when running `transformers-cli login` (stored in `~/.huggingface`).
+        revision(`str`, *optional*, defaults to `"main"`):
+            The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+            git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+            identifier allowed by git.
+        local_files_only (`bool`, *optional*, defaults to `False`):
+            If `True`, will only try to load the tokenizer configuration from local files.
+
+    <Tip>
+
+    Passing `use_auth_token=True` is required when you want to use a private model.
+
+    </Tip>
+
+    Returns:
+        `Dict`: The configuration of the tokenizer.
+
+    Examples:
+
+    ```python
+    # Download configuration from huggingface.co and cache.
+    tokenizer_config = get_tokenizer_config("bert-base-uncased")
+    # This model does not have a tokenizer config so the result will be an empty dict.
+    tokenizer_config = get_tokenizer_config("xlm-roberta-base")
+
+    # Save a pretrained tokenizer locally and you can reload its config
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    tokenizer.save_pretrained("tokenizer-test")
+    tokenizer_config = get_tokenizer_config("tokenizer-test")
+    ```"""
+    if is_offline_mode() and not local_files_only:
+        logger.info("Offline mode: forcing local_files_only=True")
+        local_files_only = True
+
+    # Will raise a ValueError if `pretrained_model_name_or_path` is not a valid path or model identifier
+    repo_files = get_list_of_files(
+        pretrained_model_name_or_path,
+        revision=revision,
+        use_auth_token=use_auth_token,
+        local_files_only=local_files_only,
+    )
+    if TOKENIZER_CONFIG_FILE not in [Path(f).name for f in repo_files]:
+        return {}
+
+    pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+    if os.path.isdir(pretrained_model_name_or_path):
+        config_file = os.path.join(
+            pretrained_model_name_or_path, TOKENIZER_CONFIG_FILE)
+    else:
+        config_file = hf_bucket_url(
+            pretrained_model_name_or_path, filename=TOKENIZER_CONFIG_FILE, revision=revision, mirror=None
+        )
+
+    try:
+        # Load from URL or cache if already cached
+        resolved_config_file = cached_path(
+            config_file,
+            cache_dir=cache_dir,
+            force_download=force_download,
+            proxies=proxies,
+            resume_download=resume_download,
+            local_files_only=local_files_only,
+            use_auth_token=use_auth_token,
+        )
+
+    except EnvironmentError:
+        logger.info(
+            "Could not locate the tokenizer configuration file, will try to use the model config instead.")
+        return {}
+
+    with open(resolved_config_file, encoding="utf-8") as reader:
+        return json.load(reader)
+
+
+class AutoTokenizer:
+    r"""
+    This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when
+    created with the [`AutoTokenizer.from_pretrained`] class method.
+
+    This class cannot be instantiated directly using `__init__()` (throws an error).
+    """
+
+    def __init__(self):
+        raise EnvironmentError(
+            "AutoTokenizer is designed to be instantiated "
+            "using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method."
+        )
+
+    @classmethod
+    @replace_list_option_in_docstrings(TOKENIZER_MAPPING_NAMES)
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
+        r"""
+        Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.
+
+        The tokenizer class to instantiate is selected based on the `model_type` property of the config object (either
+        passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
+        falling back to using pattern matching on `pretrained_model_name_or_path`:
+
+        List options
+
+        Params:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                Can be either:
+
+                    - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
+                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
+                      user or organization name, like `dbmdz/bert-base-german-cased`.
+                    - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
+                      using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
+                    - A path or url to a single saved vocabulary file if and only if the tokenizer only requires a
+                      single vocabulary file (like Bert or XLNet), e.g.: `./my_model_directory/vocab.txt`. (Not
+                      applicable to all derived classes)
+            inputs (additional positional arguments, *optional*):
+                Will be passed along to the Tokenizer `__init__()` method.
+            config ([`PretrainedConfig`], *optional*)
+                The configuration object used to dertermine the tokenizer class to instantiate.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download the model weights and configuration files and override the
+                cached versions if they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
+                file exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            revision(`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            subfolder (`str`, *optional*):
+                In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
+                facebook/rag-token-base), specify it here.
+            use_fast (`bool`, *optional*, defaults to `True`):
+                Whether or not to try to load the fast version of the tokenizer.
+            tokenizer_type (`str`, *optional*):
+                Tokenizer type to be loaded.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the Tokenizer `__init__()` method. Can be used to set special tokens like
+                `bos_token`, `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
+                `additional_special_tokens`. See parameters in the `__init__()` for more details.
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoTokenizer
+
+        >>> # Download vocabulary from huggingface.co and cache.
+        >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+        >>> # Download vocabulary from huggingface.co (user-uploaded) and cache.
+        >>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
+
+        >>> # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
+        >>> tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")
+        ```"""
+        config = kwargs.pop("config", None)
+        kwargs["_from_auto"] = True
+
+        use_fast = kwargs.pop("use_fast", True)
+        tokenizer_type = kwargs.pop("tokenizer_type", None)
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
+
+        # First, let's see whether the tokenizer_type is passed so that we can leverage it
+        if tokenizer_type is not None:
+            tokenizer_class = None
+            tokenizer_class_tuple = TOKENIZER_MAPPING_NAMES.get(
+                tokenizer_type, None)
+
+            if tokenizer_class_tuple is None:
+                raise ValueError(
+                    f"Passed `tokenizer_type` {tokenizer_type} does not exist. `tokenizer_type` should be one of "
+                    f"{', '.join(c for c in TOKENIZER_MAPPING_NAMES.keys())}."
+                )
+
+            tokenizer_class_name, tokenizer_fast_class_name = tokenizer_class_tuple
+
+            if use_fast and tokenizer_fast_class_name is not None:
+                tokenizer_class = tokenizer_class_from_name(
+                    tokenizer_fast_class_name)
+
+            if tokenizer_class is None:
+                tokenizer_class = tokenizer_class_from_name(
+                    tokenizer_class_name)
+
+            if tokenizer_class is None:
+                raise ValueError(
+                    f"Tokenizer class {tokenizer_class_name} is not currently imported.")
+
+            return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+
+        # Next, let's try to use the tokenizer_config file to get the tokenizer class.
+        tokenizer_config = get_tokenizer_config(
+            pretrained_model_name_or_path, **kwargs)
+
+        config_tokenizer_class = tokenizer_config.get("tokenizer_class")
+        tokenizer_auto_map = tokenizer_config.get("auto_map")
+
+        # If that did not work, let's try to use the config.
+        if config_tokenizer_class is None:
+            if not isinstance(config, PretrainedConfig):
+                config = AutoConfig.from_pretrained(
+                    pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
+                )
+            config_tokenizer_class = config.tokenizer_class
+            if hasattr(config, "auto_map") and "AutoTokenizer" in config.auto_map:
+                tokenizer_auto_map = config.auto_map["AutoTokenizer"]
+
+        # If we have the tokenizer class from the tokenizer config or the model config we're good!
+        if config_tokenizer_class is not None:
+            tokenizer_class = None
+            if tokenizer_auto_map is not None:
+                if not trust_remote_code:
+                    raise ValueError(
+                        f"Loading {pretrained_model_name_or_path} requires you to execute the tokenizer file in that repo "
+                        "on your local machine. Make sure you have read the code there to avoid malicious use, then set "
+                        "the option `trust_remote_code=True` to remove this error."
+                    )
+                if kwargs.get("revision", None) is None:
+                    logger.warn(
+                        "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure "
+                        "no malicious code has been contributed in a newer revision."
+                    )
+
+                if use_fast and tokenizer_auto_map[1] is not None:
+                    class_ref = tokenizer_auto_map[1]
+                else:
+                    class_ref = tokenizer_auto_map[0]
+
+                module_file, class_name = class_ref.split(".")
+                tokenizer_class = get_class_from_dynamic_module(
+                    pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs
+                )
+
+            elif use_fast and not config_tokenizer_class.endswith("Fast"):
+                tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
+                tokenizer_class = tokenizer_class_from_name(
+                    tokenizer_class_candidate)
+            if tokenizer_class is None:
+                tokenizer_class_candidate = config_tokenizer_class
+                tokenizer_class = tokenizer_class_from_name(
+                    tokenizer_class_candidate)
+
+            if tokenizer_class is None:
+                raise ValueError(
+                    f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
+                )
+            return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+
+        model_type = config_class_to_model_type(type(config).__name__)
+        if model_type is not None:
+            tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(
+                config)]
+            if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
+                return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+            else:
+                if tokenizer_class_py is not None:
+                    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+                else:
+                    raise ValueError(
+                        "This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed "
+                        "in order to use this tokenizer."
+                    )
+
+        raise ValueError(
+            f"Unrecognized configuration class {config.__class__} to build an AutoTokenizer.\n"
+            f"Model type should be one of {', '.join(c.__name__ for c in TOKENIZER_MAPPING.keys())}."
+        )
+
+    def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None):
+        """
+        Register a new tokenizer in this mapping.
+
+
+        Args:
+            config_class ([`PretrainedConfig`]):
+                The configuration corresponding to the model to register.
+            slow_tokenizer_class ([`PretrainedTokenizer`], *optional*):
+                The slow tokenizer to register.
+            slow_tokenizer_class ([`PretrainedTokenizerFast`], *optional*):
+                The fast tokenizer to register.
+        """
+        if slow_tokenizer_class is None and fast_tokenizer_class is None:
+            raise ValueError(
+                "You need to pass either a `slow_tokenizer_class` or a `fast_tokenizer_class")
+        if slow_tokenizer_class is not None and issubclass(slow_tokenizer_class, PreTrainedTokenizerFast):
+            raise ValueError(
+                "You passed a fast tokenizer in the `slow_tokenizer_class`.")
+        if fast_tokenizer_class is not None and issubclass(fast_tokenizer_class, PreTrainedTokenizer):
+            raise ValueError(
+                "You passed a slow tokenizer in the `fast_tokenizer_class`.")
+
+        if (
+            slow_tokenizer_class is not None
+            and fast_tokenizer_class is not None
+            and issubclass(fast_tokenizer_class, PreTrainedTokenizerFast)
+            and fast_tokenizer_class.slow_tokenizer_class != slow_tokenizer_class
+        ):
+            raise ValueError(
+                "The fast tokenizer class you are passing has a `slow_tokenizer_class` attribute that is not "
+                "consistent with the slow tokenizer class you passed (fast tokenizer has "
+                f"{fast_tokenizer_class.slow_tokenizer_class} and you passed {slow_tokenizer_class}. Fix one of those "
+                "so they match!"
+            )
+
+        # Avoid resetting a set slow/fast tokenizer if we are passing just the other ones.
+        if config_class in TOKENIZER_MAPPING._extra_content:
+            existing_slow, existing_fast = TOKENIZER_MAPPING[config_class]
+            if slow_tokenizer_class is None:
+                slow_tokenizer_class = existing_slow
+            if fast_tokenizer_class is None:
+                fast_tokenizer_class = existing_fast
+
+        TOKENIZER_MAPPING.register(
+            config_class, (slow_tokenizer_class, fast_tokenizer_class))
diff --git a/fengshen/models/bart/modeling_bart.py b/fengshen/models/bart/modeling_bart.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9a58ac8036fbc0bb9334b083b12a5599950d355
--- /dev/null
+++ b/fengshen/models/bart/modeling_bart.py
@@ -0,0 +1,423 @@
+import warnings
+from pytorch_lightning import LightningModule
+from fengshen.models import transformer_utils
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+import torch.nn.functional as F
+
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+from transformers.file_utils import *
+from transformers.modeling_outputs import *
+from transformers.models.bart import *
+from transformers.models.bart.modeling_bart import BartClassificationHead
+
+
+_CONFIG_FOR_DOC = "BartConfig"
+
+
+# ------------------------ ZZ: CBart addition ------------------------
+
+
+def _reorder_buffer(attn_cache, new_order):
+    for k, input_buffer_k in attn_cache.items():
+        if input_buffer_k is not None:
+            attn_cache[k] = input_buffer_k.index_select(0, new_order)
+    return attn_cache
+
+
+def _make_linear_from_emb(emb):
+    vocab_size, emb_size = emb.weight.shape
+    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)
+    lin_layer.weight.data = emb.weight.data
+    return lin_layer
+
+
+BART_GENERATION_EXAMPLE = r"""
+    Summarization example::
+
+        >>> from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
+
+        >>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
+        >>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
+
+        >>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
+        >>> inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
+
+        >>> # Generate Summary
+        >>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
+        >>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
+
+    Mask filling example::
+
+        >>> from transformers import BartTokenizer, BartForConditionalGeneration
+        >>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
+        >>> TXT = "My friends are <mask> but they eat too many carbs."
+
+        >>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
+        >>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
+        >>> logits = model(input_ids).logits
+
+        >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
+        >>> probs = logits[0, masked_index].softmax(dim=0)
+        >>> values, predictions = probs.topk(5)
+
+        >>> tokenizer.decode(predictions).split()
+"""
+
+
+@dataclass
+class CBartLMOutput(ModelOutput):
+    """
+    Base class for CBart specific language models outputs.
+
+    Args:
+        ....
+    """
+    loss: Optional[torch.FloatTensor] = None
+    encoder_loss: Optional[torch.FloatTensor] = None
+    decoder_loss: Optional[torch.FloatTensor] = None
+    encoder_logits: torch.FloatTensor = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_last_hidden_state: Optional[torch.FloatTensor] = None
+    encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+class BartForTextInfill(BartPretrainedModel):
+    """
+    this class is designed for text infilling.
+    During training, the encoder is used to predict replace, insert,
+    and the decoder is used to generate original input.
+    Compared with BartForConditionalGeneration class,
+    we add a module over the encoder and add a new loss for the encoder.
+    """
+    base_model_prefix = "model"
+    authorized_missing_keys = [r"final_logits_bias",
+                               r"encoder\.version", r"decoder\.version"]
+
+    def __init__(self, config: BartConfig):
+        super().__init__(config)
+        base_model = BartModel(config)
+        self.model = base_model
+        self.register_buffer("final_logits_bias", torch.zeros(
+            (1, self.model.shared.num_embeddings)))
+        # print( config.encoder_loss_type, config.num_labels)
+
+        # add a new attribute into BartConfig class (revise BartConfig)
+        self.encoder_loss_type = config.encoder_loss_type
+        self.num_labels = config.num_labels
+        if self.encoder_loss_type == 0:  # 0 is classification loss, 1 is regression loss
+            # add a classification module for the encoder
+            self.classification_head = BartClassificationHead(
+                config.d_model, config.d_model, config.num_labels, config.classif_dropout,
+            )
+        else:
+            # add a regression module for the encoder
+            self.classification_head = BartClassificationHead(
+                config.d_model, config.d_model, 1, config.classif_dropout,
+            )
+
+        self.model._init_weights(self.classification_head.dense)
+        self.model._init_weights(self.classification_head.out_proj)
+        self.loss_weight = config.loss_weight
+        self.register_buffer("label_weights", torch.zeros((self.num_labels)))
+
+    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
+        old_num_tokens = self.model.shared.num_embeddings
+        new_embeddings = super().resize_token_embeddings(new_num_tokens)
+        self.model.shared = new_embeddings
+        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)
+        return new_embeddings
+
+    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:
+        if new_num_tokens <= old_num_tokens:
+            new_bias = self.final_logits_bias[:, :new_num_tokens]
+        else:
+            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens),
+                                     device=self.final_logits_bias.device)
+            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)
+        self.register_buffer("final_logits_bias", new_bias)
+
+    @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
+    @add_end_docstrings(BART_GENERATION_EXAMPLE)
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        encoder_outputs=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        past_key_values=None,
+        encoder_labels=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=True,
+        **unused,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Labels for computing the masked language modeling loss.
+            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens
+            with labels in ``[0, ..., config.vocab_size]``.
+
+    Returns:
+
+    Conditional generation example::
+
+            # Mask filling only works for bart-large
+            from transformers import BartTokenizer, BartForConditionalGeneration
+            tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
+            TXT = "My friends are <mask> but they eat too many carbs."
+
+            model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
+            input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
+            logits = model(input_ids).logits
+
+            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
+            probs = logits[0, masked_index].softmax(dim=0)
+            values, predictions = probs.topk(5)
+
+            tokenizer.decode(predictions).split()
+            # ['good', 'great', 'all', 'really', 'very']
+        """
+        if "lm_labels" in unused:
+            warnings.warn(
+                "The `lm_labels` argument is deprecated and will be removed in a future version, use `labels` instead.",
+                FutureWarning,
+            )
+            labels = unused.pop("lm_labels")
+        if "decoder_cached_states" in unused:
+            warnings.warn(
+                "The `decoder_cached_states` argument is deprecated and will be removed in a future version, use `decoder_past_key_values` instead.",
+                FutureWarning,
+            )
+            decoder_past_key_values = unused.pop("decoder_cached_states")
+        return_dict = return_dict if return_dict is not None else False
+
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            encoder_outputs=encoder_outputs,
+            decoder_attention_mask=decoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        # logits and loss for the encoder
+        # last hidden state
+        encoder_last_hidden_state = outputs['encoder_last_hidden_state']
+        # eos_mask = input_ids.eq(self.config.eos_token_id)
+        # if len(torch.unique(eos_mask.sum(1))) > 1:
+        #     raise ValueError("All examples must have the same number of <eos> tokens.")
+        # sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]
+        encoder_logits = self.classification_head(encoder_last_hidden_state)
+        encoder_loss = None
+        if encoder_labels is not None:
+            # classification loss
+            if self.encoder_loss_type == 0:
+                # ZZ: seems like MSE loss does not support weighting, so only CEL has weighting applied for now
+                loss_fct = nn.CrossEntropyLoss(weight=self.label_weights)
+                encoder_loss = loss_fct(
+                    encoder_logits.view(-1, self.config.num_labels), encoder_labels.view(-1))
+            # regression loss
+            else:
+                encoder_logits = encoder_logits.view(
+                    encoder_logits.size(0), -1)
+                encoder_logits = torch.sigmoid(
+                    encoder_logits) * self.num_labels - 0.5
+                loss_fct = nn.MSELoss(reduction='none')
+                _loss = loss_fct(encoder_logits, encoder_labels)
+                encoder_loss = torch.mean(_loss[encoder_labels >= 0])
+                # encoder_loss =_loss[encoder_labels>=0]
+
+        # logits and loss for the decoder
+        lm_logits = F.linear(
+            outputs[0], self.model.shared.weight, bias=self.final_logits_bias)
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            # TODO(SS): do we need to ignore pad tokens in labels?
+            masked_lm_loss = loss_fct(
+                lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
+
+        loss = None
+        if masked_lm_loss is not None and encoder_loss is not None:
+            loss = encoder_loss * self.loss_weight + masked_lm_loss
+
+        if not return_dict:
+            output = (lm_logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CBartLMOutput(
+            loss=loss,
+            encoder_loss=encoder_loss,
+            decoder_loss=masked_lm_loss,
+            encoder_logits=encoder_logits,
+            logits=lm_logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):
+        assert past is not None, "past has to be defined for encoder_outputs"
+
+        encoder_outputs, past_key_values = past
+        return {
+            "input_ids": None,  # encoder_outputs is defined. input_ids not needed
+            "encoder_outputs": encoder_outputs,
+            "past_key_values": past_key_values,
+            "decoder_input_ids": decoder_input_ids,
+            "attention_mask": attention_mask,
+            # change this to avoid caching (presumably for debugging)
+            "use_cache": use_cache,
+        }
+
+    def adjust_logits_during_generation(self, logits, cur_len, max_length):
+        if cur_len == 1:
+            self._force_token_ids_generation(logits, self.config.bos_token_id)
+        if cur_len == max_length - 1 and self.config.eos_token_id is not None:
+            self._force_token_ids_generation(logits, self.config.eos_token_id)
+        return logits
+
+    def _force_token_ids_generation(self, scores, token_ids) -> None:
+        """force one of token_ids to be generated by setting prob of all other tokens to 0"""
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        all_but_token_ids_mask = torch.tensor(
+            [x for x in range(self.config.vocab_size) if x not in token_ids],
+            dtype=torch.long,
+            device=next(self.parameters()).device,
+        )
+        assert len(
+            scores.shape) == 2, "scores should be of rank 2 with shape: [batch_size, vocab_size]"
+        scores[:, all_but_token_ids_mask] = -float("inf")
+
+    @staticmethod
+    def _reorder_cache(past, beam_idx):
+        ((enc_out, enc_mask), past_key_values) = past
+        reordered_past = []
+        for layer_past in past_key_values:
+            # get the correct batch idx from decoder layer's batch dim for cross and self-attn
+            layer_past_new = {
+                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()
+            }
+            reordered_past.append(layer_past_new)
+
+        new_enc_out = enc_out if enc_out is None else enc_out.index_select(
+            0, beam_idx)
+        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(
+            0, beam_idx)
+
+        past = ((new_enc_out, new_enc_mask), reordered_past)
+        return past
+
+    def get_encoder(self):
+        return self.model.encoder
+
+    def get_output_embeddings(self):
+        return _make_linear_from_emb(self.model.shared)  # make it on the fly
+
+    def get_encoder_logits(self, input_ids, attention_mask=None):
+        # print(input_ids, attention_mask)
+        # encoder_outputs = self.model.get_encoder_outputs(
+        #         self,
+        #         input_ids,
+        #         attention_mask=attention_mask,
+        #         output_attentions=None,
+        #         output_hidden_states=None,
+        #         return_dict=None,
+        #  )
+
+        encoder_outputs = self.model.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            return_dict=True
+        )
+        # logits and loss for the encoder
+        # last hidden state
+        encoder_last_hidden_state = encoder_outputs['last_hidden_state']
+        encoder_logits = self.classification_head(encoder_last_hidden_state)
+
+        # classification
+        if self.encoder_loss_type == 0:
+            # probs = torch.softmax(encoder_logits,dim=-1)
+            pass
+        # regression
+        else:
+            encoder_logits = encoder_logits.view(encoder_logits.size(0), -1)
+            encoder_logits = torch.sigmoid(
+                encoder_logits) * self.num_labels - 0.5
+        return encoder_outputs, encoder_logits
+
+
+class CBartLightning(LightningModule):
+    @staticmethod
+    def add_module_specific_args(parent_args):
+        parser = parent_args.add_argument_group("CBart specific parameters")
+        parser.add_argument('--num_labels', type=int, default=3)
+        parser.add_argument('--encoder_loss_type', type=int, default=0)
+        parser.add_argument('--loss_weight', type=float, default=1.0)
+        parser.add_argument('--label_weights', type=float, nargs='+', default=[1.0, 1.0, 1.0])
+        parser.add_argument('--masked_lm', type=float, default=0)
+        return parent_args
+
+    def __init__(
+        self,
+        args,
+        **kwargs,
+    ):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.model = BartForTextInfill.from_pretrained(args.model_path, num_labels=self.hparams.num_labels,
+                                                       encoder_loss_type=self.hparams.encoder_loss_type,
+                                                       loss_weight=self.hparams.loss_weight,)
+        self.model.label_weights = torch.tensor(
+            self.hparams.label_weights, dtype=torch.half)
+
+    def forward(self, **inputs):
+        return self.model(**inputs)
+
+    def training_step(self, batch, batch_idx):
+        outputs = self(**batch)
+        return outputs
+
+    def validation_step(self, batch, batch_idx, dataloader_idx=0):
+        outputs = self(**batch)
+        val_loss = outputs["loss"]
+
+        return {"loss": val_loss}
+
+    def setup(self, stage=None) -> None:
+        if stage != "fit":
+            return
+        # Get dataloader by calling it - train_dataloader() is called after setup() by default
+        train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+
+        # Calculate total steps
+        tb_size = self.hparams.train_batchsize * max(1, self.trainer.gpus)
+        ab_size = self.trainer.accumulate_grad_batches * float(self.trainer.max_epochs)
+        self.total_steps = (len(train_loader.dataset) // tb_size) // ab_size
+
+    def configure_optimizers(self):
+        transformer_utils.configure_optimizers(self)
diff --git a/fengshen/models/longformer/__init__.py b/fengshen/models/longformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c068ccdcd2a786128a6a90032fea2ff74d3ea0f
--- /dev/null
+++ b/fengshen/models/longformer/__init__.py
@@ -0,0 +1,55 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from transformers.file_utils import _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "configuration_longformer": ["LongformerConfig"],
+    "tokenization_longformer": ["LongformerTokenizer"],
+}
+
+if is_torch_available():
+    _import_structure["modeling_longformer"] = [
+        "LongformerModel",
+        "LongformerForMaskedLM",
+        "LongformerForMultipleChoice",
+        "LongformerPreTrainedModel",
+        "LongformerForQuestionAnswering",
+        "LongformerForSequenceClassification",
+        "LongformerForTokenClassification",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_longformer import LongformerConfig
+    from .tokenization_longformer import LongformerTokenizer
+
+    if is_torch_available():
+        from .modeling_longformer import (
+            LongformerModel,
+            LongformerForMaskedLM,
+            LongformerForMultipleChoice,
+            LongformerPreTrainedModel,
+            LongformerForQuestionAnswering,
+            LongformerForSequenceClassification,
+            LongformerForTokenClassification,
+        )
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
diff --git a/fengshen/models/longformer/configuration_longformer.py b/fengshen/models/longformer/configuration_longformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..14ad2b5557d4d0cd9d2397308b6a823c1789bb31
--- /dev/null
+++ b/fengshen/models/longformer/configuration_longformer.py
@@ -0,0 +1,16 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import LongformerConfig
diff --git a/fengshen/models/longformer/modeling_longformer.py b/fengshen/models/longformer/modeling_longformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..697782a467a212926bba68e8a6791545f3c9f6e2
--- /dev/null
+++ b/fengshen/models/longformer/modeling_longformer.py
@@ -0,0 +1,2485 @@
+# coding=utf-8
+# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch Longformer model. """
+
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+from numpy.lib.function_base import kaiser
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from transformers.activations import ACT2FN, gelu
+from transformers.file_utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    replace_return_docstrings,
+)
+from transformers.modeling_utils import (
+    PreTrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from transformers.utils import logging
+from transformers import LongformerConfig
+
+logger = logging.get_logger(__name__)
+
+_CHECKPOINT_FOR_DOC = "allenai/longformer-base-4096"
+_CONFIG_FOR_DOC = "LongformerConfig"
+_TOKENIZER_FOR_DOC = "LongformerTokenizer"
+
+LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "allenai/longformer-base-4096",
+    "allenai/longformer-large-4096",
+    "allenai/longformer-large-4096-finetuned-triviaqa",
+    "allenai/longformer-base-4096-extra.pos.embd.only",
+    "allenai/longformer-large-4096-extra.pos.embd.only",
+    # See all Longformer models at https://huggingface.co/models?filter=longformer
+]
+
+
+@dataclass
+class LongformerBaseModelOutput(ModelOutput):
+    """
+    Base class for Longformer's outputs, with potential hidden states, local and global attentions.
+
+    Args:
+        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    last_hidden_state: torch.FloatTensor
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerBaseModelOutputWithPooling(ModelOutput):
+    """
+    Base class for Longformer's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) further processed by a
+            Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
+            prediction (classification) objective during pretraining.
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    last_hidden_state: torch.FloatTensor
+    pooler_output: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerMaskedLMOutput(ModelOutput):
+    """
+    Base class for masked language models outputs.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Masked language modeling (MLM) loss.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerQuestionAnsweringModelOutput(ModelOutput):
+    """
+    Base class for outputs of question answering Longformer models.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        start_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Span-start scores (before SoftMax).
+        end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Span-end scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    start_logits: torch.FloatTensor = None
+    end_logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerSequenceClassifierOutput(ModelOutput):
+    """
+    Base class for outputs of sentence classification models.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerMultipleChoiceModelOutput(ModelOutput):
+    """
+    Base class for outputs of multiple choice Longformer models.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Classification loss.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):
+            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).
+
+            Classification scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class LongformerTokenClassifierOutput(ModelOutput):
+    """
+    Base class for outputs of token classification models.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
+            Classification loss.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
+            Classification scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x + attention_window + 1)`, where ``x`` is the number of tokens with global attention
+            mask.
+
+            Local attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token in the sequence to every token with
+            global attention (first ``x`` values) and to every token in the attention window (remaining
+            ``attention_window + 1`` values). Note that the first ``x`` values refer to tokens with fixed positions in
+            the text, but the remaining ``attention_window + 1`` values refer to tokens with relative positions: the
+            attention weight of a token to itself is located at index ``x + attention_window / 2`` and the
+            ``attention_window / 2`` preceding (succeeding) values are the attention weights to the ``attention_window
+            / 2`` preceding (succeeding) tokens. If the attention window contains a token with global attention, the
+            attention weight at the corresponding index is set to 0; the value should be accessed from the first ``x``
+            attention weights. If a token has global attention, the attention weights to all other tokens in
+            :obj:`attentions` is set to 0, the values should be accessed from :obj:`global_attentions`.
+        global_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, x)`, where ``x`` is the number of tokens with global attention mask.
+
+            Global attentions weights after the attention softmax, used to compute the weighted average in the
+            self-attention heads. Those are the attention weights from every token with global attention to every token
+            in the sequence.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    global_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+def _get_question_end_index(input_ids, sep_token_id):
+    """
+    Computes the index of the first occurrence of `sep_token_id`.
+    """
+
+    sep_token_indices = (input_ids == sep_token_id).nonzero()
+    batch_size = input_ids.shape[0]
+
+    assert sep_token_indices.shape[1] == 2, "`input_ids` should have two dimensions"
+    assert (
+        sep_token_indices.shape[0] == 3 * batch_size
+    ), f"There should be exactly three separator tokens: {sep_token_id} in every sample for questions answering. You might also consider to set `global_attention_mask` manually in the forward function to avoid this error."
+    return sep_token_indices.view(batch_size, 3, 2)[:, 0, 1]
+
+
+def _compute_global_attention_mask(input_ids, sep_token_id, before_sep_token=True):
+    """
+    Computes global attention mask by putting attention on all tokens before `sep_token_id` if `before_sep_token is
+    True` else after `sep_token_id`.
+    """
+    question_end_index = _get_question_end_index(input_ids, sep_token_id)
+    question_end_index = question_end_index.unsqueeze(
+        dim=1)  # size: batch_size x 1
+    # bool attention mask with True in locations of global attention
+    attention_mask = torch.arange(input_ids.shape[1], device=input_ids.device)
+    if before_sep_token is True:
+        attention_mask = (attention_mask.expand_as(input_ids)
+                          < question_end_index).to(torch.uint8)
+    else:
+        # last token is separation token and should not be counted and in the middle are two separation tokens
+        attention_mask = (attention_mask.expand_as(input_ids) > (question_end_index + 1)).to(torch.uint8) * (
+            attention_mask.expand_as(input_ids) < input_ids.shape[-1]
+        ).to(torch.uint8)
+
+    return attention_mask
+
+
+def create_position_ids_from_input_ids(input_ids, padding_idx):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`.
+
+    Args:
+        x: torch.Tensor x:
+
+    Returns: torch.Tensor
+    """
+    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
+    mask = input_ids.ne(padding_idx).int()
+    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
+    return incremental_indices.long() + padding_idx
+
+
+class LongformerEmbeddings(nn.Module):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(
+            config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+
+        # Modify
+        # self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        # self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+
+        # self.padding_idx = config.pad_token_id
+        # self.position_embeddings = nn.Embedding(
+        #     config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+        # )
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
+
+        # if position_ids is None:
+        #     if input_ids is not None:
+        #         # Create the position ids from the input token ids. Any padded tokens remain padded.
+        #         position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx).to(input_ids.device)
+        #     else:
+        #         position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
+
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        # if position_ids is None:
+        #     position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                input_shape, dtype=torch.long, device=self.position_ids.device)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        # Modify
+        # position_embeddings = self.position_embeddings(position_ids)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+    def create_position_ids_from_inputs_embeds(self, inputs_embeds):
+        """
+        We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
+
+        Args:
+            inputs_embeds: torch.Tensor inputs_embeds:
+
+        Returns: torch.Tensor
+        """
+        input_shape = inputs_embeds.size()[:-1]
+        sequence_length = input_shape[1]
+
+        position_ids = torch.arange(
+            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device
+        )
+        return position_ids.unsqueeze(0).expand(input_shape)
+
+
+class RoPEmbedding(nn.Module):
+    def __init__(self, d_model):
+        super(RoPEmbedding, self).__init__()
+        self.d_model = d_model
+        div_term = torch.exp(torch.arange(
+            0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        self.register_buffer('div_term', div_term)
+
+    def forward(self, x, seq_dim=0):
+        x = x  # [seq_len,num_head,batch_size,per_head_hidden_size]
+        t = torch.arange(x.size(seq_dim), device=x.device).type_as(
+            self.div_term)
+        sinusoid_inp = torch.outer(t, self.div_term)
+        sin, cos = sinusoid_inp.sin(), sinusoid_inp.cos()  # [s, hn]
+        o_shape = (sin.size(0), 1, 1, sin.size(1))
+        sin, cos = sin.view(*o_shape), cos.view(*o_shape)  # [s, 1, 1, hn]
+        sin = torch.repeat_interleave(sin, 2, dim=-1)
+        cos = torch.repeat_interleave(cos, 2, dim=-1)
+        x2 = torch.stack([-x[..., 1::2], x[..., ::2]], dim=-1).reshape_as(x)
+        x = cos * x + sin * x2
+        return x
+
+
+class LongformerSelfAttention(nn.Module):
+    def __init__(self, config, layer_id):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.config = config
+        self.num_heads = config.num_attention_heads
+        self.head_dim = int(config.hidden_size / config.num_attention_heads)
+        self.embed_dim = config.hidden_size
+
+        self.query = nn.Linear(config.hidden_size, self.embed_dim)
+        self.key = nn.Linear(config.hidden_size, self.embed_dim)
+        self.value = nn.Linear(config.hidden_size, self.embed_dim)
+
+        # separate projection layers for tokens with global attention
+        # self.query_global = nn.Linear(config.hidden_size, self.embed_dim)
+        # self.key_global = nn.Linear(config.hidden_size, self.embed_dim)
+        # self.value_global = nn.Linear(config.hidden_size, self.embed_dim)
+
+        self.dropout = config.attention_probs_dropout_prob
+
+        self.layer_id = layer_id
+        attention_window = config.attention_window[self.layer_id]
+        assert (
+            attention_window % 2 == 0
+        ), f"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}"
+        assert (
+            attention_window > 0
+        ), f"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}"
+
+        self.one_sided_attn_window_size = attention_window // 2
+        self.rope_emb = RoPEmbedding(self.head_dim)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        layer_head_mask=None,
+        is_index_masked=None,
+        is_index_global_attn=None,
+        is_global_attn=None,
+        output_attentions=False,
+    ):
+        """
+        :class:`LongformerSelfAttention` expects `len(hidden_states)` to be multiple of `attention_window`. Padding to
+        `attention_window` happens in :meth:`LongformerModel.forward` to avoid redoing the padding on each layer.
+
+        The `attention_mask` is changed in :meth:`LongformerModel.forward` from 0, 1, 2 to:
+
+            * -10000: no attention
+            * 0: local attention
+            * +10000: global attention
+        """
+
+        # print(attention_mask.shape)
+        if not self.config.use_sparse_attention:  # 如果不使用稀疏attention，则使用标准的attention
+            hidden_states = hidden_states.transpose(0, 1)
+            # project hidden states
+            query_vectors = self.query(hidden_states)
+            key_vectors = self.key(hidden_states)
+            value_vectors = self.value(hidden_states)
+
+            seq_len, batch_size, embed_dim = hidden_states.size()
+            assert (
+                embed_dim == self.embed_dim
+            ), f"hidden_states should have embed_dim = {self.embed_dim}, but has {embed_dim}"
+
+            # normalize query
+
+            # query_vectors = query_vectors.view(seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+            # key_vectors = key_vectors.view(seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+
+            # print('query_vectors',query_vectors.shape)
+
+            query_vectors = query_vectors.view(
+                seq_len, batch_size, self.num_heads, self.head_dim).transpose(1, 2)
+            key_vectors = key_vectors.view(
+                seq_len, batch_size, self.num_heads, self.head_dim).transpose(1, 2)
+
+            query_vectors = self.rope_emb(query_vectors)
+            key_vectors = self.rope_emb(key_vectors)
+
+            query_vectors = query_vectors.transpose(0, 2)  # [b,mh,s,hd]
+            key_vectors = key_vectors.transpose(0, 2).transpose(2, 3)
+
+            # print('query_vectors',query_vectors.shape)
+
+            query_vectors /= math.sqrt(self.head_dim)
+
+            attention_mask = self.get_extended_attention_mask(
+                attention_mask, attention_mask.shape, attention_mask.device)
+            attn_scores = torch.matmul(
+                query_vectors, key_vectors)+attention_mask
+
+            attn_scores = torch.nn.functional.softmax(attn_scores, dim=-1)
+
+            value_vectors = value_vectors.view(
+                seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1).transpose(1, 2)
+            outputs = torch.matmul(attn_scores, value_vectors).transpose(
+                1, 2).contiguous().view(batch_size, seq_len, self.num_heads*self.head_dim)
+
+            # print('output',outputs.shape)
+            outputs = (outputs,)
+            return outputs+(attn_scores,)
+
+        # print('hidden.shape',hidden_states.shape)
+        # print('attention_mask.shape',attention_mask.shape)
+        # print('att_mask:',attention_mask)
+
+        hidden_states = hidden_states.transpose(0, 1)
+
+        # project hidden states
+        query_vectors = self.query(hidden_states)
+        key_vectors = self.key(hidden_states)
+        value_vectors = self.value(hidden_states)
+
+        seq_len, batch_size, embed_dim = hidden_states.size()
+        assert (
+            embed_dim == self.embed_dim
+        ), f"hidden_states should have embed_dim = {self.embed_dim}, but has {embed_dim}"
+
+        # normalize query
+
+        # query_vectors = query_vectors.view(seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+        # key_vectors = key_vectors.view(seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+
+        query_vectors = query_vectors.view(
+            seq_len, batch_size, self.num_heads, self.head_dim).transpose(1, 2)
+        key_vectors = key_vectors.view(
+            seq_len, batch_size, self.num_heads, self.head_dim).transpose(1, 2)
+
+        query_vectors = self.rope_emb(query_vectors)
+        key_vectors = self.rope_emb(key_vectors)
+
+        query_vectors = query_vectors.transpose(1, 2).transpose(0, 1)
+        key_vectors = key_vectors.transpose(1, 2).transpose(0, 1)
+
+        query_vectors /= math.sqrt(self.head_dim)
+
+        attn_scores = self._sliding_chunks_query_key_matmul(
+            query_vectors, key_vectors, self.one_sided_attn_window_size
+        )
+        # print('att:',attn_scores.shape)
+        # values to pad for attention probs
+        remove_from_windowed_attention_mask = (
+            attention_mask != 0)[:, :, None, None]
+
+        # cast to fp32/fp16 then replace 1's with -inf
+        float_mask = remove_from_windowed_attention_mask.type_as(query_vectors).masked_fill(
+            remove_from_windowed_attention_mask, -10000.0
+        )
+        # diagonal mask with zeros everywhere and -inf inplace of padding
+        diagonal_mask = self._sliding_chunks_query_key_matmul(
+            float_mask.new_ones(size=float_mask.size()
+                                ), float_mask, self.one_sided_attn_window_size
+        )
+
+        # pad local attention probs
+        attn_scores += diagonal_mask
+
+        assert list(attn_scores.size()) == [
+            batch_size,
+            seq_len,
+            self.num_heads,
+            self.one_sided_attn_window_size * 2 + 1,
+        ], f"local_attn_probs should be of size ({batch_size}, {seq_len}, {self.num_heads}, {self.one_sided_attn_window_size * 2 + 1}), but is of size {attn_scores.size()}"
+
+        # compute local attention probs from global attention keys and contact over window dim
+        if is_global_attn:
+            # compute global attn indices required through out forward fn
+            (
+                max_num_global_attn_indices,
+                is_index_global_attn_nonzero,
+                is_local_index_global_attn_nonzero,
+                is_local_index_no_global_attn_nonzero,
+            ) = self._get_global_attn_indices(is_index_global_attn)
+            # calculate global attn probs from global key
+
+            global_key_attn_scores = self._concat_with_global_key_attn_probs(
+                query_vectors=query_vectors,
+                key_vectors=key_vectors,
+                max_num_global_attn_indices=max_num_global_attn_indices,
+                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
+                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
+                is_local_index_no_global_attn_nonzero=is_local_index_no_global_attn_nonzero,
+            )
+            # concat to local_attn_probs
+            # (batch_size, seq_len, num_heads, extra attention count + 2*window+1)
+            attn_scores = torch.cat(
+                (global_key_attn_scores, attn_scores), dim=-1)
+
+            # free memory
+            del global_key_attn_scores
+
+        attn_probs = nn.functional.softmax(
+            attn_scores, dim=-1, dtype=torch.float32
+        )  # use fp32 for numerical stability
+
+        if layer_head_mask is not None:
+            assert layer_head_mask.size() == (
+                self.num_heads,
+            ), f"Head mask for a single layer should be of size {(self.num_heads,)}, but is {layer_head_mask.size()}"
+            attn_probs = layer_head_mask.view(1, 1, -1, 1) * attn_probs
+
+        # softmax sometimes inserts NaN if all positions are masked, replace them with 0
+        attn_probs = torch.masked_fill(
+            attn_probs, is_index_masked[:, :, None, None], 0.0)
+        attn_probs = attn_probs.type_as(attn_scores)
+
+        # free memory
+        del attn_scores
+
+        # apply dropout
+        attn_probs = nn.functional.dropout(
+            attn_probs, p=self.dropout, training=self.training)
+
+        value_vectors = value_vectors.view(
+            seq_len, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+
+        # compute local attention output with global attention value and add
+        if is_global_attn:
+            # compute sum of global and local attn
+            attn_output = self._compute_attn_output_with_global_indices(
+                value_vectors=value_vectors,
+                attn_probs=attn_probs,
+                max_num_global_attn_indices=max_num_global_attn_indices,
+                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
+                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
+            )
+        else:
+            # compute local attn only
+            attn_output = self._sliding_chunks_matmul_attn_probs_value(
+                attn_probs, value_vectors, self.one_sided_attn_window_size
+            )
+
+        assert attn_output.size() == (batch_size, seq_len, self.num_heads,
+                                      self.head_dim), "Unexpected size"
+        attn_output = attn_output.transpose(0, 1).reshape(
+            seq_len, batch_size, embed_dim).contiguous()
+
+        # compute value for global attention and overwrite to attention output
+        # TODO: remove the redundant computation
+        if is_global_attn:
+            global_attn_output, global_attn_probs = self._compute_global_attn_output_from_hidden(
+                global_query_vectors=query_vectors,
+                global_key_vectors=key_vectors,
+                global_value_vectors=value_vectors,
+                max_num_global_attn_indices=max_num_global_attn_indices,
+                layer_head_mask=layer_head_mask,
+                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
+                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
+                is_local_index_no_global_attn_nonzero=is_local_index_no_global_attn_nonzero,
+                is_index_masked=is_index_masked,
+            )
+            # print('global_attn_output',global_attn_output.shape)
+            # get only non zero global attn output
+            nonzero_global_attn_output = global_attn_output[
+                is_local_index_global_attn_nonzero[0], :, is_local_index_global_attn_nonzero[1]
+            ]
+            # print('nonzero_global_attn_output',nonzero_global_attn_output.shape)
+            # overwrite values with global attention
+            attn_output[is_index_global_attn_nonzero[::-1]] = nonzero_global_attn_output.view(
+                len(is_local_index_global_attn_nonzero[0]), -1
+            )
+            # The attention weights for tokens with global attention are
+            # just filler values, they were never used to compute the output.
+            # Fill with 0 now, the correct values are in 'global_attn_probs'.
+            attn_probs[is_index_global_attn_nonzero] = 0
+
+        outputs = (attn_output.transpose(0, 1),)
+
+        if output_attentions:
+            outputs += (attn_probs,)
+
+        return outputs + (global_attn_probs,) if (is_global_attn and output_attentions) else outputs
+
+    @staticmethod
+    def _pad_and_transpose_last_two_dims(hidden_states_padded, padding):
+        """pads rows and then flips rows and columns"""
+        hidden_states_padded = nn.functional.pad(
+            hidden_states_padded, padding
+        )  # padding value is not important because it will be overwritten
+        hidden_states_padded = hidden_states_padded.view(
+            *hidden_states_padded.size()[:-2], hidden_states_padded.size(-1), hidden_states_padded.size(-2)
+        )
+        return hidden_states_padded
+
+    @staticmethod
+    def _pad_and_diagonalize(chunked_hidden_states):
+        """
+        shift every row 1 step right, converting columns into diagonals.
+
+        Example::
+
+              chunked_hidden_states: [ 0.4983,  2.6918, -0.0071,  1.0492,
+                                       -1.8348,  0.7672,  0.2986,  0.0285,
+                                       -0.7584,  0.4206, -0.0405,  0.1599,
+                                       2.0514, -1.1600,  0.5372,  0.2629 ]
+              window_overlap = num_rows = 4
+             (pad & diagonalize) =>
+             [ 0.4983,  2.6918, -0.0071,  1.0492, 0.0000,  0.0000,  0.0000
+               0.0000,  -1.8348,  0.7672,  0.2986,  0.0285, 0.0000,  0.0000
+               0.0000,  0.0000, -0.7584,  0.4206, -0.0405,  0.1599, 0.0000
+               0.0000,  0.0000,  0.0000, 2.0514, -1.1600,  0.5372,  0.2629 ]
+        """
+        total_num_heads, num_chunks, window_overlap, hidden_dim = chunked_hidden_states.size()
+        chunked_hidden_states = nn.functional.pad(
+            chunked_hidden_states, (0, window_overlap + 1)
+        )  # total_num_heads x num_chunks x window_overlap x (hidden_dim+window_overlap+1). Padding value is not important because it'll be overwritten
+        chunked_hidden_states = chunked_hidden_states.view(
+            total_num_heads, num_chunks, -1
+        )  # total_num_heads x num_chunks x window_overlap*window_overlap+window_overlap
+        chunked_hidden_states = chunked_hidden_states[
+            :, :, :-window_overlap
+        ]  # total_num_heads x num_chunks x window_overlap*window_overlap
+        chunked_hidden_states = chunked_hidden_states.view(
+            total_num_heads, num_chunks, window_overlap, window_overlap + hidden_dim
+        )
+        chunked_hidden_states = chunked_hidden_states[:, :, :, :-1]
+        return chunked_hidden_states
+
+    @staticmethod
+    def _chunk(hidden_states, window_overlap):
+        """convert into overlapping chunks. Chunk size = 2w, overlap size = w"""
+
+        # non-overlapping chunks of size = 2w
+        hidden_states = hidden_states.view(
+            hidden_states.size(0),
+            hidden_states.size(1) // (window_overlap * 2),
+            window_overlap * 2,
+            hidden_states.size(2),
+        )
+
+        # use `as_strided` to make the chunks overlap with an overlap size = window_overlap
+        chunk_size = list(hidden_states.size())
+        chunk_size[1] = chunk_size[1] * 2 - 1
+
+        chunk_stride = list(hidden_states.stride())
+        chunk_stride[1] = chunk_stride[1] // 2
+        return hidden_states.as_strided(size=chunk_size, stride=chunk_stride)
+
+    @staticmethod
+    def _mask_invalid_locations(input_tensor, affected_seq_len) -> torch.Tensor:
+        beginning_mask_2d = input_tensor.new_ones(
+            affected_seq_len, affected_seq_len + 1).tril().flip(dims=[0])
+        beginning_mask = beginning_mask_2d[None, :, None, :]
+        ending_mask = beginning_mask.flip(dims=(1, 3))
+        beginning_input = input_tensor[:,
+                                       :affected_seq_len, :, : affected_seq_len + 1]
+        beginning_mask = beginning_mask.expand(beginning_input.size())
+        # `== 1` converts to bool or uint8
+        beginning_input.masked_fill_(beginning_mask == 1, -float("inf"))
+        ending_input = input_tensor[:, -
+                                    affected_seq_len:, :, -(affected_seq_len + 1):]
+        ending_mask = ending_mask.expand(ending_input.size())
+        # `== 1` converts to bool or uint8
+        ending_input.masked_fill_(ending_mask == 1, -float("inf"))
+
+    def _sliding_chunks_query_key_matmul(self, query: torch.Tensor, key: torch.Tensor, window_overlap: int):
+        """
+        Matrix multiplication of query and key tensors using with a sliding window attention pattern. This
+        implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer) with an
+        overlap of size window_overlap
+        """
+        batch_size, seq_len, num_heads, head_dim = query.size()
+        assert (
+            seq_len % (window_overlap * 2) == 0
+        ), f"Sequence length should be multiple of {window_overlap * 2}. Given {seq_len}"
+        assert query.size() == key.size()
+
+        chunks_count = seq_len // window_overlap - 1
+
+        # group batch_size and num_heads dimensions into one, then chunk seq_len into chunks of size window_overlap * 2
+        query = query.transpose(1, 2).reshape(
+            batch_size * num_heads, seq_len, head_dim)
+        key = key.transpose(1, 2).reshape(
+            batch_size * num_heads, seq_len, head_dim)
+
+        query = self._chunk(query, window_overlap)
+        key = self._chunk(key, window_overlap)
+
+        # matrix multiplication
+        # bcxd: batch_size * num_heads x chunks x 2window_overlap x head_dim
+        # bcyd: batch_size * num_heads x chunks x 2window_overlap x head_dim
+        # bcxy: batch_size * num_heads x chunks x 2window_overlap x 2window_overlap
+        diagonal_chunked_attention_scores = torch.einsum(
+            "bcxd,bcyd->bcxy", (query, key))  # multiply
+
+        # convert diagonals into columns
+        diagonal_chunked_attention_scores = self._pad_and_transpose_last_two_dims(
+            diagonal_chunked_attention_scores, padding=(0, 0, 0, 1)
+        )
+
+        # allocate space for the overall attention matrix where the chunks are combined. The last dimension
+        # has (window_overlap * 2 + 1) columns. The first (window_overlap) columns are the window_overlap lower triangles (attention from a word to
+        # window_overlap previous words). The following column is attention score from each word to itself, then
+        # followed by window_overlap columns for the upper triangle.
+
+        diagonal_attention_scores = diagonal_chunked_attention_scores.new_empty(
+            (batch_size * num_heads, chunks_count + 1,
+             window_overlap, window_overlap * 2 + 1)
+        )
+
+        # copy parts from diagonal_chunked_attention_scores into the combined matrix of attentions
+        # - copying the main diagonal and the upper triangle
+        diagonal_attention_scores[:, :-1, :, window_overlap:] = diagonal_chunked_attention_scores[
+            :, :, :window_overlap, : window_overlap + 1
+        ]
+        diagonal_attention_scores[:, -1, :, window_overlap:] = diagonal_chunked_attention_scores[
+            :, -1, window_overlap:, : window_overlap + 1
+        ]
+        # - copying the lower triangle
+        diagonal_attention_scores[:, 1:, :, :window_overlap] = diagonal_chunked_attention_scores[
+            :, :, -(window_overlap + 1): -1, window_overlap + 1:
+        ]
+
+        diagonal_attention_scores[:, 0, 1:window_overlap, 1:window_overlap] = diagonal_chunked_attention_scores[
+            :, 0, : window_overlap - 1, 1 - window_overlap:
+        ]
+
+        # separate batch_size and num_heads dimensions again
+        diagonal_attention_scores = diagonal_attention_scores.view(
+            batch_size, num_heads, seq_len, 2 * window_overlap + 1
+        ).transpose(2, 1)
+
+        self._mask_invalid_locations(diagonal_attention_scores, window_overlap)
+        return diagonal_attention_scores
+
+    def _sliding_chunks_matmul_attn_probs_value(
+        self, attn_probs: torch.Tensor, value: torch.Tensor, window_overlap: int
+    ):
+        """
+        Same as _sliding_chunks_query_key_matmul but for attn_probs and value tensors. Returned tensor will be of the
+        same shape as `attn_probs`
+        """
+        batch_size, seq_len, num_heads, head_dim = value.size()
+
+        assert seq_len % (window_overlap * 2) == 0
+        assert attn_probs.size()[:3] == value.size()[:3]
+        assert attn_probs.size(3) == 2 * window_overlap + 1
+        chunks_count = seq_len // window_overlap - 1
+        # group batch_size and num_heads dimensions into one, then chunk seq_len into chunks of size 2 window overlap
+
+        chunked_attn_probs = attn_probs.transpose(1, 2).reshape(
+            batch_size * num_heads, seq_len // window_overlap, window_overlap, 2 * window_overlap + 1
+        )
+
+        # group batch_size and num_heads dimensions into one
+        value = value.transpose(1, 2).reshape(
+            batch_size * num_heads, seq_len, head_dim)
+
+        # pad seq_len with w at the beginning of the sequence and another window overlap at the end
+        padded_value = nn.functional.pad(
+            value, (0, 0, window_overlap, window_overlap), value=-1)
+
+        # chunk padded_value into chunks of size 3 window overlap and an overlap of size window overlap
+        chunked_value_size = (batch_size * num_heads,
+                              chunks_count + 1, 3 * window_overlap, head_dim)
+        chunked_value_stride = padded_value.stride()
+        chunked_value_stride = (
+            chunked_value_stride[0],
+            window_overlap * chunked_value_stride[1],
+            chunked_value_stride[1],
+            chunked_value_stride[2],
+        )
+        chunked_value = padded_value.as_strided(
+            size=chunked_value_size, stride=chunked_value_stride)
+
+        chunked_attn_probs = self._pad_and_diagonalize(chunked_attn_probs)
+
+        context = torch.einsum(
+            "bcwd,bcdh->bcwh", (chunked_attn_probs, chunked_value))
+        return context.view(batch_size, num_heads, seq_len, head_dim).transpose(1, 2)
+
+    @staticmethod
+    def _get_global_attn_indices(is_index_global_attn):
+        """compute global attn indices required throughout forward pass"""
+        # helper variable
+        num_global_attn_indices = is_index_global_attn.long().sum(dim=1)
+
+        # max number of global attn indices in batch
+        max_num_global_attn_indices = num_global_attn_indices.max()
+
+        # indices of global attn
+        is_index_global_attn_nonzero = is_index_global_attn.nonzero(
+            as_tuple=True)
+
+        # helper variable
+        is_local_index_global_attn = torch.arange(
+            max_num_global_attn_indices, device=is_index_global_attn.device
+        ) < num_global_attn_indices.unsqueeze(dim=-1)
+
+        # location of the non-padding values within global attention indices
+        is_local_index_global_attn_nonzero = is_local_index_global_attn.nonzero(
+            as_tuple=True)
+
+        # location of the padding values within global attention indices
+        is_local_index_no_global_attn_nonzero = (
+            is_local_index_global_attn == 0).nonzero(as_tuple=True)
+        return (
+            max_num_global_attn_indices,
+            is_index_global_attn_nonzero,
+            is_local_index_global_attn_nonzero,
+            is_local_index_no_global_attn_nonzero,
+        )
+
+    def _concat_with_global_key_attn_probs(
+        self,
+        key_vectors,
+        query_vectors,
+        max_num_global_attn_indices,
+        is_index_global_attn_nonzero,
+        is_local_index_global_attn_nonzero,
+        is_local_index_no_global_attn_nonzero,
+    ):
+        batch_size = key_vectors.shape[0]
+
+        # create only global key vectors
+        key_vectors_only_global = key_vectors.new_zeros(
+            batch_size, max_num_global_attn_indices, self.num_heads, self.head_dim
+        )
+
+        key_vectors_only_global[is_local_index_global_attn_nonzero] = key_vectors[is_index_global_attn_nonzero]
+
+        # (batch_size, seq_len, num_heads, max_num_global_attn_indices)
+        attn_probs_from_global_key = torch.einsum(
+            "blhd,bshd->blhs", (query_vectors, key_vectors_only_global))
+
+        attn_probs_from_global_key[
+            is_local_index_no_global_attn_nonzero[0], :, :, is_local_index_no_global_attn_nonzero[1]
+        ] = -10000.0
+
+        return attn_probs_from_global_key
+
+    def _compute_attn_output_with_global_indices(
+        self,
+        value_vectors,
+        attn_probs,
+        max_num_global_attn_indices,
+        is_index_global_attn_nonzero,
+        is_local_index_global_attn_nonzero,
+    ):
+        batch_size = attn_probs.shape[0]
+
+        # cut local attn probs to global only
+        attn_probs_only_global = attn_probs.narrow(
+            -1, 0, max_num_global_attn_indices)
+        # get value vectors for global only
+        value_vectors_only_global = value_vectors.new_zeros(
+            batch_size, max_num_global_attn_indices, self.num_heads, self.head_dim
+        )
+        value_vectors_only_global[is_local_index_global_attn_nonzero] = value_vectors[is_index_global_attn_nonzero]
+
+        # use `matmul` because `einsum` crashes sometimes with fp16
+        # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))
+        # compute attn output only global
+        attn_output_only_global = torch.matmul(
+            attn_probs_only_global.transpose(
+                1, 2), value_vectors_only_global.transpose(1, 2)
+        ).transpose(1, 2)
+
+        # reshape attn probs
+        attn_probs_without_global = attn_probs.narrow(
+            -1, max_num_global_attn_indices, attn_probs.size(-1) - max_num_global_attn_indices
+        ).contiguous()
+
+        # compute attn output with global
+        attn_output_without_global = self._sliding_chunks_matmul_attn_probs_value(
+            attn_probs_without_global, value_vectors, self.one_sided_attn_window_size
+        )
+        return attn_output_only_global + attn_output_without_global
+
+    def _compute_global_attn_output_from_hidden(
+        self,
+        global_query_vectors,
+        global_key_vectors,
+        global_value_vectors,
+        max_num_global_attn_indices,
+        layer_head_mask,
+        is_local_index_global_attn_nonzero,
+        is_index_global_attn_nonzero,
+        is_local_index_no_global_attn_nonzero,
+        is_index_masked,
+    ):
+
+        global_query_vectors = global_query_vectors.transpose(0, 1)
+        seq_len, batch_size, _, _ = global_query_vectors.shape
+        global_query_vectors_only_global = global_query_vectors.new_zeros(
+            max_num_global_attn_indices, batch_size, self.num_heads, self.head_dim)
+        global_query_vectors_only_global[is_local_index_global_attn_nonzero[::-1]] = global_query_vectors[
+            is_index_global_attn_nonzero[::-1]
+        ]
+
+        seq_len_q, batch_size_q, _, _ = global_query_vectors_only_global.shape
+
+        # print('global_query_vectors_only_global',global_query_vectors_only_global.shape)
+
+        global_query_vectors_only_global = global_query_vectors_only_global.view(
+            seq_len_q, batch_size_q, self.num_heads, self.head_dim)
+        global_key_vectors = global_key_vectors.transpose(0, 1)
+        global_value_vectors = global_value_vectors.transpose(0, 1)
+
+        # reshape
+        global_query_vectors_only_global = (
+            global_query_vectors_only_global.contiguous()
+            .view(max_num_global_attn_indices, batch_size * self.num_heads, self.head_dim)
+            .transpose(0, 1)
+        )  # (batch_size * self.num_heads, max_num_global_attn_indices, head_dim)
+        global_key_vectors = (
+            global_key_vectors.contiguous().view(-1, batch_size * self.num_heads,
+                                                 self.head_dim).transpose(0, 1)
+        )  # batch_size * self.num_heads, seq_len, head_dim)
+        global_value_vectors = (
+            global_value_vectors.contiguous().view(-1, batch_size * self.num_heads,
+                                                   self.head_dim).transpose(0, 1)
+        )  # batch_size * self.num_heads, seq_len, head_dim)
+
+        # compute attn scores
+
+        global_attn_scores = torch.bmm(
+            global_query_vectors_only_global, global_key_vectors.transpose(1, 2))
+
+        assert list(global_attn_scores.size()) == [
+            batch_size * self.num_heads,
+            max_num_global_attn_indices,
+            seq_len,
+        ], f"global_attn_scores have the wrong size. Size should be {(batch_size * self.num_heads, max_num_global_attn_indices, seq_len)}, but is {global_attn_scores.size()}."
+
+        global_attn_scores = global_attn_scores.view(
+            batch_size, self.num_heads, max_num_global_attn_indices, seq_len)
+
+        global_attn_scores[
+            is_local_index_no_global_attn_nonzero[0], :, is_local_index_no_global_attn_nonzero[1], :
+        ] = -10000.0
+
+        global_attn_scores = global_attn_scores.masked_fill(
+            is_index_masked[:, None, None, :],
+            -10000.0,
+        )
+
+        global_attn_scores = global_attn_scores.view(
+            batch_size * self.num_heads, max_num_global_attn_indices, seq_len)
+
+        # compute global attn probs
+        global_attn_probs_float = nn.functional.softmax(
+            global_attn_scores, dim=-1, dtype=torch.float32
+        )  # use fp32 for numerical stability
+
+        # apply layer head masking
+        if layer_head_mask is not None:
+            assert layer_head_mask.size() == (
+                self.num_heads,
+            ), f"Head mask for a single layer should be of size {(self.num_heads,)}, but is {layer_head_mask.size()}"
+            global_attn_probs_float = layer_head_mask.view(1, -1, 1, 1) * global_attn_probs_float.view(
+                batch_size, self.num_heads, max_num_global_attn_indices, seq_len
+            )
+            global_attn_probs_float = global_attn_probs_float.view(
+                batch_size * self.num_heads, max_num_global_attn_indices, seq_len
+            )
+
+        global_attn_probs = nn.functional.dropout(
+            global_attn_probs_float.type_as(global_attn_scores), p=self.dropout, training=self.training
+        )
+
+        # global attn output
+        global_attn_output = torch.bmm(global_attn_probs, global_value_vectors)
+
+        assert list(global_attn_output.size()) == [
+            batch_size * self.num_heads,
+            max_num_global_attn_indices,
+            self.head_dim,
+        ], f"global_attn_output tensor has the wrong size. Size should be {(batch_size * self.num_heads, max_num_global_attn_indices, self.head_dim)}, but is {global_attn_output.size()}."
+
+        global_attn_probs = global_attn_probs.view(
+            batch_size, self.num_heads, max_num_global_attn_indices, seq_len)
+        global_attn_output = global_attn_output.view(
+            batch_size, self.num_heads, max_num_global_attn_indices, self.head_dim
+        )
+        return global_attn_output, global_attn_probs
+
+    def get_extended_attention_mask(self, attention_mask, input_shape, device):
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+
+        ones = torch.ones_like(attention_mask)
+        zero = torch.zeros_like(attention_mask)
+        attention_mask = torch.where(attention_mask < 0, zero, ones)
+
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        # extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+
+# Copied from transformers.models.bert.modeling_bert.BertSelfOutput
+class LongformerSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LongformerAttention(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.self = LongformerSelfAttention(config, layer_id)
+        self.output = LongformerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - \
+            len(heads)
+        self.self.all_head_size = self.self.attention_head_size * \
+            self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        layer_head_mask=None,
+        is_index_masked=None,
+        is_index_global_attn=None,
+        is_global_attn=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            is_index_masked=is_index_masked,
+            is_index_global_attn=is_index_global_attn,
+            is_global_attn=is_global_attn,
+            output_attentions=output_attentions,
+        )
+        attn_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attn_output,) + self_outputs[1:]
+        return outputs
+
+
+# Copied from transformers.models.bert.modeling_bert.BertIntermediate
+class LongformerIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOutput
+class LongformerOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LongformerLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.attention = LongformerAttention(config, layer_id)
+        self.intermediate = LongformerIntermediate(config)
+        self.output = LongformerOutput(config)
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        layer_head_mask=None,
+        is_index_masked=None,
+        is_index_global_attn=None,
+        is_global_attn=None,
+        output_attentions=False,
+    ):
+        self_attn_outputs = self.attention(
+            hidden_states,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            is_index_masked=is_index_masked,
+            is_index_global_attn=is_index_global_attn,
+            is_global_attn=is_global_attn,
+            output_attentions=output_attentions,
+        )
+        attn_output = self_attn_outputs[0]
+        outputs = self_attn_outputs[1:]
+
+        layer_output = apply_chunking_to_forward(
+            self.ff_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attn_output
+        )
+        outputs = (layer_output,) + outputs
+        return outputs
+
+    def ff_chunk(self, attn_output):
+        intermediate_output = self.intermediate(attn_output)
+        layer_output = self.output(intermediate_output, attn_output)
+        return layer_output
+
+
+class LongformerEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList(
+            [LongformerLayer(config, layer_id=i) for i in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+
+        is_index_masked = attention_mask < 0
+        is_index_global_attn = attention_mask > 0
+        is_global_attn = is_index_global_attn.flatten().any().item()
+
+        all_hidden_states = () if output_hidden_states else None
+        # All local attentions.
+        all_attentions = () if output_attentions else None
+        all_global_attentions = () if (output_attentions and is_global_attn) else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        if head_mask is not None:
+            assert head_mask.size()[0] == (
+                len(self.layer)
+            ), f"The head_mask should be specified for {len(self.layer)} layers, but it is for {head_mask.size()[0]}."
+        for idx, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, is_global_attn, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    head_mask[idx] if head_mask is not None else None,
+                    is_index_masked,
+                    is_index_global_attn,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    layer_head_mask=head_mask[idx] if head_mask is not None else None,
+                    is_index_masked=is_index_masked,
+                    is_index_global_attn=is_index_global_attn,
+                    is_global_attn=is_global_attn,
+                    output_attentions=output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                # bzs x seq_len x num_attn_heads x (num_global_attn + attention_window_len + 1) => bzs x num_attn_heads x seq_len x (num_global_attn + attention_window_len + 1)
+                all_attentions = all_attentions + \
+                    (layer_outputs[1].transpose(1, 2),)
+
+                if is_global_attn:
+                    # bzs x num_attn_heads x num_global_attn x seq_len => bzs x num_attn_heads x seq_len x num_global_attn
+                    all_global_attentions = all_global_attentions + \
+                        (layer_outputs[2].transpose(2, 3),)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v for v in [hidden_states, all_hidden_states, all_attentions, all_global_attentions] if v is not None
+            )
+        return LongformerBaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            global_attentions=all_global_attentions,
+        )
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPooler
+class LongformerPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+# Copied from transformers.models.roberta.modeling_roberta.RobertaLMHead with Roberta->Longformer
+class LongformerLMHead(nn.Module):
+    """Longformer Head for masked language modeling."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+        self.decoder.bias = self.bias
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = self.decoder(x)
+
+        return x
+
+    def _tie_weights(self):
+        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
+        self.bias = self.decoder.bias
+
+
+class LongformerPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = LongformerConfig
+    base_model_prefix = "longformer"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+LONGFORMER_START_DOCSTRING = r"""
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
+    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    general usage and behavior.
+
+    Parameters:
+        config (:class:`~transformers.LongformerConfig`): Model configuration class with all the parameters of the
+            model. Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
+            weights.
+"""
+
+LONGFORMER_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`~transformers.LongformerTokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
+            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
+            Mask to decide the attention given on each token, local attention or global attention. Tokens with global
+            attention attends to all other tokens, and all other tokens attend to them. This is important for
+            task-specific finetuning because it makes the model more flexible at representing the task. For example,
+            for classification, the <s> token should be given global attention. For QA, all question tokens should also
+            have global attention. Please refer to the `Longformer paper <https://arxiv.org/abs/2004.05150>`__ for more
+            details. Mask values selected in ``[0, 1]``:
+
+            - 0 for local attention (a sliding window attention),
+            - 1 for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
+
+        head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        decoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
+            1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+            config.max_position_embeddings - 1]``.
+
+            `What are position IDs? <../glossary.html#position-ids>`_
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare Longformer Model outputting raw hidden-states without any specific head on top.",
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerModel(LongformerPreTrainedModel):
+    """
+    This class copied code from :class:`~transformers.RobertaModel` and overwrote standard self-attention with
+    longformer self-attention to provide the ability to process long sequences following the self-attention approach
+    described in `Longformer: the Long-Document Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy,
+    Matthew E. Peters, and Arman Cohan. Longformer self-attention combines a local (sliding window) and global
+    attention to extend to long documents without the O(n^2) increase in memory and compute.
+
+    The self-attention module :obj:`LongformerSelfAttention` implemented here supports the combination of local and
+    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive and
+    dilated attention are more relevant for autoregressive language modeling than finetuning on downstream tasks.
+    Future release will add support for autoregressive attention, but the support for dilated attention requires a
+    custom CUDA kernel to be memory and compute efficient.
+
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        if isinstance(config.attention_window, int):
+            assert config.attention_window % 2 == 0, "`config.attention_window` has to be an even value"
+            assert config.attention_window > 0, "`config.attention_window` has to be positive"
+            config.attention_window = [
+                config.attention_window] * config.num_hidden_layers  # one value per layer
+        else:
+            assert len(config.attention_window) == config.num_hidden_layers, (
+                "`len(config.attention_window)` should equal `config.num_hidden_layers`. "
+                f"Expected {config.num_hidden_layers}, given {len(config.attention_window)}"
+            )
+
+        self.embeddings = LongformerEmbeddings(config)
+        self.encoder = LongformerEncoder(config)
+        self.pooler = LongformerPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def _pad_to_window_size(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        token_type_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        inputs_embeds: torch.Tensor,
+        pad_token_id: int,
+    ):
+        """A helper function to pad tokens and mask to work with implementation of Longformer self-attention."""
+        # padding
+        attention_window = (
+            self.config.attention_window
+            if isinstance(self.config.attention_window, int)
+            else max(self.config.attention_window)
+        )
+
+        assert attention_window % 2 == 0, f"`attention_window` should be an even value. Given {attention_window}"
+        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape
+        batch_size, seq_len = input_shape[:2]
+
+        padding_len = (attention_window - seq_len %
+                       attention_window) % attention_window
+        if padding_len > 0:
+            logger.info(
+                f"Input ids are automatically padded from {seq_len} to {seq_len + padding_len} to be a multiple of "
+                f"`config.attention_window`: {attention_window}"
+            )
+            if input_ids is not None:
+                input_ids = nn.functional.pad(
+                    input_ids, (0, padding_len), value=pad_token_id)
+            if position_ids is not None:
+                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings
+                position_ids = nn.functional.pad(
+                    position_ids, (0, padding_len), value=pad_token_id)
+            if inputs_embeds is not None:
+                input_ids_padding = inputs_embeds.new_full(
+                    (batch_size, padding_len),
+                    self.config.pad_token_id,
+                    dtype=torch.long,
+                )
+                inputs_embeds_padding = self.embeddings(input_ids_padding)
+                inputs_embeds = torch.cat(
+                    [inputs_embeds, inputs_embeds_padding], dim=-2)
+
+            attention_mask = nn.functional.pad(
+                attention_mask, (0, padding_len), value=False
+            )  # no attention on the padding tokens
+            token_type_ids = nn.functional.pad(
+                token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0
+
+        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds
+
+    def _merge_to_attention_mask(self, attention_mask: torch.Tensor, global_attention_mask: torch.Tensor):
+        # longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)
+        # (global_attention_mask + 1) => 1 for local attention, 2 for global attention
+        # => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention
+        if attention_mask is not None:
+            attention_mask = attention_mask * (global_attention_mask + 1)
+        else:
+            # simply use `global_attention_mask` as `attention_mask`
+            # if no `attention_mask` is given
+            attention_mask = global_attention_mask + 1
+        return attention_mask
+
+    @add_start_docstrings_to_model_forward(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=LongformerBaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+
+        Returns:
+
+        Examples::
+
+            >>> import torch
+            >>> from transformers import LongformerModel, LongformerTokenizer
+
+            >>> model = LongformerModel.from_pretrained('allenai/longformer-base-4096')
+            >>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
+
+            >>> SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
+            >>> input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
+
+            >>> attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
+            >>> global_attention_mask = torch.zeros(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to global attention to be deactivated for all tokens
+            >>> global_attention_mask[:, [1, 4, 21,]] = 1  # Set global attention to random tokens for the sake of this example
+            ...                                     # Usually, set global attention based on the task. For example,
+            ...                                     # classification: the <s> token
+            ...                                     # QA: question tokens
+            ...                                     # LM: potentially on the beginning of sentences and paragraphs
+            >>> outputs = model(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
+            >>> sequence_output = outputs.last_hidden_state
+            >>> pooled_output = outputs.pooler_output
+        """
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError(
+                "You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError(
+                "You have to specify either input_ids or inputs_embeds")
+
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                input_shape, dtype=torch.long, device=device)
+
+        # merge `global_attention_mask` and `attention_mask`
+        if global_attention_mask is not None:
+            attention_mask = self._merge_to_attention_mask(
+                attention_mask, global_attention_mask)
+
+        if self.config.use_sparse_attention:
+            padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                pad_token_id=self.config.pad_token_id,
+            )
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)[
+            :, 0, 0, :
+        ]
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(
+            sequence_output) if self.pooler is not None else None
+
+        # undo padding
+        if self.config.use_sparse_attention:
+            if padding_len > 0:
+                # unpad `sequence_output` because the calling function is expecting a length == input_ids.size(1)
+                sequence_output = sequence_output[:, :-padding_len]
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return LongformerBaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            global_attentions=encoder_outputs.global_attentions,
+        )
+
+
+@add_start_docstrings("""Longformer Model with a `language modeling` head on top. """, LONGFORMER_START_DOCSTRING)
+class LongformerForMaskedLM(LongformerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.longformer = LongformerModel(config, add_pooling_layer=False)
+        self.lm_head = LongformerLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=LongformerMaskedLMOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
+
+        Returns:
+
+        Examples::
+
+            >>> import torch
+            >>> from transformers import LongformerForMaskedLM, LongformerTokenizer
+
+            >>> model = LongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')
+            >>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
+
+            >>> SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
+            >>> input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
+
+            >>> attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM
+            ...                        # check ``LongformerModel.forward`` for more details how to set `attention_mask`
+            >>> outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
+            >>> loss = outputs.loss
+            >>> prediction_logits = output.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.longformer(
+            input_ids,
+            attention_mask=attention_mask,
+            global_attention_mask=global_attention_mask,
+            head_mask=head_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return LongformerMaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            global_attentions=outputs.global_attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Longformer Model transformer with a sequence classification/regression head on top (a linear layer on top of the
+    pooled output) e.g. for GLUE tasks.
+    """,
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerForSequenceClassification(LongformerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.longformer = LongformerModel(config, add_pooling_layer=False)
+        self.classifier = LongformerClassificationHead(config)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=LongformerSequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
+            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if global_attention_mask is None:
+            logger.info("Initializing global attention on CLS token...")
+            global_attention_mask = torch.zeros_like(input_ids)
+            # global attention on cls token
+            global_attention_mask[:, 0] = 1
+
+        outputs = self.longformer(
+            input_ids,
+            attention_mask=attention_mask,
+            global_attention_mask=global_attention_mask,
+            head_mask=head_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(
+                    logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return LongformerSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            global_attentions=outputs.global_attentions,
+        )
+
+
+class LongformerClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, hidden_states, **kwargs):
+        # take <s> token (equiv. to [CLS])
+        hidden_states = hidden_states[:, 0, :]
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        hidden_states = torch.tanh(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        output = self.out_proj(hidden_states)
+        return output
+
+
+@add_start_docstrings(
+    """
+    Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD /
+    TriviaQA (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+    """,
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerForQuestionAnswering(LongformerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.longformer = LongformerModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=LongformerQuestionAnsweringModelOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+
+        Returns:
+
+        Examples::
+
+            >>> from transformers import LongformerTokenizer, LongformerForQuestionAnswering
+            >>> import torch
+
+            >>> tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
+            >>> model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
+
+            >>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
+            >>> encoding = tokenizer(question, text, return_tensors="pt")
+            >>> input_ids = encoding["input_ids"]
+
+            >>> # default is local attention everywhere
+            >>> # the forward method will automatically set global attention on question tokens
+            >>> attention_mask = encoding["attention_mask"]
+
+            >>> outputs = model(input_ids, attention_mask=attention_mask)
+            >>> start_logits = outputs.start_logits
+            >>> end_logits = outputs.end_logits
+            >>> all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
+
+            >>> answer_tokens = all_tokens[torch.argmax(start_logits) :torch.argmax(end_logits)+1]
+            >>> answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # remove space prepending space token
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if global_attention_mask is None:
+            if input_ids is None:
+                logger.warning(
+                    "It is not possible to automatically generate the `global_attention_mask` because input_ids is None. Please make sure that it is correctly set."
+                )
+            else:
+                # set global attention on question tokens automatically
+                global_attention_mask = _compute_global_attention_mask(
+                    input_ids, self.config.sep_token_id)
+
+        outputs = self.longformer(
+            input_ids,
+            attention_mask=attention_mask,
+            global_attention_mask=global_attention_mask,
+            head_mask=head_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return LongformerQuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            global_attentions=outputs.global_attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Longformer Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
+    for Named-Entity-Recognition (NER) tasks.
+    """,
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerForTokenClassification(LongformerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.longformer = LongformerModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=LongformerTokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
+            1]``.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.longformer(
+            input_ids,
+            attention_mask=attention_mask,
+            global_attention_mask=global_attention_mask,
+            head_mask=head_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)
+                active_labels = torch.where(
+                    active_loss, labels.view(-1), torch.tensor(
+                        loss_fct.ignore_index).type_as(labels)
+                )
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(
+                    logits.view(-1, self.num_labels), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return LongformerTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            global_attentions=outputs.global_attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Longformer Model with a multiple choice classification head on top (a linear layer on top of the pooled output and
+    a softmax) e.g. for RocStories/SWAG tasks.
+    """,
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerForMultipleChoice(LongformerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.longformer = LongformerModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(
+        LONGFORMER_INPUTS_DOCSTRING.format(
+            "batch_size, num_choices, sequence_length")
+    )
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=LongformerMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        attention_mask=None,
+        global_attention_mask=None,
+        head_mask=None,
+        labels=None,
+        position_ids=None,
+        inputs_embeds=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
+            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
+            :obj:`input_ids` above)
+        """
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # set global attention on question tokens
+        if global_attention_mask is None and input_ids is not None:
+            logger.info("Initializing global attention on multiple choice...")
+            # put global attention on all tokens after `config.sep_token_id`
+            global_attention_mask = torch.stack(
+                [
+                    _compute_global_attention_mask(
+                        input_ids[:, i], self.config.sep_token_id, before_sep_token=False)
+                    for i in range(num_choices)
+                ],
+                dim=1,
+            )
+
+        flat_input_ids = input_ids.view(-1, input_ids.size(-1)
+                                        ) if input_ids is not None else None
+        flat_position_ids = position_ids.view(
+            -1, position_ids.size(-1)) if position_ids is not None else None
+        flat_token_type_ids = token_type_ids.view(
+            -1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        flat_attention_mask = attention_mask.view(
+            -1, attention_mask.size(-1)) if attention_mask is not None else None
+        flat_global_attention_mask = (
+            global_attention_mask.view(-1, global_attention_mask.size(-1))
+            if global_attention_mask is not None
+            else None
+        )
+        flat_inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2),
+                               inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.longformer(
+            flat_input_ids,
+            position_ids=flat_position_ids,
+            token_type_ids=flat_token_type_ids,
+            attention_mask=flat_attention_mask,
+            global_attention_mask=flat_global_attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=flat_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return LongformerMultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            global_attentions=outputs.global_attentions,
+        )
diff --git a/fengshen/models/longformer/tokenization_longformer.py b/fengshen/models/longformer/tokenization_longformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..16b3452ec7545e39b9ef1de276cf1fe8111a35fa
--- /dev/null
+++ b/fengshen/models/longformer/tokenization_longformer.py
@@ -0,0 +1,16 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import BertTokenizer as LongformerTokenizer
diff --git a/fengshen/models/megatron_t5/__init__.py b/fengshen/models/megatron_t5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..84f78136331c5ef4975697bc6a77910bba7429bd
--- /dev/null
+++ b/fengshen/models/megatron_t5/__init__.py
@@ -0,0 +1,49 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from transformers.file_utils import _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "configuration_megatron_t5": ["T5Config"],
+    "tokenization_megatron_t5": ["T5Tokenizer"],
+}
+
+if is_torch_available():
+    _import_structure["modeling_megatron_t5"] = [
+        "T5Model",
+        "T5EncoderModel",
+        "T5ForConditionalGeneration"
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_megatron_t5 import T5Config
+    from .tokenization_megatron_t5 import T5Tokenizer
+
+    if is_torch_available():
+        from .modeling_megatron_t5 import (
+            T5Model,
+            T5EncoderModel,
+            T5ForConditionalGeneration
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(
+        __name__, globals()["__file__"], _import_structure)
diff --git a/fengshen/models/megatron_t5/configuration_megatron_t5.py b/fengshen/models/megatron_t5/configuration_megatron_t5.py
new file mode 100644
index 0000000000000000000000000000000000000000..18b960e947cfd162d79d6b017fb77e30707c4c2e
--- /dev/null
+++ b/fengshen/models/megatron_t5/configuration_megatron_t5.py
@@ -0,0 +1,255 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" T5 model configuration """
+from collections import OrderedDict
+from typing import Any, Dict, Iterable, Mapping, Optional
+
+from transformers import PreTrainedTokenizer, TensorType
+
+from transformers import is_torch_available
+from transformers.configuration_utils import PretrainedConfig
+from transformers.onnx import OnnxConfigWithPast
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+T5_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "T5-small": "https://huggingface.co/T5-small/resolve/main/config.json",
+    "T5-base": "https://huggingface.co/T5-base/resolve/main/config.json",
+    "T5-large": "https://huggingface.co/T5-large/resolve/main/config.json",
+    "T5-3b": "https://huggingface.co/T5-3b/resolve/main/config.json",
+    "T5-11b": "https://huggingface.co/T5-11b/resolve/main/config.json",
+}
+
+
+class T5Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.T5Model` or a
+    :class:`~transformers.TFT5Model`. It is used to instantiate a T5 model according to the specified arguments,
+    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
+    to that of the T5 `T5-small <https://huggingface.co/T5-small>`__ architecture.
+
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+
+    Arguments:
+        vocab_size (:obj:`int`, `optional`, defaults to 32128):
+            Vocabulary size of the T5 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.T5Model` or :class:`~transformers.TFT5Model`.
+        d_model (:obj:`int`, `optional`, defaults to 512):
+            Size of the encoder layers and the pooler layer.
+        d_kv (:obj:`int`, `optional`, defaults to 64):
+            Size of the key, query, value projections per attention head. :obj:`d_kv` has to be equal to :obj:`d_model
+            // num_heads`.
+        d_ff (:obj:`int`, `optional`, defaults to 2048):
+            Size of the intermediate feed forward layer in each :obj:`T5Block`.
+        num_layers (:obj:`int`, `optional`, defaults to 6):
+            Number of hidden layers in the Transformer encoder.
+        num_decoder_layers (:obj:`int`, `optional`):
+            Number of hidden layers in the Transformer decoder. Will use the same value as :obj:`num_layers` if not
+            set.
+        num_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        relative_attention_num_buckets (:obj:`int`, `optional`, defaults to 32):
+            The number of buckets to use for each attention layer.
+        dropout_rate (:obj:`float`, `optional`, defaults to 0.1):
+            The ratio for all dropout layers.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (:obj:`float`, `optional`, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`):
+            Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses
+            the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`.
+        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
+    """
+    model_type = "T5"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=32128,
+        d_model=512,
+        d_kv=64,
+        d_ff=2048,
+        num_layers=6,
+        num_decoder_layers=None,
+        num_heads=8,
+        relative_attention_num_buckets=32,
+        dropout_rate=0.1,
+        layer_norm_epsilon=1e-5,
+        initializer_factor=1.0,
+        feed_forward_proj="gelu",
+        is_encoder_decoder=True,
+        use_cache=True,
+        pad_token_id=0,
+        eos_token_id=1,
+        gradient_checkpointing=False,
+        **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.d_kv = d_kv
+        self.d_ff = d_ff
+        self.num_layers = num_layers
+        self.num_decoder_layers = (
+            num_decoder_layers if num_decoder_layers is not None else self.num_layers
+        )  # default = symmetry
+        self.num_heads = num_heads
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_factor = initializer_factor
+        self.feed_forward_proj = feed_forward_proj
+        self.use_cache = use_cache
+        self.gradient_checkpointing = gradient_checkpointing
+
+    @property
+    def hidden_size(self):
+        return self.d_model
+
+    @property
+    def num_attention_heads(self):
+        return self.num_heads
+
+    @property
+    def num_hidden_layers(self):
+        return self.num_layers
+
+
+class T5OnnxConfig(OnnxConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ("decoder_input_ids", {0: "batch"}),
+                ("decoder_attention_mask", {0: "batch"}),
+            ]
+        )
+
+        if self.use_past:
+            for i in range(0, self._config.num_layers):
+                common_inputs[f"past_key_values.{i}.decoder.key"] = {
+                    0: "batch", 2: "past_sequence"}
+                common_inputs[f"past_key_values.{i}.decoder.value"] = {
+                    0: "batch", 2: "past_sequence"}
+                common_inputs[f"past_key_values.{i}.encoder.key"] = {
+                    0: "batch", 2: "past_sequence"}
+                common_inputs[f"past_key_values.{i}.encoder.value"] = {
+                    0: "batch", 2: "past_sequence"}
+
+        return common_inputs
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_outputs = super().outputs
+
+        if "last_hidden_state" in common_outputs:
+            common_outputs["last_hidden_state"] = {
+                0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            for i in range(self._config.num_layers):
+                common_outputs[f"present.{i}.decoder.key"] = {
+                    0: "batch", 2: "decoder_sequence"}
+                common_outputs[f"present.{i}.decoder.value"] = {
+                    0: "batch", 2: "decoder_sequence"}
+                common_outputs[f"present.{i}.encoder.key"] = {
+                    0: "batch", 2: "encoder_sequence"}
+                common_outputs[f"present.{i}.encoder.value"] = {
+                    0: "batch", 2: "encoder_sequence"}
+
+        if self.task == "default":
+            common_outputs["encoder_last_hidden_state"] = {
+                0: "batch", 2: "encoder_sequence"}
+
+        return common_outputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+        framework: Optional[TensorType] = None,
+    ) -> Mapping[str, Any]:
+
+        # Generate encoder inputs
+        encoder_inputs = super().generate_dummy_inputs(
+            tokenizer, batch_size, seq_length, is_pair, framework)
+
+        # Generate decoder inputs
+        decoder_inputs = super().generate_dummy_inputs(
+            tokenizer, batch_size, 1, is_pair, framework)
+        decoder_inputs = {f"decoder_{name}": tensor for name,
+                          tensor in decoder_inputs.items()}
+
+        ordered_inputs = dict(**encoder_inputs, **decoder_inputs)
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError(
+                    "Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch = encoder_inputs["input_ids"].shape[0]
+            encoder_seq_length = encoder_inputs["input_ids"].shape[1]
+            encoder_shape = (
+                batch,
+                self._config.num_heads,
+                encoder_seq_length,
+                self._config.hidden_size // self._config.num_heads,
+            )
+            decoder_shape = (batch, self._config.num_heads, 1,
+                             self._config.hidden_size // self._config.num_heads)
+
+            ordered_inputs["past_key_values"] = []
+            for _ in range(self._config.num_layers):
+                ordered_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+
+        return ordered_inputs
+
+    @staticmethod
+    def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
+        if name in ["present", "past_key_values"]:
+            flatten_output = {}
+            for idx, t in enumerate(field):
+                flatten_output[f"{name}.{idx}.decoder.key"] = t[0]
+                flatten_output[f"{name}.{idx}.decoder.value"] = t[1]
+                flatten_output[f"{name}.{idx}.encoder.key"] = t[2]
+                flatten_output[f"{name}.{idx}.encoder.value"] = t[3]
+
+            return flatten_output
+
+        return super().flatten_output_collection_property(name, field)
diff --git a/fengshen/models/megatron_t5/modeling_megatron_t5.py b/fengshen/models/megatron_t5/modeling_megatron_t5.py
new file mode 100644
index 0000000000000000000000000000000000000000..82ad4fb8126b9a4c0b0bb7debed95b999b5cf097
--- /dev/null
+++ b/fengshen/models/megatron_t5/modeling_megatron_t5.py
@@ -0,0 +1,2086 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch T5 model. """
+
+
+import copy
+import math
+import os
+import warnings
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from torch.utils.checkpoint import checkpoint
+
+from transformers.activations import ACT2FN
+from transformers.file_utils import (
+    DUMMY_INPUTS,
+    DUMMY_MASK,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_torch_fx_proxy,
+    replace_return_docstrings,
+)
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+)
+from transformers.modeling_utils import PreTrainedModel, find_pruneable_heads_and_indices, prune_linear_layer
+from transformers.utils import logging
+from transformers.utils.model_parallel_utils import assert_device_map, get_device_map
+from .configuration_megatron_t5 import T5Config
+import numpy as np
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "T5Config"
+_TOKENIZER_FOR_DOC = "T5Tokenizer"
+_CHECKPOINT_FOR_DOC = "T5-small"
+
+####################################################
+# This dict contains ids and associated url
+# for the pretrained weights provided with the models
+####################################################
+T5_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "T5-small",
+    "T5-base",
+    "T5-large",
+    "T5-3b",
+    "T5-11b",
+    # See all T5 models at https://huggingface.co/models?filter=T5
+]
+
+
+####################################################
+# This is a conversion method from TF 1.0 to PyTorch
+# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28
+####################################################
+
+def load_tf_weights_in_T5(model, config, tf_checkpoint_path):
+    """Load tf checkpoints in a pytorch model."""
+    try:
+        import re
+
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error(
+            "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions."
+        )
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    logger.info(f"Converting TensorFlow checkpoint from {tf_path}")
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    tf_weights = {}
+    for name, shape in init_vars:
+        logger.info(f"Loading TF weight {name} with shape {shape}")
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        tf_weights[name] = array
+
+    for txt_name in names:
+        name = txt_name.split("/")
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(
+            n in ["adam_v", "adam_m", "AdamWeightDecayOptimizer",
+                  "AdamWeightDecayOptimizer_1", "global_step"]
+            for n in name
+        ):
+            logger.info(f"Skipping {'/'.join(name)}")
+            tf_weights.pop(txt_name, None)
+            continue
+        if "_slot_" in name[-1]:
+            logger.info(f"Skipping {'/'.join(name)}")
+            tf_weights.pop(txt_name, None)
+            continue
+        pointer = model
+        array = tf_weights[txt_name]
+
+        for m_name in name:
+            if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
+                scope_names = re.split(r"_(\d+)", m_name)
+            else:
+                scope_names = [m_name]
+            if scope_names[0] in ["kernel", "scale", "embedding"]:
+                pointer = getattr(pointer, "weight")
+            elif scope_names[0] == "self_attention":
+                pointer = getattr(pointer, "layer")
+                pointer = pointer[0]
+            elif scope_names[0] == "enc_dec_attention":
+                pointer = getattr(pointer, "layer")
+                pointer = pointer[1]
+            elif scope_names[0] == "dense_relu_dense":
+                pointer = getattr(pointer, "layer")
+                pointer = pointer[2]
+            elif scope_names[0] == "rms_norm":
+                if hasattr(pointer, "layer_norm"):
+                    pointer = getattr(pointer, "layer_norm")
+                elif hasattr(pointer, "final_layer_norm"):
+                    pointer = getattr(pointer, "final_layer_norm")
+            elif scope_names[0] == "scale":
+                pointer = getattr(pointer, "weight")
+            elif scope_names[0] == "output_bias" or scope_names[0] == "beta":
+                pointer = getattr(pointer, "bias")
+            elif scope_names[0] == "squad":
+                pointer = getattr(pointer, "classifier")
+            elif scope_names[0] == "decoder" and name[1] == "logits":
+                continue
+            elif scope_names[0] == "logits":
+                pointer = getattr(pointer, "lm_head")
+            elif scope_names[0] == "wi" and len(scope_names) > 1 and scope_names[1].isdigit():
+                pointer = getattr(pointer, f"wi_{scope_names[1]}")
+                continue
+            else:
+                try:
+                    pointer = getattr(pointer, scope_names[0])
+                except AttributeError:
+                    logger.info(f"Skipping {'/'.join(name)}")
+                    continue
+            if len(scope_names) >= 2:
+                num = int(scope_names[1])
+                pointer = pointer[num]
+        if scope_names[0] not in ["kernel", "scale", "embedding"]:
+            pointer = getattr(pointer, "weight")
+        if scope_names[0] != "embedding":
+            logger.info(
+                f"Transposing numpy weight of shape {array.shape} for {name}")
+            array = np.transpose(array)
+        try:
+            assert (
+                pointer.shape == array.shape
+            ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info(f"Initialize PyTorch weight {name}")
+        pointer.data = torch.from_numpy(array.astype(np.float32))
+        tf_weights.pop(txt_name, None)
+
+    logger.info(
+        f"Weights not copied to PyTorch model: {', '.join(tf_weights.keys())}.")
+    return model
+
+
+####################################################
+# PyTorch Models are constructed by sub-classing
+# - torch.nn.Module for the layers and
+# - PreTrainedModel for the models (it-self a sub-class of nn.Module)
+####################################################
+PARALLELIZE_DOCSTRING = r"""
+    This is an experimental feature and is a subject to change at a moment's notice.
+
+    Uses a device map to distribute attention modules of the model across several devices. If no device map is given,
+    it will evenly distribute blocks across all devices.
+
+    Args:
+        device_map (:obj:`Dict[int, list]`, optional, defaults to None):
+            A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always
+            automatically mapped to the first device (for esoteric reasons). That means that the first device should
+            have fewer attention modules mapped to it than other devices. For reference, the T5 models have the
+            following number of attention modules:
+
+                - T5-small: 6
+                - T5-base: 12
+                - T5-large: 24
+                - T5-3b: 24
+                - T5-11b: 24
+
+    Example::
+
+    # Here is an example of a device map on a machine with 4 GPUs using T5-3b,
+    # which has a total of 24 attention modules:
+            model = T5ForConditionalGeneration.from_pretrained('T5-3b')
+            device_map = {0: [0, 1, 2],
+
+                         1: [3, 4, 5, 6, 7, 8, 9],
+                         2: [10, 11, 12, 13, 14, 15, 16],
+                         3: [17, 18, 19, 20, 21, 22, 23]}
+            model.parallelize(device_map)
+"""
+DEPARALLELIZE_DOCSTRING = r"""
+    Moves the model to cpu from a model parallel state.
+
+    Example::
+
+        # On a 4 GPU machine with T5-3b:
+        model = T5ForConditionalGeneration.from_pretrained('T5-3b')
+        device_map = {0: [0, 1, 2],
+
+                     1: [3, 4, 5, 6, 7, 8, 9],
+                     2: [10, 11, 12, 13, 14, 15, 16],
+                     3: [17, 18, 19, 20, 21, 22, 23]}
+        model.parallelize(device_map) # Splits the model across several devices
+        model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
+"""
+
+
+class T5LayerNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Construct a layernorm module in the T5 style No bias and no subtraction of mean.
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        # layer norm should always be calculated in float32
+        variance = hidden_states.to(torch.float32).pow(
+            2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * \
+            torch.rsqrt(variance + self.variance_epsilon)
+
+        # convert into float16 if necessary
+        if self.weight.dtype == torch.float16:
+            hidden_states = hidden_states.to(torch.float16)
+        return self.weight * hidden_states
+
+
+class T5DenseReluDense(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # @IDEA modified -> bias=False -> bias=True
+        self.wi = nn.Linear(config.d_model, config.d_ff, bias=True)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=True)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = nn.functional.relu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5DenseGeluDense(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # @IDEA modified -> bias=False -> bias=True
+        self.wi = nn.Linear(config.d_model, config.d_ff, bias=True)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=True)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = nn.functional.gelu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5DenseGatedGeluDense(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # @IDEA modified -> bias=False -> bias=True
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=True)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=True)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=True)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5LayerFF(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # @IDEA modified -> T5LayerNorm -> nn.LayerNorm
+        # self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.layer_norm = nn.LayerNorm(
+            config.d_model, eps=config.layer_norm_epsilon)
+        if config.feed_forward_proj == "relu":
+            self.DenseReluDense = T5DenseReluDense(config)
+        elif config.feed_forward_proj == "gelu":
+            self.DenseReluDense = T5DenseGeluDense(config)
+        else:
+            raise ValueError(
+                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
+            )
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        hidden_states = hidden_states + self.dropout(forwarded_states)
+        return hidden_states
+
+
+class T5Attention(nn.Module):
+    def __init__(self, config: T5Config, has_relative_attention_bias=False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.has_relative_attention_bias = has_relative_attention_bias
+
+        self.relative_attention_num_buckets = config.relative_attention_num_buckets
+        self.d_model = config.d_model
+        self.key_value_proj_dim = config.d_kv
+        self.n_heads = config.num_heads
+        self.dropout = config.dropout_rate
+        self.inner_dim = self.n_heads * self.key_value_proj_dim
+
+        # Mesh TensorFlow initialization to avoid scaling before softmax
+        # @IDEA modified -> bias=False -> bias=True
+
+        self.q = nn.Linear(self.d_model, self.inner_dim, bias=True)
+        self.k = nn.Linear(self.d_model, self.inner_dim, bias=True)
+        self.v = nn.Linear(self.d_model, self.inner_dim, bias=True)
+
+        self.o = nn.Linear(self.inner_dim, self.d_model, bias=True)
+
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(
+                self.relative_attention_num_buckets, self.n_heads)
+        self.pruned_heads = set()
+        self.gradient_checkpointing = False
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.n_heads, self.key_value_proj_dim, self.pruned_heads
+        )
+        # Prune linear layers
+        self.q = prune_linear_layer(self.q, index)
+        self.k = prune_linear_layer(self.k, index)
+        self.v = prune_linear_layer(self.v, index)
+
+        self.o = prune_linear_layer(self.o, index, dim=1)
+        # Update hyper params
+        self.n_heads = self.n_heads - len(heads)
+        self.inner_dim = self.key_value_proj_dim * self.n_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+
+        Args:
+            relative_position: an int32 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+
+        Returns:
+            a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
+        """
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets //= 2
+            relative_buckets += (relative_position >
+                                 0).to(torch.long) * num_buckets
+            relative_position = torch.abs(relative_position)
+        else:
+            relative_position = - \
+                torch.min(relative_position,
+                          torch.zeros_like(relative_position))
+        # now relative_position is in the range [0, inf)
+
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_postion_if_large = max_exact + (
+            torch.log(relative_position.float() / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).to(torch.long)
+        relative_postion_if_large = torch.min(
+            relative_postion_if_large, torch.full_like(
+                relative_postion_if_large, num_buckets - 1)
+        )
+
+        relative_buckets += torch.where(is_small,
+                                        relative_position, relative_postion_if_large)
+        return relative_buckets
+
+    def compute_bias(self, query_length, key_length):
+        """Compute binned relative position bias"""
+        context_position = torch.arange(
+            query_length, dtype=torch.long, device=self.relative_attention_bias.weight.device
+        )[:, None]
+        memory_position = torch.arange(
+            key_length, dtype=torch.long, device=self.relative_attention_bias.weight.device
+        )[None, :]
+        relative_position = memory_position - \
+            context_position  # shape (query_length, key_length)
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=(not self.is_decoder),
+            num_buckets=self.relative_attention_num_buckets,
+        )
+        # shape (query_length, key_length, num_heads)
+        values = self.relative_attention_bias(relative_position_bucket)
+        # shape (1, num_heads, query_length, key_length)
+        values = values.permute([2, 0, 1]).unsqueeze(0)
+        return values
+
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+        past_key_value=None,
+        layer_head_mask=None,
+        query_length=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+        # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
+        batch_size, seq_length = hidden_states.shape[:2]
+
+        real_seq_length = seq_length
+
+        if past_key_value is not None:
+            assert (
+                len(past_key_value) == 2
+            ), f"past_key_value should have 2 past states: keys and values. Got { len(past_key_value)} past states"
+            real_seq_length += past_key_value[0].shape[2] if query_length is None else query_length
+
+        key_length = real_seq_length if key_value_states is None else key_value_states.shape[
+            1]
+
+        def shape(states):
+            """projection"""
+            return states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+
+        def unshape(states):
+            """reshape"""
+            return states.transpose(1, 2).contiguous().view(batch_size, -1, self.inner_dim)
+
+        def project(hidden_states, proj_layer, key_value_states, past_key_value):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif past_key_value is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if past_key_value is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = torch.cat(
+                        [past_key_value, hidden_states], dim=2)
+                else:
+                    # cross-attn
+                    hidden_states = past_key_value
+            return hidden_states
+
+        # get query states
+        # (batch_size, n_heads, seq_length, dim_per_head)
+        query_states = shape(self.q(hidden_states))
+
+        # get key/value states
+        key_states = project(
+            hidden_states, self.k, key_value_states, past_key_value[
+                0] if past_key_value is not None else None
+        )
+        value_states = project(
+            hidden_states, self.v, key_value_states, past_key_value[
+                1] if past_key_value is not None else None
+        )
+
+        # compute scores
+        scores = torch.matmul(
+            query_states, key_states.transpose(3, 2)
+        )  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9
+
+        if position_bias is None:
+            if not self.has_relative_attention_bias:
+                position_bias = torch.zeros(
+                    (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
+                )
+                if self.gradient_checkpointing and self.training:
+                    position_bias.requires_grad = True
+            else:
+                position_bias = self.compute_bias(real_seq_length, key_length)
+
+            # if key and values are already calculated
+            # we want only the last query position bias
+            if past_key_value is not None:
+                position_bias = position_bias[:, :, -hidden_states.size(1):, :]
+
+            if mask is not None:
+                # (batch_size, n_heads, seq_length, key_length)
+                position_bias = position_bias + mask
+
+        # @IDEA modified -> delete scores += position_bias, use absolute positional
+        # scores += position_bias
+        scores = scores / math.sqrt(self.key_value_proj_dim)
+
+        if mask is not None:
+            scores = scores + mask
+
+        attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
+            scores
+        )  # (batch_size, n_heads, seq_length, key_length)
+
+        attn_weights = nn.functional.dropout(
+            attn_weights, p=0, training=self.training
+        )  # (batch_size, n_heads, seq_length, key_length)
+
+        # Mask heads if we want to
+        if layer_head_mask is not None:
+            attn_weights = attn_weights * layer_head_mask
+
+        # (batch_size, seq_length, dim)
+        attn_output = unshape(torch.matmul(attn_weights, value_states))
+
+        attn_output = self.o(attn_output)
+
+        present_key_value_state = (key_states, value_states) if (
+            self.is_decoder and use_cache) else None
+        outputs = (attn_output,) + \
+            (present_key_value_state,) + (position_bias,)
+
+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        return outputs
+
+
+class T5LayerSelfAttention(nn.Module):
+    def __init__(self, config, has_relative_attention_bias=False):
+        super().__init__()
+
+        # @IDEA modified -> T5LayerNorm -> nn.LayerNorm
+        # self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.layer_norm = nn.LayerNorm(
+            config.d_model, eps=config.layer_norm_epsilon)
+        self.SelfAttention = T5Attention(
+            config, has_relative_attention_bias=has_relative_attention_bias)
+
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        layer_head_mask=None,
+        past_key_value=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.SelfAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+            layer_head_mask=layer_head_mask,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        # add attentions if we output them
+        outputs = (hidden_states,) + attention_output[1:]
+        return outputs
+
+
+class T5LayerCrossAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # @IDEA modified -> T5LayerNorm -> nn.LayerNorm
+        # self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.layer_norm = nn.LayerNorm(
+            config.d_model, eps=config.layer_norm_epsilon)
+
+        self.EncDecAttention = T5Attention(
+            config, has_relative_attention_bias=False)
+
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states,
+        attention_mask=None,
+        position_bias=None,
+        layer_head_mask=None,
+        past_key_value=None,
+        use_cache=False,
+        query_length=None,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.EncDecAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            key_value_states=key_value_states,
+            position_bias=position_bias,
+            layer_head_mask=layer_head_mask,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+            query_length=query_length,
+            output_attentions=output_attentions,
+        )
+        layer_output = hidden_states + self.dropout(attention_output[0])
+        # add attentions if we output them
+        outputs = (layer_output,) + attention_output[1:]
+        return outputs
+
+
+class T5Block(nn.Module):
+    def __init__(self, config, has_relative_attention_bias=False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        # @IDEA modified ->
+        # self.layer = nn.ModuleList()
+        # self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
+        # if self.is_decoder:
+        #     self.layer.append(T5LayerCrossAttention(config))
+
+        # self.layer.append(T5LayerFF(config))
+
+        self.T5LayerSelfAttention = T5LayerSelfAttention(
+            config, has_relative_attention_bias=has_relative_attention_bias)
+        if self.is_decoder:
+            self.T5LayerCrossAttention = T5LayerCrossAttention(
+                config)
+        self.T5LayerFF = T5LayerFF(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+        layer_head_mask=None,
+        cross_attn_layer_head_mask=None,
+        past_key_value=None,
+        use_cache=False,
+        output_attentions=False,
+        return_dict=True,
+    ):
+
+        if past_key_value is not None:
+            assert self.is_decoder, "Only decoder can use `past_key_values`"
+            expected_num_past_key_values = 2 if encoder_hidden_states is None else 4
+
+            if len(past_key_value) != expected_num_past_key_values:
+                raise ValueError(
+                    f"There should be {expected_num_past_key_values} past states. "
+                    f"{'2 (past / key) for cross attention. ' if expected_num_past_key_values == 4 else ''}"
+                    f"Got {len(past_key_value)} past key / value states"
+                )
+
+            self_attn_past_key_value = past_key_value[:2]
+            cross_attn_past_key_value = past_key_value[2:]
+        else:
+            self_attn_past_key_value, cross_attn_past_key_value = None, None
+
+        # @IDEA modified -> self.layer[0] -> self.T5LayerSelfAttention
+        self_attention_outputs = self.T5LayerSelfAttention(
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            layer_head_mask=layer_head_mask,
+            past_key_value=self_attn_past_key_value,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states, present_key_value_state = self_attention_outputs[:2]
+        # Keep self-attention outputs and relative position weights
+        attention_outputs = self_attention_outputs[2:]
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
+            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+            hidden_states = torch.clamp(
+                hidden_states, min=-clamp_value, max=clamp_value)
+
+        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+        if do_cross_attention:
+            # the actual query length is unknown for cross attention
+            # if using past key value states. Need to inject it here
+            if present_key_value_state is not None:
+                query_length = present_key_value_state[0].shape[2]
+            else:
+                query_length = None
+            # @IDEA modified -> self.layer[1] -> self.T5LayerCrossAttention
+            cross_attention_outputs = self.T5LayerCrossAttention(
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+                layer_head_mask=cross_attn_layer_head_mask,
+                past_key_value=cross_attn_past_key_value,
+                query_length=query_length,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states = cross_attention_outputs[0]
+
+            # clamp inf values to enable fp16 training
+            if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
+                clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+                hidden_states = torch.clamp(
+                    hidden_states, min=-clamp_value, max=clamp_value)
+
+            # Combine self attn and cross attn key value states
+            if present_key_value_state is not None:
+                present_key_value_state = present_key_value_state + \
+                    cross_attention_outputs[1]
+
+            # Keep cross-attention outputs and relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[2:]
+
+        # Apply Feed Forward layer
+        # @IDEA modified -> self.layer[-1] -> self.T5LayerFF
+        hidden_states = self.T5LayerFF(hidden_states)
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any():
+            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+            hidden_states = torch.clamp(
+                hidden_states, min=-clamp_value, max=clamp_value)
+
+        outputs = (hidden_states,)
+
+        if use_cache:
+            outputs = outputs + (present_key_value_state,) + attention_outputs
+        else:
+            outputs = outputs + attention_outputs
+
+        # hidden-states, present_key_value_states, (self-attention position bias),
+        # (self-attention weights), (cross-attention position bias), (cross-attention weights)
+        return outputs
+
+
+class T5PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = T5Config
+    load_tf_weights = load_tf_weights_in_T5
+    base_model_prefix = "transformer"
+    is_parallelizable = True
+    supports_gradient_checkpointing = True
+
+    @property
+    def dummy_inputs(self):
+        input_ids = torch.tensor(DUMMY_INPUTS)
+        input_mask = torch.tensor(DUMMY_MASK)
+        dummy_inputs = {
+            "decoder_input_ids": input_ids,
+            "input_ids": input_ids,
+            "decoder_attention_mask": input_mask,
+        }
+        return dummy_inputs
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor  # Used for testing weights initialization
+        if isinstance(module, T5LayerNorm):
+            module.weight.data.fill_(factor * 1.0)
+        elif isinstance(module, (T5Model, T5ForConditionalGeneration, T5EncoderModel)):
+            # Mesh TensorFlow embeddings initialization
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d
+            # /mesh_tensorflow/layers.py#L1624
+            # @IDEA modified -> module.shared.weight -> module.shared.word_embeddings.weight
+            # module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
+            module.shared.word_embeddings.weight.data.normal_(
+                mean=0.0, std=factor * 1.0)
+            module.shared.position_embeddings.weight.data.normal_(
+                mean=0.0, std=factor * 1.0)
+        elif isinstance(module, T5DenseReluDense):
+            # Mesh TensorFlow FF initialization
+            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow
+            # /transformer/transformer_layers.py#L56
+            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/
+            # mesh_tensorflow/layers.py#L89
+            module.wi.weight.data.normal_(
+                mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi, "bias") and module.wi.bias is not None:
+                module.wi.bias.data.zero_()
+            module.wo.weight.data.normal_(
+                mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+            if hasattr(module.wo, "bias") and module.wo.bias is not None:
+                module.wo.bias.data.zero_()
+        elif isinstance(module, T5DenseGeluDense):
+            module.wi_0.weight.data.normal_(
+                mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi_0, "bias") and module.wi_0.bias is not None:
+                module.wi_0.bias.data.zero_()
+            module.wi_1.weight.data.normal_(
+                mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi, "bias") and module.wi.bias is not None:
+                module.wi.bias.data.zero_()
+            module.wo.weight.data.normal_(
+                mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+            if hasattr(module.wo, "bias") and module.wo.bias is not None:
+                module.wo.bias.data.zero_()
+        elif isinstance(module, T5Attention):
+            # Mesh TensorFlow attention initialization to avoid scaling before softmax
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d
+            # /mesh_tensorflow/transformer/attention.py#L136
+            d_model = self.config.d_model
+            key_value_proj_dim = self.config.d_kv
+            n_heads = self.config.num_heads
+            module.q.weight.data.normal_(
+                mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5))
+            module.k.weight.data.normal_(
+                mean=0.0, std=factor * (d_model ** -0.5))
+            module.v.weight.data.normal_(
+                mean=0.0, std=factor * (d_model ** -0.5))
+
+            module.o.weight.data.normal_(
+                mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5))
+            if module.has_relative_attention_bias:
+                module.relative_attention_bias.weight.data.normal_(
+                    mean=0.0, std=factor * ((d_model) ** -0.5))
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (T5Attention, T5Stack)):
+            module.gradient_checkpointing = value
+
+    def _shift_right(self, input_ids):
+        decoder_start_token_id = self.config.decoder_start_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert (
+            decoder_start_token_id is not None
+        ), "self.model.config.decoder_start_token_id has to be defined. "\
+            "In T5 it is usually set to the pad_token_id. See T5 docs for more information"
+
+        # shift inputs to the right
+        if is_torch_fx_proxy(input_ids):
+            # Item assignment is not supported natively for proxies.
+            shifted_input_ids = torch.full(
+                input_ids.shape[:-1] + (1,), decoder_start_token_id)
+            shifted_input_ids = torch.cat(
+                [shifted_input_ids, input_ids[..., :-1]], dim=-1)
+        else:
+            shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+            shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+            shifted_input_ids[..., 0] = decoder_start_token_id
+
+        assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined."
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+
+        assert torch.all(shifted_input_ids >= 0).item(
+        ), "Verify that `shifted_input_ids` has only positive values"
+
+        return shifted_input_ids
+
+
+class T5Embeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+
+        # In Megatron, layer-norm is applied after the 1st dropout.
+        # self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(
+            config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(
+            config, "position_embedding_type", "absolute")
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:,
+                                             past_key_values_length: seq_length + past_key_values_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        embeddings = inputs_embeds
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+
+        # Megatron BERT moves that layer norm after the drop-out (and to each layer).
+        # embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class T5Stack(T5PreTrainedModel):
+    def __init__(self, config, embed_tokens=None):
+        super().__init__(config)
+
+        self.embed_tokens = embed_tokens
+        self.is_decoder = config.is_decoder
+
+        # @IDEA modified -> has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)
+        # -> has_relative_attention_bias=False
+        self.block = nn.ModuleList(
+            [T5Block(config, has_relative_attention_bias=False)
+             for _ in range(config.num_layers)]
+        )
+        # @IDEA modified -> T5LayerNorm -> nn.LayerNorm
+        # self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.final_layer_norm = nn.LayerNorm(
+            config.d_model, eps=config.layer_norm_epsilon)
+
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+        self.init_weights()
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+        self.gradient_checkpointing = False
+
+    @add_start_docstrings(PARALLELIZE_DOCSTRING)
+    def parallelize(self, device_map=None):
+        # Check validity of device_map
+        self.device_map = (
+            get_device_map(len(self.block), range(
+                torch.cuda.device_count())) if device_map is None else device_map
+        )
+        assert_device_map(self.device_map, len(self.block))
+        self.model_parallel = True
+        self.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + \
+            str(min(self.device_map.keys()))
+        self.last_device = "cuda:" + str(max(self.device_map.keys()))
+        # Load onto devices
+        for k, v in self.device_map.items():
+            for layer in v:
+                cuda_device = "cuda:" + str(k)
+                self.block[layer] = self.block[layer].to(cuda_device)
+
+        # Set embed_tokens to first layer
+
+        self.embed_tokens = self.embed_tokens.to(self.first_device)
+        self.embeddings = self.embeddings.to(self.first_device)
+        # Set final layer norm to last device
+        self.final_layer_norm = self.final_layer_norm.to(self.last_device)
+
+    @add_start_docstrings(PARALLELIZE_DOCSTRING)
+    def deparallelize(self):
+        self.model_parallel = False
+        self.device_map = None
+        self.first_device = "cpu"
+        self.last_device = "cpu"
+        for i in range(len(self.block)):
+            self.block[i] = self.block[i].to("cpu")
+        self.embed_tokens = self.embed_tokens.to("cpu")
+        self.final_layer_norm = self.final_layer_norm.to("cpu")
+        torch.cuda.empty_cache()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embed_tokens = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        head_mask=None,
+        cross_attn_head_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        # Model parallel
+        if self.model_parallel:
+            torch.cuda.set_device(self.first_device)
+            self.embed_tokens = self.embed_tokens.to(self.first_device)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You cannot specify both {err_msg_prefix}input_ids and {err_msg_prefix}inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            input_ids = input_ids.view(-1, input_shape[-1])
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You have to specify either {err_msg_prefix}input_ids or {err_msg_prefix}inputs_embeds")
+
+        if inputs_embeds is None:
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
+            # @IDEA modified -> self.embed_tokens(input_ids=input_ids) ->
+            # self.embed_tokens(input_ids=input_ids,osition_ids=position_ids,)
+            # inputs_embeds = self.embed_tokens(input_ids=input_ids)
+            inputs_embeds = self.embed_tokens(input_ids=input_ids)
+
+        batch_size, seq_length = input_shape
+
+        # required mask seq length can be calculated via length of past
+        mask_seq_length = past_key_values[0][0].shape[2] + \
+            seq_length if past_key_values is not None else seq_length
+
+        if use_cache is True:
+            assert self.is_decoder, f":obj:`use_cache` can only be set to `True` if {self} is used as a decoder"
+
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                batch_size, mask_seq_length).to(inputs_embeds.device)
+        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+            encoder_seq_length = encoder_hidden_states.shape[1]
+            encoder_attention_mask = torch.ones(
+                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long
+            )
+
+        # initialize past_key_values with `None` if past does not exist
+        if past_key_values is None:
+            past_key_values = [None] * len(self.block)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(
+            attention_mask, input_shape, inputs_embeds.device)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (
+                encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=inputs_embeds.device)
+            encoder_extended_attention_mask = self.invert_attention_mask(
+                encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        head_mask = self.get_head_mask(head_mask, self.config.num_layers)
+        cross_attn_head_mask = self.get_head_mask(
+            cross_attn_head_mask, self.config.num_layers)
+        present_key_value_states = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+        position_bias = None
+        encoder_decoder_position_bias = None
+
+        hidden_states = self.dropout(inputs_embeds)
+
+        for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
+
+            layer_head_mask = head_mask[i]
+            cross_attn_layer_head_mask = cross_attn_head_mask[i]
+            # Model parallel
+            if self.model_parallel:
+                torch.cuda.set_device(hidden_states.device)
+                # Ensure that attention_mask is always on the same device as hidden_states
+                if attention_mask is not None:
+                    attention_mask = attention_mask.to(hidden_states.device)
+                if position_bias is not None:
+                    position_bias = position_bias.to(hidden_states.device)
+                if encoder_hidden_states is not None:
+                    encoder_hidden_states = encoder_hidden_states.to(
+                        hidden_states.device)
+                if encoder_extended_attention_mask is not None:
+                    encoder_extended_attention_mask = encoder_extended_attention_mask.to(
+                        hidden_states.device)
+                if encoder_decoder_position_bias is not None:
+                    encoder_decoder_position_bias = encoder_decoder_position_bias.to(
+                        hidden_states.device)
+                if layer_head_mask is not None:
+                    layer_head_mask = layer_head_mask.to(hidden_states.device)
+                if cross_attn_layer_head_mask is not None:
+                    cross_attn_layer_head_mask = cross_attn_layer_head_mask.to(
+                        hidden_states.device)
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return tuple(module(*inputs, use_cache, output_attentions))
+
+                    return custom_forward
+
+                layer_outputs = checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    extended_attention_mask,
+                    position_bias,
+                    encoder_hidden_states,
+                    encoder_extended_attention_mask,
+                    encoder_decoder_position_bias,
+                    layer_head_mask,
+                    cross_attn_layer_head_mask,
+                    None,  # past_key_value is always None with gradient checkpointing
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask=extended_attention_mask,
+                    position_bias=position_bias,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_extended_attention_mask,
+                    encoder_decoder_position_bias=encoder_decoder_position_bias,
+                    layer_head_mask=layer_head_mask,
+                    cross_attn_layer_head_mask=cross_attn_layer_head_mask,
+                    past_key_value=past_key_value,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+
+            # layer_outputs is a tuple with:
+            # hidden-states, key-value-states, (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            if use_cache is False:
+                layer_outputs = layer_outputs[:1] + (None,) + layer_outputs[1:]
+
+            hidden_states, present_key_value_state = layer_outputs[:2]
+
+            # We share the position biases between the layers - the first layer store them
+            # layer_outputs = hidden-states, key-value-states (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            position_bias = layer_outputs[2]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[4 if output_attentions else 3]
+            # append next layer key value states
+            if use_cache:
+                present_key_value_states = present_key_value_states + \
+                    (present_key_value_state,)
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[3],)
+                if self.is_decoder:
+                    all_cross_attentions = all_cross_attentions + \
+                        (layer_outputs[5],)
+
+            # Model Parallel: If it's the last layer for that device, put things on the next device
+            if self.model_parallel:
+                for k, v in self.device_map.items():
+                    if i == v[-1] and "cuda:" + str(k) != self.last_device:
+                        hidden_states = hidden_states.to("cuda:" + str(k + 1))
+
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    present_key_value_states,
+                    all_hidden_states,
+                    all_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_value_states,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+T5_START_DOCSTRING = r"""
+
+    The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
+    <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
+    Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a text-to-text
+    denoising generative setting.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
+    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    general usage and behavior.
+
+    Parameters:
+        config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
+            weights.
+"""
+
+T5_INPUTS_DOCSTRING = """
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
+            should be able to pad the inputs on both the right and the left.
+
+            Indices can be obtained using :class:`~transformers.T5Tokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            detail.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+
+            To know more on how to prepare :obj:`input_ids` for pretraining take a look a `T5 Training
+            <./T5.html#training>`__.
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
+            Indices of decoder input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`~transformers.T5Tokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            details.
+
+            `What are decoder input IDs? <../glossary.html#decoder-input-ids>`__
+
+            T5 uses the :obj:`pad_token_id` as the starting token for :obj:`decoder_input_ids` generation. If
+            :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
+            :obj:`past_key_values`).
+
+            To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at `T5 Training
+            <./T5.html#training>`__.
+        decoder_attention_mask (:obj:`torch.BoolTensor` of shape
+        :obj:`(batch_size, target_sequence_length)`, `optional`):
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the self-attention modules in the encoder. Mask values selected in ``[0,
+            1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        decoder_head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or
+        :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the self-attention modules in the decoder. Mask values selected in ``[0,
+            1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_heads,)` or
+        :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in
+                ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
+            Tuple consists of (:obj:`last_hidden_state`, :obj:`optional`: `hidden_states`, :obj:`optional`:
+            `attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a
+            sequence of hidden states at the output of the last layer of the encoder. Used in the cross-attention of
+            the decoder.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having
+        4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)
+        `, `optional`):
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
+            representation. If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_inputs_embeds`
+            have to be input (see :obj:`past_key_values`). This is useful if you want more control over how to convert
+            :obj:`decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+
+            If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both unset, :obj:`decoder_inputs_embeds`
+            takes the value of :obj:`inputs_embeds`.
+
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+"""
+
+T5_ENCODER_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
+            should be able to pad the inputs on both the right and the left.
+
+            Indices can be obtained using :class:`~transformers.T5Tokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            detail.
+
+            To know more on how to prepare :obj:`input_ids` for pretraining take a look a `T5 Training
+            <./T5.html#training>`__.
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+"""
+
+# Warning message for FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
+__HEAD_MASK_WARNING_MSG = """
+The input argument `head_mask` was split into two arguments `head_mask` and `decoder_head_mask`. Currently,
+`decoder_head_mask` is set to copy `head_mask`, but this feature is deprecated and will be removed in future versions.
+If you do not want to use any `decoder_head_mask` now, please set `decoder_head_mask = torch.ones(num_layers,
+num_heads)`.
+"""
+
+
+@add_start_docstrings(
+    "The bare T5 Model transformer outputting raw hidden-states without any specific head on top.",
+    T5_START_DOCSTRING,
+)
+class T5LMHead(nn.Module):
+    """Masked LM head for T5
+
+    Arguments:
+        mpu_vocab_size: model parallel size of vocabulary.
+        hidden_size: hidden size
+        init_method: init method for weight initialization
+        layernorm_epsilon: tolerance for layer norm divisions
+        parallel_output: wether output logits being distributed or not.
+    """
+
+    def __init__(self, config):
+        super(T5LMHead, self).__init__()
+
+        self.bias = torch.nn.Parameter(torch.zeros(config.vocab_size))
+
+    def forward(self, hidden_states, word_embeddings_weight):
+        output = torch.nn.functional.linear(hidden_states,
+                                            word_embeddings_weight,
+                                            bias=self.bias)
+        return output
+
+
+class T5Model(T5PreTrainedModel):
+    _keys_to_ignore_on_load_missing = [
+        r"encoder\.embed_tokens\.weight",
+        r"decoder\.embed_tokens\.weight",
+    ]
+    _keys_to_ignore_on_load_unexpected = [
+        r"decoder\.block\.0\.layer\.1\.EncDecAttention\.relative_attention_bias\.weight",
+    ]
+
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        # @IDEA modified -> nn.Embedding -> T5Embeddings
+        # self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.shared = T5Embeddings(config)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = T5Stack(encoder_config, self.shared)
+
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = T5Stack(decoder_config, self.shared)
+
+        self.init_weights()
+
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+
+    @add_start_docstrings(PARALLELIZE_DOCSTRING)
+    def parallelize(self, device_map=None):
+        self.device_map = (
+            get_device_map(len(self.encoder.block),
+                           range(torch.cuda.device_count()))
+            if device_map is None
+            else device_map
+        )
+        assert_device_map(self.device_map, len(self.encoder.block))
+        self.encoder.parallelize(self.device_map)
+        self.decoder.parallelize(self.device_map)
+        self.model_parallel = True
+
+    @add_start_docstrings(DEPARALLELIZE_DOCSTRING)
+    def deparallelize(self):
+        self.encoder.deparallelize()
+        self.decoder.deparallelize()
+        self.encoder = self.encoder.to("cpu")
+        self.decoder = self.decoder.to("cpu")
+        self.model_parallel = False
+        self.device_map = None
+        torch.cuda.empty_cache()
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(T5_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=Seq2SeqModelOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        cross_attn_head_mask=None,
+        encoder_outputs=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        Returns:
+
+        Example::
+
+            >>> from transformers import T5Tokenizer, T5Model
+
+            >>> tokenizer = T5Tokenizer.from_pretrained('T5-small')
+            >>> model = T5Model.from_pretrained('T5-small')
+
+            >>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you",
+            return_tensors="pt").input_ids  # Batch size 1
+            >>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
+
+            >>> # forward pass
+            >>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
+            >>> last_hidden_states = outputs.last_hidden_state
+        """
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
+        if head_mask is not None and decoder_head_mask is None:
+            if self.config.num_layers == self.config.num_decoder_layers:
+                warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
+                decoder_head_mask = head_mask
+
+        # Encode if needed (training, first prediction pass)
+        if encoder_outputs is None:
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
+            encoder_outputs = BaseModelOutput(
+                last_hidden_state=encoder_outputs[0],
+                hidden_states=encoder_outputs[1] if len(
+                    encoder_outputs) > 1 else None,
+                attentions=encoder_outputs[2] if len(
+                    encoder_outputs) > 2 else None,
+            )
+
+        hidden_states = encoder_outputs[0]
+        if self.model_parallel:
+            torch.cuda.set_device(self.decoder.first_device)
+        # Set device for model parallelism
+        if self.model_parallel:
+            torch.cuda.set_device(self.decoder.first_device)
+            hidden_states = hidden_states.to(self.decoder.first_device)
+            if decoder_input_ids is not None:
+                decoder_input_ids = decoder_input_ids.to(
+                    self.decoder.first_device)
+            if attention_mask is not None:
+                attention_mask = attention_mask.to(self.decoder.first_device)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask.to(
+                    self.decoder.first_device)
+
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            past_key_values=past_key_values,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+            head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return decoder_outputs + encoder_outputs
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+            encoder_attentions=encoder_outputs.attentions,
+        )
+
+
+@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING)
+class T5ForConditionalGeneration(T5PreTrainedModel):
+    _keys_to_ignore_on_load_missing = [
+        r"encoder\.embed_tokens\.weight",
+        r"decoder\.embed_tokens\.weight",
+        r"lm_head\.weight",
+    ]
+    _keys_to_ignore_on_load_unexpected = [
+        r"decoder\.block\.0\.layer\.1\.EncDecAttention\.relative_attention_bias\.weight",
+    ]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model_dim = config.d_model
+
+        # @IDEA modified -> nn.Embedding -> T5Embeddings
+        # self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.shared = T5Embeddings(config)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = T5Stack(encoder_config, self.shared)
+
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = T5Stack(decoder_config, self.shared)
+
+        # @IDEA modified -> add self.lm_head_bias
+        self.lm_head_bias = torch.nn.Parameter(torch.zeros(config.vocab_size))
+
+        self.init_weights()
+
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+
+    @add_start_docstrings(PARALLELIZE_DOCSTRING)
+    def parallelize(self, device_map=None):
+        self.device_map = (
+            get_device_map(len(self.encoder.block),
+                           range(torch.cuda.device_count()))
+            if device_map is None
+            else device_map
+        )
+        assert_device_map(self.device_map, len(self.encoder.block))
+        self.encoder.parallelize(self.device_map)
+        self.decoder.parallelize(self.device_map)
+        self.lm_head = self.lm_head.to(self.decoder.first_device)
+        self.model_parallel = True
+
+    @add_start_docstrings(DEPARALLELIZE_DOCSTRING)
+    def deparallelize(self):
+        self.encoder.deparallelize()
+        self.decoder.deparallelize()
+        self.encoder = self.encoder.to("cpu")
+        self.decoder = self.decoder.to("cpu")
+        self.lm_head = self.lm_head.to("cpu")
+        self.model_parallel = False
+        self.device_map = None
+        torch.cuda.empty_cache()
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def get_output_embeddings(self):
+        return self.lm_head_bias
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def generate(self, input_ids=None, max_length=512):
+
+        input_ids = torch.tensor(input_ids)
+        if len(input_ids.shape) < 2:
+            input_ids = input_ids.unsqueeze(0)
+        decode_input_id = [21128]   # [BOS]的token_id为21128
+        for i in range(max_length):
+            tensor_decode_input_id = torch.tensor([decode_input_id])
+            forword_output = self.forward(input_ids=input_ids,
+                                          decoder_input_ids=tensor_decode_input_id)
+            logits = forword_output.logits
+            logits = torch.nn.functional.softmax(
+                logits, dim=-1).cpu().detach().numpy()[0]
+
+            last_output_id = int(np.random.choice(
+                logits.shape[1], p=logits[-1]))
+            if last_output_id == 21129:  # [EOS]的token_id为21129
+                break
+            else:
+                decode_input_id.append(last_output_id)
+
+        return decode_input_id
+
+    @add_start_docstrings_to_model_forward(T5_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        cross_attn_head_mask=None,
+        encoder_outputs=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[-100, 0, ...,
+            config.vocab_size - 1]`. All labels set to ``-100`` are ignored (masked), the loss is only computed for
+            labels in ``[0, ..., config.vocab_size]``
+
+        Returns:
+        Examples::
+
+            >>> from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+            >>> tokenizer = T5Tokenizer.from_pretrained('T5-small')
+            >>> model = T5ForConditionalGeneration.from_pretrained('T5-small')
+
+            >>> # training
+            >>> input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
+            >>> labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
+            >>> outputs = model(input_ids=input_ids, labels=labels)
+            >>> loss = outputs.loss
+            >>> logits = outputs.logits
+
+            >>> # inference
+            >>> input_ids = tokenizer("summarize: studies have shown that owning a dog is good for you",
+            return_tensors="pt").input_ids  # Batch size 1
+            >>> outputs = model.generate(input_ids)
+            >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+            >>> # studies have shown that owning a dog is good for you.
+        """
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
+        if head_mask is not None and decoder_head_mask is None:
+            if self.config.num_layers == self.config.num_decoder_layers:
+                warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
+                decoder_head_mask = head_mask
+
+        # Encode if needed (training, first prediction pass)
+        if encoder_outputs is None:
+            # Convert encoder inputs in embeddings if needed
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
+            encoder_outputs = BaseModelOutput(
+                last_hidden_state=encoder_outputs[0],
+                hidden_states=encoder_outputs[1] if len(
+                    encoder_outputs) > 1 else None,
+                attentions=encoder_outputs[2] if len(
+                    encoder_outputs) > 2 else None,
+            )
+
+        hidden_states = encoder_outputs[0]
+
+        if self.model_parallel:
+            torch.cuda.set_device(self.decoder.first_device)
+
+        if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(labels)
+
+        # Set device for model parallelism
+        if self.model_parallel:
+            torch.cuda.set_device(self.decoder.first_device)
+            hidden_states = hidden_states.to(self.decoder.first_device)
+            if decoder_input_ids is not None:
+                decoder_input_ids = decoder_input_ids.to(
+                    self.decoder.first_device)
+            if attention_mask is not None:
+                attention_mask = attention_mask.to(self.decoder.first_device)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask.to(
+                    self.decoder.first_device)
+
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            past_key_values=past_key_values,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+            head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = decoder_outputs.last_hidden_state
+
+        # Set device for model parallelism
+        # if self.model_parallel:
+        #     torch.cuda.set_device(self.encoder.first_device)
+        #     self.lm_head = self.lm_head.to(self.encoder.first_device)
+        #     sequence_output = sequence_output.to(self.lm_head.weight.device)
+
+        # if self.config.tie_word_embeddings:
+        #     # Rescale output before projecting on vocab
+        #     # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/
+        #       mesh_tensorflow/transformer/transformer.py#L586
+        #     sequence_output = sequence_output * (self.model_dim ** -0.5)
+
+        lm_logits = torch.nn.functional.linear(
+            sequence_output, self.shared.word_embeddings.weight, bias=self.lm_head_bias)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(
+                lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))
+            # @IDEA modified(thom): Add z_loss https://github.com/tensorflow/mesh/blob/
+            # fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666
+
+        if not return_dict:
+            output = (lm_logits,) + decoder_outputs[1:] + encoder_outputs
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+            encoder_attentions=encoder_outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past=None,
+        attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        cross_attn_head_mask=None,
+        use_cache=None,
+        encoder_outputs=None,
+        **kwargs
+    ):
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            "decoder_input_ids": input_ids,
+            "past_key_values": past,
+            "encoder_outputs": encoder_outputs,
+            "attention_mask": attention_mask,
+            "head_mask": head_mask,
+            "decoder_head_mask": decoder_head_mask,
+            "cross_attn_head_mask": cross_attn_head_mask,
+            "use_cache": use_cache,
+        }
+
+    def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
+        return self._shift_right(labels)
+
+    def _reorder_cache(self, past, beam_idx):
+        # if decoder past is not included in output
+        # speedy decoding is disabled and no need to reorder
+        if past is None:
+            logger.warning(
+                "You might want to consider setting `use_cache=True` to speed up decoding")
+            return past
+
+        reordered_decoder_past = ()
+        for layer_past_states in past:
+            # get the correct batch idx from layer past batch dim
+            # batch dim of `past` is at 2nd position
+            reordered_layer_past_states = ()
+            for layer_past_state in layer_past_states:
+                # need to set correct `past` for each of the four key / value states
+                reordered_layer_past_states = reordered_layer_past_states + (
+                    layer_past_state.index_select(
+                        0, beam_idx.to(layer_past_state.device)),
+                )
+
+            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape
+            assert len(reordered_layer_past_states) == len(layer_past_states)
+
+            reordered_decoder_past = reordered_decoder_past + \
+                (reordered_layer_past_states,)
+        return reordered_decoder_past
+
+
+@add_start_docstrings(
+    "The bare T5 Model transformer outputting encoder's raw hidden-states without any specific head on top.",
+    T5_START_DOCSTRING,
+)
+class T5EncoderModel(T5PreTrainedModel):
+    authorized_missing_keys = [
+        r"encoder\.embed_tokens\.weight",
+    ]
+
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        # @IDEA modified -> nn.Embedding -> T5Embeddings
+        # self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.shared = T5Embeddings(config)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = T5Stack(encoder_config, self.shared)
+
+        self.init_weights()
+
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+
+    @add_start_docstrings(PARALLELIZE_DOCSTRING)
+    def parallelize(self, device_map=None):
+        self.device_map = (
+            get_device_map(len(self.encoder.block),
+                           range(torch.cuda.device_count()))
+            if device_map is None
+            else device_map
+        )
+        assert_device_map(self.device_map, len(self.encoder.block))
+        self.encoder.parallelize(self.device_map)
+        self.model_parallel = True
+
+    @add_start_docstrings(DEPARALLELIZE_DOCSTRING)
+    def deparallelize(self):
+        self.encoder.deparallelize()
+        self.encoder = self.encoder.to("cpu")
+        self.model_parallel = False
+        self.device_map = None
+        torch.cuda.empty_cache()
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(T5_ENCODER_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        Returns:
+
+        Example::
+
+            >>> from transformers import T5Tokenizer, T5EncoderModel
+            >>> tokenizer = T5Tokenizer.from_pretrained('T5-small')
+            >>> model = T5EncoderModel.from_pretrained('T5-small')
+            >>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you",
+        return_tensors="pt").input_ids  # Batch size 1
+            >>> outputs = model(input_ids=input_ids)
+            >>> last_hidden_states = outputs.last_hidden_state
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return encoder_outputs
diff --git a/fengshen/models/megatron_t5/tokenization_megatron_t5.py b/fengshen/models/megatron_t5/tokenization_megatron_t5.py
new file mode 100644
index 0000000000000000000000000000000000000000..d96b7e1743ae8c7ecb4aa3871907a9dc070cf74b
--- /dev/null
+++ b/fengshen/models/megatron_t5/tokenization_megatron_t5.py
@@ -0,0 +1,32 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" T5Tokenizer """
+
+from transformers import BertTokenizer
+
+
+class T5Tokenizer():
+    def __init__(self, extra_id_num=118):
+        self.extra_id_num = extra_id_num
+
+    @classmethod
+    def from_pretrained(self, vocab_path):
+        self.extra_id_num = 118
+        self.T5_special_tokens = ['[BOS]', '[EOS]']
+        for i in range(self.extra_id_num):
+            self.T5_special_tokens.append(f'<extra_id_{str(i)}>')
+        tokenizer = BertTokenizer.from_pretrained(vocab_path, additional_special_tokens=self.T5_special_tokens)
+
+        return tokenizer
diff --git a/fengshen/models/model_utils.py b/fengshen/models/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..09ce3e4ab99d661ee9f364c35ea0f987c4e47c93
--- /dev/null
+++ b/fengshen/models/model_utils.py
@@ -0,0 +1,66 @@
+from pytorch_lightning import LightningModule
+
+from pytorch_lightning.strategies import DeepSpeedStrategy
+from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
+from transformers.optimization import AdamW, get_scheduler
+
+
+def add_module_args(parent_args):
+    parser = parent_args.add_argument_group('Basic Module')
+    parser.add_argument('--learning_rate', default=5e-5, type=float)
+    parser.add_argument('--weight_decay', default=1e-1, type=float)
+    parser.add_argument('--warmup_steps', default=0, type=int)
+    parser.add_argument('--warmup_ratio', default=0.1, type=float)
+    parser.add_argument('--adam_beta1', default=0.9, type=float)
+    parser.add_argument('--adam_beta2', default=0.999, type=float)
+    parser.add_argument('--adam_epsilon', default=1e-8, type=float)
+    parser.add_argument('--model_path', default=None, type=str)
+    parser.add_argument('--scheduler_type', default='polynomial', type=str)
+    return parent_args
+
+
+def configure_optimizers(pl_model: LightningModule):
+    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.', 'layernorm.']
+    optimizer_grouped_params = [
+        {'params': [p for n, p in pl_model.named_parameters() if not any(
+            nd in n for nd in no_decay)], 'weight_decay': pl_model.hparams.weight_decay},
+        {'params': [p for n, p in pl_model.named_parameters() if any(
+            nd in n for nd in no_decay)], 'weight_decay': 0.0}
+    ]
+    # Configure optimizer.
+    if isinstance(pl_model.trainer.strategy, DeepSpeedStrategy):
+        if 'offload_optimizer' in pl_model.trainer.training_type_plugin.config['zero_optimization']:
+            optimizer = DeepSpeedCPUAdam(
+                optimizer_grouped_params, adamw_mode=True,
+                lr=pl_model.hparams.learning_rate,
+                betas=(pl_model.hparams.adam_beta1, pl_model.hparams.adam_beta2), eps=pl_model.hparams.adam_epsilon)
+        else:
+            optimizer = FusedAdam(
+                optimizer_grouped_params, adam_w_mode=True,
+                lr=pl_model.hparams.learning_rate,
+                betas=(pl_model.hparams.adam_beta1, pl_model.hparams.adam_beta2), eps=pl_model.hparams.adam_epsilon)
+    else:
+        optimizer = AdamW(optimizer_grouped_params, lr=pl_model.hparams.learning_rate,
+                          betas=(pl_model.hparams.adam_beta1, pl_model.hparams.adam_beta2),
+                          eps=pl_model.hparams.adam_epsilon)
+    # Configure learning rate scheduler.
+    warmup_steps = pl_model.hparams.warmup_ratio * \
+        pl_model.total_steps if pl_model.hparams.warmup_steps == 0 else pl_model.hparams.warmup_steps
+    scheduler = get_scheduler(name=pl_model.hparams.scheduler_type, optimizer=optimizer,
+                              num_warmup_steps=warmup_steps, num_training_steps=pl_model.total_steps)
+    scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
+    return [optimizer], [scheduler]
+
+
+def get_total_steps(trainer, hparams):
+    train_loader = trainer._data_connector._train_dataloader_source.dataloader()
+    # Calculate total steps
+    if trainer.max_epochs > 0:
+        world_size = trainer.world_size
+        tb_size = hparams.train_batchsize * max(1, world_size)
+        ab_size = trainer.accumulate_grad_batches
+        total_steps = (len(train_loader.dataset) *
+                       trainer.max_epochs // tb_size) // ab_size
+    else:
+        total_steps = trainer.max_steps
+    return total_steps
diff --git a/fengshen/models/roformer/__init__.py b/fengshen/models/roformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c55c090f25446ec2cf60d632dacdb53a8928e25e
--- /dev/null
+++ b/fengshen/models/roformer/__init__.py
@@ -0,0 +1,57 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from transformers.file_utils import _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "configuration_roformer": ["RoFormerConfig"],
+    "tokenization_roformer": ["RoFormerTokenizer"],
+}
+
+if is_torch_available():
+    _import_structure["modeling_roformer"] = [
+        "RoFormerModel",
+        "RoFormerForMaskedLM",
+        "RoFormerForMultipleChoice",
+        "RoFormerPreTrainedModel",
+        "RoFormerForQuestionAnswering",
+        "RoFormerForSequenceClassification",
+        "RoFormerForTokenClassification",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_roformer import RoFormerConfig
+    from .tokenization_roformer import RoFormerTokenizer
+
+    if is_torch_available():
+        from .modeling_roformer import (
+            RoFormerModel,
+            RoFormerForMaskedLM,
+            RoFormerForMultipleChoice,
+            RoFormerPreTrainedModel,
+            RoFormerForQuestionAnswering,
+            RoFormerForSequenceClassification,
+            RoFormerForTokenClassification,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(
+        __name__, globals()["__file__"], _import_structure)
diff --git a/fengshen/models/roformer/configuration_roformer.py b/fengshen/models/roformer/configuration_roformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4818b31bd215b11d4ca952437869319fc25ae5b5
--- /dev/null
+++ b/fengshen/models/roformer/configuration_roformer.py
@@ -0,0 +1,133 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" RoFormer model configuration """
+
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+RoFormer_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    # See all RoFormer models at https://huggingface.co/models?filter=bert
+}
+
+
+class RoFormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.RoFormerModel`. It is
+    used to instantiate a RoFormer model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the RoFormer
+    `megatron-bert-uncased-345m <https://huggingface.co/nvidia/megatron-bert-uncased-345m>`__ architecture.
+
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 29056):
+            Vocabulary size of the RoFormer model. Defines the number of different tokens that can be represented
+            by the :obj:`inputs_ids` passed when calling :class:`~transformers.RoFormerModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 1024):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling
+            :class:`~transformers.RoFormerModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
+        position_embedding_type (:obj:`str`, `optional`, defaults to :obj:`"absolute"`):
+            Type of position embedding. Choose one of :obj:`"absolute"`, :obj:`"relative_key"`,
+            :obj:`"relative_key_query"`. For positional embeddings use :obj:`"absolute"`. For more information on
+            :obj:`"relative_key"`, please refer to `Self-Attention with Relative Position Representations (Shaw et al.)
+            <https://arxiv.org/abs/1803.02155>`__. For more information on :obj:`"relative_key_query"`, please refer to
+            `Method 4` in `Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)
+            <https://arxiv.org/abs/2009.13658>`__.
+        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if ``config.is_decoder=True``.
+
+    Examples::
+
+        >>> from transformers import RoFormerModel, RoFormerConfig
+
+        >>> # Initializing a RoFormer bert-base-uncased style configuration
+        >>> configuration = RoFormerConfig()
+
+        >>> # Initializing a model from the bert-base-uncased style configuration
+        >>> model = RoFormerModel(configuration)
+
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+    """
+    model_type = "roformer"
+
+    def __init__(
+        self,
+        vocab_size=29056,
+        hidden_size=1024,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        intermediate_size=4096,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        gradient_checkpointing=False,
+        position_embedding_type="absolute",
+        use_cache=True,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.gradient_checkpointing = gradient_checkpointing
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
diff --git a/fengshen/models/roformer/modeling_roformer.py b/fengshen/models/roformer/modeling_roformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f67d34c6484108890f21983a0924b3a748e97b4
--- /dev/null
+++ b/fengshen/models/roformer/modeling_roformer.py
@@ -0,0 +1,1954 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch RoFormer model. """
+
+import math
+import os
+import warnings
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from transformers.activations import ACT2FN
+from transformers.file_utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    replace_return_docstrings,
+)
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    NextSentencePredictorOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import (
+    PreTrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from transformers.utils import logging
+from .configuration_roformer import RoFormerConfig
+
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "RoFormerConfig"
+_TOKENIZER_FOR_DOC = "BertTokenizer"
+_CHECKPOINT_FOR_DOC = "nvidia/megatron-bert-cased-345m"
+
+RoFormer_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "nvidia/megatron-bert-cased-345m",
+    # See all RoFormer models at https://huggingface.co/models?filter=RoFormer
+]
+
+
+def load_tf_weights_in_RoFormer(model, config, tf_checkpoint_path):
+    """Load tf checkpoints in a pytorch model."""
+    try:
+        import re
+
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error(
+            "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions."
+        )
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        logger.info(f"Loading TF weight {name} with shape {shape}")
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split("/")
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(
+            n in ["adam_v", "adam_m", "AdamWeightDecayOptimizer",
+                  "AdamWeightDecayOptimizer_1", "global_step"]
+            for n in name
+        ):
+            logger.info(f"Skipping {'/'.join(name)}")
+            continue
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
+                scope_names = re.split(r"_(\d+)", m_name)
+            else:
+                scope_names = [m_name]
+            if scope_names[0] == "kernel" or scope_names[0] == "gamma":
+                pointer = getattr(pointer, "weight")
+            elif scope_names[0] == "output_bias" or scope_names[0] == "beta":
+                pointer = getattr(pointer, "bias")
+            elif scope_names[0] == "output_weights":
+                pointer = getattr(pointer, "weight")
+            elif scope_names[0] == "squad":
+                pointer = getattr(pointer, "classifier")
+            else:
+                try:
+                    pointer = getattr(pointer, scope_names[0])
+                except AttributeError:
+                    logger.info(f"Skipping {'/'.join(name)}")
+                    continue
+            if len(scope_names) >= 2:
+                num = int(scope_names[1])
+                pointer = pointer[num]
+        if m_name[-11:] == "_embeddings":
+            pointer = getattr(pointer, "weight")
+        elif m_name == "kernel":
+            array = np.transpose(array)
+        try:
+            assert (
+                pointer.shape == array.shape
+            ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+class RoFormerEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+
+        # @IDEA modified -> roformer removed the position_embedding, and add the totary position embedding in the self_attention_layer
+        # self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+        self.token_type_embeddings = nn.Embedding(
+            config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+
+        # In Megatron, layer-norm is applied after the 1st dropout.
+        # self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(
+            config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(
+            config, "position_embedding_type", "absolute")
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:,
+                                             past_key_values_length: seq_length + past_key_values_length]
+
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                input_shape, dtype=torch.long, device=self.position_ids.device)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+
+        # @IDEA modified -> roformer removed the position_embedding
+        # if self.position_embedding_type == "absolute":
+        #     position_embeddings = self.position_embeddings(position_ids)
+        #     embeddings += position_embeddings
+
+        # Megatron BERT moves that layer norm after the drop-out (and to each layer).
+        # embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RoPEmbedding(nn.Module):
+    def __init__(self, d_model):
+        super(RoPEmbedding, self).__init__()
+        self.d_model = d_model
+        div_term = torch.exp(torch.arange(
+            0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        self.register_buffer('div_term', div_term)
+
+    def forward(self, x, seq_dim=0):
+        # x 是 [s, b, np, hn]，例如query和key
+        x = x.permute(2, 1, 0, 3)
+        t = torch.arange(x.size(seq_dim), device=x.device).type_as(
+            self.div_term)
+        sinusoid_inp = torch.outer(t, self.div_term)
+        sin, cos = sinusoid_inp.sin(), sinusoid_inp.cos()  # [s, hn]
+        o_shape = (sin.size(0), 1, 1, sin.size(1))
+        sin, cos = sin.view(*o_shape), cos.view(*o_shape)  # [s, 1, 1, hn]
+        sin = torch.repeat_interleave(sin, 2, dim=-1)
+        cos = torch.repeat_interleave(cos, 2, dim=-1)
+        x2 = torch.stack([-x[..., 1::2], x[..., ::2]], dim=-1).reshape_as(x)
+        x = cos * x + sin * x2
+        return x.permute(2, 1, 0, 3)
+
+
+# Copied from transformers.models.bert.modeling_bert.BertSelfAttention with Bert->RoFormer
+class RoFormerSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(
+            config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(
+            config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(
+                2 * config.max_position_embeddings - 1, self.attention_head_size)
+        # @IDEA modified -> add rope positional embedding
+        self.rope_emb = RoPEmbedding(self.attention_head_size)
+
+        self.is_decoder = config.is_decoder
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[
+            :-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_layer = past_key_value[0]
+            value_layer = past_key_value[1]
+            attention_mask = encoder_attention_mask
+        elif is_cross_attention:
+            key_layer = self.transpose_for_scores(
+                self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(
+                self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+
+        # @IDEA modified -> add rope positional embedding
+        # print('query_layer.shape')
+        # print(query_layer.shape)
+        # query_layer.hsape -> [batch_size,num_head,seq_len,per_head_hidden_size]
+        query_layer = self.rope_emb(query_layer)
+        key_layer = self.rope_emb(key_layer)
+
+        attention_scores = torch.matmul(
+            query_layer, key_layer.transpose(-1, -2))
+
+        """ @IDEA modified -> removed the megatron positional
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+        """
+
+        attention_scores = attention_scores / \
+            math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in RoFormerModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[
+            :-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (
+            context_layer,)
+
+        if self.is_decoder:
+            outputs = outputs + (past_key_value,)
+        return outputs
+
+
+# Based transformers.models.bert.modeling_bert.BertSelfOutput. Moved LayerNorm to RoFormerAttention below.
+class RoFormerSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, residual):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return residual + hidden_states
+
+
+# Based transformers.models.bert.modeling_bert.BertAttention. Added LayerNorm.
+class RoFormerAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.self = RoFormerSelfAttention(config)
+        self.output = RoFormerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - \
+            len(heads)
+        self.self.all_head_size = self.self.attention_head_size * \
+            self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        ln_outputs = self.ln(hidden_states)
+        self_outputs = self.self(
+            ln_outputs,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        # add attentions if we output them
+        outputs = (attention_output,) + self_outputs[1:]
+        return outputs
+
+
+# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->RoFormer
+class RoFormerIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Based on transformers.models.bert.modeling_bert.BertOutput. Moved LayerNorm to RoFormerLayer below.
+class RoFormerOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return input_tensor + hidden_states
+
+
+# Based on transformers.models.bert.modeling_bert.BertLayer. Added LayerNorm.
+class RoFormerLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = RoFormerAttention(config)
+        self.is_decoder = config.is_decoder
+        self.add_cross_attention = config.add_cross_attention
+        if self.add_cross_attention:
+            assert self.is_decoder, f"{self} should be used as a decoder model if cross attention is added"
+            self.crossattention = RoFormerAttention(config)
+        self.ln = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.intermediate = RoFormerIntermediate(config)
+        self.output = RoFormerOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:
+                                                  2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+
+        # if decoder, the last output is tuple of self-attn cache
+        if self.is_decoder:
+            outputs = self_attention_outputs[1:-1]
+            present_key_value = self_attention_outputs[-1]
+        else:
+            # add self attentions if we output attention weights
+            outputs = self_attention_outputs[1:]
+
+        cross_attn_present_key_value = None
+        if self.is_decoder and encoder_hidden_states is not None:
+            assert hasattr(
+                self, "crossattention"
+            ), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"
+
+            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
+            cross_attn_past_key_value = past_key_value[-2:
+                                                       ] if past_key_value is not None else None
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                cross_attn_past_key_value,
+                output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            # add cross attentions if we output attention weights
+            outputs = outputs + cross_attention_outputs[1:-1]
+
+            # add cross-attn cache to positions 3,4 of present_key_value tuple
+            cross_attn_present_key_value = cross_attention_outputs[-1]
+            present_key_value = present_key_value + cross_attn_present_key_value
+
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+
+        # if decoder, return the attn key/values as the last output
+        if self.is_decoder:
+            outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        ln_output = self.ln(attention_output)
+        intermediate_output = self.intermediate(ln_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+def roformer_extended_attention_mask(attention_mask, tokentype_ids):
+    # copy from bert_model.py and
+    # https://github.com/bojone/bert4keras/blob/8836dc01fa99aa54947a15db5aa60a0ab6c0c036/bert4keras/models.py#L382
+    # We create a 3D attention mask from a 2D tensor mask.
+    # [b, 1, s]
+    attention_mask_b1s = attention_mask.unsqueeze(1)
+    # [b, s, 1]
+    attention_mask_bs1 = attention_mask.unsqueeze(2)
+    # [b, s, s]
+    padding_mask_bss = attention_mask_b1s * attention_mask_bs1
+
+    # Convert attention mask to binary:
+    padding_mask_bss = (padding_mask_bss < 0.5)
+
+    # 根据tokentype_ids来获取相应的双向或者单向mask，注意
+    # 这里改变了原本实现中的小于等于号，因为megatron中的mask
+    # 中非mask部分为0，mask部分为1
+    idx = torch.cumsum(tokentype_ids, dim=1)
+    causal_mask = idx[:, None, :] > idx[:, :, None]
+    # 合并两个mask
+    mask = torch.logical_or(causal_mask, padding_mask_bss)
+    mask = mask.unsqueeze(1)  # [b, 1, s, s]
+    return mask
+
+
+class RoFormerEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([RoFormerLayer(config)
+                                   for _ in range(config.num_hidden_layers)])
+
+        # The final layer norm. We removed the 1st LN, moved LN to each hidden layer and this one
+        # is simply the final LN (Transformer's BERT has it attached to each hidden layer).
+        self.ln = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+
+        next_decoder_cache = () if use_cache else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
+                        "`use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+
+            # Because we moved the layer-norm at the end of the hidden layer, we have non-normali-
+            # zed data here. If that's really needed, we must apply LN to match Transformer's BERT.
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if self.config.add_cross_attention:
+                    all_cross_attentions = all_cross_attentions + \
+                        (layer_outputs[2],)
+
+        # Finalize the hidden states.
+        hidden_states = self.ln(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->RoFormer
+class RoFormerPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPredictionHeadTransform with Bert->RoFormer
+class RoFormerPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertLMPredictionHead with Bert->RoFormer
+class RoFormerLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.transform = RoFormerPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(
+            config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOnlyMLMHead with Bert->RoFormer
+class RoFormerOnlyMLMHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = RoFormerLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOnlyNSPHead with Bert->RoFormer
+class RoFormerOnlyNSPHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPreTrainingHeads with Bert->RoFormer
+class RoFormerPreTrainingHeads(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = RoFormerLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class RoFormerPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = RoFormerConfig
+    load_tf_weights = load_tf_weights_in_RoFormer
+    base_model_prefix = "bert"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+@dataclass
+# Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->RoFormer
+class RoFormerForPreTrainingOutput(ModelOutput):
+    """
+    Output type of :class:`~transformers.RoFormerForPreTraining`.
+
+    Args:
+        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    prediction_logits: torch.FloatTensor = None
+    seq_relationship_logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+RoFormer_START_DOCSTRING = r"""
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
+    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    general usage and behavior.
+
+    Parameters:
+        config (:class:`~transformers.RoFormerConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
+            weights.
+"""
+
+RoFormer_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`~transformers.BertTokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
+            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
+            1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+            config.max_position_embeddings - 1]``.
+
+            `What are position IDs? <../glossary.html#position-ids>`_
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare RoFormer Model transformer outputting raw hidden-states without any specific head on top.",
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerModel(RoFormerPreTrainedModel):
+    """
+
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
+    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+
+    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
+    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
+    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
+    input to the forward pass.
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = RoFormerEmbeddings(config)
+        self.encoder = RoFormerEncoder(config)
+
+        self.pooler = RoFormerPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPoolingAndCrossAttentions,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError(
+                "You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+        else:
+            raise ValueError(
+                "You have to specify either input_ids or inputs_embeds")
+
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                ((batch_size, seq_length + past_key_values_length)), device=device)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                input_shape, dtype=torch.long, device=device)
+
+        # @IDEA modified -> get_extended_attention_mask -> roformer_extended_attention_mask
+        extended_attention_mask = roformer_extended_attention_mask(
+            attention_mask, token_type_ids)
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
+        """
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.config.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (
+                encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=device)
+            encoder_extended_attention_mask = self.invert_attention_mask(
+                encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(
+            head_mask, self.config.num_hidden_layers)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(
+            sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    RoFormer Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
+    `next sentence prediction (classification)` head.
+    """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForPreTraining(RoFormerPreTrainedModel):
+    def __init__(self, config, add_binary_head=True):
+        super().__init__(config)
+
+        self.bert = RoFormerModel(config)
+        self.cls = RoFormerPreTrainingHeads(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=RoFormerForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        next_sentence_label=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
+            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
+
+        Returns:
+
+        Example::
+
+            >>> from transformers import BertTokenizer, RoFormerForPreTraining
+            >>> import torch
+
+            >>> tokenizer = BertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
+            >>> model = RoFormerForPreTraining.from_pretrained('nvidia/megatron-bert-cased-345m')
+
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+
+            >>> prediction_logits = outputs.prediction_logits
+            >>> seq_relationship_logits = outputs.seq_relationship_logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores, seq_relationship_score = self.cls(
+            sequence_output, pooled_output)
+
+        total_loss = None
+        if labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+            next_sentence_loss = loss_fct(
+                seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            total_loss = masked_lm_loss + next_sentence_loss
+
+        if not return_dict:
+            output = (prediction_scores, seq_relationship_score) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return RoFormerForPreTrainingOutput(
+            loss=total_loss,
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """RoFormer Model with a `language modeling` head on top for CLM fine-tuning. """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForCausalLM(RoFormerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids", r"predictions.decoder.bias"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if not config.is_decoder:
+            logger.warning(
+                "If you want to use `RoFormerForCausalLM` as a standalone, add `is_decoder=True.`")
+
+        self.bert = RoFormerModel(config, add_pooling_layer=False)
+        self.cls = RoFormerOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+
+        Returns:
+
+        Example::
+
+            >>> from transformers import BertTokenizer, RoFormerForCausalLM, RoFormerConfig
+            >>> import torch
+
+            >>> tokenizer = BertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
+            >>> model = RoFormerLMHeadModel.from_pretrained('nvidia/megatron-bert-cased-345m', is_decoder=True)
+
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:,
+                                                          :-1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss()
+            lm_loss = loss_fct(
+                shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past}
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx)
+                               for past_state in layer_past),)
+        return reordered_past
+
+
+@add_start_docstrings("""RoFormer Model with a `language modeling` head on top. """, RoFormer_START_DOCSTRING)
+class RoFormerForMaskedLM(RoFormerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler", r"seq_relationship"]
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids", r"predictions.decoder.bias"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if config.is_decoder:
+            logger.warning(
+                "If you want to use `RoFormerForMaskedLM` make sure `config.is_decoder=False` for "
+                "bi-directional self-attention."
+            )
+
+        self.bert = RoFormerModel(config, add_pooling_layer=False)
+        self.cls = RoFormerOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=MaskedLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        effective_batch_size = input_shape[0]
+
+        #  add a dummy token
+        assert self.config.pad_token_id is not None, "The PAD token should be defined for generation"
+        attention_mask = torch.cat(
+            [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)
+        dummy_token = torch.full(
+            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device
+        )
+        input_ids = torch.cat([input_ids, dummy_token], dim=1)
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask}
+
+
+@add_start_docstrings(
+    """RoFormer Model with a `next sentence prediction (classification)` head on top. """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForNextSentencePrediction(RoFormerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"predictions"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = RoFormerModel(config)
+        self.cls = RoFormerOnlyNSPHead(config)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **kwargs
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
+            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+
+        Returns:
+
+        Example::
+
+            >>> from transformers import BertTokenizer, RoFormerForNextSentencePrediction
+            >>> import torch
+
+            >>> tokenizer = BertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
+            >>> model = RoFormerForNextSentencePrediction.from_pretrained('nvidia/megatron-bert-cased-345m')
+
+            >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
+            >>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
+            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
+
+            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))
+            >>> logits = outputs.logits
+            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random
+        """
+
+        if "next_sentence_label" in kwargs:
+            warnings.warn(
+                "The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.",
+                FutureWarning,
+            )
+            labels = kwargs.pop("next_sentence_label")
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        seq_relationship_scores = self.cls(pooled_output)
+
+        next_sentence_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            next_sentence_loss = loss_fct(
+                seq_relationship_scores.view(-1, 2), labels.view(-1))
+
+        if not return_dict:
+            output = (seq_relationship_scores,) + outputs[2:]
+            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output
+
+        return NextSentencePredictorOutput(
+            loss=next_sentence_loss,
+            logits=seq_relationship_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    RoFormer Model transformer with a sequence classification/regression head on top (a linear layer on top of the
+    pooled output) e.g. for GLUE tasks.
+    """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForSequenceClassification(RoFormerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = RoFormerModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=SequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
+            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(
+                    logits.view(-1, self.num_labels), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    RoFormer Model with a multiple choice classification head on top (a linear layer on top of the pooled output
+    and a softmax) e.g. for RocStories/SWAG tasks.
+    """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForMultipleChoice(RoFormerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = RoFormerModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(
+        RoFormer_INPUTS_DOCSTRING.format(
+            "batch_size, num_choices, sequence_length")
+    )
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=MultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
+            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
+            :obj:`input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        input_ids = input_ids.view(-1, input_ids.size(-1)
+                                   ) if input_ids is not None else None
+        attention_mask = attention_mask.view(
+            -1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(
+            -1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)
+                                         ) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2),
+                               inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    RoFormer Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
+    for Named-Entity-Recognition (NER) tasks.
+    """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForTokenClassification(RoFormerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = RoFormerModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
+            1]``.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)
+                active_labels = torch.where(
+                    active_loss, labels.view(-1), torch.tensor(
+                        loss_fct.ignore_index).type_as(labels)
+                )
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(
+                    logits.view(-1, self.num_labels), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    RoFormer Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
+    linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+    """,
+    RoFormer_START_DOCSTRING,
+)
+class RoFormerForQuestionAnswering(RoFormerPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = RoFormerModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(RoFormer_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=QuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/fengshen/models/roformer/tokenization_roformer.py b/fengshen/models/roformer/tokenization_roformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b9267367e256b46fccc0ad196c326d28c0ebb0c
--- /dev/null
+++ b/fengshen/models/roformer/tokenization_roformer.py
@@ -0,0 +1,16 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import BertTokenizer as RoFormerTokenizer
diff --git a/fengshen/models/transfo_xl_denoise/__init__.py b/fengshen/models/transfo_xl_denoise/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fengshen/models/transfo_xl_denoise/configuration_transfo_xl_denoise.py b/fengshen/models/transfo_xl_denoise/configuration_transfo_xl_denoise.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbd0e8bbbca977f23b3e77d51d6f7fe3fb2092cc
--- /dev/null
+++ b/fengshen/models/transfo_xl_denoise/configuration_transfo_xl_denoise.py
@@ -0,0 +1,119 @@
+# coding=utf-8
+# Copyright 2022 IDEA-CCNL and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TransfoXLDenoise model configuration """
+
+from transformers.configuration_utils import PretrainedConfig
+
+
+Transfo_XL_Denoise_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "transformer-xl-1b-base": "https://huggingface.co/transformer-xl-1b-base/resolve/main/config.json",
+    # See all TransfoXLDenoise models at https://huggingface.co/models?filter=transfo_xl_denoise
+}
+
+
+class TransfoXLDenoiseConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~TransfoXLDenoiseModel`].
+    It is used to instantiate an TransfoXLDenoise model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the TransfoXLDenoise [transformer-xl-1b-base](https://huggingface.co/transformer-xl-1b-base) architecture.
+
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the TransfoXLDenoise model. Defines the number of different
+            tokens that can be represented by the
+            `inputs_ids` passed when calling [`~TransfoXLDenoiseModel`] or
+            [`~TFTransfoXLDenoiseModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimension of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`~TransfoXLDenoiseModel`] or
+            [`~TFTransfoXLDenoiseModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        Example:
+
+    ```python
+    >>> from transformers import TransfoXLDenoiseModel, TransfoXLDenoiseConfig
+
+    >>> # Initializing a TransfoXLDenoise transformer-xl-1b-base style configuration
+    >>> configuration = TransfoXLDenoiseConfig()
+
+    >>> # Initializing a model from the transformer-xl-1b-base style configuration
+    >>> model = TransfoXLDenoiseModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+"""
+    model_type = "transfo_xl_denoise"
+
+    def __init__(
+        self,
+        num_layers=32,
+        vocab_size=50048,
+        hidden_size=1600,
+        num_attention_heads=25,
+        embedding_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        output_dropout_prob=0.1,
+        max_sequence_length=512,
+        max_memory_length=512,
+        checkpoint_activations=False,
+        checkpoint_num_layers=1,
+        parallel_output=True,
+        relative_encoding=True,
+        **kwargs
+    ):
+        self.num_layers = num_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.embedding_dropout_prob = embedding_dropout_prob
+        self.attention_dropout_prob = attention_dropout_prob
+        self.output_dropout_prob = output_dropout_prob
+        self.max_sequence_length = max_sequence_length
+        self.max_memory_length = max_memory_length
+        self.checkpoint_activations = checkpoint_activations
+        self.checkpoint_num_layers = checkpoint_num_layers
+        self.parallel_output = parallel_output
+        self.relative_encoding = relative_encoding
+        super().__init__(**kwargs)
diff --git a/fengshen/models/transfo_xl_denoise/generate.py b/fengshen/models/transfo_xl_denoise/generate.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5aa6402b6d78803fb56aead3decafa7a83eaa31
--- /dev/null
+++ b/fengshen/models/transfo_xl_denoise/generate.py
@@ -0,0 +1,102 @@
+import torch
+import torch.nn.functional as F
+from fengshen.models.transfo_xl_denoise.tokenization_transfo_xl_denoise import TransfoXLDenoiseTokenizer
+from fengshen.models.transfo_xl_denoise.modeling_transfo_xl_denoise import TransfoXLDenoiseModel
+
+
+def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    # This function has been mostly taken from huggingface conversational ai code at
+    # https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    if top_p > 0.0:
+        # convert to 1D
+        sorted_logits, sorted_indices = torch.sort(logits, dim=-1, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        for i in range(sorted_indices.size()[0]):
+            indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+            logits[i][indices_to_remove] = filter_value
+    return logits
+
+
+def get_masks_and_position_ids(data, mem_length=None):
+    # Extract batch size and sequence length.
+    batch_size, seq_length = data.size()
+    # Attention mask (lower triangular).
+    attention_mask = torch.ones((1, seq_length, seq_length + mem_length), device=data.device)
+    attention_mask = torch.tril(torch.triu(attention_mask, 1 - seq_length + mem_length), mem_length)
+    attention_mask = attention_mask.unsqueeze(1)
+    # Position ids.
+    position_ids = torch.arange(seq_length, dtype=torch.long,
+                                device=data.device)
+    position_ids = position_ids.unsqueeze(0).expand_as(data)
+    return attention_mask, position_ids
+
+
+def get_batch(context_tokens, mem_length, batch_size=1):
+    tokens = context_tokens
+    tokens = tokens.view(batch_size, -1).contiguous()
+    # Get the masks and postition ids.
+    attention_mask, position_ids = get_masks_and_position_ids(tokens, mem_length=mem_length)
+    return tokens, attention_mask, position_ids
+
+
+def denoise_generate(model,
+                     tokenizer,
+                     input_text,
+                     device=0,
+                     mem_length=512,
+                     temperature=1.,
+                     top_p=0.9,
+                     eod_token=50000):
+    ''' Generate with fixed prompt pretrained '''
+    prompt = f"“{input_text}”改写后是“"
+    res = []
+    counter = 0
+    tokens, attention_mask, position_ids = get_batch(
+        torch.LongTensor(tokenizer.encode(prompt)), mem_length, batch_size=1)
+    tokens, attention_mask, position_ids = tokens.cuda(
+        device), attention_mask.cuda(device), position_ids.cuda(device)
+    org_context_length = tokens.shape[-1]
+    model = model.cuda(device)
+    while counter < 100:
+        if counter == 0:
+            mems = []  # empty at the begining
+            output = model(input_ids=tokens, attention_mask=attention_mask,
+                           position_ids=position_ids, hidden_states=mems)
+            logits, mems = output.logits, output.hidden_states
+        else:
+            index = org_context_length + counter
+            output = model(input_ids=tokens[:, index - 1: index], position_ids=tokens.new_ones((1, 1)) * (index - 1),
+                           attention_mask=tokens.new_ones(1, 1, 1, mem_length + 1, device=device,
+                                                          dtype=torch.float), hidden_states=mems)
+            logits, mems = output.logits, output.hidden_states
+        logits = logits[:, -1]
+        logits /= temperature
+        logits = top_k_logits(logits, top_k=0, top_p=top_p)
+        log_probs = F.softmax(logits, dim=-1)
+        prev = torch.multinomial(log_probs, num_samples=1)[0]
+        is_end = prev == eod_token
+        if is_end:
+            break
+        tokens = torch.cat((tokens, prev.view(1, 1)), dim=1)
+        counter += 1
+    res.append(tokenizer.decode(tokens.view(-1).contiguous().tolist()))
+    return res
+
+
+if __name__ == "__main__":
+    device = 1
+    tokenizer = TransfoXLDenoiseTokenizer.from_pretrained('IDEA-CCNL/Bigan-Transformer-XL-denoise-1.1B')
+    model = TransfoXLDenoiseModel.from_pretrained('IDEA-CCNL/Bigan-Transformer-XL-denoise-1.1B')
+    input_text = "凡是有成就的人, 都很严肃地对待生命自己的"
+    res = denoise_generate(model, tokenizer, input_text)
+    print(res)
diff --git a/fengshen/models/transfo_xl_denoise/modeling_transfo_xl_denoise.py b/fengshen/models/transfo_xl_denoise/modeling_transfo_xl_denoise.py
new file mode 100755
index 0000000000000000000000000000000000000000..04fb81f284036eea063ba83c25bff981f9febbf5
--- /dev/null
+++ b/fengshen/models/transfo_xl_denoise/modeling_transfo_xl_denoise.py
@@ -0,0 +1,769 @@
+# coding=utf-8
+# Copyright 2022 IDEA-CCNL The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch TransfoXLDenoise model. """
+
+
+import math
+import torch
+import torch.utils.checkpoint as checkpoint
+import torch.nn.functional as F
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+from transformers.modeling_utils import (
+    PreTrainedModel
+)
+from transformers.modeling_outputs import ModelOutput
+from .configuration_transfo_xl_denoise import TransfoXLDenoiseConfig
+
+
+_CHECKPOINT_FOR_DOC = "transformer-xl-1b-base"
+_CONFIG_FOR_DOC = "TransfoXLDenoiseConfig"
+_TOKENIZER_FOR_DOC = "TransfoXLDenoiseTokenizer"
+
+Transfo_XL_Denoise_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
+
+    Parameters:
+        config ([`~TransfoXLDenoiseConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+Transfo_XL_Denoise_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `({0})`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`TransfoXLDenoiseTokenizer`].
+            See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+        token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
+
+            - 0 corresponds to a *sentence A* token,
+            - 1 corresponds to a *sentence B* token.
+
+            [What are token type IDs?](../glossary#token-type-ids)
+        position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range `[0, config.max_position_embeddings - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert *input_ids* indices into associated vectors
+            than the model's internal embedding lookup matrix.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+Transfo_XL_Denoise_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "transformer-xl-1b-base",
+]
+
+
+@dataclass
+class TransfoXLDenoiseModelOutput(ModelOutput):
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+
+
+class PositionalEmbedding(torch.nn.Module):
+    def __init__(self, hidden_size):
+        super(PositionalEmbedding, self).__init__()
+
+        self.hidden_size = hidden_size
+
+        inv_freq = 1 / (10000 ** (torch.arange(0.0, hidden_size, 2.0) / hidden_size))
+        self.register_buffer('inv_freq', inv_freq)
+
+    def forward(self, pos_seq, bsz=None):
+        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
+        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)
+
+        if bsz is not None:
+            return pos_emb[None, :, :].expand(bsz, -1, -1)
+        else:
+            return pos_emb[None, :, :]
+
+
+def ensure_divisibility(numerator, denominator):
+    """Ensure that numerator is divisible by the denominator."""
+    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
+        numerator, denominator)
+
+
+def divide(numerator, denominator):
+    """Ensure that numerator is divisible by the denominator and return
+    the division value."""
+    ensure_divisibility(numerator, denominator)
+    return numerator // denominator
+
+
+def scaled_init_method(sigma, num_layers):
+    """Init method based on N(0, sigma/sqrt(2*num_layers)."""
+    std = sigma / math.sqrt(2.0 * num_layers)
+
+    def init_(tensor):
+        return torch.nn.init.normal_(tensor, mean=0.0, std=std)
+
+    return init_
+
+
+def unscaled_init_method(sigma):
+    """Init method based on N(0, sigma)."""
+    def init_(tensor):
+        return torch.nn.init.normal_(tensor, mean=0.0, std=sigma)
+
+    return init_
+
+
+@torch.jit.script
+def gelu_impl(x):
+    """OpenAI's gelu implementation."""
+    return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * x
+                                       * (1.0 + 0.044715 * x * x)))
+
+
+def gelu(x):
+    return gelu_impl(x)
+
+
+class GPT2SelfAttention(torch.nn.Module):
+    """Parallel self-attention layer for GPT2.
+
+    Self-attention layer takes input with size [b, s, h] where b is
+    the batch size, s is the sequence lenght, and h is the hidden size
+    and creates output of the same size.
+    Arguments:
+        hidden_size: total hidden size of the layer (h).
+        num_attention_heads: number of attention heads (n). Note that we
+                             require n to be divisible by number of GPUs
+                             used to parallelize the model. Also, we
+                             require hidden size to be divisible by n.
+        dropout_prob: dropout probability for the attention scores.
+        init_method: weight initialization.
+        output_layer_init_method: output layer initialization. If None, use
+                                  `init_method`.
+    We use the following notation:
+        h: hidden_size
+        n: num_attention_heads
+        p: number of partitions
+        np: n/p
+        hp: h/p
+        hn: h/n
+        b: batch size
+        s: sequence length
+    """
+
+    def __init__(self, hidden_size, num_attention_heads,
+                 attention_dropout_prob, output_dropout_prob,
+                 init_method, output_layer_init_method=None, relative_encoding=False):
+        super(GPT2SelfAttention, self).__init__()
+        # Set output layer initialization if not provided.
+        if output_layer_init_method is None:
+            output_layer_init_method = init_method
+        # Per attention head and per partition values.
+        self.hidden_size_per_partition = hidden_size
+        self.hidden_size_per_attention_head = divide(hidden_size,
+                                                     num_attention_heads)
+        self.num_attention_heads_per_partition = num_attention_heads
+        self.relative_encoding = relative_encoding
+        # Strided linear layer.
+        self.query_key_value = torch.nn.Linear(hidden_size,
+                                               3 * hidden_size, bias=True)
+
+        if relative_encoding:
+            self.relative = torch.nn.Linear(hidden_size, hidden_size, bias=True)
+        # Dropout. Note that for a single iteration, this layer will generate
+        # different outputs on different number of parallel partitions but
+        # on average it should not be partition dependent.
+        self.attention_dropout = torch.nn.Dropout(attention_dropout_prob)
+
+        # Output.
+        self.dense = torch.nn.Linear(hidden_size, hidden_size, bias=True)
+        self.output_dropout = torch.nn.Dropout(output_dropout_prob)
+
+    def _transpose_for_scores(self, tensor):
+        """Transpose a 3D tensor [b, s, np*hn] into a 4D tensor with
+        size [b, np, s, hn].
+        """
+        new_tensor_shape = tensor.size()[:-1] + \
+            (self.num_attention_heads_per_partition,
+             self.hidden_size_per_attention_head)
+        tensor = tensor.view(*new_tensor_shape)
+        return tensor.permute(0, 2, 1, 3)
+
+    @staticmethod
+    def _rel_shift(x, zero_triu=False):
+        # ql x kl x bsz x h
+        # bsz x h x ql x kl
+        zero_pad = torch.zeros((*x.size()[:-2], x.size(-2), 1),
+                               device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+
+        x_padded = x_padded.view(*x.size()[:-2], x.size(-1) + 1, x.size(-2))
+
+        x = x_padded[:, :, 1:].view_as(x)
+
+        if zero_triu:
+            ones = torch.ones((x.size(0), x.size(1)))
+            x = x * torch.tril(ones, x.size(1) - x.size(0))[:, :, None, None]
+
+        return x
+
+    @staticmethod
+    def _rel_shift_latest(x: torch.Tensor):
+        ndims = x.dim()
+        x_shape = x.size()
+        row_dim = 2
+        col_dim = row_dim + 1
+        assert col_dim < ndims
+        tgt_shape_1, tgt_shape_2 = [], []
+        for i in range(ndims):
+            if i == row_dim:
+                tgt_shape_1.append(x_shape[col_dim])
+                tgt_shape_2.append(x_shape[row_dim])
+            elif i == col_dim:
+                tgt_shape_1.append(x_shape[row_dim])
+                tgt_shape_2.append(x_shape[col_dim] - 1)
+            else:
+                tgt_shape_1.append(x_shape[i])
+                tgt_shape_2.append(x_shape[i])
+        x = x.view(*tgt_shape_1)
+        x = x[:, :, 1:, :]
+        x = x.view(*tgt_shape_2)
+        return x
+
+    def forward(self, hidden_states, ltor_mask, position_embeddings=None, r_w_bias=None, r_r_bias=None, mem=None):
+        # hidden_states: [b, s, h]
+        # ltor_mask: [1, 1, s, s]
+
+        # Attention heads. [b, s, hp]
+        query_length = hidden_states.size(1)
+
+        if mem is None:
+            mixed_x_layer = self.query_key_value(hidden_states)
+            (mixed_query_layer,
+             mixed_key_layer,
+             mixed_value_layer) = torch.chunk(mixed_x_layer, 3, dim=-1)
+        else:
+            cat = torch.cat((mem, hidden_states), 1)
+            mixed_x_layer = self.query_key_value(cat)
+            (mixed_query_layer,
+             mixed_key_layer,
+             mixed_value_layer) = torch.chunk(mixed_x_layer, 3, dim=-1)
+            mixed_query_layer = mixed_query_layer[:, -query_length:]
+
+        # Reshape and transpose [b, np, s, hn]
+        query_layer = self._transpose_for_scores(mixed_query_layer)
+        key_layer = self._transpose_for_scores(mixed_key_layer)
+        value_layer = self._transpose_for_scores(mixed_value_layer)
+        if self.relative_encoding:
+            relative_layer = self.relative(position_embeddings)
+            relative_layer = self._transpose_for_scores(
+                relative_layer)  # 1 (bsz) x n_head x klen x d_head
+            # Raw attention scores. [b, np, qs, ks]
+            rw_head_q = query_layer + r_w_bias.unsqueeze(1)
+            ac_score = torch.matmul(rw_head_q, key_layer.transpose(-1, -2))
+            rr_head_q = query_layer + r_r_bias.unsqueeze(1)
+            bd_score = torch.matmul(rr_head_q, relative_layer.transpose(-1, -2))
+            bd_score = self._rel_shift(bd_score)  # qlen x klen x bsz x n_head
+            # bd_score = bd_score.permute(2, 3, 0, 1) # bsz n_head qlen klen
+
+            attention_scores = ac_score + bd_score
+        else:
+            # Raw attention scores. [b, np, s, s]
+            attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(
+            self.hidden_size_per_attention_head)
+
+        # Apply the left to right attention mask.
+        attention_scores = torch.mul(attention_scores, ltor_mask) - \
+            10000.0 * (1.0 - ltor_mask)
+
+        # Attention probabilities. [b, np, s, s]
+        attention_probs = torch.nn.Softmax(dim=-1)(attention_scores)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        # with get_cuda_rng_tracker().fork():
+        #     attention_probs = self.attention_dropout(attention_probs)
+
+        # Context layer.
+        # [b, np, s, hn]
+        context_layer = torch.matmul(attention_probs, value_layer)
+        # [b, s, np, hn]
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + \
+            (self.hidden_size_per_partition,)
+        # [b, s, hp]
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        # Output. [b, s, h]
+        output = self.dense(context_layer)
+        output = self.output_dropout(output)
+
+        return output
+
+
+class GPT2MLP(torch.nn.Module):
+    """MLP for GPT2.
+
+    MLP will take the input with h hidden state, project it to 4*h
+    hidden dimension, perform gelu transformation, and project the
+    state back into h hidden dimension. At the end, dropout is also
+    applied.
+
+    Arguments:
+        hidden_size: The hidden size of the self attention.
+        output_dropout_prob: dropout probability for the outputs
+                             after self attention and final output.
+        init_method: initialization method used for the weights. Note
+                     that all biases are initialized to zero and
+                     layernorm weight are initialized to one.
+        output_layer_init_method: output layer initialization. If None,
+                                  use `init_method`.
+    """
+
+    def __init__(self, hidden_size, output_dropout_prob, init_method,
+                 output_layer_init_method=None):
+        super(GPT2MLP, self).__init__()
+        # Set output layer initialization if not provided.
+        if output_layer_init_method is None:
+            output_layer_init_method = init_method
+        # Project to 4h.
+        self.dense_h_to_4h = torch.nn.Linear(hidden_size, 4 * hidden_size)
+        # Project back to h.
+        self.dense_4h_to_h = torch.nn.Linear(4 * hidden_size, hidden_size)
+        self.dropout = torch.nn.Dropout(output_dropout_prob)
+
+    def forward(self, hidden_states):
+        # [b, s, 4hp]
+        intermediate_parallel = self.dense_h_to_4h(hidden_states)
+        intermediate_parallel = gelu(intermediate_parallel)
+
+        # [b, s, h]
+        output = self.dense_4h_to_h(intermediate_parallel)
+        output = self.dropout(output)
+        return output
+
+
+class GPT2TransformerLayer(torch.nn.Module):
+    """A single layer transformer for GPT2.
+
+    We use the following notation:
+        h: hidden size
+        n: number of attention heads
+        b: batch size
+        s: sequence length
+    Transformore layer takes input with size [b, s, h] and returns an
+    output of the same size.
+
+    Arguments:
+        hidden_size: The hidden size of the self attention.
+        num_attention_heads: number of attention head in the self
+                             attention.
+        attention_dropout_prob: dropout probability of the attention
+                                score in self attention.
+        output_dropout_prob: dropout probability for the outputs
+                             after self attention and final output.
+        layernorm_epsilon: epsilon used in layernorm to avoid
+                           division by zero.
+        init_method: initialization method used for the weights. Note
+                     that all biases are initialized to zero and
+                     layernorm weight are initialized to one.
+        output_layer_init_method: output layers (attention output and
+                                  mlp output) initialization. If None,
+                                  use `init_method`.
+    """
+
+    def __init__(self,
+                 hidden_size,
+                 num_attention_heads,
+                 attention_dropout_prob,
+                 output_dropout_prob,
+                 layernorm_epsilon,
+                 init_method,
+                 output_layer_init_method=None,
+                 relative_encoding=False):
+        super(GPT2TransformerLayer, self).__init__()
+        # Set output layer initialization if not provided.
+        if output_layer_init_method is None:
+            output_layer_init_method = init_method
+
+        # Layernorm on the input data.
+        self.input_layernorm = torch.nn.LayerNorm(hidden_size, eps=layernorm_epsilon)
+
+        # Self attention.
+        self.attention = GPT2SelfAttention(
+            hidden_size,
+            num_attention_heads,
+            attention_dropout_prob,
+            output_dropout_prob,
+            init_method,
+            output_layer_init_method=output_layer_init_method,
+            relative_encoding=relative_encoding)
+
+        # Layernorm on the input data.
+        self.post_attention_layernorm = torch.nn.LayerNorm(hidden_size,
+                                                           eps=layernorm_epsilon)
+
+        # MLP
+        self.mlp = GPT2MLP(
+            hidden_size,
+            output_dropout_prob,
+            init_method,
+            output_layer_init_method=output_layer_init_method)
+
+    def forward(self, hidden_states, ltor_mask, position_embeddings=None, r_w_bias=None, r_r_bias=None, mem=None):
+        # hidden_states: [b, s, h]
+        # ltor_mask: [1, 1, s, s]
+
+        # Layer norm at the begining of the transformer layer.
+        layernorm_output = self.input_layernorm(hidden_states)
+        mem = self.input_layernorm(mem) if mem is not None else None
+        # Self attention.
+        attention_output = self.attention(
+            layernorm_output, ltor_mask, position_embeddings, r_w_bias, r_r_bias, mem)
+        # Residual connection.
+        # print(f'hz {hidden_states.shape}, attn {attention_output.shape}')
+        layernorm_input = hidden_states + attention_output
+        # Layer norm post the self attention.
+        layernorm_output = self.post_attention_layernorm(layernorm_input)
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+        # Second residual connection.
+        output = layernorm_input + mlp_output
+
+        return output
+
+
+class GPT2Transformer(torch.nn.Module):
+    """GPT-2 transformer.
+
+    This module takes input from embedding layer and it's output can
+    be used directly by a logit layer. It consists of L (num-layers)
+    blocks of:
+        layer norm
+        self attention
+        residual connection
+        layer norm
+        mlp
+        residual connection
+    followed by a final layer norm.
+
+    Arguments:
+        num_layers: Number of transformer layers.
+        hidden_size: The hidden size of the self attention.
+        num_attention_heads: number of attention head in the self
+                             attention.
+        attention_dropout_prob: dropout probability of the attention
+                                score in self attention.
+        output_dropout_prob: dropout probability for the outputs
+                             after self attention and final output.
+        checkpoint_activations: if True, checkpoint activations.
+        checkpoint_num_layers: number of layers to checkpoint. This
+                               is basically the chunk size in checkpoitning.
+        layernorm_epsilon: epsilon used in layernorm to avoid
+                           division by zero.
+        init_method_std: standard deviation of the init method which has
+                         the form N(0, std).
+        use_scaled_init_for_output_weights: If Ture use 1/sqrt(2*num_layers)
+                                            scaling for the output weights (
+                                            output of self attention and mlp).
+    """
+
+    def __init__(self,
+                 num_layers,
+                 hidden_size,
+                 num_attention_heads,
+                 max_sequence_length,
+                 max_memory_length,
+                 embedding_dropout_prob,
+                 attention_dropout_prob,
+                 output_dropout_prob,
+                 checkpoint_activations,
+                 checkpoint_num_layers=1,
+                 layernorm_epsilon=1.0e-5,
+                 init_method_std=0.02,
+                 use_scaled_init_for_output_weights=True,
+                 relative_encoding=False):
+        super(GPT2Transformer, self).__init__()
+        # Store activation checkpoiting flag.
+        self.checkpoint_activations = checkpoint_activations
+        self.checkpoint_num_layers = checkpoint_num_layers
+        self.max_memory_length = max_memory_length
+
+        output_layer_init_method = None
+        if use_scaled_init_for_output_weights:
+            output_layer_init_method = scaled_init_method(init_method_std,
+                                                          num_layers)
+        # Embeddings dropout
+        self.embedding_dropout = torch.nn.Dropout(embedding_dropout_prob)
+        self.relative_encoding = relative_encoding
+        if relative_encoding:
+            # Relative position embedding
+            self.position_embeddings = PositionalEmbedding(hidden_size)
+            # Per attention head and per partition values.
+            self.hidden_size_per_attention_head = divide(hidden_size,
+                                                         num_attention_heads)
+            self.num_attention_heads_per_partition = num_attention_heads
+            self.r_w_bias = torch.nn.Parameter(
+                torch.Tensor(self.num_attention_heads_per_partition, self.hidden_size_per_attention_head))
+            self.r_r_bias = torch.nn.Parameter(
+                torch.Tensor(self.num_attention_heads_per_partition, self.hidden_size_per_attention_head))
+
+            # Always initialize bias to zero.
+            with torch.no_grad():
+                self.r_w_bias.zero_()
+                self.r_r_bias.zero_()
+        else:
+            # Position embedding (serial).
+            self.position_embeddings = torch.nn.Embedding(max_sequence_length,
+                                                          hidden_size)
+            # Initialize the position embeddings.
+            torch.nn.init.normal_(self.position_embeddings.weight, mean=0.0, std=init_method_std)
+
+        def get_layer():
+            return GPT2TransformerLayer(
+                hidden_size,
+                num_attention_heads,
+                attention_dropout_prob,
+                output_dropout_prob,
+                layernorm_epsilon,
+                unscaled_init_method(init_method_std),
+                output_layer_init_method=output_layer_init_method,
+                relative_encoding=relative_encoding)
+
+        # Transformer layers.
+        self.layers = torch.nn.ModuleList(
+            [get_layer() for _ in range(num_layers)])
+
+        # Final layer norm before output.
+        self.final_layernorm = torch.nn.LayerNorm(hidden_size, eps=layernorm_epsilon)
+
+    def forward(self, hidden_states, position_ids, attention_mask, *mems):
+        batch_size, query_length = hidden_states.size()[:2]
+        memory_length = mems[0].size(1) if mems else 0
+        key_length = query_length + memory_length
+        attention_mask = attention_mask[:, :, :, -query_length - memory_length:]
+        if self.relative_encoding:
+            # why drop twice here
+            # hidden_states = self.embedding_dropout(hidden_states)
+            position_sequence = torch.arange(key_length - 1, -1, -1.0, device=hidden_states.device,
+                                             dtype=hidden_states.dtype)
+            position_embeddings = self.position_embeddings(position_sequence)
+            # Apply dropout
+            position_embeddings = self.embedding_dropout(position_embeddings)
+            hidden_states = self.embedding_dropout(hidden_states)
+        else:
+            position_embeddings = self.position_embeddings(position_ids)
+            hidden_states = hidden_states + position_embeddings
+            hidden_states = self.embedding_dropout(hidden_states)
+        if self.max_memory_length > 0:
+            mem_layers = [hidden_states.detach()]
+        else:
+            mem_layers = []
+
+        def custom(start, end):
+            def custom_forward(*inputs):
+                layers_ = self.layers[start:end]
+                x_, inputs = inputs[0], inputs[1:]
+                if self.relative_encoding:
+                    inputs, mems_ = inputs[:4], inputs[4:]
+                else:
+                    inputs, mems_ = inputs[:1], inputs[1:]
+                for i, layer in enumerate(layers_):
+                    mem_i_ = mems_[i] if mems_ else None
+                    x_ = layer(x_, *inputs, mem=mem_i_)
+                    if self.max_memory_length > 0:
+                        mem_layers.append(x_.detach())
+                return x_
+            return custom_forward
+
+        if self.checkpoint_activations:
+            la = 0
+            num_layers = len(self.layers)
+            chunk_length = self.checkpoint_num_layers
+            while la < num_layers:
+                args = [hidden_states, attention_mask]
+                if self.relative_encoding:
+                    args += [position_embeddings, self.r_w_bias, self.r_r_bias]
+                if mems:
+                    args += mems[la: la + chunk_length]
+                hidden_states = checkpoint(custom(la, la + chunk_length), *args)
+                la += chunk_length
+        else:
+            for i, layer in enumerate(self.layers):
+                args = [hidden_states, attention_mask]
+                if self.relative_encoding:
+                    args += [position_embeddings, self.r_w_bias, self.r_r_bias]
+                mem_i = mems[i] if mems else None
+                hidden_states = layer(*args, mem=mem_i)
+                if self.max_memory_length > 0:
+                    mem_layers.append(hidden_states.detach())
+
+        # Final layer norm.
+        output = self.final_layernorm(hidden_states)
+        if self.max_memory_length > 0:
+            mem_layers = self.update_mems(mem_layers, mems)
+
+        return (output, *mem_layers)
+
+    def update_mems(self, hiddens, mems):
+        memory_length = mems[0].size(1) if mems else 0
+        query_length = hiddens[0].size(1)
+        new_memory_length = min(self.max_memory_length, memory_length + query_length)
+        new_mems = []
+        with torch.no_grad():
+            for i in range(len(hiddens)):
+                if new_memory_length <= query_length:
+                    new_mems.append(hiddens[i][:, -new_memory_length:])
+                else:
+                    new_mems.append(
+                        torch.cat(
+                            (mems[i][:, -new_memory_length + query_length:], hiddens[i]), dim=1))
+        return new_mems
+
+
+class TransfoXLDenoisePreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and
+    a simple interface for downloading and loading pretrained models.
+    """
+
+    config_class = TransfoXLDenoiseConfig
+    base_model_prefix = "transfo_xl_denoise"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """ Initialize the weights """
+        pass  # to bypass the not implement error
+
+
+class TransfoXLDenoiseModel(TransfoXLDenoisePreTrainedModel):
+    """GPT-2 Language model.
+
+    The output of the forward method are the logits (parallel or
+    serial depending on the `parallel_output` flag.
+    """
+
+    def __init__(self, config: TransfoXLDenoiseConfig):
+        super().__init__(config)
+        self.config = config
+        # Word embeddings (parallel).
+        self.word_embeddings = torch.nn.Embedding(config.vocab_size, config.hidden_size)
+        # Transformer
+        self.transformer = GPT2Transformer(config.num_layers,
+                                           config.hidden_size,
+                                           config.num_attention_heads,
+                                           config.max_sequence_length,
+                                           config.max_memory_length,
+                                           config.embedding_dropout_prob,
+                                           config.attention_dropout_prob,
+                                           config.output_dropout_prob,
+                                           config.checkpoint_activations,
+                                           config.checkpoint_num_layers,
+                                           relative_encoding=config.relative_encoding)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        hidden_states=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **unused,
+    ):
+        r"""
+        encoder_hidden_states  (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+            if the model is configured as a decoder.
+        encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask
+            is used in the cross-attention if the model is configured as a decoder.
+            Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with
+        each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            decoding (see `past_key_values`).
+        """
+        # Embeddings.
+        # one-hot batch_size * seq_len * vocab_size, can use gradient
+        # if input_ids.shape[-1] == self.word_embeddings.weight.shape[0]:
+        #    words_embeddings = torch.einsum("ijk,kl->ijl", input_ids, self.word_embeddings.weight)
+        # else:
+        #    print(f'input_ids {input_ids.device}, word_embedding {self.word_embeddings.weight.device}')
+        #    words_embeddings = self.word_embeddings(input_ids)
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        assert input_ids is not None and attention_mask is not None and position_ids is not None, \
+            "You have to specify input_ids, attention_mask, and position_ids. Check tokenizer.encode_plus for details"
+        if not hidden_states:
+            hidden_states = []
+        embeddings = self.word_embeddings(input_ids)
+
+        # Transformer.
+        transformer_output = self.transformer(
+            embeddings, position_ids, attention_mask, *hidden_states)
+        logits, *hidden_states = transformer_output
+        logits = F.linear(logits, self.word_embeddings.weight)
+
+        if not return_dict:
+            return logits, hidden_states
+
+        return TransfoXLDenoiseModelOutput(
+            logits=logits,
+            hidden_states=hidden_states
+        )
diff --git a/fengshen/models/transfo_xl_denoise/tokenization_transfo_xl_denoise.py b/fengshen/models/transfo_xl_denoise/tokenization_transfo_xl_denoise.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b454c8cc236a114074c8a099878f8e464f87ad5
--- /dev/null
+++ b/fengshen/models/transfo_xl_denoise/tokenization_transfo_xl_denoise.py
@@ -0,0 +1,82 @@
+# coding=utf-8
+# Copyright 2022 IDEA-CCNL and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for TransfoXLDenoise."""
+
+import sentencepiece as spm
+from transformers.tokenization_utils import PreTrainedTokenizer
+
+VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "transformer-xl-1b-base":
+            "https://huggingface.co/IDEA-CCNL/Bigan-Transformer-XL-denoise-1.1B/resolve/main/spiece.model",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "transformer-xl-1b-base": 512,
+}
+
+
+class TransfoXLDenoiseTokenizer(PreTrainedTokenizer):
+    """
+    Construct a TransfoXLDenoise tokenizer. Based on pretrained sentence piece
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+    SPIECE_UNDERLINE = "▁"
+
+    def __init__(
+            self,
+            vocab_file,
+            unk_token="<|endoftext|>",
+            bos_token="<|endoftext|>",
+            eos_token="<|endoftext|>",
+            **kwargs
+    ):
+        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+        "Initialisation"
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        "Returns vocab size"
+        return len(self.sp_model)
+
+    def _tokenize(self, text):
+        """ Returns a tokenized string. """
+        return self.sp_model.EncodeAsPieces(text)
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.sp_model.IdToPiece(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = "".join(tokens).replace(self.SPIECE_UNDERLINE, " ").strip()
+        return out_string
diff --git a/fengshen/models/transformer_utils.py b/fengshen/models/transformer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fengshen/models/ubert/__init__.py b/fengshen/models/ubert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..027dbd44683cfbd9a4804345db6633cc46ff8c23
--- /dev/null
+++ b/fengshen/models/ubert/__init__.py
@@ -0,0 +1,17 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from .modeling_ubert import UbertPiplines, UbertModel, UbertDataset
diff --git a/fengshen/models/ubert/modeling_ubert.py b/fengshen/models/ubert/modeling_ubert.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f200c24110302020788fdc64801df3be84e3efa
--- /dev/null
+++ b/fengshen/models/ubert/modeling_ubert.py
@@ -0,0 +1,698 @@
+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from logging import basicConfig, setLogRecordFactory
+import torch
+from torch import nn
+import json
+from tqdm import tqdm
+import os
+import numpy as np
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    BertTokenizer,
+    file_utils
+)
+import pytorch_lightning as pl
+
+from pytorch_lightning.callbacks import ModelCheckpoint
+from pytorch_lightning import trainer, loggers
+from torch.utils.data import Dataset, DataLoader
+from transformers.optimization import get_linear_schedule_with_warmup
+from transformers import BertForPreTraining, BertForMaskedLM, BertModel
+from transformers import BertConfig, BertForTokenClassification, BertPreTrainedModel
+import transformers
+import unicodedata
+import re
+import argparse
+
+
+transformers.logging.set_verbosity_error()
+# os.environ["CUDA_VISIBLE_DEVICES"] = '6'
+
+
+def search(pattern, sequence):
+    n = len(pattern)
+    res = []
+    for i in range(len(sequence)):
+        if sequence[i:i + n] == pattern:
+            res.append([i, i + n-1])
+    return res
+
+
+class UbertDataset(Dataset):
+    def __init__(self, data, tokenizer, args, used_mask=True):
+        super().__init__()
+        self.tokenizer = tokenizer
+        self.max_length = args.max_length
+        self.num_labels = args.num_labels
+        self.used_mask = used_mask
+        self.data = data
+        self.args = args
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return self.encode(self.data[index], self.used_mask)
+
+    def encode(self, item, used_mask=False):
+        input_ids1 = []
+        attention_mask1 = []
+        token_type_ids1 = []
+        span_labels1 = []
+        span_labels_masks1 = []
+
+        input_ids0 = []
+        attention_mask0 = []
+        token_type_ids0 = []
+        span_labels0 = []
+        span_labels_masks0 = []
+
+        subtask_type = item['subtask_type']
+        for choice in item['choices']:
+            try:
+                texta = item['task_type'] + '[SEP]' + \
+                    subtask_type + '[SEP]' + choice['entity_type']
+                textb = item['text']
+                encode_dict = self.tokenizer.encode_plus(texta, textb,
+                                                         max_length=self.max_length,
+                                                         padding='max_length',
+                                                         truncation='longest_first')
+
+                encode_sent = encode_dict['input_ids']
+                encode_token_type_ids = encode_dict['token_type_ids']
+                encode_attention_mask = encode_dict['attention_mask']
+                span_label = np.zeros((self.max_length, self.max_length))
+                span_label_mask = np.zeros(
+                    (self.max_length, self.max_length))-10000
+
+                if item['task_type'] == '分类任务':
+                    span_label_mask[0, 0] = 0
+                    span_label[0, 0] = choice['label']
+
+                else:
+                    question_len = len(self.tokenizer.encode(texta))
+                    span_label_mask[question_len:, question_len:] = np.zeros(
+                        (self.max_length-question_len, self.max_length-question_len))
+                    for entity in choice['entity_list']:
+                        # if 'entity_name' in entity.keys() and entity['entity_name']=='':
+                        #     continue
+                        entity_idx_list = entity['entity_idx']
+                        if entity_idx_list == []:
+                            continue
+                        for entity_idx in entity_idx_list:
+                            if entity_idx == []:
+                                continue
+                            start_idx_text = item['text'][:entity_idx[0]]
+                            start_idx_text_encode = self.tokenizer.encode(
+                                start_idx_text, add_special_tokens=False)
+                            start_idx = question_len + \
+                                len(start_idx_text_encode)
+
+                            end_idx_text = item['text'][:entity_idx[1]+1]
+                            end_idx_text_encode = self.tokenizer.encode(
+                                end_idx_text, add_special_tokens=False)
+                            end_idx = question_len + \
+                                len(end_idx_text_encode) - 1
+                            if start_idx < self.max_length and end_idx < self.max_length:
+                                span_label[start_idx, end_idx] = 1
+
+                if np.sum(span_label) < 1:
+                    input_ids0.append(encode_sent)
+                    attention_mask0.append(encode_attention_mask)
+                    token_type_ids0.append(encode_token_type_ids)
+                    span_labels0.append(span_label)
+                    span_labels_masks0.append(span_label_mask)
+                else:
+                    input_ids1.append(encode_sent)
+                    attention_mask1.append(encode_attention_mask)
+                    token_type_ids1.append(encode_token_type_ids)
+                    span_labels1.append(span_label)
+                    span_labels_masks1.append(span_label_mask)
+            except:
+                print(item)
+                print(texta)
+                print(textb)
+
+        randomize = np.arange(len(input_ids0))
+        np.random.shuffle(randomize)
+        cur = 0
+        count = len(input_ids1)
+        while count < self.args.num_labels:
+            if cur < len(randomize):
+                input_ids1.append(input_ids0[randomize[cur]])
+                attention_mask1.append(attention_mask0[randomize[cur]])
+                token_type_ids1.append(token_type_ids0[randomize[cur]])
+                span_labels1.append(span_labels0[randomize[cur]])
+                span_labels_masks1.append(span_labels_masks0[randomize[cur]])
+                cur += 1
+            count += 1
+
+        while len(input_ids1) < self.args.num_labels:
+            input_ids1.append([0]*self.max_length)
+            attention_mask1.append([0]*self.max_length)
+            token_type_ids1.append([0]*self.max_length)
+            span_labels1.append(np.zeros((self.max_length, self.max_length)))
+            span_labels_masks1.append(
+                np.zeros((self.max_length, self.max_length))-10000)
+
+        input_ids = input_ids1[:self.args.num_labels]
+        attention_mask = attention_mask1[:self.args.num_labels]
+        token_type_ids = token_type_ids1[:self.args.num_labels]
+        span_labels = span_labels1[:self.args.num_labels]
+        span_labels_masks = span_labels_masks1[:self.args.num_labels]
+
+        span_labels = np.array(span_labels)
+        span_labels_masks = np.array(span_labels_masks)
+        if np.sum(span_labels) < 1:
+            span_labels[-1, -1, -1] = 1
+            span_labels_masks[-1, -1, -1] = 10000
+
+        sample = {
+            "input_ids": torch.tensor(input_ids).long(),
+            "token_type_ids": torch.tensor(token_type_ids).long(),
+            "attention_mask": torch.tensor(attention_mask).float(),
+            "span_labels": torch.tensor(span_labels).float(),
+            "span_labels_mask": torch.tensor(span_labels_masks).float()
+        }
+
+        return sample
+
+
+class UbertDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--batchsize', default=8, type=int)
+        parser.add_argument('--max_length', default=128, type=int)
+        return parent_args
+
+    def __init__(self, train_data, val_data, tokenizer, args):
+        super().__init__()
+        self.batchsize = args.batchsize
+
+        self.train_data = UbertDataset(train_data, tokenizer, args, True)
+        self.valid_data = UbertDataset(val_data, tokenizer, args, False)
+
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.batchsize, pin_memory=False)
+
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.batchsize, pin_memory=False)
+
+
+class biaffine(nn.Module):
+    def __init__(self, in_size, out_size, bias_x=True, bias_y=True):
+        super().__init__()
+        self.bias_x = bias_x
+        self.bias_y = bias_y
+        self.out_size = out_size
+        self.U = torch.nn.Parameter(torch.zeros(
+            in_size + int(bias_x), out_size, in_size + int(bias_y)))
+        torch.nn.init.normal_(self.U, mean=0, std=0.1)
+
+    def forward(self, x, y):
+        if self.bias_x:
+            x = torch.cat((x, torch.ones_like(x[..., :1])), dim=-1)
+        if self.bias_y:
+            y = torch.cat((y, torch.ones_like(y[..., :1])), dim=-1)
+        bilinar_mapping = torch.einsum('bxi,ioj,byj->bxyo', x, self.U, y)
+        return bilinar_mapping
+
+
+class MultilabelCrossEntropy(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, y_pred, y_true):
+        y_true = y_true.float()
+        y_pred = torch.mul((1.0 - torch.mul(y_true, 2.0)), y_pred)
+        y_pred_neg = y_pred - torch.mul(y_true, 1e12)
+        y_pred_pos = y_pred - torch.mul(1.0 - y_true, 1e12)
+        zeros = torch.zeros_like(y_pred[..., :1])
+        y_pred_neg = torch.cat([y_pred_neg, zeros], axis=-1)
+        y_pred_pos = torch.cat([y_pred_pos, zeros], axis=-1)
+        neg_loss = torch.logsumexp(y_pred_neg, axis=-1)
+        pos_loss = torch.logsumexp(y_pred_pos, axis=-1)
+        loss = torch.mean(neg_loss + pos_loss)
+        return loss
+
+
+class UbertModel(BertPreTrainedModel):
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config)
+        self.query_layer = torch.nn.Sequential(torch.nn.Linear(in_features=self.config.hidden_size,
+                                                               out_features=self.config.biaffine_size),
+                                               torch.nn.GELU())
+        self.key_layer = torch.nn.Sequential(torch.nn.Linear(in_features=self.config.hidden_size, out_features=self.config.biaffine_size),
+                                             torch.nn.GELU())
+        self.biaffine_query_key_cls = biaffine(self.config.biaffine_size, 1)
+        self.loss_softmax = MultilabelCrossEntropy()
+        self.loss_sigmoid = torch.nn.BCEWithLogitsLoss(reduction='mean')
+
+    def forward(self,
+                input_ids,
+                attention_mask,
+                token_type_ids,
+                span_labels=None,
+                span_labels_mask=None):
+
+        batch_size, num_label, seq_len = input_ids.shape
+
+        input_ids = input_ids.view(-1, seq_len)
+        attention_mask = attention_mask.view(-1, seq_len)
+        token_type_ids = token_type_ids.view(-1, seq_len)
+
+        batch_size, seq_len = input_ids.shape
+        outputs = self.bert(input_ids=input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            output_hidden_states=True)  # (bsz, seq, dim)
+
+        hidden_states = outputs[0]
+        batch_size, seq_len, hidden_size = hidden_states.shape
+
+        query = self.query_layer(hidden_states)
+        key = self.key_layer(hidden_states)
+
+        span_logits = self.biaffine_query_key_cls(
+            query, key).reshape(-1, num_label, seq_len, seq_len)
+
+        span_logits = span_logits + span_labels_mask
+
+        if span_labels == None:
+            return 0, span_logits
+        else:
+            soft_loss1 = self.loss_softmax(
+                span_logits.reshape(-1, num_label, seq_len*seq_len), span_labels.reshape(-1, num_label, seq_len*seq_len))
+            soft_loss2 = self.loss_softmax(span_logits.permute(
+                0, 2, 3, 1), span_labels.permute(0, 2, 3, 1))
+            sig_loss = self.loss_sigmoid(span_logits, span_labels)
+            all_loss = 10*(100*sig_loss+soft_loss1+soft_loss2)
+            return all_loss, span_logits
+
+
+class UbertLitModel(pl.LightningModule):
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--learning_rate', default=1e-5, type=float)
+        parser.add_argument('--weight_decay', default=0.1, type=float)
+        parser.add_argument('--warmup', default=0.01, type=float)
+        parser.add_argument('--num_labels', default=10, type=int)
+
+        return parent_args
+
+    def __init__(self, args, num_data=1):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        self.model = UbertModel.from_pretrained(
+            self.args.pretrained_model_path)
+        self.count = 0
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
+            self.total_step = int(self.trainer.max_epochs * self.num_data /
+                                  (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
+            print('Total training step:', self.total_step)
+
+    def training_step(self, batch, batch_idx):
+        loss, span_logits = self.model(**batch)
+        span_acc, recall, precise = self.comput_metrix_span(
+            span_logits, batch['span_labels'])
+        self.log('train_loss', loss)
+        self.log('train_span_acc', span_acc)
+        self.log('train_span_recall', recall)
+        self.log('train_span_precise', precise)
+
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        loss, span_logits = self.model(**batch)
+        span_acc, recall, precise = self.comput_metrix_span(
+            span_logits, batch['span_labels'])
+
+        self.log('val_loss', loss)
+        self.log('val_span_acc', span_acc)
+        self.log('val_span_recall', recall)
+        self.log('val_span_precise', precise)
+
+    def predict_step(self, batch, batch_idx):
+        loss, span_logits = self.model(**batch)
+        span_acc = self.comput_metrix_span(span_logits, batch['span_labels'])
+        return span_acc.item()
+
+    def configure_optimizers(self):
+
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        paras = list(
+            filter(lambda p: p[1].requires_grad, self.named_parameters()))
+        paras = [{
+            'params':
+            [p for n, p in paras if not any(nd in n for nd in no_decay)],
+            'weight_decay': self.args.weight_decay
+        }, {
+            'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
+            'weight_decay': 0.0
+        }]
+        optimizer = torch.optim.AdamW(paras, lr=self.args.learning_rate)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, int(self.total_step * self.args.warmup),
+            self.total_step)
+
+        return [{
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,
+                'interval': 'step',
+                'frequency': 1
+            }
+        }]
+
+    def comput_metrix_span(self, logits, labels):
+        ones = torch.ones_like(logits)
+        zero = torch.zeros_like(logits)
+        logits = torch.where(logits < 0, zero, ones)
+        y_pred = logits.view(size=(-1,))
+        y_true = labels.view(size=(-1,))
+        corr = torch.eq(y_pred, y_true).float()
+        corr = torch.multiply(y_true, corr)
+        recall = torch.sum(corr.float())/(torch.sum(y_true.float())+1e-5)
+        precise = torch.sum(corr.float())/(torch.sum(y_pred.float())+1e-5)
+        f1 = 2*recall*precise/(recall+precise+1e-5)
+        return f1, recall, precise
+
+
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--checkpoint_path',
+                            default='./checkpoint/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_epochs', default=1, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+        return parent_args
+
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         save_last=True,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.checkpoint_path,
+                                         filename=args.filename)
+
+
+class OffsetMapping:
+    def __init__(self):
+        self._do_lower_case = True
+
+    @staticmethod
+    def stem(token):
+        if token[:2] == '##':
+            return token[2:]
+        else:
+            return token
+
+    @staticmethod
+    def _is_control(ch):
+        return unicodedata.category(ch) in ('Cc', 'Cf')
+
+    @staticmethod
+    def _is_special(ch):
+        return bool(ch) and (ch[0] == '[') and (ch[-1] == ']')
+
+    def rematch(self, text, tokens):
+        if self._do_lower_case:
+            text = text.lower()
+
+        normalized_text, char_mapping = '', []
+        for i, ch in enumerate(text):
+            if self._do_lower_case:
+                ch = unicodedata.normalize('NFD', ch)
+                ch = ''.join(
+                    [c for c in ch if unicodedata.category(c) != 'Mn'])
+            ch = ''.join([
+                c for c in ch
+                if not (ord(c) == 0 or ord(c) == 0xfffd or self._is_control(c))
+            ])
+            normalized_text += ch
+            char_mapping.extend([i] * len(ch))
+
+        text, token_mapping, offset = normalized_text, [], 0
+        for token in tokens:
+            if self._is_special(token):
+                token_mapping.append([offset])
+                offset += 1
+            else:
+                token = self.stem(token)
+                start = text[offset:].index(token) + offset
+                end = start + len(token)
+                token_mapping.append(char_mapping[start:end])
+                offset = end
+
+        return token_mapping
+
+
+class extractModel:
+    def get_actual_id(self, text, query_text, tokenizer, args):
+        text_encode = tokenizer.encode(text)
+        one_input_encode = tokenizer.encode(query_text)
+        text_start_id = search(text_encode[1:-1], one_input_encode)[0][0]
+        text_end_id = text_start_id+len(text_encode)-1
+        if text_end_id > args.max_length:
+            text_end_id = args.max_length
+
+        text_token = tokenizer.tokenize(text)
+        text_mapping = OffsetMapping().rematch(text, text_token)
+
+        return text_start_id, text_end_id, text_mapping, one_input_encode
+
+    def extract_index(self, span_logits, sample_length, split_value=0.5):
+        result = []
+        for i in range(sample_length):
+            for j in range(i, sample_length):
+                if span_logits[i, j] > split_value:
+                    result.append((i, j, span_logits[i, j]))
+        return result
+
+    def extract_entity(self, text, entity_idx, text_start_id, text_mapping):
+        start_split = text_mapping[entity_idx[0]-text_start_id] if entity_idx[0] - \
+            text_start_id < len(text_mapping) and entity_idx[0]-text_start_id >= 0 else []
+        end_split = text_mapping[entity_idx[1]-text_start_id] if entity_idx[1] - \
+            text_start_id < len(text_mapping) and entity_idx[1]-text_start_id >= 0 else []
+        entity = ''
+        if start_split != [] and end_split != []:
+            entity = text[start_split[0]:end_split[-1]+1]
+        return entity
+
+    def extract(self, batch_data, model, tokenizer, args):
+        input_ids = []
+        attention_mask = []
+        token_type_ids = []
+        span_labels_masks = []
+
+        for item in batch_data:
+            input_ids0 = []
+            attention_mask0 = []
+            token_type_ids0 = []
+            span_labels_masks0 = []
+            for choice in item['choices']:
+                texta = item['task_type'] + '[SEP]' + \
+                    item['subtask_type'] + '[SEP]' + choice['entity_type']
+                textb = item['text']
+                encode_dict = tokenizer.encode_plus(texta, textb,
+                                                    max_length=args.max_length,
+                                                    padding='max_length',
+                                                    truncation='longest_first')
+
+                encode_sent = encode_dict['input_ids']
+                encode_token_type_ids = encode_dict['token_type_ids']
+                encode_attention_mask = encode_dict['attention_mask']
+                span_label_mask = np.zeros(
+                    (args.max_length, args.max_length))-10000
+
+                if item['task_type'] == '分类任务':
+                    span_label_mask[0, 0] = 0
+                else:
+                    question_len = len(tokenizer.encode(texta))
+                    span_label_mask[question_len:, question_len:] = np.zeros(
+                        (args.max_length-question_len, args.max_length-question_len))
+                input_ids0.append(encode_sent)
+                attention_mask0.append(encode_attention_mask)
+                token_type_ids0.append(encode_token_type_ids)
+                span_labels_masks0.append(span_label_mask)
+
+            input_ids.append(input_ids0)
+            attention_mask.append(attention_mask0)
+            token_type_ids.append(token_type_ids0)
+            span_labels_masks.append(span_labels_masks0)
+
+        input_ids = torch.tensor(input_ids).to(model.device)
+        attention_mask = torch.tensor(attention_mask).to(model.device)
+        token_type_ids = torch.tensor(token_type_ids).to(model.device)
+        span_labels_mask = torch.tensor(span_labels_masks).to(model.device)
+
+        _, span_logits = model.model(input_ids=input_ids,
+                                     attention_mask=attention_mask,
+                                     token_type_ids=token_type_ids,
+                                     span_labels=None,
+                                     span_labels_mask=span_labels_mask)
+
+        span_logits = torch.nn.functional.sigmoid(span_logits)
+        span_logits = span_logits.cpu().detach().numpy()
+
+        for i, item in enumerate(batch_data):
+            if item['task_type'] == '分类任务':
+                cls_idx = 0
+                max_c = np.argmax(span_logits[i, :, cls_idx, cls_idx])
+                batch_data[i]['choices'][max_c]['label'] = 1
+                batch_data[i]['choices'][max_c]['score'] = span_logits[i,
+                                                                       max_c, cls_idx, cls_idx]
+            else:
+                if item['subtask_type'] == '抽取式阅读理解':
+                    for c in range(len(item['choices'])):
+                        texta = item['subtask_type'] + \
+                            '[SEP]' + choice['entity_type']
+                        textb = item['text']
+                        text_start_id, text_end_id, offset_mapping, input_ids = self.get_actual_id(
+                            item['text'], texta+'[SEP]'+textb, tokenizer, args)
+                        logits = span_logits[i, c, :, :]
+                        max_index = np.unravel_index(
+                            np.argmax(logits, axis=None), logits.shape)
+                        entity_list = []
+                        if logits[max_index] > args.threshold:
+
+                            entity = self.extract_entity(
+                                item['text'], (max_index[0], max_index[1]), text_start_id, offset_mapping)
+                            entity = {
+                                'entity_name': entity,
+                                'score': logits[max_index]
+                            }
+                            if entity not in entity_list:
+                                entity_list.append(entity)
+                        batch_data[i]['choices'][c]['entity_list'] = entity_list
+                else:
+                    for c in range(len(item['choices'])):
+                        texta = item['task_type'] + '[SEP]' + item['subtask_type'] + \
+                            '[SEP]' + item['choices'][c]['entity_type']
+
+                        textb = item['text']
+                        text_start_id, text_end_id, offset_mapping, input_ids = self.get_actual_id(
+                            item['text'], texta+'[SEP]'+textb, tokenizer, args)
+                        logits = span_logits[i, c, :, :]
+                        sample_length = len(input_ids)
+                        entity_idx_type_list = self.extract_index(
+                            logits, sample_length, split_value=args.threshold)
+                        entity_list = []
+
+                        for entity_idx in entity_idx_type_list:
+                            entity = self.extract_entity(
+                                item['text'], (entity_idx[0], entity_idx[1]), text_start_id, offset_mapping)
+                            entity = {
+                                'entity_name': entity,
+                                'score': entity_idx[2]
+                            }
+                            if entity not in entity_list:
+                                entity_list.append(entity)
+                        batch_data[i]['choices'][c]['entity_list'] = entity_list
+        return batch_data
+
+
+class UbertPiplines:
+    @staticmethod
+    def piplines_args(parent_args):
+        total_parser = parent_args.add_argument_group("piplines args")
+        total_parser.add_argument(
+            '--pretrained_model_path', default='IDEA-CCNL/Erlangshen-Ubert-110M-Chinese', type=str)
+        total_parser.add_argument('--output_save_path',
+                                  default='./predict.json', type=str)
+
+        total_parser.add_argument('--load_checkpoints_path',
+                                  default='', type=str)
+
+        total_parser.add_argument('--max_extract_entity_number',
+                                  default=1, type=float)
+
+        total_parser.add_argument('--train', action='store_true')
+
+        total_parser.add_argument('--threshold',
+                                  default=0.5, type=float)
+
+        total_parser = UbertDataModel.add_data_specific_args(total_parser)
+        total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+        total_parser = UbertLitModel.add_model_specific_args(total_parser)
+        total_parser = pl.Trainer.add_argparse_args(parent_args)
+
+        return parent_args
+
+    def __init__(self, args):
+
+        if args.load_checkpoints_path != '':
+            self.model = UbertLitModel.load_from_checkpoint(
+                args.load_checkpoints_path, args=args)
+        else:
+            self.model = UbertLitModel(args)
+
+        self.args = args
+        self.checkpoint_callback = TaskModelCheckpoint(args).callbacks
+        self.logger = loggers.TensorBoardLogger(save_dir=args.default_root_dir)
+        self.trainer = pl.Trainer.from_argparse_args(args,
+                                                     logger=self.logger,
+                                                     callbacks=[self.checkpoint_callback])
+
+        self.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path,
+                                                       additional_special_tokens=['[unused'+str(i+1)+']' for i in range(99)])
+
+        self.em = extractModel()
+
+    def fit(self, train_data, dev_data):
+        data_model = UbertDataModel(
+            train_data, dev_data, self.tokenizer, self.args)
+        self.model.num_data = len(train_data)
+        self.trainer.fit(self.model, data_model)
+
+    def predict(self, test_data, cuda=True):
+        result = []
+        start = 0
+        if cuda:
+            self.model = self.model.cuda()
+        self.model.eval()
+        while start < len(test_data):
+            batch_data = test_data[start:start+self.args.batchsize]
+            start += self.args.batchsize
+
+            batch_result = self.em.extract(
+                batch_data, self.model, self.tokenizer, self.args)
+            result.extend(batch_result)
+        return result
diff --git a/fengshen/models/zen1/__init__.py b/fengshen/models/zen1/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2dec07c8fb965677ba8c8d3b0a13809d0199d301
--- /dev/null
+++ b/fengshen/models/zen1/__init__.py
@@ -0,0 +1,6 @@
+from .ngram_utils import ZenNgramDict, NGRAM_DICT_NAME
+from .modeling import ZenConfig, ZenModel, ZenForPreTraining, ZenForTokenClassification, ZenForSequenceClassification
+from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
+version = "0.1.0"
+__all__ = ['ZenNgramDict', 'NGRAM_DICT_NAME', "ZenConfig", "ZenModel", "ZenForPreTraining", "ZenForTokenClassification",
+           "ZenForSequenceClassification", "BertTokenizer", "BasicTokenizer", "WordpieceTokenizer"]
diff --git a/fengshen/models/zen1/configuration_zen1.py b/fengshen/models/zen1/configuration_zen1.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7cbeb5657ea07b2a4e8429199a6091be39864c8
--- /dev/null
+++ b/fengshen/models/zen1/configuration_zen1.py
@@ -0,0 +1,80 @@
+# coding=utf-8
+# Copyright 2022 IDEA-CCNL and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TransfoXLDenoise model configuration """
+
+from transformers.configuration_utils import PretrainedConfig
+
+
+class ZenConfig(PretrainedConfig):
+
+    """Configuration class to store the configuration of a `ZenModel`.
+    """
+
+    def __init__(self,
+                 #  vocab_size_or_config_json_file,
+                 #  word_vocab_size,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 initializer_range=0.02,
+                 layer_norm_eps=1e-12,
+                 num_hidden_word_layers=6,
+                 **kwargs):
+        """Constructs ZenConfig.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
+            hidden_size: Size of the encoder layers and the pooler layer.
+            num_hidden_layers: Number of hidden layers in the Transformer encoder.
+            num_attention_heads: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+                layer in the Transformer encoder.
+            hidden_act: The non-linear activation function (function or string) in the
+                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+            hidden_dropout_prob: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attention_probs_dropout_prob: The dropout ratio for the attention
+                probabilities.
+            max_position_embeddings: The maximum sequence length that this model might
+                ever be used with. Typically set this to something large just in case
+                (e.g., 512 or 1024 or 2048).
+            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+                `BertModel`.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+            layer_norm_eps: The epsilon used by LayerNorm.
+        """
+        # self.vocab_size = vocab_size_or_config_json_file
+        # self.word_size = word_vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.num_hidden_word_layers = num_hidden_word_layers
+        super().__init__(**kwargs)
diff --git a/fengshen/models/zen1/modeling.py b/fengshen/models/zen1/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..510c1f890c21517605d49e62655719ff6c8bd8b2
--- /dev/null
+++ b/fengshen/models/zen1/modeling.py
@@ -0,0 +1,1085 @@
+# coding: utf-8
+# Copyright 2019 Sinovation Ventures AI Institute
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This file is partially derived from the code at
+# https://github.com/huggingface/transformers/tree/master/transformers
+#
+# Original copyright notice:
+#
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch ZEN model classes."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+import copy
+import logging
+import math
+import os
+import sys
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import PreTrainedModel
+
+from .configuration_zen1 import ZenConfig
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese/resolve/main/pytorch_model.bin',
+}
+PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese/resolve/main/config.json',
+}
+BERT_CONFIG_NAME = 'bert_config.json'
+TF_WEIGHTS_NAME = 'model.ckpt'
+
+
+def prune_linear_layer(layer, index, dim=0):
+    """ Prune a linear layer (a model parameters) to keep only entries in index.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    index = index.to(layer.weight.device)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if layer.bias is not None:
+        if dim == 1:
+            b = layer.bias.clone().detach()
+        else:
+            b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.size())
+    new_size[dim] = len(index)
+    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)
+    new_layer.weight.requires_grad = False
+    new_layer.weight.copy_(W.contiguous())
+    new_layer.weight.requires_grad = True
+    if layer.bias is not None:
+        new_layer.bias.requires_grad = False
+        new_layer.bias.copy_(b.contiguous())
+        new_layer.bias.requires_grad = True
+    return new_layer
+
+
+def load_tf_weights_in_bert(model, tf_checkpoint_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import re
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        print("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+              "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    print("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        print("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split('/')
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
+            print("Skipping {}".format("/".join(name)))
+            continue
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
+                lname = re.split(r'_(\d+)', m_name)
+            else:
+                lname = [m_name]
+            if lname[0] == 'kernel' or lname[0] == 'gamma':
+                pointer = getattr(pointer, 'weight')
+            elif lname[0] == 'output_bias' or lname[0] == 'beta':
+                pointer = getattr(pointer, 'bias')
+            elif lname[0] == 'output_weights':
+                pointer = getattr(pointer, 'weight')
+            elif lname[0] == 'squad':
+                pointer = getattr(pointer, 'classifier')
+            else:
+                try:
+                    pointer = getattr(pointer, lname[0])
+                except AttributeError:
+                    print("Skipping {}".format("/".join(name)))
+                    continue
+            if len(lname) >= 2:
+                num = int(lname[1])
+                pointer = pointer[num]
+        if m_name[-11:] == '_embeddings':
+            pointer = getattr(pointer, 'weight')
+        elif m_name == 'kernel':
+            array = np.transpose(array)
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        print("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+def gelu(x):
+    """Implementation of the gelu activation function.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
+
+
+try:
+    # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
+    from torch.nn import LayerNorm as BertLayerNorm
+except ImportError:
+    logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
+
+    class BertLayerNorm(nn.Module):
+        def __init__(self, hidden_size, eps=1e-12):
+            """Construct a layernorm module in the TF style (epsilon inside the square root).
+            """
+            super(BertLayerNorm, self).__init__()
+            self.weight = nn.Parameter(torch.ones(hidden_size))
+            self.bias = nn.Parameter(torch.zeros(hidden_size))
+            self.variance_epsilon = eps
+
+        def forward(self, x):
+            u = x.mean(-1, keepdim=True)
+            s = (x - u).pow(2).mean(-1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
+            return self.weight * x + self.bias
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        seq_length = input_ids.size(1)
+        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
+        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + position_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertWordEmbeddings(nn.Module):
+    """Construct the embeddings from ngram, position and token_type embeddings.
+    """
+
+    def __init__(self, config):
+        super(BertWordEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.word_size, config.hidden_size, padding_idx=0)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
+        self.output_attentions = output_attentions
+        self.keep_multihead_output = keep_multihead_output
+        self.multihead_output = None
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+        attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+        if self.keep_multihead_output:
+            self.multihead_output = context_layer
+            self.multihead_output.retain_grad()
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        if self.output_attentions:
+            return attention_probs, context_layer
+        return context_layer
+
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super(BertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertAttention, self).__init__()
+        self.output_attentions = output_attentions
+        self.self = BertSelfAttention(config, output_attentions=output_attentions,
+                                      keep_multihead_output=keep_multihead_output)
+        self.output = BertSelfOutput(config)
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
+        for head in heads:
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+
+    def forward(self, input_tensor, attention_mask, head_mask=None):
+        self_output = self.self(input_tensor, attention_mask, head_mask)
+        if self.output_attentions:
+            attentions, self_output = self_output
+        attention_output = self.output(self_output, input_tensor)
+        if self.output_attentions:
+            return attentions, attention_output
+        return attention_output
+
+
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super(BertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        # if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super(BertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertLayer, self).__init__()
+        self.output_attentions = output_attentions
+        self.attention = BertAttention(config, output_attentions=output_attentions,
+                                       keep_multihead_output=keep_multihead_output)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        attention_output = self.attention(hidden_states, attention_mask, head_mask)
+        if self.output_attentions:
+            attentions, attention_output = attention_output
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        if self.output_attentions:
+            return attentions, layer_output
+        return layer_output
+
+
+class ZenEncoder(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenEncoder, self).__init__()
+        self.output_attentions = output_attentions
+        layer = BertLayer(config, output_attentions=output_attentions,
+                          keep_multihead_output=keep_multihead_output)
+        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
+        self.word_layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_word_layers)])
+        self.num_hidden_word_layers = config.num_hidden_word_layers
+
+    def forward(self, hidden_states, ngram_hidden_states, ngram_position_matrix, attention_mask,
+                ngram_attention_mask,
+                output_all_encoded_layers=True, head_mask=None):
+        # Need to check what is the attention masking doing here
+        all_encoder_layers = []
+        all_attentions = []
+        num_hidden_ngram_layers = self.num_hidden_word_layers
+        for i, layer_module in enumerate(self.layer):
+            hidden_states = layer_module(hidden_states, attention_mask, head_mask[i])
+            if i < num_hidden_ngram_layers:
+                ngram_hidden_states = self.word_layers[i](ngram_hidden_states, ngram_attention_mask, head_mask[i])
+                if self.output_attentions:
+                    ngram_attentions, ngram_hidden_states = ngram_hidden_states
+            if self.output_attentions:
+                attentions, hidden_states = hidden_states
+                all_attentions.append(attentions)
+            hidden_states += torch.bmm(ngram_position_matrix.float(), ngram_hidden_states.float())
+            if output_all_encoded_layers:
+                all_encoder_layers.append(hidden_states)
+        if not output_all_encoded_layers:
+            all_encoder_layers.append(hidden_states)
+        if self.output_attentions:
+            return all_attentions, all_encoder_layers
+        return all_encoder_layers
+
+
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super(BertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super(BertPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        # if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(BertLMPredictionHead, self).__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
+                                 bert_model_embedding_weights.size(0),
+                                 bias=False)
+        self.decoder.weight = bert_model_embedding_weights
+        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states) + self.bias
+        return hidden_states
+
+
+class ZenOnlyMLMHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(ZenOnlyMLMHead, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class ZenOnlyNSPHead(nn.Module):
+    def __init__(self, config):
+        super(ZenOnlyNSPHead, self).__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class ZenPreTrainingHeads(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(ZenPreTrainingHeads, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class ZenPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = ZenConfig
+    base_model_prefix = "IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+class ZenModel(ZenPreTrainedModel):
+    """ZEN model ("BERT-based Chinese (Z) text encoder Enhanced by N-gram representations").
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+
+    Outputs: Tuple of (encoded_layers, pooled_output)
+        `encoded_layers`: controled by `output_all_encoded_layers` argument:
+            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
+                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
+                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
+            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
+                to the last attention block of shape [batch_size, sequence_length, hidden_size],
+        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
+            classifier pretrained on top of the hidden state associated to the first character of the
+            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenModel, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.embeddings = BertEmbeddings(config)
+        self.word_embeddings = BertWordEmbeddings(config)
+        self.encoder = ZenEncoder(config, output_attentions=output_attentions,
+                                  keep_multihead_output=keep_multihead_output)
+        self.pooler = BertPooler(config)
+        self.init_weights()
+
+    def prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_multihead_outputs(self):
+        """ Gather all multi-head outputs.
+            Return: list (layers) of multihead module outputs with gradients
+        """
+        return [layer.attention.self.multihead_output for layer in self.encoder.layer]
+
+    def forward(self, input_ids,
+                input_ngram_ids,
+                ngram_position_matrix,
+                token_type_ids=None,
+                ngram_token_type_ids=None,
+                attention_mask=None,
+                ngram_attention_mask=None,
+                output_all_encoded_layers=True,
+                head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        if ngram_attention_mask is None:
+            ngram_attention_mask = torch.ones_like(input_ngram_ids)
+        if ngram_token_type_ids is None:
+            ngram_token_type_ids = torch.zeros_like(input_ngram_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+        extended_ngram_attention_mask = ngram_attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        extended_ngram_attention_mask = extended_ngram_attention_mask.to(dtype=next(self.parameters()).dtype)
+        extended_ngram_attention_mask = (1.0 - extended_ngram_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand_as(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(
+                    -1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(
+                dtype=next(self.parameters()).dtype)  # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(input_ids, token_type_ids)
+        ngram_embedding_output = self.word_embeddings(input_ngram_ids, ngram_token_type_ids)
+
+        encoded_layers = self.encoder(embedding_output,
+                                      ngram_embedding_output,
+                                      ngram_position_matrix,
+                                      extended_attention_mask,
+                                      extended_ngram_attention_mask,
+                                      output_all_encoded_layers=output_all_encoded_layers,
+                                      head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, encoded_layers = encoded_layers
+        sequence_output = encoded_layers[-1]
+        pooled_output = self.pooler(sequence_output)
+        if not output_all_encoded_layers:
+            encoded_layers = encoded_layers[-1]
+        if self.output_attentions:
+            return all_attentions, encoded_layers, pooled_output
+        return encoded_layers, pooled_output
+
+
+class ZenForPreTraining(ZenPreTrainedModel):
+    """ZEN model with pre-training heads.
+    This module comprises the ZEN model followed by the two pre-training heads:
+        - the masked language modeling head, and
+        - the next sentence classification head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `masked_lm_labels` and `next_sentence_label` are not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `masked_lm_labels` or `next_sentence_label` is `None`:
+            Outputs a tuple comprising
+            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+            - the next sentence classification logits of shape [batch_size, 2].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForPreTraining, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None,
+                ngram_token_type_ids=None,
+                attention_mask=None,
+                ngram_attention_mask=None,
+                masked_lm_labels=None,
+                next_sentence_label=None, head_mask=None):
+        outputs = self.bert(input_ids,
+                            input_ngram_ids,
+                            ngram_position_matrix,
+                            token_type_ids,
+                            ngram_token_type_ids,
+                            attention_mask,
+                            ngram_attention_mask,
+                            output_all_encoded_layers=False, head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, pooled_output = outputs
+        else:
+            sequence_output, pooled_output = outputs
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        if masked_lm_labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            total_loss = masked_lm_loss + next_sentence_loss
+            return total_loss
+        elif self.output_attentions:
+            return all_attentions, prediction_scores, seq_relationship_score
+        return prediction_scores, seq_relationship_score
+
+
+class ZenForMaskedLM(ZenPreTrainedModel):
+    """ZEN model with the masked language modeling head.
+    This module comprises the ZEN model followed by the masked language modeling head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+        `head_mask`: an optional torch.LongTensor of shape [num_heads] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `masked_lm_labels` is  not `None`:
+            Outputs the masked language modeling loss.
+        if `masked_lm_labels` is `None`:
+            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForMaskedLM, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenOnlyMLMHead(config, self.bert.embeddings.word_embeddings.weight)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, masked_lm_labels=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, _ = outputs
+        else:
+            sequence_output, _ = outputs
+        prediction_scores = self.cls(sequence_output)
+
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            return masked_lm_loss
+        elif self.output_attentions:
+            return all_attentions, prediction_scores
+        return prediction_scores
+
+
+class ZenForNextSentencePrediction(ZenPreTrainedModel):
+    """ZEN model with next sentence prediction head.
+    This module comprises the ZEN model followed by the next sentence classification head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `next_sentence_label` is not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `next_sentence_label` is `None`:
+            Outputs the next sentence classification logits of shape [batch_size, 2].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForNextSentencePrediction, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenOnlyNSPHead(config)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, next_sentence_label=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, _, pooled_output = outputs
+        else:
+            _, pooled_output = outputs
+        seq_relationship_score = self.cls(pooled_output)
+
+        if next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            return next_sentence_loss
+        elif self.output_attentions:
+            return all_attentions, seq_relationship_score
+        return seq_relationship_score
+
+
+class ZenForSequenceClassification(ZenPreTrainedModel):
+    """ZEN model for classification.
+    This module is composed of the ZEN model with a linear layer on top of
+    the pooled output.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, ..., num_labels].
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, num_labels].
+
+    """
+
+    def __init__(self, config, num_labels=2, output_attentions=False, keep_multihead_output=False):
+        # super().__init__(config, num_labels, output_attentions, keep_multihead_output)
+        super().__init__(config)
+        self.config = config
+        self.output_attentions = output_attentions
+        self.num_labels = config.num_labels
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, _, pooled_output = outputs
+        else:
+            _, pooled_output = outputs
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # print('logits***************', logits, labels)
+            # breakpoint()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss, logits
+        elif self.output_attentions:
+            return all_attentions, logits
+        return loss, logits
+
+
+class ZenForTokenClassification(ZenPreTrainedModel):
+    """ZEN model for token-level classification.
+    This module is composed of the ZEN model with a linear layer on top of
+    the full hidden state of the last layer.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [0, ..., num_labels].
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
+
+    """
+
+    def __init__(self, config, num_labels=2, output_attentions=False, keep_multihead_output=False):
+        super().__init__(config)
+        self.output_attentions = output_attentions
+        self.num_labels = config.num_labels
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.init_weights()
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, valid_ids=None,
+                attention_mask_label=None, ngram_ids=None, ngram_positions=None, head_mask=None):
+        outputs = self.bert(input_ids, ngram_ids, ngram_positions, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False, head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, _ = outputs
+        else:
+            sequence_output, _ = outputs
+
+        batch_size, max_len, feat_dim = sequence_output.shape
+        valid_output = torch.zeros(batch_size, max_len, feat_dim, dtype=torch.float32, device=input_ids.device)
+
+        if self.num_labels == 38:
+            # just for POS to filter/mask input_ids=0
+            for i in range(batch_size):
+                temp = sequence_output[i][valid_ids[i] == 1]
+                valid_output[i][:temp.size(0)] = temp
+        else:
+            valid_output = sequence_output
+
+        sequence_output = self.dropout(valid_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=0)
+            # Only keep active parts of the loss
+            attention_mask_label = None
+            if attention_mask_label is not None:
+                active_loss = attention_mask_label.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)[active_loss]
+                active_labels = labels.view(-1)[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss, logits
+        else:
+            return loss, logits
diff --git a/fengshen/models/zen1/ngram_utils.py b/fengshen/models/zen1/ngram_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..917f770fab84db4c8a55b11a296afdb61f8283c9
--- /dev/null
+++ b/fengshen/models/zen1/ngram_utils.py
@@ -0,0 +1,106 @@
+# coding: utf-8
+# Copyright 2019 Sinovation Ventures AI Institute
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""utils for ngram for ZEN model."""
+
+import os
+import logging
+
+from transformers import cached_path
+
+NGRAM_DICT_NAME = 'ngram.txt'
+
+logger = logging.getLogger(__name__)
+PRETRAINED_VOCAB_ARCHIVE_MAP = {'IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese/resolve/main/ngram.txt'}
+
+
+class ZenNgramDict(object):
+    """
+    Dict class to store the ngram
+    """
+
+    def __init__(self, ngram_freq_path, tokenizer, max_ngram_in_seq=128):
+        """Constructs ZenNgramDict
+
+        :param ngram_freq_path: ngrams with frequency
+        """
+        if os.path.isdir(ngram_freq_path):
+            ngram_freq_path = os.path.join(ngram_freq_path, NGRAM_DICT_NAME)
+        self.ngram_freq_path = ngram_freq_path
+        self.max_ngram_in_seq = max_ngram_in_seq
+        self.id_to_ngram_list = ["[pad]"]
+        self.ngram_to_id_dict = {"[pad]": 0}
+        self.ngram_to_freq_dict = {}
+
+        logger.info("loading ngram frequency file {}".format(ngram_freq_path))
+        with open(ngram_freq_path, "r", encoding="utf-8") as fin:
+            for i, line in enumerate(fin):
+                ngram, freq = line.split(",")
+                tokens = tuple(tokenizer.tokenize(ngram))
+                self.ngram_to_freq_dict[ngram] = freq
+                self.id_to_ngram_list.append(tokens)
+                self.ngram_to_id_dict[tokens] = i + 1
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            ngram_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is a cased model but you have not set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
+                               "you may want to check this behavior.")
+                kwargs['do_lower_case'] = False
+            elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is an uncased model but you have set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=True` for you "
+                               "but you may want to check this behavior.")
+                kwargs['do_lower_case'] = True
+        else:
+            ngram_file = pretrained_model_name_or_path
+        if os.path.isdir(ngram_file):
+            ngram_file = os.path.join(ngram_file, NGRAM_DICT_NAME)
+        # redirect to the cache, if necessary
+        try:
+            resolved_ngram_file = cached_path(ngram_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+                logger.error(
+                    "Couldn't reach server at '{}' to download vocabulary.".format(
+                        ngram_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                        ngram_file))
+            return None
+        if resolved_ngram_file == ngram_file:
+            logger.info("loading vocabulary file {}".format(ngram_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                ngram_file, resolved_ngram_file))
+        # Instantiate ngram.
+        ngram_dict = cls(resolved_ngram_file, **kwargs)
+        return ngram_dict
+
+    def save(self, ngram_freq_path):
+        with open(ngram_freq_path, "w", encoding="utf-8") as fout:
+            for ngram, freq in self.ngram_to_freq_dict.items():
+                fout.write("{},{}\n".format(ngram, freq))
diff --git a/fengshen/models/zen1/tokenization.py b/fengshen/models/zen1/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbc94e2417ff42ffcfb18284b8cb396415e630b1
--- /dev/null
+++ b/fengshen/models/zen1/tokenization.py
@@ -0,0 +1,438 @@
+# coding=utf-8
+# This file is derived from the code at
+# https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
+#
+# Original copyright notice:
+#
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import logging
+import os
+import unicodedata
+from io import open
+
+from transformers import cached_path
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
+    'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
+    'IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN1-224M-Chinese/resolve/main/vocab.txt',
+}
+PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
+    'bert-base-uncased': 512,
+    'bert-large-uncased': 512,
+    'bert-base-cased': 512,
+    'bert-large-cased': 512,
+    'bert-base-multilingual-uncased': 512,
+    'bert-base-multilingual-cased': 512,
+    'bert-base-chinese': 512,
+    'bert-base-german-cased': 512,
+    'bert-large-uncased-whole-word-masking': 512,
+    'bert-large-cased-whole-word-masking': 512,
+    'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
+    'bert-large-cased-whole-word-masking-finetuned-squad': 512,
+    'bert-base-cased-finetuned-mrpc': 512,
+}
+VOCAB_NAME = 'vocab.txt'
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    index = 0
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        while True:
+            token = reader.readline()
+            if not token:
+                break
+            token = token.strip()
+            vocab[token] = index
+            index += 1
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class BertTokenizer(object):
+    """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
+
+    def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BertTokenizer.
+
+        Args:
+          vocab_file: Path to a one-wordpiece-per-line vocabulary file
+          do_lower_case: Whether to lower case the input
+                         Only has an effect when do_wordpiece_only=False
+          do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+          max_len: An artificial maximum length to truncate tokenized sequences to;
+                         Effective maximum length is always the minimum of this
+                         value (if specified) and the underlying BERT model's
+                         sequence length.
+          never_split: List of tokens which will never be split during tokenization.
+                         Only has an effect when do_wordpiece_only=False
+        """
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict(
+            [(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
+                                                  never_split=never_split)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+    def tokenize(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text):
+                for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                    split_tokens.append(sub_token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """Converts a sequence of tokens into ids using the vocab."""
+        ids = []
+        for token in tokens:
+            ids.append(self.vocab[token])
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this BERT model ({} > {}). Running this"
+                " sequence through BERT will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids):
+        """Converts a sequence of ids in wordpiece tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            tokens.append(self.ids_to_tokens[i])
+        return tokens
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary to a directory or file."""
+        index = 0
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_NAME)
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: vocabulary indices are not consecutive."
+                                   " Please check that the vocabulary is not corrupted!".format(vocab_file))
+                    index = token_index
+                writer.write(token + u'\n')
+                index += 1
+        return vocab_file
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is a cased model but you have not set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
+                               "you may want to check this behavior.")
+                kwargs['do_lower_case'] = False
+            elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is an uncased model but you have set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=True` for you "
+                               "but you may want to check this behavior.")
+                kwargs['do_lower_case'] = True
+        else:
+            vocab_file = pretrained_model_name_or_path
+        if os.path.isdir(vocab_file):
+            vocab_file = os.path.join(vocab_file, VOCAB_NAME)
+        # redirect to the cache, if necessary
+        try:
+            resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+                logger.error(
+                    "Couldn't reach server at '{}' to download vocabulary.".format(
+                        vocab_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                        vocab_file))
+            return None
+        if resolved_vocab_file == vocab_file:
+            logger.info("loading vocabulary file {}".format(vocab_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                vocab_file, resolved_vocab_file))
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
+            # if we're using a pretrained model, ensure the tokenizer wont index sequences longer
+            # than the number of positional embeddings
+            max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
+            kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
+        # Instantiate tokenizer.
+        tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
+        return tokenizer
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self,
+                 do_lower_case=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BasicTokenizer.
+
+        Args:
+          do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+        self.never_split = never_split
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in self.never_split:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        if text in self.never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+          input = "unaffable"
+          output = ["un", "##aff", "##able"]
+
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`.
+
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
diff --git a/fengshen/models/zen2/__init__.py b/fengshen/models/zen2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..67450bc9ff1c9f7c80131fb149ab3389fda15cea
--- /dev/null
+++ b/fengshen/models/zen2/__init__.py
@@ -0,0 +1,12 @@
+from .configuration_zen2 import ZenConfig
+from .modeling import ZenForPreTraining, ZenForTokenClassification, ZenForSequenceClassification, ZenForQuestionAnswering, ZenModel
+from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer, _is_whitespace, whitespace_tokenize, convert_to_unicode, _is_punctuation, _is_control, VOCAB_NAME
+from .ngram_utils import ZenNgramDict, NGRAM_DICT_NAME, extract_ngram_feature, construct_ngram_matrix
+__all__ = [
+    'ZenConfig', 'ZenForPreTraining', 'ZenForTokenClassification', 'ZenForSequenceClassification',
+    'ZenForQuestionAnswering', 'ZenModel', 'BertTokenizer', 'BasicTokenizer',
+    'WordpieceTokenizer', '_is_whitespace', 'whitespace_tokenize', 'convert_to_unicode',
+    '_is_punctuation', '_is_control', 'VOCAB_NAME', 'ZenNgramDict', 'NGRAM_DICT_NAME',
+    'extract_ngram_feature', 'construct_ngram_matrix',
+]
+version = "0.1.0"
diff --git a/fengshen/models/zen2/configuration_zen2.py b/fengshen/models/zen2/configuration_zen2.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7cbeb5657ea07b2a4e8429199a6091be39864c8
--- /dev/null
+++ b/fengshen/models/zen2/configuration_zen2.py
@@ -0,0 +1,80 @@
+# coding=utf-8
+# Copyright 2022 IDEA-CCNL and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TransfoXLDenoise model configuration """
+
+from transformers.configuration_utils import PretrainedConfig
+
+
+class ZenConfig(PretrainedConfig):
+
+    """Configuration class to store the configuration of a `ZenModel`.
+    """
+
+    def __init__(self,
+                 #  vocab_size_or_config_json_file,
+                 #  word_vocab_size,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 initializer_range=0.02,
+                 layer_norm_eps=1e-12,
+                 num_hidden_word_layers=6,
+                 **kwargs):
+        """Constructs ZenConfig.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
+            hidden_size: Size of the encoder layers and the pooler layer.
+            num_hidden_layers: Number of hidden layers in the Transformer encoder.
+            num_attention_heads: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+                layer in the Transformer encoder.
+            hidden_act: The non-linear activation function (function or string) in the
+                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+            hidden_dropout_prob: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attention_probs_dropout_prob: The dropout ratio for the attention
+                probabilities.
+            max_position_embeddings: The maximum sequence length that this model might
+                ever be used with. Typically set this to something large just in case
+                (e.g., 512 or 1024 or 2048).
+            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+                `BertModel`.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+            layer_norm_eps: The epsilon used by LayerNorm.
+        """
+        # self.vocab_size = vocab_size_or_config_json_file
+        # self.word_size = word_vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.num_hidden_word_layers = num_hidden_word_layers
+        super().__init__(**kwargs)
diff --git a/fengshen/models/zen2/modeling.py b/fengshen/models/zen2/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b730c060e5a78dae6576aafaf18bc0fba9ce280
--- /dev/null
+++ b/fengshen/models/zen2/modeling.py
@@ -0,0 +1,1316 @@
+# coding: utf-8
+# Copyright 2019 Sinovation Ventures AI Institute
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This file is partially derived from the code at
+# https://github.com/huggingface/transformers/tree/master/transformers
+#
+# Original copyright notice:
+#
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch ZEN2 model classes."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import copy
+import logging
+import math
+import os
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from dataclasses import dataclass
+from typing import Optional
+from transformers import PreTrainedModel
+
+from fengshen.models.zen2.configuration_zen2 import ZenConfig
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-pytorch_model.bin",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-pytorch_model.bin",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin",
+    'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-pytorch_model.bin",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-pytorch_model.bin",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",
+    'IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese/resolve/main/pytorch_model.bin',
+    'IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese/resolve/main/pytorch_model.bin',
+}
+PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json",
+    'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
+    'IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese/resolve/main/config.json',
+    'IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese/resolve/main/config.json',
+}
+BERT_CONFIG_NAME = 'bert_config.json'
+TF_WEIGHTS_NAME = 'model.ckpt'
+
+
+def prune_linear_layer(layer, index, dim=0):
+    """ Prune a linear layer (a model parameters) to keep only entries in index.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    index = index.to(layer.weight.device)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if layer.bias is not None:
+        if dim == 1:
+            b = layer.bias.clone().detach()
+        else:
+            b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.size())
+    new_size[dim] = len(index)
+    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)
+    new_layer.weight.requires_grad = False
+    new_layer.weight.copy_(W.contiguous())
+    new_layer.weight.requires_grad = True
+    if layer.bias is not None:
+        new_layer.bias.requires_grad = False
+        new_layer.bias.copy_(b.contiguous())
+        new_layer.bias.requires_grad = True
+    return new_layer
+
+
+def load_tf_weights_in_bert(model, tf_checkpoint_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import re
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        print("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+              "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    print("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        print("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split('/')
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
+            print("Skipping {}".format("/".join(name)))
+            continue
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
+                name_lists = re.split(r'_(\d+)', m_name)
+            else:
+                name_lists = [m_name]
+            if name_lists[0] == 'kernel' or name_lists[0] == 'gamma':
+                pointer = getattr(pointer, 'weight')
+            elif name_lists[0] == 'output_bias' or name_lists[0] == 'beta':
+                pointer = getattr(pointer, 'bias')
+            elif name_lists[0] == 'output_weights':
+                pointer = getattr(pointer, 'weight')
+            elif name_lists[0] == 'squad':
+                pointer = getattr(pointer, 'classifier')
+            else:
+                try:
+                    pointer = getattr(pointer, name_lists[0])
+                except AttributeError:
+                    print("Skipping {}".format("/".join(name)))
+                    continue
+            if len(name_lists) >= 2:
+                num = int(name_lists[1])
+                pointer = pointer[num]
+        if m_name[-11:] == '_embeddings':
+            pointer = getattr(pointer, 'weight')
+        elif m_name == 'kernel':
+            array = np.transpose(array)
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        print("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+def gelu(x):
+    """Implementation of the gelu activation function.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
+
+
+try:
+    # from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
+    from torch.nn import LayerNorm as BertLayerNorm
+except ImportError:
+    logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
+
+    class BertLayerNorm(nn.Module):
+        def __init__(self, hidden_size, eps=1e-12):
+            """Construct a layernorm module in the TF style (epsilon inside the square root).
+            """
+            super(BertLayerNorm, self).__init__()
+            self.weight = nn.Parameter(torch.ones(hidden_size))
+            self.bias = nn.Parameter(torch.zeros(hidden_size))
+            self.variance_epsilon = eps
+
+        def forward(self, x):
+            u = x.mean(-1, keepdim=True)
+            s = (x - u).pow(2).mean(-1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
+            return self.weight * x + self.bias
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertWordEmbeddings(nn.Module):
+    """Construct the embeddings from ngram, position and token_type embeddings.
+    """
+
+    def __init__(self, config):
+        super(BertWordEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.word_size, config.hidden_size, padding_idx=0)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RelativeSinusoidalPositionalEmbedding(nn.Module):
+    """This module produces sinusoidal positional embeddings of any length.
+    Padding symbols are ignored.
+    """
+
+    def __init__(self, embedding_dim, padding_idx, init_size=1568):
+        """
+
+        :param embedding_dim: 每个位置的dimension
+        :param padding_idx:
+        :param init_size:
+        """
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx
+        assert init_size % 2 == 0
+        weights = self.get_embedding(
+            init_size+1,
+            embedding_dim,
+            padding_idx,
+        )
+        self.register_buffer('weights', weights)
+        self.register_buffer('_float_tensor', torch.FloatTensor(1))
+
+    def get_embedding(self, num_embeddings, embedding_dim, padding_idx=None):
+        """Build sinusoidal embeddings.
+        This matches the implementation in tensor2tensor, but differs slightly
+        from the description in Section 3.5 of "Attention Is All You Need".
+        """
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
+        emb = torch.arange(-num_embeddings//2, num_embeddings//2, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        self.origin_shift = num_embeddings//2 + 1
+        return emb
+
+    def forward(self, input):
+        """Input is expected to be of size [bsz x seqlen].
+        """
+        bsz, _, _, seq_len = input.size()
+        max_pos = self.padding_idx + seq_len
+        if max_pos > self.origin_shift:
+            # recompute/expand embeddings if needed
+            weights = self.get_embedding(
+                max_pos*2,
+                self.embedding_dim,
+                self.padding_idx,
+            )
+            weights = weights.to(self._float_tensor)
+            del self.weights
+            self.origin_shift = weights.size(0)//2
+            self.register_buffer('weights', weights)
+
+        positions = torch.arange(-seq_len, seq_len).to(input.device).long() + self.origin_shift  # 2*seq_len
+        embed = self.weights.index_select(0, positions.long()).detach()
+        return embed
+
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
+        self.output_attentions = output_attentions
+        self.keep_multihead_output = keep_multihead_output
+        self.multihead_output = None
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.softmax = nn.Softmax(dim=-1)
+
+        self.position_embedding = RelativeSinusoidalPositionalEmbedding(self.attention_head_size, 0, 1200)
+        self.r_r_bias = nn.Parameter(
+            nn.init.xavier_normal_(torch.zeros(self.num_attention_heads, self.attention_head_size)))
+        self.r_w_bias = nn.Parameter(
+            nn.init.xavier_normal_(torch.zeros(self.num_attention_heads, self.attention_head_size)))
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        position_embedding = self.position_embedding(attention_mask)
+
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        rw_head_q = query_layer + self.r_r_bias[:, None]
+        AC = torch.einsum('bnqd,bnkd->bnqk', [rw_head_q.float(), key_layer.float()])  # b x n x l x d, n是head
+
+        D_ = torch.einsum('nd,ld->nl', self.r_w_bias.float(), position_embedding.float())[None, :,
+                                                                                          None]  # head x 2max_len, 每个head对位置的bias
+        B_ = torch.einsum('bnqd,ld->bnql', query_layer.float(),
+                          position_embedding.float())  # bsz x head  x max_len x 2max_len，每个query对每个shift的偏移
+        BD = B_ + D_  # bsz x head x max_len x 2max_len, 要转换为bsz x head x max_len x max_len
+        BD = self._shift(BD)
+        attention_scores = AC + BD
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+        attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = self.softmax(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs.type_as(value_layer), value_layer)
+        if self.keep_multihead_output:
+            self.multihead_output = context_layer
+            self.multihead_output.retain_grad()
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        if self.output_attentions:
+            return attention_probs, context_layer
+        return context_layer
+
+    def _shift(self, BD):
+        """
+        类似
+        -3 -2 -1 0 1 2
+        -3 -2 -1 0 1 2
+        -3 -2 -1 0 1 2
+
+        转换为
+        0   1  2
+        -1  0  1
+        -2 -1  0
+
+        :param BD: batch_size x n_head x max_len x 2max_len
+        :return: batch_size x n_head x max_len x max_len
+        """
+        bsz, n_head, max_len, _ = BD.size()
+        zero_pad = BD.new_zeros(bsz, n_head, max_len, 1)
+        BD = torch.cat([BD, zero_pad], dim=-1).view(bsz, n_head, -1, max_len)  # bsz x n_head x (2max_len+1) x max_len
+        BD = BD[:, :, :-1].view(bsz, n_head, max_len, -1)  # bsz x n_head x 2max_len x max_len
+        BD = BD[:, :, :, max_len:]
+        return BD
+
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super(BertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertAttention, self).__init__()
+        self.output_attentions = output_attentions
+        self.self = BertSelfAttention(config, output_attentions=output_attentions,
+                                      keep_multihead_output=keep_multihead_output)
+        self.output = BertSelfOutput(config)
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
+        for head in heads:
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+
+    def forward(self, input_tensor, attention_mask, head_mask=None):
+        self_output = self.self(input_tensor, attention_mask, head_mask)
+        if self.output_attentions:
+            attentions, self_output = self_output
+        attention_output = self.output(self_output, input_tensor)
+        if self.output_attentions:
+            return attentions, attention_output
+        return attention_output
+
+
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super(BertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        # if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super(BertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(BertLayer, self).__init__()
+        self.output_attentions = output_attentions
+        self.attention = BertAttention(config, output_attentions=output_attentions,
+                                       keep_multihead_output=keep_multihead_output)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        attention_output = self.attention(hidden_states, attention_mask, head_mask)
+        if self.output_attentions:
+            attentions, attention_output = attention_output
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        if self.output_attentions:
+            return attentions, layer_output
+        return layer_output
+
+
+class ZenEncoder(nn.Module):
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenEncoder, self).__init__()
+        self.output_attentions = output_attentions
+        layer = BertLayer(config, output_attentions=output_attentions,
+                          keep_multihead_output=keep_multihead_output)
+        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
+        self.word_layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_word_layers)])
+        self.num_hidden_word_layers = config.num_hidden_word_layers
+
+    def forward(self, hidden_states, ngram_hidden_states, ngram_position_matrix, attention_mask,
+                ngram_attention_mask,
+                output_all_encoded_layers=True, head_mask=None):
+        # Need to check what is the attention masking doing here
+        all_encoder_layers = []
+        all_attentions = []
+        num_hidden_ngram_layers = self.num_hidden_word_layers
+        for i, layer_module in enumerate(self.layer):
+            hidden_states = layer_module(hidden_states, attention_mask, head_mask[i])
+            if i < num_hidden_ngram_layers:
+                ngram_hidden_states = self.word_layers[i](ngram_hidden_states, ngram_attention_mask, head_mask[i])
+                if self.output_attentions:
+                    ngram_attentions, ngram_hidden_states = ngram_hidden_states
+                    all_attentions.append(ngram_attentions)
+            if self.output_attentions:
+                attentions, hidden_states = hidden_states
+                all_attentions.append(attentions)
+            hidden_states += torch.bmm(ngram_position_matrix.float(), ngram_hidden_states.float())
+            if output_all_encoded_layers:
+                all_encoder_layers.append(hidden_states)
+        if not output_all_encoded_layers:
+            all_encoder_layers.append(hidden_states)
+        if self.output_attentions:
+            return all_attentions, all_encoder_layers
+        return all_encoder_layers
+
+
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super(BertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super(BertPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        # if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(BertLMPredictionHead, self).__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
+                                 bert_model_embedding_weights.size(0),
+                                 bias=False)
+        self.decoder.weight = bert_model_embedding_weights
+        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states) + self.bias
+        return hidden_states
+
+
+class ZenOnlyMLMHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(ZenOnlyMLMHead, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class ZenOnlyNSPHead(nn.Module):
+    def __init__(self, config):
+        super(ZenOnlyNSPHead, self).__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class ZenPreTrainingHeads(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(ZenPreTrainingHeads, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class ZenPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = ZenConfig
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+class ZenModel(ZenPreTrainedModel):
+    """ZEN model ("BERT-based Chinese (Z) text encoder Enhanced by N-gram representations").
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+
+    Outputs: Tuple of (encoded_layers, pooled_output)
+        `encoded_layers`: controled by `output_all_encoded_layers` argument:
+            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
+                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
+                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
+            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
+                to the last attention block of shape [batch_size, sequence_length, hidden_size],
+        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
+            classifier pretrained on top of the hidden state associated to the first character of the
+            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenModel, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.embeddings = BertEmbeddings(config)
+        self.word_embeddings = BertWordEmbeddings(config)
+        self.encoder = ZenEncoder(config, output_attentions=output_attentions,
+                                  keep_multihead_output=keep_multihead_output)
+        self.pooler = BertPooler(config)
+        self.init_weights()
+
+    def prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_multihead_outputs(self):
+        """ Gather all multi-head outputs.
+            Return: list (layers) of multihead module outputs with gradients
+        """
+        return [layer.attention.self.multihead_output for layer in self.encoder.layer]
+
+    def forward(self, input_ids,
+                input_ngram_ids,
+                ngram_position_matrix,
+                token_type_ids=None,
+                ngram_token_type_ids=None,
+                attention_mask=None,
+                ngram_attention_mask=None,
+                output_all_encoded_layers=True,
+                head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        if ngram_attention_mask is None:
+            ngram_attention_mask = torch.ones_like(input_ngram_ids)
+        if ngram_token_type_ids is None:
+            ngram_token_type_ids = torch.zeros_like(input_ngram_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+        extended_ngram_attention_mask = ngram_attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        extended_ngram_attention_mask = extended_ngram_attention_mask.to(dtype=next(self.parameters()).dtype)
+        extended_ngram_attention_mask = (1.0 - extended_ngram_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand_as(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(
+                    -1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(
+                dtype=next(self.parameters()).dtype)  # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(input_ids, token_type_ids)
+        ngram_embedding_output = self.word_embeddings(input_ngram_ids, ngram_token_type_ids)
+
+        encoded_layers = self.encoder(embedding_output,
+                                      ngram_embedding_output,
+                                      ngram_position_matrix,
+                                      extended_attention_mask,
+                                      extended_ngram_attention_mask,
+                                      output_all_encoded_layers=output_all_encoded_layers,
+                                      head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, encoded_layers = encoded_layers
+        sequence_output = encoded_layers[-1]
+        pooled_output = self.pooler(sequence_output)
+        if not output_all_encoded_layers:
+            encoded_layers = encoded_layers[-1]
+        if self.output_attentions:
+            return all_attentions, encoded_layers, pooled_output
+        return encoded_layers, pooled_output
+
+
+class ZenForPreTraining(ZenPreTrainedModel):
+    """ZEN model with pre-training heads.
+    This module comprises the ZEN model followed by the two pre-training heads:
+        - the masked language modeling head, and
+        - the next sentence classification head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `masked_lm_labels` and `next_sentence_label` are not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `masked_lm_labels` or `next_sentence_label` is `None`:
+            Outputs a tuple comprising
+            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+            - the next sentence classification logits of shape [batch_size, 2].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForPreTraining, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None,
+                ngram_token_type_ids=None,
+                attention_mask=None,
+                ngram_attention_mask=None,
+                masked_lm_labels=None,
+                next_sentence_label=None, head_mask=None):
+        outputs = self.bert(input_ids,
+                            input_ngram_ids,
+                            ngram_position_matrix,
+                            token_type_ids,
+                            ngram_token_type_ids,
+                            attention_mask,
+                            ngram_attention_mask,
+                            output_all_encoded_layers=False, head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, pooled_output = outputs
+        else:
+            sequence_output, pooled_output = outputs
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        if masked_lm_labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            total_loss = masked_lm_loss + next_sentence_loss
+            return total_loss
+        elif self.output_attentions:
+            return all_attentions, prediction_scores, seq_relationship_score
+        return prediction_scores, seq_relationship_score
+
+
+class ZenForMaskedLM(ZenPreTrainedModel):
+    """ZEN model with the masked language modeling head.
+    This module comprises the ZEN model followed by the masked language modeling head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+        `head_mask`: an optional torch.LongTensor of shape [num_heads] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `masked_lm_labels` is  not `None`:
+            Outputs the masked language modeling loss.
+        if `masked_lm_labels` is `None`:
+            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForMaskedLM, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenOnlyMLMHead(config, self.bert.embeddings.word_embeddings.weight)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, masked_lm_labels=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, _ = outputs
+        else:
+            sequence_output, _ = outputs
+        prediction_scores = self.cls(sequence_output)
+
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            return masked_lm_loss
+        elif self.output_attentions:
+            return all_attentions, prediction_scores
+        return prediction_scores
+
+
+class ZenForNextSentencePrediction(ZenPreTrainedModel):
+    """ZEN model with next sentence prediction head.
+    This module comprises the ZEN model followed by the next sentence classification head.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `next_sentence_label` is not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `next_sentence_label` is `None`:
+            Outputs the next sentence classification logits of shape [batch_size, 2].
+
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForNextSentencePrediction, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.cls = ZenOnlyNSPHead(config)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, next_sentence_label=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids, attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, _, pooled_output = outputs
+        else:
+            _, pooled_output = outputs
+        seq_relationship_score = self.cls(pooled_output)
+
+        if next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            return next_sentence_loss
+        elif self.output_attentions:
+            return all_attentions, seq_relationship_score
+        return seq_relationship_score
+
+
+class ZenForSequenceClassification(ZenPreTrainedModel):
+    """ZEN model for classification.
+    This module is composed of the ZEN model with a linear layer on top of
+    the pooled output.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, ..., num_labels].
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, num_labels].
+
+    """
+
+    def __init__(self, config, num_labels=2, output_attentions=False, keep_multihead_output=False):
+        super(ZenForSequenceClassification, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.num_labels = config.num_labels
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids,
+                            attention_mask=attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, _, pooled_output = outputs
+        else:
+            _, pooled_output = outputs
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss, logits
+        elif self.output_attentions:
+            return all_attentions, logits
+        return loss, logits
+
+
+@dataclass
+class TokenClassifierOutput:
+    """
+    Base class for outputs of token classification models.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+
+
+class ZenForTokenClassification(ZenPreTrainedModel):
+    """ZEN model for token-level classification.
+    This module is composed of the ZEN model with a linear layer on top of
+    the full hidden state of the last layer.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [0, ..., num_labels].
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+        `input_ngram_ids`: input_ids of ngrams.
+        `ngram_token_type_ids`: token_type_ids of ngrams.
+        `ngram_attention_mask`: attention_mask of ngrams.
+        `ngram_position_matrix`: position matrix of ngrams.
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
+
+    """
+
+    def __init__(self, config, num_labels=2, output_attentions=False, keep_multihead_output=False):
+        super(ZenForTokenClassification, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.num_labels = config.num_labels
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+        self.init_weights()
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, valid_ids=None,
+                input_ngram_ids=None, ngram_position_matrix=None, head_mask=None, b_use_valid_filter=False):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids,
+                            attention_mask=attention_mask, output_all_encoded_layers=False, head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, _ = outputs
+        else:
+            sequence_output, _ = outputs
+
+        # if b_use_valid_filter:
+        #     batch_size, max_len, feat_dim = sequence_output.shape
+        #     valid_output = torch.zeros(batch_size, max_len, feat_dim, dtype=sequence_output.dtype,
+        #                                device=input_ids.device)
+        #     for i in range(batch_size):
+        #         temp = sequence_output[i][valid_ids[i] == 1]
+        #         valid_output[i][:temp.size(0)] = temp
+        # else:
+        #     valid_output = sequence_output
+        valid_output = sequence_output
+
+        sequence_output = self.dropout(valid_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=0)
+            # Only keep active parts of the loss
+            # attention_mask_label = None
+            # if attention_mask_label is not None:
+            if attention_mask is not None:
+                # active_loss = attention_mask_label.view(-1) == 1
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)[active_loss]
+                active_labels = labels.view(-1)[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return TokenClassifierOutput(loss, logits)
+        else:
+            return TokenClassifierOutput(loss, logits)
+
+
+class ZenForQuestionAnswering(ZenPreTrainedModel):
+    """BERT model for Question Answering (span extraction).
+    This module is composed of the BERT model with a linear layer on top of
+    the sequence output that computes start_logits and end_logits
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model
+        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+        `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient.
+            This can be used to compute head importance metrics. Default: False
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
+            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+            into account for computing the loss.
+        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
+            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+            into account for computing the loss.
+        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+    Outputs:
+        if `start_positions` and `end_positions` are not `None`:
+            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
+        if `start_positions` or `end_positions` is `None`:
+            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
+            position tokens of shape [batch_size, sequence_length].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = BertForQuestionAnswering(config)
+    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+
+    def __init__(self, config, output_attentions=False, keep_multihead_output=False):
+        super(ZenForQuestionAnswering, self).__init__(config)
+        self.output_attentions = output_attentions
+        self.bert = ZenModel(config, output_attentions=output_attentions,
+                             keep_multihead_output=keep_multihead_output)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+        self.init_weights()
+
+    def forward(self, input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids=None, attention_mask=None, start_positions=None,
+                end_positions=None, head_mask=None):
+        outputs = self.bert(input_ids, input_ngram_ids, ngram_position_matrix, token_type_ids,
+                            attention_mask=attention_mask,
+                            output_all_encoded_layers=False,
+                            head_mask=head_mask)
+        if self.output_attentions:
+            all_attentions, sequence_output, _ = outputs
+        else:
+            sequence_output, _ = outputs
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            return total_loss
+        elif self.output_attentions:
+            return all_attentions, start_logits, end_logits
+        return start_logits, end_logits
diff --git a/fengshen/models/zen2/ngram_utils.py b/fengshen/models/zen2/ngram_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcf474bf6f11c91159b9fdbaaeb2a3f7b6858539
--- /dev/null
+++ b/fengshen/models/zen2/ngram_utils.py
@@ -0,0 +1,192 @@
+# coding: utf-8
+# Copyright 2019 Sinovation Ventures AI Institute
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""utils for ngram for ZEN2 model."""
+
+import os
+import logging
+import math
+import numpy as np
+import torch
+from transformers import cached_path
+
+NGRAM_DICT_NAME = 'ngram.txt'
+
+logger = logging.getLogger(__name__)
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    'IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese/resolve/main/ngram.txt',
+    'IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese/resolve/main/ngram.txt',
+}
+
+
+class ZenNgramDict(object):
+    """
+    Dict class to store the ngram
+    """
+
+    def __init__(self, ngram_freq_path, tokenizer=None, max_ngram_in_seq=128):
+        """Constructs ZenNgramDict
+
+        :param ngram_freq_path: ngrams with frequency
+        """
+        if os.path.isdir(ngram_freq_path):
+            ngram_freq_path = os.path.join(ngram_freq_path, NGRAM_DICT_NAME)
+        self.ngram_freq_path = ngram_freq_path
+        self.max_ngram_in_seq = max_ngram_in_seq
+        self.max_ngram_len = 8
+        self.id_to_ngram_list = ["[pad]"]
+        self.ngram_to_id_dict = {"[pad]": 0}
+        self.ngram_to_freq_dict = {}
+
+        logger.info("loading ngram frequency file {}".format(ngram_freq_path))
+        with open(ngram_freq_path, "r", encoding="utf-8") as fin:
+            for i, line in enumerate(fin):
+                items = line.strip().split(",")
+                if len(items) != 2:
+                    continue
+                ngram, freq = items
+                # self.ngram_to_freq_dict[ngram] = int(freq)
+                if tokenizer:
+                    tokens = tuple(tokenizer.tokenize(ngram))
+                    if len([token for token in tokens if "[UNK]" in token]) > 0:
+                        tokens = ngram
+                else:
+                    tokens = tuple(ngram.split(" "))
+                self.id_to_ngram_list.append(tokens)
+                self.ngram_to_id_dict[tokens] = i + 1
+                self.ngram_to_freq_dict[tokens] = int(freq)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            ngram_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is a cased model but you have not set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
+                               "you may want to check this behavior.")
+                kwargs['do_lower_case'] = False
+            elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is an uncased model but you have set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=True` for you "
+                               "but you may want to check this behavior.")
+                kwargs['do_lower_case'] = True
+        else:
+            ngram_file = pretrained_model_name_or_path
+        if os.path.isdir(ngram_file):
+            ngram_file = os.path.join(ngram_file, NGRAM_DICT_NAME)
+        # redirect to the cache, if necessary
+        try:
+            resolved_ngram_file = cached_path(ngram_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+                logger.error(
+                    "Couldn't reach server at '{}' to download vocabulary.".format(
+                        ngram_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                        ngram_file))
+            return None
+        if resolved_ngram_file == ngram_file:
+            logger.info("loading vocabulary file {}".format(ngram_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                ngram_file, resolved_ngram_file))
+        # Instantiate ngram.
+        ngram_dict = cls(resolved_ngram_file, **kwargs)
+        return ngram_dict
+
+    def save(self, ngram_freq_path):
+        ngram_freq_path = os.path.join(ngram_freq_path, NGRAM_DICT_NAME)
+        with open(ngram_freq_path, "w+", encoding="utf-8") as fout:
+            for ngram, freq in self.ngram_to_freq_dict.items():
+                fout.write("{},{}\n".format(" ".join(ngram), freq))
+
+
+def extract_ngram_feature(tokens, ngram_dict, max_seq_len, seg_id_limit):
+    # ----------- code for ngram BEGIN-----------
+    ngram_matches = []
+    #  Filter the word segment from 2 to max_ngram_len to check whether there is a word
+    max_gram_n = ngram_dict.max_ngram_len
+    for p in range(2, max_gram_n):
+        for q in range(0, len(tokens) - p + 1):
+            character_segment = tokens[q:q + p]
+            # j is the starting position of the word
+            # i is the length of the current word
+            character_segment = tuple(character_segment)
+            if character_segment in ngram_dict.ngram_to_id_dict:
+                ngram_index = ngram_dict.ngram_to_id_dict[character_segment]
+                ngram_freq = ngram_dict.ngram_to_freq_dict[character_segment]
+                ngram_matches.append([ngram_index, q, p, character_segment, ngram_freq])
+
+    # shuffle(ngram_matches)
+    ngram_matches = sorted(ngram_matches, key=lambda s: s[0])
+    # max_word_in_seq_proportion = max_word_in_seq
+    max_word_in_seq_proportion = math.ceil((len(tokens) / max_seq_len) * ngram_dict.max_ngram_in_seq)
+    if len(ngram_matches) > max_word_in_seq_proportion:
+        ngram_matches = ngram_matches[:max_word_in_seq_proportion]
+    ngram_ids = [ngram[0] for ngram in ngram_matches]
+    ngram_positions = [ngram[1] for ngram in ngram_matches]
+    ngram_lengths = [ngram[2] for ngram in ngram_matches]
+    ngram_tuples = [ngram[3] for ngram in ngram_matches]
+    ngram_freqs = [ngram[4] for ngram in ngram_matches]
+    ngram_seg_ids = [0 if position < seg_id_limit else 1 for position in
+                     ngram_positions]
+
+    ngram_mask_array = np.zeros(ngram_dict.max_ngram_in_seq, dtype=np.bool)
+    ngram_mask_array[:len(ngram_ids)] = 1
+
+    # Zero-pad up to the max word in seq length.
+    padding = [0] * (ngram_dict.max_ngram_in_seq - len(ngram_ids))
+    ngram_ids += padding
+    ngram_positions += padding
+    ngram_lengths += padding
+    ngram_seg_ids += padding
+    ngram_freqs += padding
+
+    # ----------- code for ngram END-----------
+
+    return {
+        "ngram_ids": ngram_ids,
+        "ngram_positions": ngram_positions,
+        "ngram_lengths": ngram_lengths,
+        "ngram_tuples": ngram_tuples,
+        "ngram_seg_ids": ngram_seg_ids,
+        "ngram_masks": ngram_mask_array,
+        "ngram_freqs": ngram_freqs,
+    }
+
+
+def construct_ngram_matrix(ngram_data, max_seq_length):
+    max_ngram_in_sequence = len(ngram_data["ngram_ids"])
+    ngram_ids_num = len([x for x in ngram_data["ngram_masks"] if x == 1])
+
+    ngram_positions_matrix = np.zeros(shape=(max_seq_length, max_ngram_in_sequence), dtype=np.float)
+    for i in range(ngram_ids_num):
+        ngram_positions_matrix[ngram_data["ngram_positions"][i]:
+                               ngram_data["ngram_positions"][i] + ngram_data["ngram_lengths"][i], i] = \
+            ngram_data["ngram_freqs"][i]
+    ngram_positions_matrix_t = torch.from_numpy(ngram_positions_matrix.astype(np.float))
+    ngram_positions_matrix_t = torch.div(ngram_positions_matrix_t,
+                                         torch.stack([torch.sum(ngram_positions_matrix_t, 1)] * ngram_positions_matrix_t.size(1)).t() + 1e-10)
+
+    return ngram_positions_matrix_t.numpy()
diff --git a/fengshen/models/zen2/tokenization.py b/fengshen/models/zen2/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..e89c857e5f81f0b40a06b8dcc8c9344069a8d781
--- /dev/null
+++ b/fengshen/models/zen2/tokenization.py
@@ -0,0 +1,460 @@
+# coding=utf-8
+# This file is derived from the code at
+# https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
+#
+# Original copyright notice:
+#
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import logging
+import os
+import six
+import unicodedata
+from io import open
+
+from transformers import cached_path
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
+    'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
+    'IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-345M-Chinese/resolve/main/vocab.txt',
+    'IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese': 'https://huggingface.co/IDEA-CCNL/Erlangshen-ZEN2-668M-Chinese/resolve/main/vocab.txt',
+}
+PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
+    'bert-base-uncased': 512,
+    'bert-large-uncased': 512,
+    'bert-base-cased': 512,
+    'bert-large-cased': 512,
+    'bert-base-multilingual-uncased': 512,
+    'bert-base-multilingual-cased': 512,
+    'bert-base-chinese': 512,
+    'bert-base-german-cased': 512,
+    'bert-large-uncased-whole-word-masking': 512,
+    'bert-large-cased-whole-word-masking': 512,
+    'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
+    'bert-large-cased-whole-word-masking-finetuned-squad': 512,
+    'bert-base-cased-finetuned-mrpc': 512,
+}
+VOCAB_NAME = 'vocab.txt'
+
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text.decode("utf-8", "ignore")
+        # elif isinstance(text, unicode):
+        #     return text
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    index = 0
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        while True:
+            token = reader.readline()
+            if not token:
+                break
+            token = token.strip()
+            vocab[token] = index
+            index += 1
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class BertTokenizer(object):
+    """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
+
+    def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BertTokenizer.
+
+        Args:
+          vocab_file: Path to a one-wordpiece-per-line vocabulary file
+          do_lower_case: Whether to lower case the input
+                         Only has an effect when do_wordpiece_only=False
+          do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+          max_len: An artificial maximum length to truncate tokenized sequences to;
+                         Effective maximum length is always the minimum of this
+                         value (if specified) and the underlying BERT model's
+                         sequence length.
+          never_split: List of tokens which will never be split during tokenization.
+                         Only has an effect when do_wordpiece_only=False
+        """
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict(
+            [(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
+                                                  never_split=never_split)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+    def tokenize(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text):
+                for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                    split_tokens.append(sub_token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """Converts a sequence of tokens into ids using the vocab."""
+        ids = []
+        for token in tokens:
+            ids.append(self.vocab[token])
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this BERT model ({} > {}). Running this"
+                " sequence through BERT will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids):
+        """Converts a sequence of ids in wordpiece tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            tokens.append(self.ids_to_tokens[i])
+        return tokens
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary to a directory or file."""
+        index = 0
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_NAME)
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: vocabulary indices are not consecutive."
+                                   " Please check that the vocabulary is not corrupted!".format(vocab_file))
+                    index = token_index
+                writer.write(token + u'\n')
+                index += 1
+        return vocab_file
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is a cased model but you have not set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
+                               "you may want to check this behavior.")
+                kwargs['do_lower_case'] = False
+            elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
+                logger.warning("The pre-trained model you are loading is an uncased model but you have set "
+                               "`do_lower_case` to False. We are setting `do_lower_case=True` for you "
+                               "but you may want to check this behavior.")
+                kwargs['do_lower_case'] = True
+        else:
+            vocab_file = pretrained_model_name_or_path
+        if os.path.isdir(vocab_file):
+            vocab_file = os.path.join(vocab_file, VOCAB_NAME)
+        # redirect to the cache, if necessary
+        try:
+            resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+                logger.error(
+                    "Couldn't reach server at '{}' to download vocabulary.".format(
+                        vocab_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                        vocab_file))
+            return None
+        if resolved_vocab_file == vocab_file:
+            logger.info("loading vocabulary file {}".format(vocab_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                vocab_file, resolved_vocab_file))
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
+            # if we're using a pretrained model, ensure the tokenizer wont index sequences longer
+            # than the number of positional embeddings
+            max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
+            kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
+        # Instantiate tokenizer.
+        tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
+        return tokenizer
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self,
+                 do_lower_case=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BasicTokenizer.
+
+        Args:
+          do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+        self.never_split = never_split
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in self.never_split:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        if text in self.never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+          input = "unaffable"
+          output = ["un", "##aff", "##able"]
+
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`.
+
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
diff --git a/fengshen/pipelines/base.py b/fengshen/pipelines/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8e4a109c3d8a232201a255ba1a5bb77f008a78c
--- /dev/null
+++ b/fengshen/pipelines/base.py
@@ -0,0 +1,2 @@
+_CONFIG_MODEL_TYPE = 'fengshen_model_type'
+_CONFIG_TOKENIZER_TYPE = 'fengshen_tokenizer_type'
diff --git a/fengshen/pipelines/test.py b/fengshen/pipelines/test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ae421da62760cbf84e9121edf562d40e5e7f047
--- /dev/null
+++ b/fengshen/pipelines/test.py
@@ -0,0 +1,19 @@
+from fengshen.pipelines.text_classification import TextClassificationPipeline
+import argparse
+from datasets import load_dataset
+
+
+# 预测 支持批量
+pipe = TextClassificationPipeline(
+    model='/data/gaoxinyu/pretrained_model/deberta-base-sp', device=-1)
+print(pipe(['今天心情不好</s>今天很开心', '今天心情很好</s>今天很开心']))
+
+# 训练 支持各种超参调整
+total_parser = argparse.ArgumentParser("test")
+total_parser = TextClassificationPipeline.add_data_specific_args(total_parser)
+args = total_parser.parse_args()
+datasets = load_dataset('IDEA-CCNL/AFQMC')
+pipe = TextClassificationPipeline(
+    args=args,
+    model='/data/gaoxinyu/pretrained_model/deberta-base-sp', device=-1)
+pipe.train(datasets)
diff --git a/fengshen/pipelines/text_classification.py b/fengshen/pipelines/text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9aa89ce7cd99bd9a89e1371907e3502604e9b1
--- /dev/null
+++ b/fengshen/pipelines/text_classification.py
@@ -0,0 +1,232 @@
+import torch
+from torch.utils.data._utils.collate import default_collate
+from dataclasses import dataclass
+from typing import Dict, List
+from .base import (
+    _CONFIG_MODEL_TYPE,
+    _CONFIG_TOKENIZER_TYPE)
+from fengshen.models.roformer import RoFormerForSequenceClassification
+from fengshen.models.longformer import LongformerForSequenceClassification
+from fengshen.models.zen1 import ZenForSequenceClassification
+from transformers import (
+    BertConfig,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+)
+from transformers.models.auto.tokenization_auto import get_tokenizer_config
+from transformers.pipelines.base import PipelineException, GenericTensor
+from transformers import TextClassificationPipeline as HuggingfacePipe
+import pytorch_lightning as pl
+from fengshen.data.universal_datamodule import UniversalDataModule
+from fengshen.utils.universal_checkpoint import UniversalCheckpoint
+from fengshen.models.model_utils import add_module_args
+import torchmetrics
+
+_model_dict = {
+    'fengshen-roformer': RoFormerForSequenceClassification,
+    # 'fengshen-megatron_t5': T5EncoderModel,  TODO 实现T5EncoderForSequenceClassification
+    'fengshen-longformer': LongformerForSequenceClassification,
+    'fengshen-zen1': ZenForSequenceClassification,
+    'huggingface-auto': AutoModelForSequenceClassification,
+}
+
+_tokenizer_dict = {}
+
+_ATTR_PREPARE_INPUT = '_prepare_inputs_for_sequence_classification'
+
+
+class _taskModel(pl.LightningModule):
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        _ = parent_args.add_argument_group('text classification task model')
+        return parent_args
+
+    def __init__(self, args, model):
+        super().__init__()
+        self.model = model
+        self.acc_metrics = torchmetrics.Accuracy()
+        self.save_hyperparameters(args)
+
+    def setup(self, stage) -> None:
+        if stage == 'fit':
+            train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+            # Calculate total steps
+            if self.trainer.max_epochs > 0:
+                world_size = self.trainer.world_size
+                tb_size = self.hparams.train_batchsize * max(1, world_size)
+                ab_size = self.trainer.accumulate_grad_batches
+                self.total_steps = (len(train_loader.dataset) *
+                                    self.trainer.max_epochs // tb_size) // ab_size
+            else:
+                self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+
+            print('Total steps: {}' .format(self.total_steps))
+
+    def training_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss, _ = outputs[0], outputs[1]
+        self.log('train_loss', loss)
+        return loss
+
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).long()
+        acc = self.acc_metrics(y_pred.long(), y_true.long())
+        return acc
+
+    def validation_step(self, batch, batch_idx):
+        outputs = self.model(**batch)
+        loss, logits = outputs[0], outputs[1]
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('val_loss', loss)
+        self.log('val_acc', acc)
+
+    def predict_step(self, batch, batch_idx):
+        output = self.model(**batch)
+        return output.logits
+
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+
+
+@dataclass
+class _Collator:
+    tokenizer = None
+    texta_name = 'sentence'
+    textb_name = 'sentence2'
+    label_name = 'label'
+    max_length = 512
+    model_type = 'huggingface-auto'
+
+    def __call__(self, samples):
+        sample_list = []
+        for item in samples:
+            if self.textb_name in item and item[self.textb_name] != '':
+                if self.model_type != 'fengshen-roformer':
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.texta_name], item[self.textb_name]],
+                        max_length=self.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+                else:
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.texta_name]+'[SEP]'+item[self.textb_name]],
+                        max_length=self.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+            else:
+                encode_dict = self.tokenizer.encode_plus(
+                    item[self.texta_name],
+                    max_length=self.max_length,
+                    padding='max_length',
+                    truncation='longest_first')
+            sample = {}
+            for k, v in encode_dict.items():
+                sample[k] = torch.tensor(v)
+            if self.label_name in item:
+                sample['labels'] = torch.tensor(item[self.label_name]).long()
+            sample_list.append(sample)
+        return default_collate(sample_list)
+
+
+class TextClassificationPipeline(HuggingfacePipe):
+    @staticmethod
+    def add_pipeline_specific_args(parent_args):
+        parser = parent_args.add_argument_group('SequenceClassificationPipeline')
+        parser.add_argument('--texta_name', default='sentence', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--max_length', default=512, type=int)
+        parser.add_argument('--device', default=-1, type=int)
+        parser = _taskModel.add_model_specific_args(parent_args)
+        parser = UniversalDataModule.add_data_specific_args(parent_args)
+        parser = UniversalCheckpoint.add_argparse_args(parent_args)
+        parser = pl.Trainer.add_argparse_args(parent_args)
+        parser = add_module_args(parent_args)
+        return parent_args
+
+    def __init__(self,
+                 model: str = None,
+                 args=None,
+                 **kwargs):
+        self.args = args
+        self.model_name = model
+        self.model_type = 'huggingface-auto'
+        # 用BertConfig做兼容，我只需要读里面的fengshen_model_type，所以这里用啥Config都可以
+        config = BertConfig.from_pretrained(model)
+        if hasattr(config, _CONFIG_MODEL_TYPE):
+            self.model_type = config.fengshen_model_type
+        if self.model_type not in _model_dict:
+            raise PipelineException(self.model_name, ' not in model type dict')
+        # 加载模型，并且使用模型的config
+        self.model = _model_dict[self.model_type].from_pretrained(model)
+        self.config = self.model.config
+        # 加载分词
+        tokenizer_config = get_tokenizer_config(model, **kwargs)
+        self.tokenizer = None
+        if hasattr(tokenizer_config, _CONFIG_TOKENIZER_TYPE):
+            if tokenizer_config._CONFIG_TOKENIZER_TYPE in _tokenizer_dict:
+                self.tokenizer = _tokenizer_dict[tokenizer_config._CONFIG_TOKENIZER_TYPE].from_pretrained(
+                    model)
+        if self.tokenizer is None:
+            self.tokenizer = AutoTokenizer.from_pretrained(model)
+        # 加载数据处理模块
+        c = _Collator()
+        c.tokenizer = self.tokenizer
+        c.model_type = self.model_type
+        if args is not None:
+            c.texta_name = self.args.texta_name
+            c.textb_name = self.args.textb_name
+            c.label_name = self.args.label_name
+            c.max_length = self.args.max_length
+        self.collator = c
+        device = -1 if args is None else args.device
+        super().__init__(model=self.model,
+                         tokenizer=self.tokenizer,
+                         framework='pt',
+                         device=device,
+                         **kwargs)
+
+    def train(self,
+              datasets: Dict):
+        """
+        Args:
+            datasets is a dict like
+            {
+                test: Dataset()
+                validation: Dataset()
+                train: Dataset()
+            }
+        """
+        checkpoint_callback = UniversalCheckpoint(self.args)
+        trainer = pl.Trainer.from_argparse_args(self.args,
+                                                callbacks=[checkpoint_callback]
+                                                )
+
+        data_model = UniversalDataModule(
+            datasets=datasets,
+            tokenizer=self.tokenizer,
+            collate_fn=self.collator,
+            args=self.args)
+        model = _taskModel(self.args, self.model)
+
+        trainer.fit(model, data_model)
+        return
+
+    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
+        # 如果模型有自定义的接口，用模型的口
+        if hasattr(self.model, _ATTR_PREPARE_INPUT):
+            return getattr(self.model, _ATTR_PREPARE_INPUT)(inputs, self.tokenizer, **tokenizer_kwargs)
+        samples = []
+        if isinstance(inputs, str):
+            samples.append({self.collator.texta_name: inputs})
+        else:
+            # 在__call__里面已经保证了input的类型，所以这里直接else就行
+            for i in inputs:
+                samples.append({self.collator.texta_name})
+        return self.collator(samples)
+
+
+Pipeline = TextClassificationPipeline
diff --git a/fengshen/requirement.txt b/fengshen/requirement.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4ee69068ca4c3fd0e51337df1006908e9ed1bc83
--- /dev/null
+++ b/fengshen/requirement.txt
@@ -0,0 +1,8 @@
+transformers>=4.17.0
+datasets>=2.0.0
+pytorch_lightning==1.6.3
+deepspeed==0.5.10
+jieba-fast>=0.53
+jieba>=0.40.0
+protobuf==3.20.1
+
diff --git a/fengshen/tokenizer/sentencepiece/pretrain_google_sp.sh b/fengshen/tokenizer/sentencepiece/pretrain_google_sp.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e7dd39f59dac0314a9b285c02f05156fda67e622
--- /dev/null
+++ b/fengshen/tokenizer/sentencepiece/pretrain_google_sp.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+#SBATCH --job-name=google_sp
+#SBATCH --nodes=1
+#SBATCH --cpus-per-task=100
+#SBATCH --ntasks-per-node=1
+#SBATCH -o %x-%j.log
+
+set -x -e
+
+echo "START TIME: $(date)"
+
+BIN_PATH=/cognitive_comp/gaoxinyu/sentencepiece/sentencepiece/bin/usr/local/bin/spm_train
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cognitive_comp/gaoxinyu/sentencepiece/sentencepiece/bin/usr/local/lib
+INPUT_FILE=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/tokenizer/sentencepiece/shuffle_corpus_59132213.txt
+INPUT_FILE_SMALL=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/tokenizer/sentencepiece/shuffle_corpus_1000000.txt
+
+
+VOCAB_SIZE=40000
+COV=0.9995
+MAX_LENGTH=6
+TYPE=bpe
+SEED=42
+MAX_INPUT_LENGTH=100000
+
+OPTION="\
+    --input=${INPUT_FILE} \
+    --vocab_size=${VOCAB_SIZE} \
+    --character_coverage=${COV} \
+    --max_sentencepiece_length=${MAX_LENGTH} \
+    --model_type=${TYPE} \
+    --model_prefix=${TYPE}_v${VOCAB_SIZE}_s${SEED}_cov${COV}_max${MAX_LENGTH} \
+    --random_seed=${SEED} \
+    --max_sentence_length=100000 \
+    --shuffle_input_sentence=true \
+    --input_sentence_size=${MAX_INPUT_LENGTH} \
+    --minloglevel 1 \
+    --num_threads=100 \
+    --train_extremely_large_corpus=true \
+    "
+
+eval $BIN_PATH $OPTION
\ No newline at end of file
diff --git a/fengshen/tokenizer/sentencepiece/shuffle_corpus.py b/fengshen/tokenizer/sentencepiece/shuffle_corpus.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b3bdf1fc55f3bdd78ca5d540f80d5b612188b68
--- /dev/null
+++ b/fengshen/tokenizer/sentencepiece/shuffle_corpus.py
@@ -0,0 +1,18 @@
+import sys
+import os
+from tqdm import tqdm
+sys.path.append('../../')
+
+if __name__ == '__main__':
+    from data.fs_datasets import load_dataset
+    dataset = load_dataset('wudao_180g', num_proc=100)
+    print('dataset loaded', flush=True)
+
+    shuffle_ds = dataset['train'].shuffle(seed=42, writer_batch_size=1000)
+    print('dataset shuffled', flush=True)
+    need_size = len(shuffle_ds)
+
+    f = open('shuffle_corpus_{}.txt'.format(need_size), 'w', encoding='utf-8')
+    for i in tqdm(range(0, need_size)):
+        f.write(shuffle_ds[i]['text'] + os.linesep)
+    f.close()
diff --git a/fengshen/tokenizer/tokenizer.py b/fengshen/tokenizer/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bad5790a5799b96f2e164d825c0b1f8ec0c2dfb
--- /dev/null
+++ b/fengshen/tokenizer/tokenizer.py
@@ -0,0 +1 @@
+# coding=utf-8
diff --git a/fengshen/utils/__init__.py b/fengshen/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b392268121643a5ce6fdc6d6a0f712ad8dd867a9
--- /dev/null
+++ b/fengshen/utils/__init__.py
@@ -0,0 +1,3 @@
+from .universal_checkpoint import UniversalCheckpoint
+from .utils import chinese_char_tokenize
+__all__ = ['UniversalCheckpoint', 'chinese_char_tokenize']
diff --git a/fengshen/utils/convert_py_to_npy.py b/fengshen/utils/convert_py_to_npy.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d652169b59ffdc7ca977318ee72187b2ce73c1f
--- /dev/null
+++ b/fengshen/utils/convert_py_to_npy.py
@@ -0,0 +1,54 @@
+import argparse
+import torch
+import glob
+import os
+import numpy as np
+
+
+class MMapIndexDataset():
+    def __init__(self, datapath):
+        self.idxfp = np.load(datapath + '.npy', mmap_mode='r')
+        self.binfp = np.memmap(datapath + '.bin', dtype='long', mode='r')
+
+    def __len__(self):
+        return self.idxfp.shape[0]
+
+    def __getitem__(self, idx):
+        return self.binfp[self.idxfp[idx, 0]:self.idxfp[idx, 1]]
+
+
+def convert_py_to_npy(input_tensor, bin_out, idx_out):
+    idx = torch.empty(len(input_tensor), 2, dtype=torch.long)
+    start = 0
+    for i, input in enumerate(input_tensor):
+        idx[i] = torch.tensor([start, start + len(input)])
+        start += len(input)
+    np.save(idx_out, idx)
+    binfp = np.memmap(bin_out, dtype='long', mode='w+', shape=(start))
+    start = 0
+    for i, input in enumerate(input_tensor):
+        for j, idx in enumerate(input):
+            binfp[start + j] = idx
+        start += len(input)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Text infilling.")
+    parser.add_argument('--data_path', type=str,
+                        default='/cognitive_comp/gaoxinyu/data/wudao')
+    args = parser.parse_args()
+    process_key = [
+        'incorrect_input_ids_list',
+        'label_ids_list',
+        'target_ids_list',
+    ]
+    if os.path.exists(args.data_path):
+        print(f'''Loading data from {args.data_path}''')
+        data_dict = torch.load(args.data_path)
+        for k in process_key:
+            bin_out = ('_' + k + '.bin').join(args.data_path.rsplit('.pt', 1))
+            idx_out = ('_' + k).join(args.data_path.rsplit('.pt', 1))
+            convert_py_to_npy(data_dict[k], bin_out, idx_out)
+    else:
+        print(
+            f'Please create the synthetic datafile {args.data_path} with create_synthetic_data.py.')
diff --git a/fengshen/utils/universal_checkpoint.py b/fengshen/utils/universal_checkpoint.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e62f6993f22fa01dad5cabc4d832e36478c5deb
--- /dev/null
+++ b/fengshen/utils/universal_checkpoint.py
@@ -0,0 +1,34 @@
+from pytorch_lightning.callbacks import ModelCheckpoint
+
+
+class UniversalCheckpoint(ModelCheckpoint):
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('universal checkpoint callback')
+
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--save_ckpt_path', default='./ckpt/', type=str)
+        parser.add_argument('--load_ckpt_path', default='./ckpt/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_last', action='store_true', default=False)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=None, type=float)
+        parser.add_argument('--save_weights_only', action='store_true', default=False)
+        parser.add_argument('--every_n_epochs', default=None, type=int)
+        parser.add_argument('--save_on_train_epoch_end', action='store_true', default=None)
+
+        return parent_args
+
+    def __init__(self, args):
+        super().__init__(monitor=args.monitor,
+                         save_top_k=args.save_top_k,
+                         mode=args.mode,
+                         every_n_train_steps=args.every_n_train_steps,
+                         save_weights_only=args.save_weights_only,
+                         dirpath=args.save_ckpt_path,
+                         filename=args.filename,
+                         save_last=args.save_last,
+                         every_n_epochs=args.every_n_epochs,
+                         save_on_train_epoch_end=args.save_on_train_epoch_end)
diff --git a/fengshen/utils/utils.py b/fengshen/utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..02d9121cb1f55d142b62e304468695a0feda753b
--- /dev/null
+++ b/fengshen/utils/utils.py
@@ -0,0 +1,58 @@
+# coding=utf-8
+import jieba
+
+
+def jieba_tokenize(str):
+    return jieba.lcut(str)
+
+
+_UCODE_RANGES = (
+    ("\u3400", "\u4db5"),  # CJK Unified Ideographs Extension A, release 3.0
+    ("\u4e00", "\u9fa5"),  # CJK Unified Ideographs, release 1.1
+    ("\u9fa6", "\u9fbb"),  # CJK Unified Ideographs, release 4.1
+    ("\uf900", "\ufa2d"),  # CJK Compatibility Ideographs, release 1.1
+    ("\ufa30", "\ufa6a"),  # CJK Compatibility Ideographs, release 3.2
+    ("\ufa70", "\ufad9"),  # CJK Compatibility Ideographs, release 4.1
+    ("\u20000", "\u2a6d6"),  # (UTF16) CJK Unified Ideographs Extension B, release 3.1
+    ("\u2f800", "\u2fa1d"),  # (UTF16) CJK Compatibility Supplement, release 3.1
+    ("\uff00", "\uffef"),  # Full width ASCII, full width of English punctuation,
+    # half width Katakana, half wide half width kana, Korean alphabet
+    ("\u2e80", "\u2eff"),  # CJK Radicals Supplement
+    ("\u3000", "\u303f"),  # CJK punctuation mark
+    ("\u31c0", "\u31ef"),  # CJK stroke
+    ("\u2f00", "\u2fdf"),  # Kangxi Radicals
+    ("\u2ff0", "\u2fff"),  # Chinese character structure
+    ("\u3100", "\u312f"),  # Phonetic symbols
+    ("\u31a0", "\u31bf"),  # Phonetic symbols (Taiwanese and Hakka expansion)
+    ("\ufe10", "\ufe1f"),
+    ("\ufe30", "\ufe4f"),
+    ("\u2600", "\u26ff"),
+    ("\u2700", "\u27bf"),
+    ("\u3200", "\u32ff"),
+    ("\u3300", "\u33ff"),
+)
+
+
+def is_chinese_char(uchar):
+    for start, end in _UCODE_RANGES:
+        if start <= uchar <= end:
+            return True
+    return False
+
+
+def chinese_char_tokenize(line):
+    line = line.strip()
+    line_in_chars = ""
+
+    for char in line:
+        if is_chinese_char(char):
+            line_in_chars += " "
+            line_in_chars += char
+            line_in_chars += " "
+        else:
+            line_in_chars += char
+
+    return line_in_chars
+
+# s = '中国的首都是哪里？1，2，3d回答'
+# print(chinese_char_tokenize(s))
diff --git a/fengshen/workspace/erlangshen-bert-base/pretrain/config.json b/fengshen/workspace/erlangshen-bert-base/pretrain/config.json
new file mode 100644
index 0000000000000000000000000000000000000000..6fe73faf0b8edec537616f52988f9e985e940526
--- /dev/null
+++ b/fengshen/workspace/erlangshen-bert-base/pretrain/config.json
@@ -0,0 +1,18 @@
+{
+    "vocab_size": 12800,
+    "hidden_size": 768,
+    "num_hidden_layers": 12,
+    "num_attention_heads": 12,
+    "hidden_act": "gelu_new",
+    "intermediate_size": 3072,
+    "hidden_dropout_prob": 0.1,
+    "attention_probs_dropout_prob": 0.1,
+    "max_position_embeddings": 512,
+    "type_vocab_size": 2,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-12,
+    "gradient_checkpointing": false,
+    "position_embedding_type": "absolute",
+    "use_cache": false,
+    "model_type": "megatron-bert"
+}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-bert-base/pretrain/special_tokens_map.json b/fengshen/workspace/erlangshen-bert-base/pretrain/special_tokens_map.json
new file mode 100644
index 0000000000000000000000000000000000000000..e7b0375001f109a6b8873d756ad4f7bbb15fbaa5
--- /dev/null
+++ b/fengshen/workspace/erlangshen-bert-base/pretrain/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-bert-base/pretrain/tokenizer_config.json b/fengshen/workspace/erlangshen-bert-base/pretrain/tokenizer_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..53d88ea9bb5c978402b7d9bb4a80690171e5491c
--- /dev/null
+++ b/fengshen/workspace/erlangshen-bert-base/pretrain/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+    "do_lower_case": true,
+    "do_basic_tokenize": true,
+    "never_split": null,
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]",
+    "tokenize_chinese_chars": true,
+    "strip_accents": null,
+    "special_tokens_map_file": null,
+    "name_or_path": "/cognitive_comp/gaoxinyu/pretrained_model/bert-1.3B",
+    "tokenizer_class": "BertTokenizer"
+}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-bert-base/pretrain/vocab.txt b/fengshen/workspace/erlangshen-bert-base/pretrain/vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..437c75cb0090ba6d449e478fae3f7421ab1de961
--- /dev/null
+++ b/fengshen/workspace/erlangshen-bert-base/pretrain/vocab.txt
@@ -0,0 +1,12800 @@
+[PAD]
+[CLS]
+[SEP]
+[UNK]
+[MASK]
+[unused1]
+[unused2]
+[unused3]
+[unused4]
+[unused5]
+[unused6]
+[unused7]
+[unused8]
+[unused9]
+[unused10]
+[unused11]
+[unused12]
+[unused13]
+[unused14]
+[unused15]
+[unused16]
+[unused17]
+[unused18]
+[unused19]
+[unused20]
+[unused21]
+[unused22]
+[unused23]
+[unused24]
+[unused25]
+[unused26]
+[unused27]
+[unused28]
+[unused29]
+[unused30]
+[unused31]
+[unused32]
+[unused33]
+[unused34]
+[unused35]
+[unused36]
+[unused37]
+[unused38]
+[unused39]
+[unused40]
+[unused41]
+[unused42]
+[unused43]
+[unused44]
+[unused45]
+[unused46]
+[unused47]
+[unused48]
+[unused49]
+[unused50]
+[unused51]
+[unused52]
+[unused53]
+[unused54]
+[unused55]
+[unused56]
+[unused57]
+[unused58]
+[unused59]
+[unused60]
+[unused61]
+[unused62]
+[unused63]
+[unused64]
+[unused65]
+[unused66]
+[unused67]
+[unused68]
+[unused69]
+[unused70]
+[unused71]
+[unused72]
+[unused73]
+[unused74]
+[unused75]
+[unused76]
+[unused77]
+[unused78]
+[unused79]
+[unused80]
+[unused81]
+[unused82]
+[unused83]
+[unused84]
+[unused85]
+[unused86]
+[unused87]
+[unused88]
+[unused89]
+[unused90]
+[unused91]
+[unused92]
+[unused93]
+[unused94]
+[unused95]
+[unused96]
+[unused97]
+[unused98]
+[unused99]
+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+:
+;
+<
+=
+>
+?
+@
+[
+\
+]
+^
+_
+`
+{
+|
+}
+~
+·
+–
+—
+‘
+’
+‛
+“
+”
+„
+‟
+…
+‧
+、
+。
+〃
+〈
+〉
+《
+》
+「
+」
+『
+』
+【
+】
+〔
+〕
+〖
+〗
+〜
+〝
+〞
+〰
+﹏
+﹑
+﹔
+！
+＂
+＃
+＄
+％
+＆
+＇
+（
+）
+＊
+＋
+，
+－
+：
+；
+＜
+＝
+＞
+？
+＠
+［
+＼
+］
+＾
+＿
+｀
+｛
+｜
+｝
+～
+｡
+｢
+｣
+､
+0
+##0
+1
+##1
+2
+##2
+3
+##3
+4
+##4
+5
+##5
+6
+##6
+7
+##7
+8
+##8
+9
+##9
+一
+丁
+七
+丄
+丅
+丆
+万
+丈
+三
+上
+下
+丌
+不
+与
+丏
+丐
+丑
+专
+且
+丕
+世
+丘
+丙
+业
+丛
+东
+丝
+丞
+両
+丢
+两
+严
+丧
+丨
+个
+丫
+中
+丰
+串
+临
+丶
+丸
+丹
+为
+主
+丼
+丽
+举
+丿
+乂
+乃
+乄
+久
+乇
+么
+义
+之
+乌
+乍
+乎
+乏
+乐
+乒
+乓
+乔
+乖
+乗
+乘
+乙
+乛
+乜
+九
+乞
+也
+习
+乡
+书
+乩
+买
+乱
+乳
+乸
+乾
+亀
+了
+亇
+予
+争
+亊
+事
+二
+亍
+于
+亏
+云
+互
+亓
+五
+井
+亘
+亚
+些
+亜
+亟
+亡
+亢
+交
+亥
+亦
+产
+亨
+亩
+享
+京
+亭
+亮
+亲
+亳
+亵
+亶
+亸
+亹
+人
+亻
+亽
+亾
+亿
+什
+仁
+仂
+仃
+仄
+仅
+仆
+仇
+仉
+今
+介
+仍
+从
+仏
+仑
+仓
+仔
+仕
+他
+仗
+付
+仙
+仝
+仞
+仟
+仡
+代
+令
+以
+仨
+仩
+仪
+仫
+们
+仮
+仰
+仲
+仳
+仵
+件
+价
+任
+仼
+份
+仿
+企
+伉
+伊
+伋
+伍
+伎
+伏
+伐
+休
+众
+优
+伙
+会
+伛
+伝
+伞
+伟
+传
+伢
+伤
+伥
+伦
+伧
+伪
+伫
+伯
+估
+伱
+伲
+伴
+伶
+伷
+伸
+伺
+似
+伽
+伾
+佀
+佃
+但
+佉
+位
+低
+住
+佐
+佑
+体
+何
+佗
+佘
+余
+佚
+佛
+作
+佝
+佞
+佟
+你
+佢
+佣
+佤
+佥
+佧
+佩
+佬
+佯
+佰
+佳
+佴
+佶
+佷
+佺
+佻
+佼
+佾
+使
+侁
+侂
+侃
+侄
+來
+侈
+侉
+例
+侍
+侏
+侑
+侔
+侗
+侘
+供
+依
+侠
+価
+侣
+侥
+侦
+侧
+侨
+侩
+侪
+侬
+侮
+侯
+侵
+便
+促
+俄
+俅
+俊
+俎
+俏
+俐
+俑
+俗
+俘
+俚
+俛
+俜
+保
+俞
+俟
+信
+俢
+俣
+俤
+俦
+俨
+俩
+俪
+俬
+俭
+修
+俯
+俱
+俳
+俶
+俸
+俺
+俾
+倅
+倌
+倍
+倏
+倒
+倓
+倔
+倕
+倘
+候
+倚
+倜
+倞
+借
+倡
+値
+倥
+倦
+倧
+倨
+倩
+倪
+倬
+倭
+倮
+倶
+债
+倻
+值
+倾
+偁
+偃
+假
+偈
+偌
+偍
+偎
+偏
+偓
+偕
+做
+停
+偠
+健
+偬
+偰
+偲
+偶
+偷
+偻
+偾
+偿
+傀
+傅
+傈
+傉
+傍
+傒
+傕
+傣
+傥
+傧
+储
+傩
+催
+傲
+傺
+傻
+働
+像
+僖
+僚
+僜
+僣
+僦
+僧
+僬
+僭
+僮
+僰
+僳
+僵
+僻
+僾
+儁
+儆
+儇
+儋
+儒
+儞
+儡
+儿
+兀
+允
+元
+兄
+充
+兆
+先
+光
+克
+免
+兎
+児
+兑
+兔
+兕
+兖
+党
+兜
+兢
+入
+全
+八
+公
+六
+兮
+兰
+共
+兲
+关
+兴
+兵
+其
+具
+典
+兹
+养
+兼
+兽
+冀
+内
+円
+冇
+冈
+冉
+册
+再
+冏
+冒
+冕
+冗
+写
+冚
+军
+农
+冠
+冢
+冤
+冥
+冧
+冨
+冬
+冮
+冯
+冰
+冲
+决
+冴
+况
+冶
+冷
+冻
+冼
+冽
+净
+凃
+凄
+准
+凇
+凉
+凊
+凋
+凌
+减
+凑
+凖
+凛
+凝
+几
+凡
+凤
+処
+凪
+凫
+凭
+凯
+凰
+凳
+凶
+凸
+凹
+出
+击
+凼
+函
+凿
+刀
+刁
+刂
+刃
+分
+切
+刈
+刊
+刋
+刍
+刎
+刑
+划
+刖
+列
+刘
+则
+刚
+创
+初
+删
+判
+刨
+利
+别
+刬
+刭
+刮
+到
+刳
+制
+刷
+券
+刹
+刺
+刻
+刽
+刿
+剀
+剁
+剂
+剃
+剅
+削
+剌
+前
+剐
+剑
+剔
+剖
+剜
+剞
+剡
+剣
+剥
+剧
+剩
+剪
+副
+割
+剽
+剿
+劂
+劄
+劈
+劏
+劓
+力
+劝
+办
+功
+加
+务
+劢
+劣
+动
+助
+努
+劫
+劬
+劭
+励
+劲
+劳
+劵
+効
+劻
+劼
+劾
+势
+勃
+勅
+勇
+勉
+勋
+勍
+勐
+勑
+勒
+勔
+勖
+勘
+募
+勠
+勤
+勰
+勲
+勳
+勷
+勺
+勾
+勿
+匀
+匂
+匄
+包
+匆
+匈
+匋
+匍
+匏
+匐
+匕
+化
+北
+匙
+匚
+匜
+匝
+匠
+匡
+匣
+匦
+匪
+匮
+匹
+区
+医
+匾
+匿
+區
+十
+千
+卅
+升
+午
+卉
+半
+卌
+卍
+华
+协
+卐
+卑
+卒
+卓
+单
+卖
+南
+単
+博
+卜
+卝
+卞
+卟
+占
+卡
+卢
+卣
+卤
+卦
+卧
+卨
+卫
+卬
+卮
+卯
+印
+危
+卲
+即
+却
+卵
+卷
+卸
+卺
+卽
+卿
+厂
+厄
+厅
+历
+厉
+压
+厌
+厍
+厓
+厔
+厕
+厘
+厚
+厝
+原
+厢
+厣
+厥
+厦
+厨
+厩
+厮
+厶
+去
+县
+叁
+参
+叆
+又
+叉
+及
+友
+双
+反
+収
+发
+叒
+叔
+叕
+取
+受
+变
+叙
+叛
+叟
+叠
+叡
+口
+古
+句
+另
+叨
+叩
+只
+叫
+召
+叭
+叮
+可
+台
+叱
+史
+右
+叵
+叶
+号
+司
+叹
+叻
+叼
+叽
+吁
+吃
+各
+吅
+吆
+吇
+合
+吉
+吊
+吋
+同
+名
+后
+吏
+吐
+向
+吒
+吓
+吔
+吕
+吖
+吗
+吙
+吚
+君
+吝
+吞
+吟
+吠
+吡
+吥
+否
+吧
+吨
+吩
+含
+听
+吭
+吮
+启
+吱
+吲
+吴
+吵
+吸
+吹
+吻
+吼
+吽
+吾
+吿
+呀
+呃
+呆
+呈
+呉
+告
+呋
+呎
+呐
+呑
+呒
+呓
+呔
+呕
+呖
+呗
+员
+呙
+呛
+呜
+呢
+呣
+呤
+呦
+周
+呪
+呬
+呯
+呱
+呲
+味
+呴
+呵
+呶
+呷
+呸
+呻
+呼
+命
+呾
+咀
+咁
+咂
+咄
+咆
+咋
+和
+咎
+咏
+咐
+咒
+咔
+咕
+咖
+咗
+咘
+咙
+咚
+咛
+咝
+咢
+咣
+咤
+咥
+咦
+咧
+咨
+咩
+咪
+咫
+咬
+咭
+咯
+咱
+咲
+咳
+咴
+咸
+咻
+咽
+咾
+咿
+哀
+品
+哂
+哄
+哆
+哇
+哈
+哉
+哋
+哌
+响
+哎
+哏
+哐
+哑
+哒
+哓
+哔
+哕
+哗
+哙
+哚
+哝
+哞
+哟
+哥
+哦
+哧
+哨
+哩
+哪
+哭
+哮
+哲
+哺
+哼
+哽
+哿
+唁
+唃
+唆
+唇
+唉
+唎
+唏
+唐
+唑
+唔
+唛
+唠
+唢
+唤
+唦
+唧
+唬
+售
+唯
+唰
+唱
+唳
+唵
+唷
+唻
+唼
+唾
+唿
+啁
+啃
+啄
+啅
+商
+啉
+啊
+啐
+啓
+啕
+啖
+啜
+啝
+啡
+啤
+啥
+啦
+啧
+啪
+啫
+啬
+啭
+啮
+啯
+啰
+啱
+啲
+啵
+啶
+啷
+啸
+啻
+啼
+啾
+喀
+喁
+喂
+喃
+善
+喆
+喇
+喈
+喉
+喊
+喋
+喎
+喏
+喑
+喔
+喘
+喙
+喜
+喝
+喟
+喦
+喧
+喰
+喱
+喳
+喵
+営
+喷
+喹
+喺
+喻
+喽
+喾
+嗄
+嗅
+嗉
+嗌
+嗍
+嗑
+嗒
+嗓
+嗔
+嗖
+嗜
+嗝
+嗞
+嗟
+嗡
+嗣
+嗤
+嗥
+嗦
+嗨
+嗪
+嗫
+嗬
+嗮
+嗯
+嗰
+嗲
+嗳
+嗵
+嗷
+嗽
+嗾
+嘀
+嘁
+嘅
+嘈
+嘉
+嘌
+嘎
+嘏
+嘘
+嘚
+嘛
+嘞
+嘟
+嘢
+嘣
+嘤
+嘦
+嘧
+嘬
+嘭
+嘱
+嘲
+嘴
+嘶
+嘹
+嘻
+嘿
+噁
+噉
+噌
+噎
+噐
+噔
+噗
+噘
+噙
+噜
+噢
+噤
+器
+噩
+噪
+噫
+噬
+噱
+噶
+噻
+噼
+嚅
+嚈
+嚎
+嚏
+嚐
+嚒
+嚓
+嚟
+嚣
+嚧
+嚩
+嚭
+嚯
+嚷
+嚼
+囊
+囍
+囔
+囖
+囗
+囘
+囚
+四
+囝
+回
+囟
+因
+囡
+团
+団
+囤
+囧
+囫
+囬
+园
+囯
+困
+囱
+囲
+図
+围
+囵
+囷
+囹
+固
+国
+图
+囿
+圃
+圄
+圆
+圈
+圉
+圊
+國
+圌
+圏
+圜
+圝
+圞
+土
+圣
+圧
+在
+圩
+圪
+圬
+圭
+圮
+圯
+地
+圳
+圹
+场
+圻
+圾
+址
+坂
+均
+坊
+坌
+坍
+坎
+坏
+坐
+坑
+块
+坚
+坛
+坜
+坝
+坞
+坟
+坠
+坡
+坤
+坦
+坨
+坩
+坪
+坫
+坬
+坭
+坯
+坳
+坷
+坻
+坼
+垂
+垃
+垄
+垅
+垆
+型
+垌
+垍
+垒
+垓
+垕
+垚
+垛
+垞
+垟
+垠
+垡
+垢
+垣
+垤
+垦
+垧
+垩
+垫
+垭
+垮
+垱
+垲
+垴
+垵
+垸
+埂
+埃
+埇
+埈
+埋
+埌
+城
+埏
+埒
+埔
+埕
+埗
+埘
+埙
+埚
+埜
+埝
+域
+埠
+埤
+埭
+埯
+埴
+埵
+埸
+培
+基
+埼
+埽
+堀
+堂
+堃
+堆
+堇
+堈
+堉
+堋
+堌
+堍
+堑
+堕
+堙
+堞
+堠
+堡
+堤
+堨
+堪
+堰
+堵
+堺
+堽
+塁
+塄
+塅
+塆
+塌
+塍
+塑
+塔
+塘
+塚
+塝
+塞
+塩
+填
+塬
+塭
+塱
+塽
+塾
+墀
+墁
+境
+墅
+墉
+墒
+墓
+墕
+増
+墘
+墙
+增
+墟
+墨
+墩
+墫
+壁
+壅
+壆
+壊
+壑
+壕
+壤
+士
+壬
+壮
+声
+壱
+売
+壳
+壶
+壸
+壹
+处
+备
+変
+复
+夏
+夑
+夔
+夕
+外
+夙
+多
+夜
+够
+夤
+夥
+大
+夨
+天
+太
+夫
+夬
+夭
+央
+夯
+失
+夲
+头
+夶
+夷
+夸
+夹
+夺
+夼
+奀
+奁
+奂
+奄
+奇
+奈
+奉
+奋
+奌
+奎
+奏
+契
+奔
+奕
+奖
+套
+奘
+奚
+奠
+奢
+奥
+奨
+奭
+女
+奴
+奶
+奸
+她
+好
+妁
+如
+妃
+妄
+妆
+妇
+妈
+妊
+妍
+妒
+妓
+妖
+妗
+妘
+妙
+妞
+妠
+妣
+妤
+妥
+妨
+妩
+妪
+妫
+妬
+妮
+妯
+妲
+妹
+妺
+妻
+妼
+妾
+姁
+姆
+姈
+姉
+姊
+始
+姐
+姑
+姒
+姓
+委
+姗
+姘
+姚
+姜
+姝
+姞
+姣
+姤
+姥
+姨
+姫
+姬
+姮
+姱
+姵
+姹
+姻
+姽
+姿
+威
+娃
+娄
+娅
+娆
+娇
+娈
+娉
+娌
+娑
+娓
+娘
+娜
+娟
+娠
+娡
+娣
+娥
+娩
+娭
+娱
+娲
+娴
+娶
+娼
+婀
+婆
+婉
+婊
+婕
+婚
+婠
+婢
+婧
+婪
+婬
+婳
+婴
+婵
+婶
+婷
+婺
+婻
+婼
+婿
+媄
+媒
+媗
+媚
+媛
+媜
+媞
+媪
+媲
+媳
+媵
+媸
+媺
+媾
+嫁
+嫂
+嫄
+嫉
+嫌
+嫒
+嫔
+嫖
+嫘
+嫚
+嫠
+嫡
+嫣
+嫦
+嫩
+嫪
+嫫
+嫰
+嫱
+嫲
+嫽
+嬅
+嬉
+嬖
+嬗
+嬛
+嬜
+嬢
+嬲
+嬴
+嬷
+嬾
+嬿
+孀
+子
+孑
+孒
+孓
+孔
+孕
+孖
+字
+存
+孙
+孚
+孛
+孜
+孝
+孟
+孢
+季
+孤
+孥
+学
+孩
+孪
+孬
+孰
+孱
+孳
+孵
+孺
+孽
+宀
+宁
+它
+宄
+宅
+宇
+守
+安
+宋
+完
+宍
+宏
+宓
+宕
+宗
+官
+宙
+定
+宛
+宜
+宝
+实
+実
+宠
+审
+客
+宣
+室
+宥
+宦
+宪
+宫
+宬
+宰
+害
+宴
+宵
+家
+宸
+容
+宽
+宾
+宿
+寀
+寂
+寃
+寄
+寅
+密
+寇
+富
+寐
+寒
+寓
+寔
+寘
+寛
+寝
+寞
+察
+寡
+寤
+寥
+寨
+寮
+寯
+寰
+寳
+寸
+对
+寺
+寻
+导
+対
+寿
+封
+専
+射
+尅
+将
+尉
+尊
+對
+小
+尐
+少
+尒
+尓
+尔
+尕
+尖
+尘
+尙
+尚
+尛
+尜
+尝
+尢
+尤
+尧
+尨
+尪
+尬
+就
+尴
+尸
+尹
+尺
+尻
+尼
+尽
+尾
+尿
+局
+屁
+层
+屃
+屄
+居
+屈
+屉
+届
+屋
+屌
+屍
+屎
+屏
+屐
+屑
+展
+屙
+属
+屠
+屡
+屣
+履
+屦
+屮
+屯
+山
+屹
+屺
+屾
+屿
+岀
+岁
+岂
+岈
+岌
+岐
+岑
+岔
+岕
+岖
+岗
+岘
+岙
+岚
+岛
+岜
+岞
+岢
+岣
+岩
+岫
+岬
+岭
+岱
+岳
+岵
+岷
+岸
+岺
+岽
+岿
+峁
+峄
+峇
+峋
+峒
+峕
+峙
+峠
+峡
+峣
+峤
+峥
+峦
+峨
+峩
+峪
+峭
+峯
+峰
+峻
+崀
+崁
+崂
+崃
+崄
+崆
+崇
+崎
+崐
+崑
+崔
+崖
+崚
+崛
+崞
+崟
+崤
+崦
+崧
+崩
+崭
+崮
+崴
+崽
+崾
+嵇
+嵊
+嵋
+嵌
+嵎
+嵖
+嵗
+嵘
+嵚
+嵛
+嵝
+嵩
+嵬
+嵯
+嵴
+嶂
+嶋
+嶙
+嶝
+嶲
+嶷
+巂
+巅
+巇
+巉
+巍
+巎
+巘
+巜
+川
+州
+巡
+巢
+巣
+工
+左
+巧
+巨
+巩
+巫
+差
+巯
+己
+已
+巳
+巴
+巷
+巻
+巽
+巾
+巿
+币
+市
+布
+帅
+帆
+师
+希
+帏
+帐
+帑
+帔
+帕
+帖
+帘
+帙
+帚
+帛
+帜
+帝
+带
+帧
+席
+帮
+帯
+帰
+帷
+常
+帻
+帼
+帽
+幂
+幄
+幅
+幌
+幔
+幕
+幛
+幞
+幡
+幢
+干
+平
+年
+幵
+并
+幷
+幸
+幺
+幻
+幼
+幽
+广
+庀
+庁
+広
+庄
+庆
+庇
+床
+庋
+序
+庐
+庑
+库
+应
+底
+庖
+店
+庙
+庚
+府
+庞
+废
+庠
+庤
+庥
+度
+座
+庭
+庵
+庶
+康
+庸
+庹
+庾
+廆
+廉
+廊
+廋
+廌
+廑
+廒
+廓
+廕
+廖
+廙
+廛
+廞
+廨
+廪
+廯
+延
+廷
+建
+廻
+廼
+廾
+廿
+开
+弁
+异
+弃
+弄
+弇
+弈
+弊
+弋
+式
+弐
+弑
+弓
+引
+弗
+弘
+弛
+弟
+张
+弢
+弥
+弦
+弧
+弩
+弭
+弯
+弱
+弹
+强
+弼
+弾
+彀
+归
+当
+录
+彖
+彗
+彘
+彝
+彟
+彡
+形
+彤
+彦
+彧
+彩
+彪
+彬
+彭
+彰
+影
+彳
+彵
+彷
+役
+彻
+彼
+往
+征
+徂
+径
+待
+徇
+很
+徉
+徊
+律
+徐
+徒
+従
+徕
+得
+徘
+徙
+徜
+御
+徧
+徨
+循
+徭
+微
+徳
+徴
+徵
+德
+徼
+徽
+心
+忄
+必
+忆
+忉
+忌
+忍
+忏
+忐
+忑
+忒
+忖
+志
+忘
+忙
+応
+忝
+忞
+忠
+忡
+忤
+忧
+忪
+快
+忭
+忱
+念
+忸
+忻
+忽
+忾
+忿
+怀
+态
+怂
+怃
+怄
+怅
+怆
+怍
+怎
+怏
+怒
+怔
+怕
+怖
+怙
+怛
+怜
+思
+怠
+怡
+急
+怦
+性
+怨
+怩
+怪
+怫
+怯
+怱
+怳
+怵
+怹
+总
+怼
+怿
+恁
+恂
+恃
+恋
+恍
+恏
+恐
+恒
+恕
+恙
+恚
+恠
+恢
+恣
+恤
+恨
+恩
+恪
+恫
+恬
+恭
+息
+恰
+恳
+恵
+恶
+恸
+恹
+恺
+恻
+恼
+恽
+恿
+悃
+悄
+悉
+悌
+悍
+悒
+悔
+悖
+悚
+悛
+悝
+悟
+悠
+患
+悦
+您
+悩
+悪
+悫
+悬
+悭
+悯
+悰
+悱
+悲
+悳
+悴
+悸
+悻
+悼
+情
+惆
+惇
+惊
+惋
+惑
+惔
+惕
+惘
+惚
+惛
+惜
+惝
+惟
+惠
+惢
+惣
+惦
+惧
+惨
+惩
+惪
+惫
+惬
+惭
+惮
+惯
+惰
+想
+惴
+惶
+惹
+惺
+愀
+愁
+愆
+愈
+愉
+愍
+愎
+意
+愔
+愕
+愚
+感
+愠
+愣
+愤
+愦
+愧
+愫
+愬
+愰
+愽
+愿
+慆
+慈
+慊
+慌
+慎
+慑
+慕
+慜
+慝
+慢
+慥
+慧
+慨
+慰
+慵
+慷
+慾
+憋
+憍
+憎
+憔
+憙
+憧
+憨
+憩
+憬
+憷
+憺
+憾
+懂
+懈
+懊
+懋
+懐
+懑
+懒
+懔
+懦
+懮
+懵
+懽
+懿
+戆
+戈
+戊
+戋
+戌
+戍
+戎
+戏
+成
+我
+戒
+戓
+戕
+或
+戗
+战
+戚
+戛
+戟
+戡
+戢
+戥
+戦
+截
+戬
+戮
+戯
+戳
+戴
+户
+戻
+戽
+戾
+房
+所
+扁
+扃
+扆
+扇
+扈
+扉
+手
+扌
+才
+扎
+扑
+扒
+打
+扔
+托
+扛
+扞
+扣
+扥
+扦
+执
+扩
+扪
+扫
+扬
+扭
+扮
+扯
+扰
+扳
+扶
+批
+扼
+扽
+找
+承
+技
+抃
+抄
+抉
+把
+抑
+抒
+抓
+抔
+投
+抖
+抗
+折
+抚
+抛
+抜
+抟
+抠
+抡
+抢
+护
+报
+抧
+抨
+披
+抬
+抱
+抳
+抵
+抹
+抺
+抻
+押
+抽
+抿
+拂
+拄
+担
+拆
+拇
+拈
+拉
+拊
+拌
+拍
+拎
+拏
+拐
+拒
+拓
+拔
+拖
+拗
+拘
+拙
+招
+拜
+拟
+拢
+拣
+拥
+拦
+拧
+拨
+择
+括
+拭
+拮
+拯
+拱
+拳
+拴
+拶
+拷
+拼
+拽
+拾
+拿
+挀
+持
+挂
+指
+挈
+按
+挎
+挑
+挒
+挖
+挚
+挛
+挝
+挞
+挟
+挠
+挡
+挢
+挣
+挤
+挥
+挨
+挪
+挫
+振
+挲
+挹
+挺
+挻
+挼
+挽
+捂
+捃
+捅
+捆
+捉
+捋
+捌
+捍
+捎
+捏
+捐
+捕
+捜
+捞
+损
+捡
+换
+捣
+捧
+捩
+捭
+据
+捯
+捱
+捶
+捷
+捺
+捻
+捽
+掀
+掂
+掇
+授
+掉
+掊
+掌
+掎
+掏
+掐
+排
+掖
+掘
+掞
+掠
+探
+掣
+掤
+接
+控
+推
+掩
+措
+掬
+掭
+掮
+掰
+掲
+掳
+掴
+掷
+掸
+掺
+掼
+掾
+揄
+揆
+揉
+揍
+描
+提
+插
+揖
+揠
+握
+揣
+揩
+揪
+揭
+揲
+援
+揵
+揶
+揸
+揺
+揽
+揾
+揿
+搀
+搁
+搂
+搅
+搋
+搏
+搐
+搓
+搔
+搜
+搞
+搠
+搡
+搢
+搥
+搦
+搧
+搨
+搪
+搬
+搭
+搴
+搵
+携
+搽
+搿
+摁
+摄
+摅
+摆
+摇
+摈
+摊
+摒
+摔
+摘
+摛
+摞
+摧
+摩
+摭
+摸
+摹
+摺
+摽
+撂
+撃
+撄
+撅
+撇
+撑
+撒
+撕
+撘
+撙
+撝
+撞
+撤
+撩
+撬
+播
+撮
+撰
+撵
+撷
+撸
+撺
+撼
+擀
+擂
+擅
+操
+擎
+擒
+擗
+擘
+擞
+擢
+擤
+擦
+擫
+擿
+攀
+攒
+攘
+攞
+攥
+攫
+支
+攲
+攴
+攵
+收
+攸
+改
+攻
+攽
+放
+政
+故
+效
+敉
+敌
+敎
+敏
+救
+敔
+敕
+敖
+教
+敚
+敛
+敝
+敞
+敢
+散
+敦
+敫
+敬
+数
+敲
+整
+敷
+敻
+文
+斉
+斋
+斌
+斎
+斐
+斑
+斓
+斗
+料
+斛
+斜
+斝
+斟
+斡
+斤
+斥
+斧
+斩
+斫
+断
+斯
+新
+斲
+斶
+方
+於
+施
+旁
+旃
+旄
+旅
+旆
+旋
+旌
+旎
+族
+旒
+旖
+旗
+旛
+无
+既
+旣
+日
+旦
+旧
+旨
+早
+旬
+旭
+旮
+旯
+旰
+旱
+旳
+旴
+时
+旷
+旸
+旺
+旻
+旼
+昀
+昂
+昃
+昆
+昇
+昉
+昊
+昌
+明
+昏
+易
+昔
+昕
+昙
+昚
+昝
+昞
+星
+映
+春
+昧
+昨
+昪
+昫
+昭
+是
+昰
+昱
+昳
+昴
+昵
+昶
+昺
+昼
+显
+晁
+時
+晃
+晄
+晋
+晌
+晏
+晒
+晓
+晔
+晕
+晖
+晗
+晙
+晚
+晞
+晟
+晡
+晢
+晤
+晦
+晧
+晨
+晩
+晬
+普
+景
+晰
+晳
+晴
+晶
+晷
+晸
+智
+晻
+晾
+暁
+暂
+暄
+暇
+暌
+暍
+暎
+暐
+暑
+暕
+暖
+暗
+暝
+暠
+暧
+暨
+暮
+暴
+暸
+暹
+暻
+暾
+曈
+曌
+曕
+曙
+曛
+曜
+曝
+曦
+曩
+曰
+曱
+曲
+曳
+更
+曷
+曹
+曺
+曼
+曽
+曾
+替
+最
+會
+朅
+月
+有
+朊
+朋
+服
+朏
+朐
+朓
+朔
+朕
+朗
+望
+朝
+期
+朥
+朦
+木
+未
+末
+本
+札
+术
+朱
+朲
+朴
+朵
+机
+朽
+朿
+杀
+杂
+权
+杆
+杈
+杉
+杌
+李
+杏
+材
+村
+杓
+杖
+杜
+杞
+束
+杠
+条
+来
+杧
+杨
+杪
+杭
+杮
+杯
+杰
+杲
+杳
+杵
+杷
+杼
+松
+板
+极
+构
+枇
+枉
+枋
+枏
+析
+枓
+枕
+林
+枘
+枚
+果
+枝
+枞
+枟
+枠
+枢
+枣
+枥
+枧
+枨
+枪
+枫
+枭
+枯
+枰
+枱
+枳
+枵
+架
+枷
+枸
+枹
+柁
+柃
+柄
+柊
+柏
+某
+柑
+柒
+染
+柔
+柘
+柙
+柚
+柜
+柝
+柞
+柟
+柠
+柢
+查
+柩
+柬
+柯
+柰
+柱
+柳
+柴
+柷
+査
+柽
+柾
+柿
+栀
+栃
+栄
+栅
+标
+栈
+栉
+栊
+栋
+栌
+栎
+栏
+树
+栒
+栓
+栖
+栗
+栝
+栞
+栟
+校
+栢
+栩
+株
+栱
+栲
+栳
+栴
+样
+核
+根
+栻
+格
+栽
+栾
+栿
+桀
+桁
+桂
+桃
+桄
+桅
+框
+案
+桉
+桌
+桎
+桐
+桑
+桓
+桔
+桕
+桖
+桜
+桠
+桡
+桢
+档
+桤
+桥
+桦
+桧
+桨
+桩
+桫
+桴
+桶
+桷
+梁
+梃
+梅
+梆
+梏
+梓
+梗
+梢
+梣
+梦
+梧
+梨
+梭
+梯
+械
+梳
+梵
+梶
+梼
+梿
+检
+棂
+棉
+棋
+棍
+棐
+棒
+棕
+棘
+棚
+棠
+棣
+棨
+棪
+棫
+森
+棰
+棱
+棵
+棹
+棺
+棻
+棼
+椀
+椁
+椅
+椇
+椋
+植
+椎
+椐
+椒
+椛
+検
+椟
+椠
+椤
+椪
+椭
+椰
+椴
+椹
+椽
+椿
+楂
+楔
+楗
+楙
+楚
+楛
+楝
+楞
+楠
+楢
+楣
+楤
+楦
+楩
+楪
+楫
+業
+楮
+楯
+楶
+楷
+楸
+楹
+楼
+楽
+榀
+概
+榃
+榄
+榆
+榇
+榈
+榉
+榊
+榎
+榔
+榕
+榖
+榘
+榛
+榜
+榧
+榨
+榫
+榭
+榱
+榴
+榷
+榻
+榼
+槁
+槃
+槅
+槊
+槌
+槎
+槐
+槑
+槔
+様
+槙
+槚
+槛
+槜
+槟
+槠
+槭
+槱
+槲
+槵
+槻
+槽
+槿
+樊
+樋
+樗
+樘
+樛
+樟
+模
+樨
+権
+横
+樫
+樯
+樱
+樵
+樽
+樾
+橄
+橇
+橐
+橘
+橙
+橚
+橛
+橡
+橥
+橦
+橱
+橹
+橼
+檀
+檄
+檇
+檎
+檐
+檗
+檞
+檠
+檩
+檫
+檬
+檵
+櫂
+櫆
+欠
+次
+欢
+欣
+欤
+欧
+欲
+欷
+欸
+欹
+欺
+欻
+款
+歃
+歆
+歇
+歉
+歌
+歔
+歘
+歙
+止
+正
+此
+步
+武
+歧
+歩
+歪
+歳
+歴
+歹
+歺
+死
+歼
+殁
+殂
+殃
+殄
+殆
+殇
+殉
+殊
+残
+殍
+殑
+殒
+殓
+殖
+殚
+殛
+殡
+殢
+殪
+殳
+殴
+段
+殷
+殿
+毁
+毂
+毅
+毋
+毌
+母
+毎
+每
+毐
+毑
+毒
+毓
+比
+毕
+毖
+毗
+毘
+毙
+毛
+毡
+毫
+毯
+毳
+毵
+毽
+氀
+氂
+氅
+氆
+氇
+氍
+氏
+氐
+民
+氓
+气
+氖
+気
+氘
+氙
+氚
+氛
+氟
+氡
+氢
+氤
+氦
+氧
+氨
+氩
+氪
+氮
+氯
+氰
+氲
+水
+氵
+氷
+永
+氹
+氺
+氽
+氿
+汀
+汁
+求
+汆
+汇
+汉
+汊
+汐
+汕
+汖
+汗
+汛
+汜
+汝
+汞
+江
+池
+污
+汤
+汧
+汨
+汩
+汪
+汭
+汰
+汲
+汴
+汶
+汸
+汹
+汽
+汾
+沁
+沂
+沃
+沄
+沅
+沆
+沇
+沈
+沉
+沌
+沏
+沐
+沒
+沓
+沔
+沕
+沙
+沚
+沛
+沟
+没
+沢
+沣
+沤
+沥
+沦
+沧
+沨
+沩
+沪
+沫
+沬
+沭
+沮
+沱
+河
+沴
+沵
+沸
+油
+治
+沼
+沽
+沾
+沿
+泃
+泄
+泅
+泇
+泉
+泊
+泌
+泐
+泓
+泔
+法
+泖
+泗
+泚
+泛
+泞
+泠
+泡
+波
+泣
+泥
+注
+泩
+泪
+泫
+泮
+泯
+泰
+泱
+泳
+泵
+泷
+泸
+泺
+泻
+泼
+泽
+泾
+洁
+洄
+洇
+洈
+洊
+洋
+洌
+洎
+洑
+洒
+洗
+洙
+洛
+洞
+洢
+洣
+津
+洧
+洨
+洪
+洫
+洮
+洱
+洲
+洳
+洵
+洸
+洹
+洺
+活
+洼
+洽
+派
+流
+浃
+浄
+浅
+浆
+浇
+浈
+浉
+浊
+测
+浍
+济
+浏
+浐
+浑
+浒
+浓
+浔
+浙
+浚
+浛
+浜
+浞
+浠
+浡
+浣
+浥
+浦
+浩
+浪
+浬
+浮
+浯
+浴
+海
+浸
+浼
+涂
+涅
+消
+涉
+涌
+涎
+涐
+涑
+涓
+涔
+涕
+涘
+涙
+涛
+涝
+涞
+涟
+涠
+涡
+涢
+涣
+涤
+润
+涧
+涨
+涩
+涪
+涫
+涮
+涯
+液
+涴
+涵
+涸
+涿
+淀
+淄
+淅
+淆
+淇
+淋
+淌
+淏
+淑
+淖
+淘
+淙
+淛
+淝
+淞
+淠
+淡
+淤
+淦
+淩
+淫
+淬
+淮
+淯
+深
+淳
+混
+淸
+淹
+添
+淼
+渀
+渃
+清
+済
+渉
+渊
+渋
+渌
+渍
+渎
+渐
+渑
+渔
+渕
+渖
+渗
+渚
+渝
+渟
+渠
+渡
+渣
+渤
+渥
+温
+渫
+渭
+港
+渲
+渴
+游
+渺
+渼
+湃
+湄
+湉
+湋
+湍
+湎
+湑
+湓
+湔
+湖
+湘
+湛
+湜
+湟
+湣
+湫
+湮
+湲
+湳
+湴
+湾
+湿
+満
+溁
+溃
+溅
+溆
+溇
+溉
+溍
+溏
+源
+溘
+溜
+溞
+溟
+溢
+溥
+溦
+溧
+溪
+溯
+溱
+溲
+溴
+溶
+溷
+溺
+溽
+滁
+滂
+滃
+滆
+滇
+滈
+滉
+滋
+滍
+滏
+滑
+滓
+滔
+滕
+滗
+滘
+滙
+滚
+滝
+滞
+滟
+滠
+满
+滢
+滤
+滥
+滦
+滨
+滩
+滴
+滹
+漀
+漂
+漆
+漈
+漉
+漏
+漓
+演
+漕
+漠
+漩
+漪
+漫
+漭
+漯
+漱
+漳
+漶
+漷
+漾
+潆
+潇
+潋
+潍
+潏
+潘
+潜
+潞
+潟
+潢
+潦
+潭
+潮
+潲
+潴
+潸
+潺
+潼
+潽
+潾
+澂
+澄
+澈
+澉
+澌
+澍
+澎
+澐
+澔
+澜
+澡
+澥
+澧
+澪
+澳
+澶
+澹
+激
+濂
+濆
+濉
+濊
+濑
+濒
+濙
+濛
+濞
+濠
+濡
+濩
+濬
+濮
+濯
+瀀
+瀍
+瀑
+瀚
+瀛
+瀞
+瀣
+瀬
+瀹
+瀼
+灌
+灏
+灞
+火
+灬
+灭
+灯
+灰
+灵
+灶
+灸
+灼
+灾
+灿
+炀
+炁
+炅
+炆
+炉
+炊
+炎
+炒
+炔
+炕
+炖
+炘
+炙
+炜
+炝
+炟
+炤
+炩
+炫
+炬
+炭
+炮
+炯
+炱
+炳
+炷
+炸
+点
+為
+炻
+炼
+炽
+烀
+烁
+烂
+烃
+烈
+烊
+烎
+烔
+烘
+烙
+烛
+烜
+烝
+烟
+烤
+烦
+烧
+烨
+烩
+烫
+烬
+热
+烯
+烷
+烹
+烺
+烽
+焉
+焊
+焌
+焐
+焓
+焕
+焖
+焗
+焘
+焙
+焚
+焜
+焞
+焦
+焮
+焯
+焰
+焱
+然
+焼
+煅
+煇
+煊
+煌
+煎
+煐
+煕
+煖
+煚
+煜
+煞
+煤
+煦
+照
+煨
+煮
+煲
+煳
+煴
+煸
+煺
+煽
+熄
+熇
+熊
+熏
+熔
+熘
+熙
+熜
+熟
+熠
+熥
+熨
+熬
+熳
+熵
+熹
+熺
+燃
+燊
+燋
+燎
+燏
+燔
+燕
+燚
+燠
+燥
+燧
+燮
+燹
+燻
+燿
+爀
+爆
+爇
+爨
+爪
+爬
+爰
+爱
+爲
+爵
+父
+爷
+爸
+爹
+爻
+爽
+爿
+牀
+牁
+牂
+片
+版
+牋
+牌
+牍
+牐
+牒
+牖
+牙
+牛
+牝
+牟
+牠
+牡
+牢
+牤
+牦
+牧
+物
+牯
+牲
+牵
+特
+牺
+牻
+牾
+犀
+犁
+犄
+犇
+犊
+犍
+犏
+犒
+犟
+犨
+犬
+犭
+犯
+犰
+犴
+状
+犷
+犸
+犹
+犼
+犽
+狁
+狂
+狃
+狄
+狈
+狌
+狍
+狎
+狐
+狒
+狗
+狙
+狛
+狝
+狞
+狠
+狡
+狨
+狩
+独
+狭
+狮
+狯
+狰
+狱
+狲
+狳
+狴
+狷
+狸
+狻
+狼
+猀
+猁
+猃
+猄
+猇
+猊
+猋
+猎
+猓
+猕
+猖
+猗
+猛
+猜
+猝
+猞
+猟
+猡
+猢
+猥
+猩
+猪
+猫
+猬
+献
+猰
+猱
+猴
+猷
+猹
+猾
+猿
+獍
+獐
+獒
+獗
+獠
+獣
+獬
+獭
+獴
+獾
+玁
+玄
+率
+玉
+王
+玎
+玏
+玑
+玕
+玖
+玗
+玘
+玙
+玚
+玛
+玟
+玠
+玡
+玢
+玥
+玦
+玧
+玩
+玫
+玭
+玮
+环
+现
+玲
+玳
+玷
+玹
+玺
+玻
+珀
+珂
+珅
+珈
+珉
+珊
+珍
+珏
+珐
+珑
+珖
+珙
+珝
+珞
+珠
+珣
+珥
+珦
+珧
+珩
+珪
+班
+珰
+珲
+珵
+珹
+珺
+珽
+琀
+球
+琅
+理
+琇
+琉
+琊
+琋
+琍
+琎
+琏
+琐
+琚
+琛
+琢
+琤
+琥
+琦
+琨
+琪
+琬
+琮
+琯
+琰
+琲
+琳
+琴
+琵
+琶
+琹
+琼
+瑀
+瑁
+瑄
+瑆
+瑊
+瑒
+瑕
+瑗
+瑙
+瑚
+瑛
+瑜
+瑞
+瑟
+瑠
+瑢
+瑧
+瑨
+瑭
+瑰
+瑱
+瑶
+瑷
+瑸
+瑺
+瑾
+璀
+璁
+璂
+璃
+璆
+璇
+璈
+璋
+璎
+璐
+璘
+璜
+璞
+璟
+璠
+璧
+璨
+璩
+璪
+璮
+璲
+璺
+璿
+瓌
+瓒
+瓘
+瓛
+瓜
+瓞
+瓟
+瓠
+瓢
+瓣
+瓤
+瓦
+瓮
+瓯
+瓴
+瓶
+瓷
+瓿
+甃
+甄
+甍
+甏
+甑
+甓
+甗
+甘
+甙
+甚
+甜
+生
+甡
+產
+甥
+用
+甩
+甪
+甫
+甬
+甭
+甯
+田
+由
+甲
+申
+电
+男
+甸
+町
+画
+甾
+畀
+畅
+畈
+畊
+畋
+界
+畎
+畏
+畑
+畔
+留
+畚
+畛
+畜
+畠
+畤
+略
+畦
+番
+畯
+畲
+畴
+畸
+畹
+畿
+疃
+疆
+疋
+疍
+疎
+疏
+疑
+疒
+疔
+疖
+疗
+疙
+疚
+疝
+疟
+疠
+疡
+疣
+疤
+疥
+疫
+疬
+疮
+疯
+疰
+疱
+疲
+疳
+疴
+疵
+疸
+疹
+疼
+疽
+疾
+痂
+痄
+病
+症
+痈
+痉
+痊
+痍
+痒
+痔
+痕
+痖
+痘
+痛
+痞
+痢
+痣
+痤
+痦
+痧
+痨
+痩
+痪
+痫
+痰
+痱
+痴
+痹
+痼
+痿
+瘀
+瘁
+瘅
+瘆
+瘊
+瘌
+瘐
+瘕
+瘖
+瘗
+瘘
+瘙
+瘛
+瘟
+瘠
+瘢
+瘤
+瘥
+瘦
+瘩
+瘪
+瘫
+瘰
+瘳
+瘴
+瘵
+瘸
+瘼
+瘾
+瘿
+癀
+癃
+癌
+癍
+癎
+癒
+癔
+癖
+癜
+癞
+癣
+癫
+癯
+癸
+発
+登
+發
+白
+百
+癿
+皂
+的
+皆
+皇
+皈
+皋
+皎
+皐
+皑
+皒
+皓
+皕
+皖
+皙
+皛
+皝
+皞
+皤
+皦
+皮
+皱
+皲
+皴
+皿
+盂
+盃
+盅
+盆
+盈
+盉
+益
+盌
+盍
+盎
+盏
+盐
+监
+盒
+盔
+盖
+盗
+盘
+盛
+盝
+盟
+盥
+盦
+盨
+盩
+目
+盯
+盱
+盲
+直
+相
+盹
+盼
+盾
+眀
+省
+眄
+眇
+眈
+眉
+看
+県
+眙
+眚
+眛
+眞
+真
+眠
+眦
+眨
+眩
+眬
+眭
+眯
+眵
+眶
+眷
+眸
+眺
+眼
+着
+睁
+睇
+睐
+睑
+睒
+睚
+睛
+睡
+睢
+督
+睥
+睦
+睨
+睪
+睫
+睬
+睱
+睹
+睺
+睽
+睾
+睿
+瞀
+瞄
+瞅
+瞋
+瞌
+瞎
+瞑
+瞒
+瞟
+瞠
+瞢
+瞥
+瞧
+瞩
+瞪
+瞬
+瞭
+瞰
+瞳
+瞻
+瞽
+瞿
+矍
+矗
+矛
+矜
+矞
+矢
+矣
+知
+矧
+矩
+矫
+矬
+短
+矮
+石
+矶
+矸
+矽
+矾
+矿
+砀
+码
+砂
+砉
+砌
+砍
+砑
+砒
+研
+砕
+砖
+砗
+砚
+砜
+砝
+砟
+砢
+砣
+砥
+砦
+砧
+砩
+砬
+砭
+砰
+砳
+破
+砵
+砷
+砸
+砹
+砺
+砻
+砼
+砾
+础
+硅
+硇
+硌
+硎
+硏
+硐
+硒
+硔
+硕
+硖
+硗
+硙
+硚
+硝
+硪
+硫
+硬
+确
+硷
+硼
+碁
+碇
+碉
+碌
+碍
+碎
+碏
+碑
+碓
+碗
+碘
+碚
+碛
+碜
+碟
+碡
+碣
+碥
+碧
+碰
+碱
+碲
+碳
+碴
+碶
+碹
+碾
+磁
+磅
+磉
+磊
+磋
+磐
+磔
+磕
+磙
+磜
+磡
+磦
+磨
+磬
+磲
+磴
+磷
+磺
+磻
+磾
+礁
+礅
+礐
+礓
+礞
+礤
+礴
+示
+礻
+礼
+礽
+社
+祀
+祁
+祂
+祆
+祇
+祈
+祉
+祊
+祎
+祏
+祐
+祓
+祔
+祖
+祗
+祘
+祚
+祛
+祜
+祝
+神
+祟
+祠
+祢
+祥
+祧
+票
+祭
+祯
+祲
+祷
+祸
+祹
+祺
+祼
+祾
+禀
+禁
+禄
+禅
+禇
+禊
+禋
+福
+禑
+禔
+禖
+禘
+禚
+禛
+禟
+禤
+禧
+禩
+禳
+禵
+禹
+禺
+离
+禽
+禾
+秀
+私
+秂
+秃
+秆
+秉
+秋
+种
+科
+秒
+秕
+秘
+租
+秣
+秤
+秦
+秧
+秩
+秫
+秬
+秭
+积
+称
+秸
+移
+秽
+秾
+稀
+稃
+程
+稍
+税
+稔
+稗
+稙
+稚
+稞
+稠
+稣
+稲
+稳
+稷
+稹
+稻
+稼
+稽
+稿
+穂
+穆
+穉
+穏
+穑
+穗
+穣
+穰
+穴
+究
+穷
+穹
+空
+穿
+突
+窃
+窄
+窅
+窆
+窈
+窊
+窋
+窍
+窑
+窒
+窓
+窕
+窖
+窗
+窘
+窜
+窝
+窟
+窠
+窣
+窥
+窦
+窨
+窭
+窰
+窳
+窸
+窿
+立
+竑
+竖
+站
+竜
+竝
+竞
+竟
+章
+竣
+童
+竦
+竭
+端
+竹
+竺
+竽
+竿
+笃
+笄
+笆
+笈
+笊
+笋
+笏
+笑
+笔
+笕
+笙
+笛
+笞
+笠
+笤
+笥
+符
+笨
+笪
+笫
+第
+笮
+笱
+笳
+笸
+笹
+笺
+笼
+笾
+筅
+筇
+等
+筊
+筋
+筌
+筏
+筐
+筑
+筒
+答
+策
+筘
+筚
+筛
+筜
+筝
+筠
+筭
+筮
+筯
+筱
+筲
+筴
+筵
+筷
+筹
+筼
+签
+简
+箅
+箌
+箍
+箐
+箓
+箔
+箕
+算
+箜
+箝
+管
+箦
+箧
+箨
+箩
+箪
+箫
+箬
+箭
+箱
+箴
+箸
+篁
+篆
+篇
+篌
+篑
+篓
+篙
+篚
+篝
+篡
+篥
+篦
+篪
+篮
+篯
+篱
+篷
+篼
+篾
+簃
+簇
+簋
+簌
+簏
+簕
+簖
+簟
+簠
+簦
+簧
+簪
+簰
+簸
+簿
+籀
+籁
+籍
+籓
+籙
+米
+籴
+籺
+类
+籼
+籽
+粄
+粉
+粑
+粒
+粕
+粗
+粘
+粜
+粝
+粞
+粟
+粢
+粤
+粥
+粦
+粧
+粪
+粬
+粮
+粱
+粲
+粳
+粹
+粼
+粽
+精
+粿
+糁
+糅
+糊
+糌
+糍
+糕
+糖
+糗
+糙
+糜
+糟
+糠
+糨
+糬
+糯
+糸
+系
+紊
+紘
+素
+索
+紧
+紫
+紬
+紮
+累
+経
+絜
+絪
+絮
+絵
+絶
+絷
+絺
+綎
+綖
+継
+続
+綝
+綦
+綫
+綮
+総
+緑
+緾
+縁
+縂
+縠
+縢
+縻
+繁
+繇
+繋
+繸
+繻
+纁
+纂
+纚
+纛
+纟
+纠
+纡
+红
+纣
+纤
+纥
+约
+级
+纨
+纩
+纪
+纫
+纬
+纭
+纮
+纯
+纰
+纱
+纲
+纳
+纵
+纶
+纷
+纸
+纹
+纺
+纻
+纽
+纾
+线
+绀
+绁
+绂
+练
+组
+绅
+细
+织
+终
+绉
+绊
+绌
+绍
+绎
+经
+绐
+绑
+绒
+结
+绔
+绕
+绗
+绘
+给
+绚
+绛
+络
+绝
+绞
+统
+绠
+绡
+绢
+绣
+绥
+绦
+继
+绨
+绩
+绪
+绫
+续
+绮
+绯
+绰
+绱
+绲
+绳
+维
+绵
+绶
+绷
+绸
+绹
+绺
+绻
+综
+绽
+绾
+绿
+缀
+缁
+缂
+缃
+缄
+缅
+缆
+缇
+缈
+缉
+缊
+缋
+缌
+缍
+缎
+缐
+缑
+缒
+缓
+缔
+缕
+编
+缗
+缘
+缙
+缚
+缛
+缜
+缝
+缟
+缠
+缡
+缢
+缣
+缤
+缥
+缦
+缧
+缨
+缩
+缪
+缫
+缬
+缭
+缮
+缯
+缰
+缱
+缲
+缳
+缴
+缵
+缶
+缷
+缸
+缺
+罂
+罃
+罄
+罅
+罍
+罐
+网
+罔
+罕
+罗
+罘
+罚
+罝
+罟
+罠
+罡
+罢
+罣
+罥
+罨
+罩
+罪
+置
+罱
+署
+罴
+罹
+罽
+罾
+羁
+羊
+羌
+美
+羑
+羔
+羕
+羚
+羝
+羞
+羟
+羡
+羣
+群
+羧
+羮
+羯
+羰
+羲
+羸
+羹
+羼
+羽
+羿
+翀
+翁
+翃
+翅
+翈
+翊
+翌
+翎
+翔
+翕
+翘
+翙
+翚
+翛
+翟
+翠
+翡
+翥
+翦
+翩
+翫
+翮
+翰
+翱
+翳
+翻
+翼
+翾
+耀
+老
+考
+耄
+者
+耆
+耋
+而
+耍
+耐
+耒
+耔
+耕
+耗
+耘
+耙
+耜
+耦
+耧
+耨
+耩
+耪
+耳
+耵
+耶
+耷
+耸
+耻
+耽
+耿
+聂
+聃
+聆
+聊
+聋
+职
+聍
+聒
+联
+聕
+聘
+聚
+聡
+聩
+聪
+聴
+聼
+聿
+肃
+肄
+肆
+肇
+肉
+肋
+肌
+肓
+肖
+肘
+肚
+肛
+肜
+肝
+肟
+肠
+股
+肢
+肤
+肥
+肩
+肪
+肫
+肮
+肯
+肱
+育
+肴
+肸
+肺
+肼
+肽
+肾
+肿
+胀
+胁
+胂
+胃
+胄
+胆
+背
+胍
+胎
+胖
+胗
+胙
+胚
+胛
+胜
+胝
+胞
+胠
+胡
+胤
+胥
+胧
+胨
+胪
+胫
+胬
+胭
+胯
+胰
+胱
+胳
+胴
+胶
+胸
+胺
+胼
+能
+脁
+脂
+脆
+脇
+脉
+脊
+脍
+脏
+脐
+脑
+脒
+脓
+脔
+脖
+脘
+脚
+脞
+脢
+脩
+脬
+脯
+脱
+脲
+脳
+脷
+脸
+脾
+脿
+腆
+腈
+腊
+腋
+腌
+腐
+腑
+腓
+腔
+腕
+腘
+腙
+腚
+腠
+腥
+腧
+腩
+腭
+腮
+腰
+腱
+腴
+腹
+腺
+腻
+腼
+腾
+腿
+膀
+膂
+膈
+膊
+膏
+膑
+膘
+膛
+膜
+膝
+膦
+膨
+膳
+膺
+膻
+臀
+臁
+臂
+臃
+臆
+臊
+臌
+臑
+臓
+臜
+臞
+臣
+臧
+自
+臬
+臭
+至
+致
+臵
+臻
+臼
+臾
+舀
+舁
+舂
+舄
+舅
+舆
+舌
+舍
+舎
+舐
+舒
+舔
+舖
+舘
+舛
+舜
+舞
+舟
+舡
+舢
+舣
+舨
+航
+舫
+般
+舯
+舰
+舱
+舲
+舳
+舴
+舵
+舶
+舷
+舸
+船
+舺
+舻
+舾
+艄
+艇
+艉
+艋
+艏
+艘
+艟
+艨
+艮
+良
+艰
+色
+艳
+艹
+艺
+艽
+艾
+艿
+节
+芃
+芄
+芈
+芊
+芋
+芍
+芎
+芏
+芑
+芒
+芗
+芘
+芙
+芜
+芝
+芟
+芡
+芣
+芤
+芥
+芦
+芨
+芩
+芪
+芫
+芬
+芭
+芮
+芯
+芰
+花
+芳
+芴
+芶
+芷
+芸
+芹
+芽
+芾
+苁
+苄
+苇
+苈
+苊
+苋
+苌
+苍
+苎
+苏
+苑
+苒
+苓
+苔
+苕
+苗
+苘
+苛
+苜
+苞
+苟
+苡
+苢
+苣
+苤
+若
+苦
+苫
+苭
+苯
+英
+苴
+苷
+苹
+苺
+苻
+苼
+苾
+茀
+茁
+茂
+范
+茄
+茅
+茆
+茇
+茈
+茉
+茌
+茎
+茏
+茑
+茔
+茕
+茗
+茚
+茛
+茜
+茝
+茧
+茨
+茫
+茬
+茭
+茯
+茱
+茳
+茴
+茵
+茶
+茸
+茹
+茺
+茼
+荀
+荃
+荄
+荅
+荆
+荇
+荈
+草
+荏
+荐
+荑
+荒
+荔
+荖
+荘
+荚
+荛
+荜
+荞
+荟
+荠
+荡
+荣
+荤
+荥
+荦
+荧
+荨
+荩
+荪
+荫
+荬
+荭
+荮
+药
+荳
+荷
+荸
+荻
+荼
+荽
+莃
+莅
+莆
+莉
+莎
+莒
+莓
+莘
+莙
+莛
+莜
+莞
+莠
+莨
+莩
+莪
+莫
+莱
+莲
+莳
+莴
+莶
+获
+莸
+莹
+莺
+莼
+莽
+菀
+菁
+菂
+菅
+菇
+菈
+菉
+菊
+菌
+菏
+菑
+菓
+菔
+菖
+菘
+菜
+菝
+菟
+菠
+菡
+菩
+菪
+菫
+菰
+菱
+菲
+菸
+菹
+菽
+菿
+萁
+萃
+萄
+萆
+萋
+萌
+萍
+萎
+萏
+萑
+萘
+萜
+萝
+萢
+萤
+营
+萦
+萧
+萨
+萩
+萱
+萸
+萼
+落
+葆
+葎
+葑
+葖
+著
+葙
+葚
+葛
+葜
+葡
+葢
+董
+葩
+葫
+葬
+葭
+葱
+葳
+葵
+葶
+葸
+葺
+蒂
+蒇
+蒉
+蒋
+蒌
+蒍
+蒎
+蒐
+蒗
+蒙
+蒜
+蒟
+蒡
+蒨
+蒯
+蒲
+蒴
+蒸
+蒹
+蒺
+蒻
+蒽
+蒾
+蒿
+蓁
+蓂
+蓄
+蓇
+蓉
+蓊
+蓍
+蓐
+蓑
+蓓
+蓖
+蓝
+蓟
+蓠
+蓢
+蓣
+蓥
+蓦
+蓬
+蓼
+蓿
+蔀
+蔌
+蔑
+蔓
+蔗
+蔘
+蔚
+蔟
+蔡
+蔫
+蔬
+蔴
+蔵
+蔷
+蔸
+蔹
+蔺
+蔻
+蔼
+蔽
+蕃
+蕅
+蕈
+蕉
+蕊
+蕖
+蕗
+蕙
+蕞
+蕡
+蕤
+蕨
+蕰
+蕲
+蕴
+蕹
+蕺
+蕻
+蕾
+薄
+薅
+薆
+薇
+薏
+薙
+薛
+薜
+薢
+薤
+薨
+薪
+薫
+薬
+薮
+薯
+薰
+薷
+薹
+藁
+藉
+藏
+藐
+藓
+藕
+藜
+藟
+藠
+藤
+藦
+藨
+藩
+藻
+藿
+蘅
+蘑
+蘖
+蘧
+蘩
+蘸
+蘼
+虎
+虏
+虐
+虑
+虒
+虓
+虔
+虚
+虞
+虢
+虫
+虬
+虮
+虱
+虹
+虺
+虻
+虽
+虾
+虿
+蚀
+蚁
+蚂
+蚊
+蚋
+蚌
+蚍
+蚓
+蚕
+蚖
+蚜
+蚝
+蚡
+蚣
+蚤
+蚧
+蚨
+蚩
+蚪
+蚬
+蚯
+蚰
+蚱
+蚴
+蚵
+蚶
+蚺
+蛀
+蛄
+蛆
+蛇
+蛉
+蛊
+蛋
+蛎
+蛏
+蛐
+蛔
+蛙
+蛛
+蛞
+蛟
+蛤
+蛩
+蛭
+蛮
+蛰
+蛱
+蛲
+蛳
+蛴
+蛸
+蛹
+蛾
+蜀
+蜂
+蜃
+蜇
+蜈
+蜉
+蜊
+蜍
+蜑
+蜒
+蜓
+蜕
+蜗
+蜘
+蜚
+蜜
+蜞
+蜡
+蜢
+蜣
+蜥
+蜩
+蜮
+蜱
+蜴
+蜷
+蜻
+蜾
+蜿
+蝇
+蝈
+蝉
+蝌
+蝎
+蝓
+蝗
+蝙
+蝠
+蝣
+蝤
+蝥
+蝮
+蝰
+蝲
+蝴
+蝶
+蝻
+蝼
+蝽
+蝾
+螂
+螃
+螅
+螈
+螋
+融
+螓
+螟
+螣
+螨
+螫
+螬
+螭
+螯
+螳
+螵
+螺
+螽
+蟀
+蟆
+蟊
+蟋
+蟌
+蟑
+蟒
+蟛
+蟜
+蟠
+蟥
+蟪
+蟮
+蟳
+蟹
+蟾
+蠃
+蠊
+蠋
+蠍
+蠓
+蠕
+蠖
+蠡
+蠢
+蠲
+蠹
+蠼
+血
+衄
+衅
+衆
+行
+衍
+衎
+衔
+街
+衙
+衞
+衡
+衢
+衣
+补
+表
+衩
+衪
+衫
+衬
+衮
+衰
+衲
+衷
+衽
+衾
+衿
+袁
+袂
+袄
+袅
+袆
+袈
+袋
+袍
+袒
+袓
+袖
+袛
+袜
+袢
+袤
+袪
+被
+袭
+袮
+袱
+袴
+袷
+袼
+裁
+裂
+装
+裆
+裈
+裉
+裒
+裔
+裕
+裘
+裙
+裛
+裟
+裡
+裢
+裤
+裥
+裨
+裪
+裱
+裳
+裴
+裸
+裹
+裼
+裾
+褀
+褂
+褆
+褊
+褐
+褒
+褓
+褔
+褙
+褚
+褛
+褡
+褥
+褪
+褫
+褰
+褴
+褶
+襁
+襄
+襌
+襕
+襜
+襞
+襟
+襦
+襻
+西
+要
+覃
+覆
+覇
+覚
+覧
+覩
+観
+见
+观
+规
+觅
+视
+觇
+览
+觉
+觊
+觋
+觌
+觎
+觏
+觐
+觑
+角
+觚
+觜
+觞
+解
+觥
+触
+觯
+觳
+觽
+觿
+言
+訇
+訏
+訚
+訫
+訳
+訾
+詈
+詜
+詝
+詧
+詹
+誉
+誊
+誐
+誓
+説
+読
+諡
+諲
+諴
+謇
+謦
+譞
+警
+譬
+譲
+讌
+讠
+计
+订
+讣
+认
+讥
+讦
+讧
+讨
+让
+讪
+讫
+讬
+训
+议
+讯
+记
+讲
+讳
+讴
+讵
+讶
+讷
+许
+讹
+论
+讼
+讽
+设
+访
+诀
+证
+诂
+诃
+评
+诅
+识
+诈
+诉
+诊
+诋
+诌
+词
+诎
+诏
+译
+诒
+诓
+诔
+试
+诖
+诗
+诘
+诙
+诚
+诛
+诜
+话
+诞
+诟
+诠
+诡
+询
+诣
+诤
+该
+详
+诧
+诨
+诩
+诫
+诬
+语
+诮
+误
+诰
+诱
+诲
+诳
+说
+诵
+诶
+请
+诸
+诹
+诺
+读
+诼
+诽
+课
+诿
+谀
+谁
+谂
+调
+谄
+谅
+谆
+谇
+谈
+谊
+谋
+谌
+谍
+谎
+谏
+谐
+谑
+谒
+谓
+谔
+谕
+谖
+谗
+谘
+谙
+谚
+谛
+谜
+谝
+谞
+谟
+谠
+谡
+谢
+谣
+谤
+谥
+谦
+谧
+谨
+谩
+谪
+谫
+谬
+谭
+谮
+谯
+谰
+谱
+谲
+谳
+谴
+谵
+谶
+谷
+谿
+豁
+豆
+豇
+豉
+豊
+豌
+豕
+豚
+象
+豢
+豨
+豪
+豫
+豳
+豸
+豹
+豺
+貂
+貅
+貉
+貊
+貌
+貐
+貔
+貘
+賨
+賸
+贇
+贝
+贞
+负
+贠
+贡
+财
+责
+贤
+败
+账
+货
+质
+贩
+贪
+贫
+贬
+购
+贮
+贯
+贰
+贱
+贲
+贳
+贴
+贵
+贶
+贷
+贸
+费
+贺
+贻
+贼
+贽
+贾
+贿
+赀
+赁
+赂
+赃
+资
+赅
+赈
+赉
+赊
+赋
+赌
+赍
+赎
+赏
+赐
+赑
+赓
+赔
+赕
+赖
+赘
+赙
+赚
+赛
+赜
+赝
+赞
+赟
+赠
+赡
+赢
+赣
+赤
+赦
+赧
+赪
+赫
+赭
+赮
+走
+赳
+赴
+赵
+赶
+起
+趁
+趄
+超
+越
+趋
+趔
+趟
+趣
+趯
+趱
+足
+趴
+趵
+趸
+趺
+趼
+趾
+趿
+跂
+跃
+跄
+跆
+跋
+跌
+跎
+跏
+跑
+跖
+跗
+跚
+跛
+距
+跞
+跟
+跢
+跣
+跤
+跨
+跩
+跪
+跫
+跬
+路
+跳
+践
+跶
+跷
+跸
+跹
+跺
+跻
+跽
+踅
+踉
+踊
+踌
+踏
+踔
+踝
+踞
+踟
+踢
+踣
+踩
+踪
+踬
+踮
+踯
+踰
+踱
+踵
+踹
+踽
+蹀
+蹁
+蹂
+蹄
+蹇
+蹈
+蹉
+蹊
+蹋
+蹑
+蹒
+蹓
+蹙
+蹚
+蹟
+蹦
+蹩
+蹬
+蹭
+蹰
+蹲
+蹴
+蹶
+蹻
+蹼
+蹿
+躁
+躄
+躅
+躇
+躏
+躐
+躔
+躜
+躞
+身
+躬
+躯
+躲
+躺
+転
+軽
+輋
+轘
+车
+轧
+轨
+轩
+轫
+转
+轭
+轮
+软
+轰
+轱
+轲
+轳
+轴
+轵
+轶
+轸
+轹
+轺
+轻
+轼
+载
+轾
+轿
+辂
+较
+辄
+辅
+辆
+辇
+辈
+辉
+辊
+辋
+辍
+辎
+辏
+辐
+辑
+输
+辔
+辕
+辖
+辗
+辘
+辙
+辚
+辛
+辜
+辞
+辟
+辣
+辨
+辩
+辫
+辰
+辱
+辶
+边
+辺
+辻
+込
+辽
+达
+辿
+迁
+迂
+迄
+迅
+过
+迈
+迍
+迎
+运
+近
+迓
+返
+迕
+还
+这
+进
+远
+违
+连
+迟
+迢
+迤
+迥
+迦
+迨
+迩
+迪
+迫
+迭
+迮
+述
+迳
+迷
+迸
+迹
+追
+退
+送
+适
+逃
+逄
+逅
+逆
+选
+逊
+逋
+逍
+透
+逐
+逑
+递
+途
+逖
+逗
+這
+通
+逛
+逝
+逞
+速
+造
+逡
+逢
+逦
+逭
+逮
+逯
+進
+逵
+逶
+逸
+逹
+逺
+逻
+逼
+逾
+遁
+遂
+遄
+遇
+遍
+遏
+遐
+遑
+遒
+道
+遗
+遘
+遛
+遢
+遣
+遥
+遨
+遭
+遮
+遯
+遴
+遵
+遶
+遹
+遽
+避
+邀
+邂
+邃
+邅
+邈
+邉
+邋
+邑
+邓
+邕
+邗
+邙
+邛
+邝
+邠
+邡
+邢
+那
+邦
+邨
+邪
+邬
+邮
+邯
+邰
+邱
+邳
+邴
+邵
+邶
+邸
+邹
+邺
+邻
+邽
+邾
+郁
+郄
+郅
+郇
+郈
+郊
+郎
+郏
+郐
+郑
+郓
+郕
+郗
+郚
+郛
+郜
+郝
+郞
+郡
+郢
+郤
+郦
+郧
+部
+郪
+郫
+郭
+郯
+郴
+郷
+郸
+都
+郾
+郿
+鄀
+鄂
+鄄
+鄗
+鄘
+鄙
+鄚
+鄜
+鄞
+鄠
+鄢
+鄣
+鄩
+鄫
+鄮
+鄯
+鄱
+鄹
+酂
+酃
+酆
+酉
+酊
+酋
+酌
+配
+酎
+酐
+酒
+酗
+酚
+酝
+酞
+酡
+酢
+酣
+酤
+酥
+酩
+酪
+酬
+酮
+酯
+酰
+酱
+酲
+酴
+酵
+酶
+酷
+酸
+酹
+酺
+酽
+酾
+酿
+醂
+醅
+醇
+醉
+醋
+醌
+醍
+醐
+醑
+醒
+醚
+醛
+醢
+醣
+醪
+醭
+醮
+醯
+醴
+醵
+醺
+醿
+釆
+采
+釉
+释
+里
+重
+野
+量
+金
+釜
+釭
+釿
+鈇
+鈈
+鈊
+鈎
+鈡
+鉄
+鉏
+鉨
+鉴
+鉷
+銎
+銙
+銭
+銮
+鋆
+鋈
+鋐
+鋗
+鋬
+鋹
+錞
+錡
+錤
+録
+錾
+鍀
+鍪
+鎌
+鎏
+鎚
+鏊
+鏐
+鏖
+鐏
+鑨
+鑫
+钅
+钆
+钇
+针
+钉
+钊
+钋
+钌
+钍
+钎
+钏
+钐
+钒
+钓
+钔
+钕
+钖
+钗
+钙
+钚
+钛
+钜
+钝
+钞
+钟
+钠
+钡
+钢
+钣
+钤
+钥
+钦
+钧
+钨
+钩
+钪
+钫
+钬
+钭
+钮
+钯
+钰
+钱
+钲
+钳
+钴
+钵
+钶
+钷
+钹
+钺
+钻
+钼
+钽
+钾
+钿
+铀
+铁
+铂
+铃
+铄
+铅
+铆
+铈
+铉
+铊
+铋
+铌
+铍
+铎
+铐
+铑
+铒
+铓
+铕
+铖
+铗
+铙
+铚
+铛
+铜
+铝
+铟
+铠
+铡
+铢
+铣
+铤
+铥
+铦
+铧
+铨
+铩
+铪
+铫
+铬
+铭
+铮
+铯
+铰
+铱
+铲
+铳
+铵
+银
+铷
+铸
+铺
+铼
+铽
+链
+铿
+销
+锁
+锂
+锃
+锄
+锅
+锆
+锇
+锈
+锉
+锋
+锌
+锍
+锎
+锏
+锐
+锑
+锒
+锔
+锕
+锖
+锗
+锘
+错
+锚
+锛
+锜
+锝
+锞
+锟
+锠
+锡
+锢
+锣
+锤
+锥
+锦
+锨
+锩
+锪
+锫
+锬
+锭
+键
+锯
+锰
+锱
+锲
+锳
+锴
+锵
+锶
+锷
+锸
+锹
+锺
+锻
+锼
+锽
+镀
+镁
+镂
+镄
+镅
+镆
+镇
+镈
+镉
+镊
+镋
+镌
+镍
+镎
+镏
+镐
+镑
+镒
+镓
+镔
+镕
+镖
+镗
+镘
+镙
+镚
+镛
+镜
+镝
+镞
+镟
+镠
+镡
+镢
+镣
+镥
+镦
+镧
+镨
+镪
+镫
+镬
+镭
+镯
+镰
+镱
+镲
+镳
+镶
+长
+開
+閟
+関
+閦
+闇
+闍
+闘
+门
+闩
+闪
+闫
+闭
+问
+闯
+闰
+闱
+闲
+闳
+间
+闵
+闶
+闷
+闸
+闹
+闺
+闻
+闼
+闽
+闾
+闿
+阀
+阁
+阂
+阃
+阄
+阅
+阆
+阇
+阈
+阉
+阊
+阋
+阌
+阍
+阎
+阏
+阐
+阑
+阒
+阓
+阔
+阕
+阖
+阗
+阙
+阚
+阛
+阜
+阝
+队
+阡
+阪
+阮
+阱
+防
+阳
+阴
+阵
+阶
+阻
+阼
+阿
+陀
+陂
+附
+际
+陆
+陇
+陈
+陉
+陋
+陌
+降
+限
+陔
+陕
+陛
+陞
+陟
+陡
+院
+除
+陨
+险
+陪
+陬
+陲
+陵
+陶
+陷
+険
+隂
+隅
+隆
+隈
+隋
+隍
+随
+隐
+隔
+隗
+隘
+隙
+障
+隠
+隣
+隧
+隰
+隳
+隶
+隹
+隼
+隽
+难
+雀
+雁
+雄
+雅
+集
+雇
+雉
+雊
+雌
+雍
+雎
+雏
+雑
+雒
+雕
+雠
+雨
+雩
+雪
+雫
+雯
+雱
+雳
+零
+雷
+雹
+雾
+需
+霁
+霂
+霄
+霅
+霆
+震
+霈
+霉
+霊
+霍
+霎
+霏
+霓
+霖
+霙
+霜
+霞
+霪
+霭
+霰
+露
+霸
+霹
+霾
+靑
+青
+靓
+靖
+静
+靛
+非
+靠
+靡
+面
+靥
+革
+靬
+靳
+靴
+靶
+靺
+靼
+鞅
+鞋
+鞍
+鞑
+鞔
+鞘
+鞞
+鞠
+鞣
+鞥
+鞨
+鞫
+鞬
+鞭
+鞮
+鞯
+鞴
+韘
+韡
+韦
+韧
+韩
+韪
+韫
+韬
+韭
+音
+韵
+韶
+頔
+頞
+頠
+頫
+頵
+頼
+顒
+顔
+顕
+顗
+页
+顶
+顷
+顸
+项
+顺
+须
+顼
+顽
+顾
+顿
+颀
+颁
+颂
+颃
+预
+颅
+领
+颇
+颈
+颉
+颊
+颋
+颌
+颍
+颎
+颏
+颐
+频
+颓
+颔
+颖
+颗
+题
+颙
+颚
+颛
+颜
+额
+颞
+颟
+颠
+颡
+颢
+颤
+颦
+颧
+风
+飏
+飐
+飑
+飒
+飓
+飕
+飖
+飘
+飙
+飚
+飞
+食
+飧
+飨
+餍
+餐
+餗
+餮
+饔
+饕
+饣
+饥
+饦
+饧
+饨
+饩
+饪
+饫
+饬
+饭
+饮
+饯
+饰
+饱
+饲
+饴
+饵
+饶
+饷
+饸
+饹
+饺
+饼
+饽
+饿
+馀
+馁
+馃
+馄
+馅
+馆
+馇
+馈
+馊
+馋
+馍
+馏
+馐
+馑
+馒
+馓
+馔
+馕
+首
+馗
+馘
+香
+馥
+馨
+駄
+駅
+駆
+騄
+騑
+騒
+験
+驎
+驒
+驩
+马
+驭
+驮
+驯
+驰
+驱
+驳
+驴
+驶
+驷
+驸
+驹
+驺
+驻
+驼
+驽
+驾
+驿
+骀
+骁
+骂
+骃
+骄
+骅
+骆
+骇
+骈
+骉
+骊
+骋
+验
+骍
+骎
+骏
+骐
+骑
+骓
+骕
+骖
+骗
+骘
+骙
+骚
+骛
+骜
+骝
+骞
+骟
+骠
+骡
+骢
+骤
+骥
+骧
+骨
+骰
+骱
+骶
+骷
+骸
+骺
+骼
+髀
+髁
+髂
+髃
+髅
+髈
+髋
+髌
+髎
+髑
+髓
+高
+髙
+髡
+髦
+髪
+髫
+髭
+髯
+髹
+髻
+鬃
+鬈
+鬐
+鬓
+鬘
+鬟
+鬣
+鬯
+鬲
+鬶
+鬻
+鬼
+魁
+魂
+魃
+魄
+魅
+魆
+魇
+魈
+魉
+魋
+魍
+魏
+魑
+魔
+魟
+鮀
+鮈
+鮋
+鮟
+鮠
+鮨
+鮰
+鰕
+鰤
+鱀
+鱇
+鱓
+鱬
+鱲
+鱻
+鱼
+鱿
+鲀
+鲁
+鲂
+鲃
+鲅
+鲆
+鲇
+鲈
+鲉
+鲊
+鲋
+鲌
+鲍
+鲎
+鲏
+鲐
+鲑
+鲔
+鲕
+鲗
+鲘
+鲙
+鲚
+鲛
+鲜
+鲞
+鲟
+鲠
+鲡
+鲢
+鲣
+鲤
+鲥
+鲦
+鲧
+鲨
+鲩
+鲫
+鲭
+鲮
+鲱
+鲲
+鲳
+鲴
+鲵
+鲶
+鲷
+鲸
+鲹
+鲺
+鲻
+鲼
+鲽
+鲿
+鳀
+鳃
+鳄
+鳅
+鳇
+鳉
+鳊
+鳌
+鳍
+鳎
+鳏
+鳐
+鳑
+鳓
+鳔
+鳕
+鳖
+鳗
+鳙
+鳚
+鳜
+鳝
+鳞
+鳟
+鳡
+鳢
+鳣
+鳯
+鳽
+鳾
+鴂
+鴞
+鴷
+鵖
+鵙
+鵟
+鵺
+鶒
+鶲
+鷇
+鷉
+鷟
+鸂
+鸊
+鸑
+鸟
+鸠
+鸡
+鸢
+鸣
+鸥
+鸦
+鸨
+鸩
+鸪
+鸫
+鸬
+鸭
+鸮
+鸯
+鸰
+鸱
+鸲
+鸳
+鸵
+鸶
+鸷
+鸸
+鸹
+鸺
+鸻
+鸽
+鸾
+鸿
+鹀
+鹁
+鹂
+鹃
+鹄
+鹅
+鹆
+鹇
+鹈
+鹉
+鹊
+鹋
+鹌
+鹍
+鹎
+鹏
+鹑
+鹓
+鹕
+鹖
+鹗
+鹘
+鹚
+鹛
+鹜
+鹞
+鹟
+鹠
+鹡
+鹢
+鹣
+鹤
+鹦
+鹧
+鹨
+鹩
+鹪
+鹫
+鹬
+鹭
+鹮
+鹰
+鹱
+鹳
+鹾
+鹿
+麂
+麇
+麈
+麋
+麐
+麑
+麒
+麓
+麝
+麟
+麤
+麦
+麴
+麸
+麹
+麻
+麽
+麾
+麿
+黄
+黉
+黍
+黎
+黏
+黐
+黑
+黒
+黔
+默
+黙
+黛
+黜
+黝
+黟
+黠
+黡
+黢
+黥
+黧
+黩
+黯
+黻
+黼
+黾
+鼋
+鼍
+鼎
+鼐
+鼓
+鼗
+鼙
+鼠
+鼢
+鼩
+鼬
+鼯
+鼱
+鼷
+鼹
+鼻
+鼽
+鼾
+齁
+齐
+齑
+齢
+齮
+齿
+龀
+龁
+龃
+龄
+龅
+龆
+龇
+龈
+龉
+龊
+龋
+龌
+龑
+龘
+龙
+龚
+龛
+龟
+龠
+龢
+A
+B
+C
+D
+E
+F
+G
+H
+I
+J
+K
+L
+M
+N
+O
+P
+Q
+R
+S
+T
+U
+V
+W
+X
+Y
+Z
+a
+b
+c
+d
+e
+f
+g
+h
+i
+j
+k
+l
+m
+n
+o
+p
+q
+r
+s
+t
+u
+v
+w
+x
+y
+z
+AA
+AB
+AC
+AD
+AE
+AF
+AG
+AH
+AI
+AJ
+AK
+AL
+AM
+AN
+AP
+AQ
+AR
+AS
+AT
+AU
+AV
+AW
+AX
+AZ
+Al
+An
+Au
+Aw
+BA
+BB
+BC
+BD
+BE
+BF
+BG
+BH
+BI
+BJ
+BK
+BL
+BM
+BN
+BO
+BP
+BQ
+BR
+BS
+BT
+BU
+BV
+BW
+BY
+Bo
+Br
+Bu
+CA
+CB
+CC
+CD
+CE
+CF
+CG
+CH
+CI
+CJ
+CK
+CL
+CM
+CN
+CO
+CP
+CQ
+CR
+CS
+CT
+CU
+CV
+CW
+CX
+CY
+CZ
+Ca
+Ch
+Cl
+Co
+Cu
+DA
+DB
+DC
+DD
+DE
+DF
+DG
+DH
+DI
+DJ
+DK
+DL
+DM
+DN
+DO
+DQ
+DR
+DS
+DT
+DV
+DW
+DX
+DY
+DZ
+Da
+De
+Di
+Do
+Dr
+Du
+EA
+EB
+EC
+ED
+EE
+EF
+EG
+EH
+EI
+EK
+EL
+EM
+EN
+EP
+EQ
+ER
+ES
+ET
+EU
+EV
+EW
+EX
+EZ
+Ed
+En
+Ev
+Ex
+FA
+FB
+FC
+FD
+FE
+FF
+FG
+FH
+FI
+FJ
+FL
+FM
+FN
+FO
+FP
+FR
+FS
+FT
+FU
+FW
+FX
+FY
+FZ
+Fa
+Fi
+Fl
+Fo
+Fr
+Fu
+GA
+GB
+GC
+GD
+GE
+GF
+GG
+GH
+GI
+GJ
+GK
+GL
+GM
+GN
+GO
+GP
+GQ
+GR
+GS
+GT
+GU
+GW
+GX
+GY
+GZ
+Ga
+Go
+Gr
+Gu
+HA
+HB
+HC
+HD
+HE
+HF
+HG
+HH
+HI
+HJ
+HK
+HL
+HO
+HP
+HQ
+HR
+HS
+HT
+HU
+HV
+HW
+HX
+HY
+HZ
+Ha
+He
+Hi
+Ho
+Hu
+Hz
+IB
+IC
+ID
+IE
+IF
+IG
+IH
+II
+IK
+IL
+IM
+IN
+IO
+IP
+IQ
+IR
+IS
+IT
+IU
+IV
+IX
+If
+In
+JA
+JB
+JC
+JD
+JF
+JG
+JH
+JI
+JJ
+JK
+JL
+JM
+JO
+JP
+JQ
+JR
+JS
+JT
+JU
+JW
+JX
+JY
+JZ
+Ja
+Ji
+Jo
+Ju
+KA
+KB
+KC
+KD
+KE
+KF
+KG
+KH
+KI
+KJ
+KK
+KL
+KM
+KN
+KO
+KP
+KR
+KS
+KT
+KV
+KW
+KX
+KY
+KZ
+LA
+LB
+LC
+LD
+LE
+LF
+LG
+LH
+LI
+LJ
+LK
+LL
+LM
+LN
+LO
+LP
+LQ
+LR
+LS
+LT
+LU
+LV
+LW
+LX
+LY
+LZ
+La
+Le
+Li
+Lo
+Lu
+MA
+MB
+MC
+MD
+ME
+MF
+MG
+MH
+MI
+MJ
+MK
+ML
+MM
+MN
+MO
+MP
+MQ
+MR
+MS
+MT
+MU
+MV
+MW
+MX
+MY
+Ma
+Me
+Mi
+Mo
+Mu
+My
+NA
+NB
+NC
+ND
+NE
+NF
+NG
+NH
+NI
+NJ
+NK
+NL
+NN
+NO
+NP
+NR
+NS
+NT
+NU
+NV
+NW
+NX
+NY
+NZ
+Na
+Ne
+No
+Nu
+OA
+OB
+OC
+OD
+OE
+OF
+OG
+OH
+OK
+OL
+OM
+ON
+OO
+OP
+OR
+OS
+OT
+OU
+OV
+OZ
+Of
+Oh
+On
+Op
+Or
+Ou
+Ox
+PA
+PB
+PC
+PD
+PE
+PF
+PG
+PH
+PI
+PJ
+PK
+PL
+PM
+PN
+PO
+PP
+PQ
+PR
+PS
+PT
+PU
+PV
+PW
+PX
+Pa
+Ph
+Pl
+Po
+Pr
+Pu
+QA
+QB
+QC
+QE
+QF
+QG
+QJ
+QL
+QQ
+QR
+QS
+QT
+QU
+QW
+QY
+Qi
+Qu
+RA
+RB
+RC
+RE
+RF
+RG
+RH
+RI
+RJ
+RK
+RL
+RM
+RN
+RO
+RP
+RQ
+RR
+RS
+RT
+RV
+RW
+RX
+RZ
+Ra
+Re
+Ro
+Ru
+SA
+SB
+SC
+SD
+SE
+SF
+SG
+SH
+SI
+SJ
+SK
+SL
+SM
+SN
+SO
+SP
+SQ
+SR
+SS
+ST
+SU
+SV
+SW
+SX
+SY
+SZ
+Sc
+Sh
+So
+Sp
+St
+Su
+Sw
+Sy
+TA
+TB
+TC
+TD
+TE
+TF
+TG
+TH
+TI
+TJ
+TK
+TL
+TM
+TN
+TO
+TP
+TQ
+TR
+TS
+TT
+TU
+TV
+TW
+TX
+TY
+TZ
+Th
+To
+Tr
+Tw
+UA
+UC
+UD
+UE
+UF
+UG
+UH
+UI
+UK
+UL
+UM
+UN
+UP
+UR
+US
+UT
+UU
+UV
+UW
+UX
+Ub
+Un
+Up
+VA
+VB
+VC
+VE
+VF
+VG
+VH
+VI
+VJ
+VK
+VL
+VM
+VN
+VO
+VP
+VR
+VS
+VT
+VU
+VV
+VX
+Vi
+Vo
+WA
+WB
+WC
+WE
+WH
+WI
+WJ
+WL
+WM
+WN
+WO
+WQ
+WR
+WS
+WT
+WU
+WW
+WX
+WZ
+Wa
+We
+Wi
+Wo
+Wu
+XB
+XC
+XD
+XF
+XG
+XH
+XI
+XJ
+XK
+XL
+XM
+XO
+XP
+XQ
+XR
+XS
+XT
+XU
+XV
+XW
+XX
+XY
+XZ
+Xi
+Xu
+YA
+YB
+YC
+YD
+YE
+YF
+YG
+YH
+YJ
+YL
+YM
+YO
+YP
+YS
+YT
+YU
+YX
+YY
+YZ
+Ya
+Yo
+Yu
+ZA
+ZB
+ZC
+ZD
+ZE
+ZF
+ZG
+ZH
+ZI
+ZJ
+ZL
+ZM
+ZN
+ZO
+ZQ
+ZR
+ZS
+ZU
+ZW
+ZX
+ZY
+ZZ
+Zh
+ab
+aj
+an
+ap
+ar
+bb
+be
+bj
+bo
+bu
+by
+ca
+cb
+cf
+ch
+cl
+cm
+co
+cp
+cv
+dB
+da
+de
+di
+dj
+dn
+do
+dr
+dv
+ed
+em
+en
+ep
+eq
+ev
+ex
+ez
+fa
+fe
+ff
+fi
+fl
+fo
+fr
+fu
+gb
+gd
+gh
+gi
+go
+gp
+gr
+gu
+gz
+ha
+he
+hi
+ho
+hp
+hz
+iP
+iT
+ib
+ic
+id
+if
+ig
+im
+in
+io
+ip
+iq
+is
+it
+jQ
+ja
+ji
+jj
+jo
+jq
+ju
+kJ
+kN
+kW
+kg
+kn
+kz
+la
+ld
+le
+lg
+li
+ll
+lo
+lp
+lz
+ma
+mb
+me
+mi
+mm
+mo
+mp
+mq
+mu
+mv
+my
+na
+nb
+ng
+no
+nv
+ob
+of
+oh
+ok
+ol
+on
+op
+or
+ou
+ow
+oz
+pH
+pa
+pc
+ph
+pk
+pl
+po
+pp
+pr
+pu
+pv
+qf
+qq
+qu
+qz
+ra
+re
+rn
+ro
+rq
+se
+sh
+sk
+so
+sp
+sq
+st
+su
+sw
+sz
+th
+ti
+to
+tr
+tv
+tw
+ub
+uc
+uf
+ui
+uk
+un
+up
+us
+uv
+ux
+uz
+vc
+vi
+vo
+vr
+wa
+we
+wh
+wi
+wo
+wr
+ww
+xj
+xq
+xx
+ya
+ye
+yj
+yo
+yu
+yy
+yz
+zf
+zh
+zi
+zj
+zq
+zu
+zz
+AAA
+AAC
+ABA
+ABB
+ABC
+ABO
+ABS
+ABT
+ACA
+ACC
+ACD
+ACE
+ACG
+ACK
+ACL
+ACM
+ACP
+ACR
+ACS
+ACT
+ADA
+ADC
+ADD
+ADF
+ADI
+ADO
+ADP
+ADR
+ADS
+ADV
+AED
+AES
+AFC
+AFP
+AFS
+AGB
+AGC
+AGE
+AGM
+AGP
+AGV
+AIA
+AIC
+AIG
+AIM
+AIP
+AIR
+AIS
+AIX
+AKB
+AKM
+ALA
+ALL
+ALT
+AMA
+AMC
+AMD
+AMG
+AMI
+AML
+AMP
+AMR
+AMS
+AMT
+AMX
+AND
+AOC
+AOE
+AOL
+APA
+APC
+APE
+APG
+API
+APK
+APL
+APM
+APP
+APS
+APT
+APU
+ARA
+ARC
+ARE
+ARM
+ARP
+ART
+ASA
+ASC
+ASF
+ASM
+ASP
+ASR
+AST
+ATA
+ATC
+ATF
+ATI
+ATK
+ATM
+ATP
+ATS
+ATV
+ATX
+AUC
+AUG
+AUX
+AVC
+AVG
+AVI
+AVR
+AVS
+AVX
+AWM
+AWS
+All
+And
+Ang
+App
+Aqu
+BAC
+BAD
+BAE
+BAR
+BAT
+BAU
+BBA
+BBB
+BBC
+BBE
+BBQ
+BBS
+BBT
+BCD
+BEA
+BEC
+BEI
+BET
+BGA
+BGM
+BGP
+BIG
+BIM
+BIS
+BLG
+BMC
+BMD
+BMG
+BMI
+BMP
+BMW
+BMX
+BNC
+BOD
+BOM
+BOT
+BOX
+BOY
+BPM
+BPO
+BRN
+BRT
+BSA
+BSC
+BSD
+BSI
+BSM
+BSP
+BSS
+BTC
+BTR
+BTS
+BTV
+BUG
+BUN
+BUS
+BWV
+Bur
+Bus
+But
+CAA
+CAC
+CAD
+CAE
+CAI
+CAJ
+CAM
+CAN
+CAP
+CAR
+CAS
+CAT
+CBA
+CBC
+CBD
+CBN
+CBR
+CBS
+CCA
+CCC
+CCD
+CCF
+CCG
+CCI
+CCK
+CCM
+CCN
+CCP
+CCS
+CDC
+CDM
+CDN
+CDO
+CDP
+CDR
+CDS
+CEA
+CEC
+CEO
+CES
+CET
+CFA
+CFC
+CFD
+CFO
+CFR
+CGI
+CHA
+CHM
+CHO
+CIA
+CIC
+CID
+CIE
+CIF
+CIK
+CIO
+CIP
+CIS
+CLA
+CLI
+CLM
+CLS
+CMA
+CMC
+CME
+CML
+CMM
+CMO
+CMP
+CMS
+CMV
+CNC
+CNG
+CNN
+CNS
+COB
+COC
+COD
+COM
+CON
+COO
+COP
+COS
+COX
+CPA
+CPC
+CPE
+CPI
+CPL
+CPM
+CPP
+CPR
+CPS
+CPU
+CQC
+CRC
+CRM
+CRP
+CRS
+CRT
+CSA
+CSF
+CSI
+CSM
+CSP
+CSR
+CSS
+CST
+CTA
+CTC
+CTI
+CTO
+CTP
+CTS
+CUB
+CUT
+CVD
+CVN
+CVS
+CVT
+CXW
+CYP
+Car
+Cha
+Chr
+Chu
+Com
+Con
+Cou
+Cur
+DAB
+DAC
+DAO
+DAS
+DAT
+DAY
+DBA
+DBM
+DCD
+DCE
+DCF
+DCS
+DCT
+DDC
+DDD
+DDG
+DDN
+DDR
+DDS
+DDT
+DEA
+DEC
+DEM
+DES
+DFS
+DFT
+DHA
+DHL
+DIC
+DID
+DIF
+DIN
+DIP
+DIV
+DIY
+DLC
+DLL
+DLP
+DLT
+DMA
+DMC
+DMD
+DMF
+DMI
+DMO
+DMZ
+DNA
+DNF
+DNS
+DNV
+DOC
+DOI
+DOM
+DON
+DOS
+DOT
+DPI
+DPP
+DPS
+DRM
+DRX
+DSA
+DSC
+DSG
+DSL
+DSM
+DSP
+DSS
+DTC
+DTE
+DTM
+DTS
+DTU
+DVB
+DVD
+DVI
+DVR
+DWG
+DYG
+Day
+Div
+Don
+Dou
+Dow
+EAN
+EAP
+EBD
+EBS
+ECC
+ECM
+ECO
+ECT
+ECU
+ECW
+EDA
+EDG
+EDI
+EDM
+EDP
+EDR
+EEG
+EEP
+EFR
+EGF
+EHS
+EIA
+EJB
+EMA
+EMC
+EMI
+EMP
+EMS
+END
+EOS
+EPA
+EPC
+EPO
+EPR
+EPS
+ERP
+ESC
+ESD
+ESI
+ESL
+ESP
+ESR
+EST
+ETC
+ETF
+ETH
+ETL
+ETS
+EVA
+EVE
+EVO
+EXE
+EXO
+EXP
+EYE
+Eff
+Ell
+Emb
+Emp
+End
+Eng
+Equ
+Eur
+Eva
+Exc
+Exp
+FAA
+FAB
+FAG
+FAL
+FAN
+FAO
+FAQ
+FAT
+FBI
+FCA
+FCC
+FCI
+FCS
+FDA
+FDD
+FDI
+FEM
+FES
+FET
+FFT
+FGO
+FHD
+FIA
+FLV
+FLY
+FMS
+FNC
+FOB
+FOF
+FOR
+FOX
+FPC
+FPS
+FPX
+FRP
+FSA
+FSB
+FSC
+FSH
+FTA
+FTC
+FTP
+FUE
+FUN
+Fin
+Fiv
+Fly
+For
+Fou
+Fuj
+Fun
+Fut
+GAP
+GAT
+GAY
+GBA
+GBK
+GBT
+GBU
+GCC
+GCS
+GCT
+GDI
+GDP
+GEN
+GEO
+GET
+GFP
+GHz
+GIA
+GIF
+GIS
+GLA
+GLC
+GLP
+GLS
+GMA
+GMC
+GMP
+GMS
+GMT
+GMV
+GND
+GNP
+GNU
+GOD
+GOT
+GPA
+GPL
+GPS
+GPT
+GPU
+GRC
+GRE
+GRF
+GSH
+GSM
+GSP
+GTA
+GTI
+GTO
+GTP
+GTR
+GTS
+GTX
+GUI
+Giv
+Gmb
+Gua
+Gui
+Gun
+Guo
+Guy
+HAD
+HAL
+HBA
+HBO
+HBV
+HBs
+HCG
+HCI
+HCV
+HCl
+HDD
+HDL
+HDR
+HDV
+HEY
+HFC
+HGH
+HGT
+HID
+HIP
+HIS
+HIT
+HIV
+HLA
+HMG
+HMI
+HMS
+HOP
+HOT
+HOW
+HPC
+HPV
+HRC
+HRT
+HSE
+HSK
+HSV
+HTC
+HUB
+HUD
+HVG
+Haz
+Her
+Hom
+Hon
+Hou
+How
+Hua
+Hub
+Hum
+Hun
+IAI
+IAS
+IAT
+IBC
+IBF
+IBM
+ICA
+ICC
+ICD
+ICE
+ICO
+ICP
+ICQ
+ICS
+ICT
+ICU
+IDC
+IDD
+IDE
+IDF
+IDG
+IDS
+IEC
+IET
+IFA
+IFC
+IFI
+IFN
+IGF
+IGN
+IIA
+III
+IIS
+IKO
+IMA
+IMC
+IMD
+IME
+IMF
+IMG
+IMO
+IMS
+IMT
+INA
+INC
+INF
+ING
+INS
+INT
+IOS
+IPA
+IPC
+IPO
+IPS
+IPX
+IRC
+IRI
+ISA
+ISI
+ISM
+ISO
+ISP
+ITC
+ITF
+ITO
+ITS
+ITT
+ITU
+ITV
+IVR
+Imp
+InC
+Inf
+Inj
+Int
+JAR
+JBL
+JBT
+JCB
+JCR
+JDB
+JDG
+JET
+JGJ
+JIS
+JIT
+JKL
+JOE
+JPG
+JSF
+JSP
+JST
+JTA
+JVC
+JVM
+JYJ
+JYP
+Jac
+Jam
+Jan
+Jap
+Jav
+Jay
+Jin
+Joh
+Jon
+Jul
+Jun
+Jus
+KAB
+KAT
+KBS
+KDF
+KDJ
+KEY
+KFC
+KFR
+KID
+KIS
+KJm
+KOF
+KOH
+KOL
+KPI
+KPL
+KTV
+KVM
+Kin
+Kon
+Kur
+LAB
+LAN
+LBS
+LCA
+LCD
+LCK
+LCS
+LDA
+LDH
+LDL
+LDP
+LED
+LEE
+LEO
+LES
+LET
+LGA
+LGD
+LIN
+LIU
+LLC
+LME
+LMS
+LNG
+LOF
+LOL
+LOW
+LPG
+LPL
+LPR
+LRC
+LSA
+LSD
+LSI
+LSP
+LTD
+LTE
+LUC
+LUN
+LVM
+Laz
+Lib
+Lif
+Lin
+Liu
+Liz
+Lon
+Lou
+Low
+Luc
+Lum
+Luo
+Lux
+MAC
+MAD
+MAG
+MAN
+MAO
+MAP
+MAR
+MAS
+MAT
+MAX
+MAY
+MBA
+MBC
+MBO
+MBR
+MBS
+MCA
+MCC
+MCM
+MCN
+MCP
+MCS
+MCU
+MDA
+MDI
+MDL
+MDR
+MDS
+MEN
+MES
+MFA
+MFC
+MHC
+MHz
+MIB
+MIC
+MID
+MIL
+MIN
+MIS
+MIT
+MIX
+MKV
+MLC
+MLF
+MMA
+MMC
+MMI
+MMO
+MMS
+MMX
+MOD
+MOM
+MOS
+MOV
+MPA
+MPC
+MPG
+MPI
+MPS
+MPV
+MPa
+MRC
+MRI
+MRO
+MRP
+MSA
+MSC
+MSI
+MSN
+MTI
+MTK
+MTS
+MTU
+MTV
+MVC
+MVP
+Mac
+Mag
+Maj
+Man
+Mar
+Max
+May
+Mic
+Min
+Mon
+Mou
+Mur
+NAD
+NAS
+NAT
+NBA
+NBC
+NBL
+NCT
+NDS
+NEC
+NEO
+NES
+NET
+NEW
+NEX
+NFA
+NFC
+NFL
+NFS
+NGC
+NGN
+NGO
+NHK
+NHL
+NIC
+NIH
+NLP
+NME
+NMR
+NOT
+NOW
+NOX
+NOx
+NPC
+NPN
+NPR
+NSA
+NSC
+NSF
+NSK
+NTN
+NTP
+NTT
+NTV
+NVH
+NWA
+NXT
+NYT
+Nic
+Nob
+Nor
+Nov
+Now
+Nur
+OAD
+OBD
+OCG
+OCP
+OCR
+OCT
+ODM
+OEM
+OFF
+OGG
+OLE
+OMG
+ONE
+ONU
+OOO
+OPC
+OPP
+ORC
+OSD
+OSI
+OSS
+OST
+OTA
+OTC
+OTG
+OTT
+OUT
+OVA
+OVP
+Obj
+Off
+Oly
+Ope
+Oph
+Opt
+Our
+Out
+Ove
+PAC
+PAD
+PAH
+PAL
+PAM
+PAN
+PAS
+PBS
+PBT
+PCA
+PCB
+PCD
+PCI
+PCL
+PCM
+PCR
+PCS
+PCT
+PDA
+PDB
+PDC
+PDD
+PDF
+PDM
+PDP
+PDU
+PEG
+PEP
+PER
+PES
+PET
+PFA
+PFC
+PGA
+PGC
+PHP
+PHS
+PIC
+PID
+PIM
+PIN
+PKI
+PLA
+PLC
+PLD
+PLL
+PLM
+PMC
+PMI
+PMP
+PND
+PNG
+PNP
+POE
+POM
+PON
+POP
+POS
+PPA
+PPC
+PPG
+PPH
+PPI
+PPM
+PPP
+PPR
+PPS
+PPT
+PPV
+PRL
+PRO
+PSA
+PSD
+PSE
+PSG
+PSI
+PSK
+PSP
+PSS
+PSV
+PSW
+PSY
+PTA
+PTC
+PTH
+PTT
+PUB
+PVA
+PVC
+PVE
+PVP
+PWM
+Par
+Per
+Pic
+Pow
+Pro
+Pur
+QAM
+QDI
+QFP
+QGh
+QOS
+QPI
+QPS
+QRS
+QTL
+Qin
+Qua
+Que
+RAM
+RAP
+RAR
+RAS
+RAW
+RBC
+RCA
+RCS
+RDF
+RDS
+RED
+REF
+REG
+REM
+REX
+RFC
+RGB
+RIA
+RIM
+RIP
+RMB
+RMS
+RNA
+RNG
+ROC
+ROE
+ROI
+ROM
+RPC
+RPG
+RPM
+RRW
+RSA
+RSC
+RSI
+RSS
+RTA
+RTC
+RTK
+RTP
+RTS
+RTU
+RTX
+RUN
+RUS
+Ray
+Raz
+Ric
+Riv
+Rom
+Rou
+Rub
+Run
+Rus
+SAC
+SAE
+SAM
+SAN
+SAO
+SAP
+SAR
+SAS
+SAT
+SAY
+SBR
+SBS
+SCE
+SCH
+SCI
+SCM
+SCP
+SCR
+SDH
+SDI
+SDK
+SDR
+SDS
+SEA
+SEC
+SEE
+SEM
+SEO
+SER
+SET
+SFC
+SFP
+SGH
+SGI
+SGS
+SHA
+SHE
+SID
+SIG
+SIM
+SIP
+SIR
+SIS
+SKF
+SKT
+SKU
+SKY
+SLA
+SLC
+SLE
+SLG
+SLI
+SLR
+SLS
+SMA
+SMB
+SMC
+SMD
+SMG
+SMI
+SMP
+SMS
+SMT
+SNK
+SNP
+SNR
+SNS
+SOA
+SOC
+SOD
+SOI
+SOP
+SOS
+SPA
+SPC
+SPD
+SPE
+SPF
+SPI
+SPR
+SPS
+SPT
+SPV
+SQL
+SQU
+SRS
+SRT
+SSA
+SSC
+SSD
+SSE
+SSH
+SSL
+SSR
+SSS
+SST
+STC
+STD
+STK
+STL
+STM
+STN
+STP
+STR
+STS
+SUB
+SUN
+SUV
+SVC
+SVD
+SVG
+SVM
+SWF
+SXG
+SYN
+SYS
+Sch
+Ser
+She
+Siz
+Som
+Sou
+Squ
+Sub
+Sum
+Sun
+Sup
+Suz
+TAB
+TAC
+TAG
+TAO
+TBC
+TBM
+TBS
+TCG
+TCL
+TCM
+TCO
+TCP
+TCR
+TCS
+TCT
+TDD
+TDI
+TDM
+TDP
+TDS
+TEC
+TED
+TEL
+TEM
+TES
+TEU
+TEX
+TFT
+TGA
+TGF
+TGV
+THD
+THE
+TIA
+TIF
+TKO
+TLC
+TLS
+TMD
+TMP
+TMS
+TMT
+TNA
+TNF
+TNT
+TOC
+TOD
+TOE
+TOM
+TOP
+TPC
+TPE
+TPM
+TPO
+TPP
+TPR
+TPS
+TPU
+TQM
+TSC
+TSH
+TSI
+TSP
+TTL
+TTS
+TTT
+TUV
+TVB
+TVC
+TVP
+TVS
+TWO
+TXT
+Tay
+The
+Tom
+Tou
+Tow
+Tur
+UAR
+UBC
+UCC
+UCL
+UDP
+UFC
+UFO
+UGC
+UHF
+UIP
+UMD
+UML
+UNI
+UPC
+UPS
+URL
+USA
+USB
+USD
+USM
+USP
+USS
+UTC
+UTF
+UTP
+UTR
+UVA
+UVB
+UWB
+UZI
+Umb
+Uni
+Upp
+Uzi
+VAC
+VAR
+VBA
+VBR
+VBS
+VCC
+VCD
+VCR
+VDC
+VDE
+VGA
+VHF
+VHS
+VIA
+VII
+VIP
+VIS
+VMw
+VOA
+VOB
+VOC
+VOD
+VOL
+VPN
+VPS
+VRP
+VSS
+VTE
+VVT
+Ver
+Vic
+Vid
+Vis
+Viv
+WAN
+WAP
+WAV
+WAY
+WBA
+WBC
+WBO
+WBS
+WCG
+WCW
+WDM
+WDS
+WEB
+WEP
+WEY
+WGK
+WHO
+WIN
+WMA
+WMS
+WMV
+WOW
+WPA
+WPF
+WPS
+WRC
+WSA
+WTA
+WTI
+WTO
+WVG
+WWE
+WWF
+WWW
+Way
+Wha
+Whe
+Whi
+Who
+Why
+WiF
+Win
+Wiz
+Wom
+Wor
+Wou
+XGA
+XII
+XML
+XPS
+XXX
+XYZ
+YAG
+YES
+YOU
+YZB
+Yin
+You
+Yua
+Yuk
+Yun
+ZIP
+ZOL
+Zer
+Zha
+Zhu
+Zom
+Zon
+Zou
+abb
+abc
+abo
+abs
+act
+adj
+aff
+all
+and
+ang
+any
+app
+aws
+bbb
+bbc
+bbq
+bbs
+but
+cAM
+cDN
+cGM
+can
+car
+cba
+cha
+chi
+col
+com
+con
+cor
+cou
+cpi
+cpu
+dan
+day
+des
+did
+dif
+dis
+div
+diy
+doc
+don
+dow
+eAA
+eSA
+ech
+eff
+emb
+emp
+end
+eng
+eqc
+equ
+euv
+eve
+exc
+exe
+exp
+fac
+fil
+fin
+fir
+fiv
+fla
+fly
+for
+fox
+fre
+fri
+gAS
+gdp
+gen
+giv
+gmp
+gon
+goo
+got
+gps
+gra
+gre
+gro
+had
+har
+has
+hav
+haz
+her
+his
+hiv
+hol
+hom
+hou
+how
+iBT
+iOS
+iPa
+iPh
+iPo
+iSC
+ima
+imp
+inc
+inf
+inj
+int
+ipa
+iph
+ipo
+isb
+iso
+jam
+jap
+jav
+jay
+jus
+kHz
+kJm
+kdj
+kin
+lay
+laz
+lck
+lea
+led
+let
+lib
+lif
+lin
+liq
+lis
+lit
+liv
+liz
+lly
+lng
+loc
+lof
+log
+loo
+los
+low
+mRN
+mac
+mad
+maj
+man
+mar
+mat
+max
+may
+maz
+mba
+men
+mic
+min
+mmH
+mod
+mon
+mor
+mys
+nVI
+nba
+nex
+nic
+not
+nov
+now
+nxp
+obj
+off
+one
+ope
+opp
+our
+out
+ove
+par
+pay
+per
+phe
+php
+piz
+pla
+pow
+ppp
+pre
+pro
+pvc
+qHD
+qgh
+qua
+que
+qui
+rRN
+ray
+raz
+rea
+rec
+red
+ref
+reg
+rem
+rep
+req
+res
+rev
+ric
+riv
+rmb
+rng
+rom
+rou
+say
+sch
+sha
+she
+shi
+sho
+sim
+sin
+siz
+som
+sou
+spa
+spe
+sql
+squ
+sta
+ste
+sto
+str
+sty
+sub
+suv
+tRN
+tha
+the
+thi
+thr
+tim
+tip
+top
+tow
+tpp
+tra
+tur
+tuv
+two
+ubc
+uiv
+unc
+und
+uni
+unk
+ups
+usb
+uva
+uvb
+uzi
+val
+var
+ver
+vie
+vip
+vis
+viv
+wan
+was
+way
+web
+wer
+wha
+whi
+who
+why
+wif
+wit
+wom
+won
+wor
+wou
+www
+xin
+xxx
+yin
+you
+zha
+zhi
+zho
+zhu
+zon
+zzf
+zzy
+AAAA
+AACS
+ABCD
+ACCA
+ACCE
+ACCP
+ACDC
+ACGN
+ACID
+ACPI
+ACTH
+ADHD
+ADPC
+ADSL
+AIDS
+AJAX
+ALPH
+AMEX
+AMOL
+ANGE
+ANSI
+ANSY
+APEC
+APPL
+APTE
+ARDS
+ARPA
+ARPG
+ASCE
+ASCI
+ASIA
+ASIC
+ASIN
+ASME
+ASSO
+ASTM
+ASUS
+AUDI
+AUTO
+AVCH
+AWAR
+Andr
+BABY
+BACK
+BAND
+BANG
+BANK
+BASI
+BASS
+BATT
+BEAS
+BEAT
+BEST
+BEYO
+BIGB
+BIOS
+BLAC
+BLEA
+BLOG
+BLOO
+BLUE
+BOBO
+BOOK
+BOOL
+BOOM
+BOPP
+BOSS
+BOYS
+BRAV
+BREA
+BUFF
+Buck
+Buff
+Bull
+Bung
+Buzz
+CADC
+CALL
+CAPC
+CAPP
+CARD
+CASE
+CASI
+CAST
+CATI
+CATV
+CAXA
+CCFL
+CCIE
+CCNA
+CCTV
+CDMA
+CEPA
+CERN
+CHAN
+CHAP
+CHAR
+CHEN
+CHIN
+CHOR
+CIMS
+CIPA
+CISC
+CITE
+CITY
+CLAM
+CLAN
+CLAS
+CLOS
+CLUB
+CMMB
+CMMI
+CMOS
+CMYK
+CNAS
+CNBC
+CNBL
+CNKI
+CNNI
+COCO
+CODE
+COLL
+COLO
+COMB
+COME
+COMI
+COMP
+CONT
+COOL
+CORB
+CORE
+COSM
+COSP
+COST
+COUN
+COVI
+CPLD
+CREA
+CROS
+CSCD
+CSDN
+CSMA
+CSOL
+CSSC
+CSTN
+CTRL
+CUBA
+CUDA
+CURR
+CVBS
+Chin
+Chur
+DANC
+DARK
+DARP
+DASH
+DATA
+DAYS
+DCDC
+DDNS
+DDOS
+DDRI
+DELL
+DEMO
+DESI
+DEST
+DHCP
+DIGI
+DIMM
+DISC
+DIVX
+DLNA
+DOHC
+DOTA
+DOWN
+DRAG
+DRAM
+DREA
+DRIV
+DSLR
+DVDC
+DVGA
+DWDM
+DWOR
+EAST
+EASY
+EBIT
+ECMO
+EDGE
+EDIT
+EDTA
+EGFR
+EINE
+ELIS
+ELLE
+EMBA
+ENER
+ENGI
+ENTE
+EPDM
+EPIS
+EPON
+EPSO
+EPUB
+ERCP
+ERRO
+ESET
+ESPN
+ETSI
+EVDO
+EVER
+EXCE
+EXIL
+EXPO
+Ever
+Exch
+Exer
+FACE
+FALS
+FANS
+FANU
+FAST
+FDDI
+FIBA
+FIDI
+FIFA
+FIFO
+FILE
+FINA
+FIRE
+FIRS
+FISH
+FIVE
+FLAC
+FLAS
+FLOW
+FMVP
+FORT
+FPGA
+FREE
+FROM
+FTTH
+FULL
+FWVG
+FXCM
+Fuck
+Full
+Fund
+Fung
+Fuzz
+GABA
+GALA
+GAME
+GANK
+GATT
+GEAR
+GENE
+GHOS
+GIRL
+GLON
+GMAT
+GNSS
+GOLD
+GOOD
+GOOG
+GPRS
+GREE
+GROU
+GSMG
+GUCC
+GUND
+GUTS
+Gund
+HACC
+HAPP
+HARD
+HART
+HDCP
+HDMI
+HDPE
+HDTV
+HEAD
+HEAR
+HELL
+HEPA
+HERO
+HIFI
+HIGH
+HIPH
+HKEY
+HOLD
+HOME
+HOST
+HOUS
+HPLC
+HSDP
+HSPA
+HTML
+HTTP
+HUNT
+Hugh
+Hung
+ICAN
+ICMP
+ICON
+IDEA
+IDOL
+IEEE
+IELT
+IETF
+IFPI
+IGBT
+IGMP
+IMAX
+IMDB
+INFO
+INTE
+IPAD
+IPTV
+ISBN
+ISDN
+ISIS
+ISOI
+ISRC
+ISSN
+ISTP
+ITER
+ITIL
+IUCN
+Inte
+Inve
+JACK
+JAPA
+JAVA
+JAZZ
+JBOD
+JOHN
+JOJO
+JOKE
+JOUR
+JPEG
+JUMP
+JUST
+Jack
+Jake
+Jazz
+John
+Joke
+July
+Jump
+Jung
+KING
+KISS
+KONA
+KOYO
+LASI
+LAST
+LEED
+LEEP
+LESS
+LEVE
+LEXU
+LIFE
+LIKE
+LIMI
+LINE
+LINK
+LINU
+LIST
+LIVE
+LLDP
+LOCA
+LOFT
+LOGO
+LOLI
+LONG
+LOOK
+LOVE
+LPGA
+LTPS
+LVDS
+Ligh
+Like
+Lily
+Lind
+Ling
+Liqu
+Live
+Luck
+Luke
+MACD
+MACH
+MAGI
+MALL
+MAMA
+MARK
+MAST
+MATL
+MATX
+MAYA
+MBLA
+MEDI
+MEGA
+MEMS
+MERS
+META
+MIDI
+MIDP
+MIMO
+MINI
+MIPS
+MISS
+MIUI
+MMOR
+MOBA
+MODB
+MODE
+MOMO
+MOOC
+MOON
+MORE
+MOSF
+MOTO
+MOVI
+MPEG
+MPLS
+MSCI
+MSDS
+MTBF
+MUSI
+Mach
+Make
+Maur
+Mazz
+NACH
+NADH
+NADP
+NAMC
+NAME
+NANA
+NAND
+NASA
+NASD
+NATO
+NAVE
+NCAA
+NCAP
+NCIS
+NEDC
+NEOP
+NERV
+NEST
+NEWS
+NEXT
+NICO
+NIGH
+NIKE
+NINE
+NOKI
+NOTE
+NOVA
+NSAI
+NTFS
+NTSC
+NULL
+NURB
+NVID
+NYSE
+Nove
+ODBC
+OECD
+OFDM
+OFFI
+OLAP
+OLED
+ONLI
+ONLY
+OPEC
+OPEN
+OPPO
+ORAC
+ORIC
+ORIG
+OSPF
+OVER
+Oper
+PACS
+PAGE
+PARK
+PART
+PASS
+PCMC
+PDCA
+PEEK
+PERC
+PERF
+PETS
+PHEV
+PHIL
+PHOT
+PICC
+PIEC
+PLAN
+PLAY
+PLUS
+PMMA
+PNAS
+POLO
+POSE
+POST
+POWE
+PPTP
+PPTV
+PRAD
+PROD
+PROF
+PROJ
+PSTN
+PTFE
+PUNK
+PVDF
+Pric
+Prin
+Priv
+Priz
+Prom
+QFII
+QVGA
+QVOD
+QWER
+Quic
+Quin
+Quiz
+RADI
+RAID
+RAIN
+REAC
+READ
+REAL
+REIT
+RESE
+RFID
+RIDE
+RISC
+RMON
+RMRM
+RMVB
+ROAD
+ROCK
+ROHS
+ROOT
+ROSE
+RTEC
+RWBY
+Ruby
+SAAS
+SAMS
+SARS
+SATA
+SCAD
+SCAR
+SCDM
+SCHO
+SCIE
+SCSI
+SDHC
+SDMM
+SDRA
+SDSD
+SDXC
+SECA
+SECC
+SECT
+SEED
+SEGA
+SELE
+SERV
+SEVE
+SFDA
+SHIF
+SHIN
+SHOC
+SHOP
+SHOW
+SIDE
+SIEM
+SING
+SIZE
+SKIP
+SMAP
+SMAR
+SMIL
+SMTP
+SNMP
+SOAP
+SOCK
+SOHO
+SOLO
+SONG
+SONY
+SOSO
+SOUL
+SPAC
+SPCC
+SPDI
+SPEC
+SPEE
+SPIE
+SPOR
+SPSS
+SRAM
+SSCI
+STAF
+STAG
+STAR
+STAT
+STEM
+STEP
+STER
+STOP
+STOR
+STUD
+STYL
+SUMM
+SUPE
+SUSE
+SWAT
+SWIF
+SWOT
+SYST
+Subj
+Sull
+Sund
+Sung
+Supp
+TABL
+TANK
+TCPI
+TDMA
+TEAM
+TECH
+TEST
+TEXT
+TFBO
+TFSI
+TFTP
+THIS
+THRE
+TIFF
+TIME
+TIMK
+TIPS
+TOEF
+TOKY
+TOSH
+TOUC
+TOUR
+TOWN
+TRAC
+TRIP
+TRIZ
+TRUE
+TVBS
+TVOC
+TWIC
+TYPE
+Ther
+Thin
+Thom
+Thou
+UCLA
+UHMW
+ULTR
+UMTS
+UNES
+UNIT
+UNIV
+UNIX
+Unic
+Unit
+Univ
+VAIO
+VCCI
+VEGF
+VERS
+VHDL
+VIDE
+VIER
+VIII
+VISA
+VISI
+VIST
+VIVO
+VLAN
+VLSI
+VOCA
+VOGU
+VOIP
+VRay
+VSAT
+Vick
+Vill
+WANG
+WAPI
+WASD
+WAVE
+WCBA
+WCDM
+WEEK
+WEST
+WHAT
+WHIT
+WIFI
+WIND
+WITH
+WLAN
+WORD
+WORK
+WORL
+WQVG
+WXGA
+Wang
+Wher
+WiMA
+Will
+Wind
+Wing
+XBOX
+XBRL
+XHTM
+XVID
+XXXX
+YAMA
+YANG
+YEAH
+YONE
+YOUN
+YOUR
+YOYO
+Yong
+Your
+ZAFT
+ZARA
+ZERO
+ZGMF
+ZHAN
+ZONE
+Zhon
+Zhou
+abby
+abou
+andr
+appl
+baby
+back
+blic
+call
+char
+chic
+chin
+coff
+coll
+comb
+comm
+comp
+cond
+cons
+cont
+dick
+diff
+ding
+dock
+doin
+dong
+down
+ever
+exch
+find
+foll
+four
+from
+fron
+goin
+good
+goog
+gove
+hack
+hall
+hand
+hang
+happ
+have
+here
+high
+home
+into
+inve
+jack
+java
+jazz
+jump
+jung
+just
+know
+life
+ligh
+like
+lily
+ling
+liqu
+live
+lock
+logo
+lond
+long
+look
+love
+macd
+mach
+make
+mapp
+mmer
+nove
+okay
+only
+oper
+oppo
+othe
+over
+play
+pray
+pric
+prin
+priv
+priz
+prod
+prom
+quic
+real
+requ
+righ
+scho
+shou
+show
+some
+star
+stat
+stay
+stom
+subj
+such
+suff
+supp
+take
+than
+they
+thin
+thou
+toke
+uber
+unic
+univ
+upon
+usdj
+user
+usin
+vill
+vivo
+wake
+wall
+wang
+want
+wave
+were
+what
+when
+wifi
+will
+wind
+wing
+with
+work
+xing
+xxxx
+year
+your
+zhon
+China
+Inter
+Journ
+china
+every
+inter
+iphon
+thing
+think
+where
+which
+Univer
+univer
+Windows
+windows
+##A
+##B
+##C
+##D
+##E
+##F
+##G
+##H
+##I
+##J
+##K
+##L
+##M
+##N
+##O
+##P
+##Q
+##R
+##S
+##T
+##U
+##V
+##W
+##X
+##Y
+##Z
+##a
+##b
+##c
+##d
+##e
+##f
+##g
+##h
+##i
+##j
+##k
+##l
+##m
+##n
+##o
+##p
+##q
+##r
+##s
+##t
+##u
+##v
+##w
+##x
+##y
+##z
+##AA
+##AB
+##AC
+##AD
+##AE
+##AF
+##AG
+##AH
+##AI
+##AK
+##AL
+##AM
+##AN
+##AO
+##AP
+##AQ
+##AR
+##AS
+##AT
+##AV
+##AW
+##AX
+##AY
+##AZ
+##BA
+##BB
+##BC
+##BE
+##BG
+##BI
+##BM
+##BN
+##BO
+##BP
+##BR
+##BS
+##BT
+##BU
+##BY
+##CA
+##CB
+##CC
+##CD
+##CE
+##CF
+##CG
+##CH
+##CI
+##CK
+##CL
+##CM
+##CN
+##CO
+##CP
+##CR
+##CS
+##CT
+##CU
+##DA
+##DB
+##DC
+##DD
+##DE
+##DI
+##DL
+##DM
+##DN
+##DO
+##DP
+##DR
+##DS
+##DT
+##DU
+##DX
+##DY
+##EA
+##EB
+##EC
+##ED
+##EE
+##EF
+##EG
+##EI
+##EK
+##EL
+##EM
+##EN
+##EO
+##EP
+##ER
+##ES
+##ET
+##EV
+##EW
+##EX
+##EY
+##FA
+##FC
+##FD
+##FE
+##FF
+##FI
+##FL
+##FO
+##FP
+##FR
+##FS
+##FT
+##FU
+##FX
+##Fi
+##GA
+##GC
+##GE
+##GF
+##GH
+##GI
+##GL
+##GN
+##GO
+##GP
+##GR
+##GS
+##GU
+##GY
+##HA
+##HC
+##HD
+##HE
+##HG
+##HI
+##HM
+##HN
+##HO
+##HP
+##HR
+##HS
+##HT
+##IA
+##IB
+##IC
+##ID
+##IE
+##IF
+##IG
+##II
+##IK
+##IL
+##IM
+##IN
+##IO
+##IP
+##IR
+##IS
+##IT
+##IU
+##IV
+##IX
+##IZ
+##JI
+##JO
+##Jo
+##Ju
+##KA
+##KE
+##KI
+##KK
+##KO
+##KU
+##KY
+##LA
+##LC
+##LD
+##LE
+##LF
+##LG
+##LI
+##LK
+##LL
+##LM
+##LO
+##LP
+##LS
+##LT
+##LU
+##LV
+##LY
+##MA
+##MB
+##MC
+##MD
+##ME
+##MF
+##MI
+##ML
+##MM
+##MN
+##MO
+##MP
+##MR
+##MS
+##MT
+##MV
+##MY
+##NA
+##NC
+##ND
+##NE
+##NG
+##NI
+##NJ
+##NK
+##NN
+##NO
+##NP
+##NS
+##NT
+##NU
+##NX
+##NY
+##NZ
+##OB
+##OC
+##OD
+##OE
+##OF
+##OG
+##OH
+##OI
+##OK
+##OL
+##OM
+##ON
+##OO
+##OP
+##OR
+##OS
+##OT
+##OU
+##OV
+##OW
+##OX
+##PA
+##PC
+##PD
+##PE
+##PF
+##PG
+##PH
+##PI
+##PL
+##PM
+##PO
+##PP
+##PR
+##PS
+##PT
+##PU
+##QU
+##Qu
+##RA
+##RB
+##RC
+##RD
+##RE
+##RF
+##RG
+##RH
+##RI
+##RK
+##RL
+##RM
+##RN
+##RO
+##RP
+##RR
+##RS
+##RT
+##RU
+##RY
+##SA
+##SB
+##SC
+##SD
+##SE
+##SF
+##SH
+##SI
+##SK
+##SL
+##SM
+##SN
+##SO
+##SP
+##SS
+##ST
+##SU
+##SY
+##TA
+##TC
+##TD
+##TE
+##TH
+##TI
+##TM
+##TO
+##TP
+##TR
+##TS
+##TT
+##TU
+##TV
+##TY
+##Tu
+##UB
+##UC
+##UD
+##UE
+##UF
+##UG
+##UI
+##UK
+##UL
+##UM
+##UN
+##UP
+##UR
+##US
+##UT
+##VA
+##VB
+##VC
+##VD
+##VE
+##VI
+##VO
+##VP
+##VR
+##VT
+##WA
+##WC
+##WE
+##WF
+##WI
+##WL
+##WM
+##WO
+##WS
+##XA
+##XC
+##XE
+##XG
+##XO
+##XP
+##XT
+##XX
+##XY
+##YA
+##YE
+##YL
+##YO
+##YP
+##YS
+##YT
+##ZA
+##ZB
+##ZE
+##ZI
+##ZO
+##ZR
+##ZU
+##ZX
+##ZZ
+##ab
+##ag
+##al
+##am
+##an
+##ar
+##as
+##at
+##ax
+##ay
+##az
+##bi
+##bj
+##bl
+##bo
+##by
+##ce
+##ch
+##ci
+##ck
+##cq
+##ct
+##dj
+##ed
+##en
+##er
+##ew
+##ex
+##ff
+##fi
+##gh
+##gn
+##ha
+##he
+##ho
+##hz
+##ic
+##id
+##im
+##in
+##is
+##it
+##iv
+##ix
+##iz
+##jj
+##jo
+##ke
+##ky
+##kz
+##ld
+##le
+##lf
+##ll
+##ly
+##mb
+##mp
+##na
+##nc
+##nd
+##ng
+##nj
+##nk
+##nn
+##nt
+##nz
+##ob
+##oj
+##ok
+##ol
+##om
+##on
+##op
+##or
+##ou
+##ow
+##ox
+##ph
+##pp
+##pu
+##pv
+##qf
+##ql
+##qq
+##qu
+##re
+##rk
+##ro
+##ry
+##sh
+##sq
+##st
+##th
+##ty
+##ub
+##ul
+##um
+##un
+##ur
+##us
+##uv
+##ux
+##uz
+##ve
+##vi
+##wn
+##ws
+##ww
+##xp
+##xx
+##xy
+##zh
+##zy
+##zz
+##ACE
+##ACH
+##ACT
+##ADE
+##AGE
+##AIN
+##AME
+##AND
+##ANG
+##ANO
+##ANT
+##ARD
+##ARE
+##ASS
+##AST
+##ATE
+##BER
+##BLE
+##BOX
+##BSD
+##Bay
+##CAD
+##CAL
+##CAM
+##COM
+##CSE
+##DEO
+##DER
+##DIA
+##DNA
+##DSL
+##DVD
+##EAM
+##EAR
+##ECT
+##EEN
+##ENS
+##ENT
+##ERA
+##ERS
+##ESE
+##ESS
+##FTA
+##GER
+##GHT
+##GIS
+##IAL
+##IBA
+##IBU
+##ICE
+##ICS
+##IDE
+##INA
+##INE
+##ING
+##INT
+##INY
+##ION
+##IPS
+##ITE
+##IVE
+##KER
+##KON
+##LAY
+##LLA
+##LOR
+##MAN
+##MAS
+##MAX
+##MES
+##NAD
+##NAL
+##NCE
+##NET
+##NEY
+##NIC
+##NNA
+##OCK
+##ODE
+##OME
+##ONE
+##ORA
+##OWS
+##Off
+##PAC
+##PER
+##PRS
+##RAN
+##RIS
+##RNA
+##ROM
+##RON
+##ROR
+##SCO
+##SHI
+##SIC
+##SOL
+##SON
+##SQL
+##TAL
+##TED
+##TER
+##TML
+##TON
+##TRA
+##UND
+##UNG
+##UPA
+##USB
+##USE
+##VEL
+##VER
+##VGA
+##VID
+##WER
+##You
+##abl
+##aby
+##ach
+##ack
+##act
+##ain
+##ake
+##all
+##aly
+##anc
+##and
+##ang
+##ank
+##app
+##ard
+##ark
+##art
+##ary
+##ash
+##ath
+##auv
+##ave
+##avi
+##azi
+##azy
+##azz
+##bVI
+##bby
+##ber
+##bje
+##ble
+##cGI
+##cho
+##com
+##cqu
+##day
+##der
+##ebo
+##ect
+##ell
+##emb
+##enc
+##eng
+##ent
+##erJ
+##ern
+##erv
+##ery
+##eve
+##ews
+##exp
+##ext
+##ezy
+##fer
+##ffe
+##fic
+##for
+##gaz
+##ger
+##ght
+##gin
+##hen
+##her
+##hev
+##hin
+##hon
+##hou
+##iRF
+##ial
+##ica
+##ice
+##ich
+##ick
+##iff
+##igh
+##ike
+##ill
+##ily
+##ime
+##ine
+##ing
+##ink
+##ion
+##iqu
+##ish
+##ith
+##ive
+##iza
+##ize
+##izz
+##jin
+##ker
+##kin
+##lDR
+##lay
+##laz
+##lex
+##lic
+##lin
+##liz
+##llo
+##lly
+##man
+##maz
+##men
+##mer
+##min
+##mpl
+##mpo
+##nGL
+##nRH
+##nal
+##ner
+##ngz
+##niz
+##now
+##nxp
+##oCA
+##obj
+##ock
+##oll
+##omb
+##ome
+##omm
+##omp
+##one
+##ong
+##ook
+##ork
+##orm
+##ort
+##ory
+##oul
+##oup
+##our
+##ous
+##out
+##ove
+##own
+##ows
+##per
+##phe
+##ply
+##por
+##ppl
+##ppy
+##qqu
+##qua
+##que
+##qui
+##raz
+##rch
+##ric
+##rou
+##son
+##tBI
+##tch
+##ter
+##the
+##tic
+##tim
+##tiv
+##tur
+##uch
+##uck
+##uct
+##uff
+##ugh
+##umb
+##ung
+##ure
+##urn
+##vel
+##ven
+##ver
+##vic
+##vid
+##vin
+##war
+##way
+##whe
+##wor
+##www
+##xxx
+##ymb
+##yth
+##zhe
+##zym
+##zzy
+##ATIO
+##CESS
+##CIAT
+##CTIO
+##CTOR
+##ENGI
+##ERSI
+##HCSD
+##INES
+##INUE
+##IONA
+##LOID
+##MENT
+##NEER
+##NOLO
+##NTER
+##NTSC
+##ORMA
+##OSHO
+##RISE
+##RNAT
+##RNET
+##SATA
+##SION
+##TION
+##TTLE
+##VERS
+##ally
+##arch
+##ayer
+##azer
+##azin
+##bert
+##book
+##chin
+##ctor
+##ding
+##echn
+##erPC
+##erVR
+##eriz
+##erve
+##ever
+##ffer
+##ffff
+##ffic
+##fter
+##ghly
+##hell
+##ical
+##iche
+##icke
+##ific
+##ight
+##iver
+##izon
+##izzy
+##king
+##lack
+##land
+##llow
+##mber
+##ngin
+##ning
+##omic
+##onom
+##othe
+##ouch
+##ough
+##ound
+##ower
+##pper
+##ppin
+##pter
+##ster
+##ther
+##tion
+##tive
+##tter
+##ture
+##urch
+##vely
+##ction
+##ctive
+##enter
+##erica
+##ional
+##thing
diff --git a/fengshen/workspace/erlangshen-deberta-base/pretrain/README.md b/fengshen/workspace/erlangshen-deberta-base/pretrain/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..942a13a2a1a2eca8fe42ecaab03d33b16e4c0700
--- /dev/null
+++ b/fengshen/workspace/erlangshen-deberta-base/pretrain/README.md
@@ -0,0 +1,54 @@
+---
+language: 
+  - zh
+
+license: apache-2.0
+
+tags:
+  - bert
+
+inference: true
+
+widget:
+- text: "生活的真谛是[MASK]。"
+---
+# Erlangshen-Deberta-97M-Chinese，one model of [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM).
+The 97 million parameter deberta-V2 base model, using 180G Chinese data, 24 A100(40G) training for 7 days，which is a encoder-only transformer structure. Consumed totally 1B samples.
+
+
+## Task Description
+
+Erlangshen-Deberta-97M-Chinese is pre-trained by bert like mask task from Deberta [paper](https://readpaper.com/paper/3033187248)
+
+
+## Usage
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
+import torch
+
+tokenizer=AutoTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-DeBERTa-v2-97M-Chinese', use_fast=False)
+model=AutoModelForMaskedLM.from_pretrained('IDEA-CCNL/Erlangshen-DeBERTa-v2-97M-Chinese')
+text = '生活的真谛是[MASK]。'
+fillmask_pipe = FillMaskPipeline(model, tokenizer, device=7)
+print(fillmask_pipe(text, top_k=10))
+```
+
+## Finetune
+
+We present the dev results on some tasks.
+
+| Model                              | OCNLI | CMNLI  |
+| ---------------------------------- | ----- | ------ |
+| RoBERTa-base                       | 0.743 | 0.7973 |
+| **Erlangshen-Deberta-97M-Chinese** | 0.752 | 0.807  |
+
+## Citation
+If you find the resource is useful, please cite the following website in your paper.
+```
+@misc{Fengshenbang-LM,
+  title={Fengshenbang-LM},
+  author={IDEA-CCNL},
+  year={2022},
+  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
+}
+```
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-deberta-base/pretrain/config.json b/fengshen/workspace/erlangshen-deberta-base/pretrain/config.json
new file mode 100644
index 0000000000000000000000000000000000000000..00b0a4e997b3dc7008980c27ceb510418144809f
--- /dev/null
+++ b/fengshen/workspace/erlangshen-deberta-base/pretrain/config.json
@@ -0,0 +1,27 @@
+{
+    "model_type": "deberta-v2",
+    "architectures": [
+        "DebertaV2ForMaskedLM"
+    ],
+    "attention_probs_dropout_prob": 0.1,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 768,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "max_position_embeddings": 512,
+    "relative_attention": true,
+    "position_buckets": 256,
+    "norm_rel_ebd": "layer_norm",
+    "share_att_key": true,
+    "pos_att_type": "c2p|p2c",
+    "conv_kernel_size": 3,
+    "conv_act": "gelu",
+    "layer_norm_eps": 1e-7,
+    "max_relative_positions": -1,
+    "position_biased_input": false,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "type_vocab_size": 0,
+    "vocab_size": 12800
+}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-deberta-base/pretrain/special_tokens_map.json b/fengshen/workspace/erlangshen-deberta-base/pretrain/special_tokens_map.json
new file mode 100644
index 0000000000000000000000000000000000000000..e7b0375001f109a6b8873d756ad4f7bbb15fbaa5
--- /dev/null
+++ b/fengshen/workspace/erlangshen-deberta-base/pretrain/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-deberta-base/pretrain/tokenizer_config.json b/fengshen/workspace/erlangshen-deberta-base/pretrain/tokenizer_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..63c7eb71d0058343534b09eb59770646b9368208
--- /dev/null
+++ b/fengshen/workspace/erlangshen-deberta-base/pretrain/tokenizer_config.json
@@ -0,0 +1,14 @@
+{
+    "do_lower_case": true,
+    "do_basic_tokenize": true,
+    "never_split": null,
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]",
+    "tokenize_chinese_chars": true,
+    "strip_accents": null,
+    "special_tokens_map_file": null,
+    "tokenizer_class": "BertTokenizer"
+}
\ No newline at end of file
diff --git a/fengshen/workspace/erlangshen-deberta-base/pretrain/vocab.txt b/fengshen/workspace/erlangshen-deberta-base/pretrain/vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..437c75cb0090ba6d449e478fae3f7421ab1de961
--- /dev/null
+++ b/fengshen/workspace/erlangshen-deberta-base/pretrain/vocab.txt
@@ -0,0 +1,12800 @@
+[PAD]
+[CLS]
+[SEP]
+[UNK]
+[MASK]
+[unused1]
+[unused2]
+[unused3]
+[unused4]
+[unused5]
+[unused6]
+[unused7]
+[unused8]
+[unused9]
+[unused10]
+[unused11]
+[unused12]
+[unused13]
+[unused14]
+[unused15]
+[unused16]
+[unused17]
+[unused18]
+[unused19]
+[unused20]
+[unused21]
+[unused22]
+[unused23]
+[unused24]
+[unused25]
+[unused26]
+[unused27]
+[unused28]
+[unused29]
+[unused30]
+[unused31]
+[unused32]
+[unused33]
+[unused34]
+[unused35]
+[unused36]
+[unused37]
+[unused38]
+[unused39]
+[unused40]
+[unused41]
+[unused42]
+[unused43]
+[unused44]
+[unused45]
+[unused46]
+[unused47]
+[unused48]
+[unused49]
+[unused50]
+[unused51]
+[unused52]
+[unused53]
+[unused54]
+[unused55]
+[unused56]
+[unused57]
+[unused58]
+[unused59]
+[unused60]
+[unused61]
+[unused62]
+[unused63]
+[unused64]
+[unused65]
+[unused66]
+[unused67]
+[unused68]
+[unused69]
+[unused70]
+[unused71]
+[unused72]
+[unused73]
+[unused74]
+[unused75]
+[unused76]
+[unused77]
+[unused78]
+[unused79]
+[unused80]
+[unused81]
+[unused82]
+[unused83]
+[unused84]
+[unused85]
+[unused86]
+[unused87]
+[unused88]
+[unused89]
+[unused90]
+[unused91]
+[unused92]
+[unused93]
+[unused94]
+[unused95]
+[unused96]
+[unused97]
+[unused98]
+[unused99]
+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+:
+;
+<
+=
+>
+?
+@
+[
+\
+]
+^
+_
+`
+{
+|
+}
+~
+·
+–
+—
+‘
+’
+‛
+“
+”
+„
+‟
+…
+‧
+、
+。
+〃
+〈
+〉
+《
+》
+「
+」
+『
+』
+【
+】
+〔
+〕
+〖
+〗
+〜
+〝
+〞
+〰
+﹏
+﹑
+﹔
+！
+＂
+＃
+＄
+％
+＆
+＇
+（
+）
+＊
+＋
+，
+－
+：
+；
+＜
+＝
+＞
+？
+＠
+［
+＼
+］
+＾
+＿
+｀
+｛
+｜
+｝
+～
+｡
+｢
+｣
+､
+0
+##0
+1
+##1
+2
+##2
+3
+##3
+4
+##4
+5
+##5
+6
+##6
+7
+##7
+8
+##8
+9
+##9
+一
+丁
+七
+丄
+丅
+丆
+万
+丈
+三
+上
+下
+丌
+不
+与
+丏
+丐
+丑
+专
+且
+丕
+世
+丘
+丙
+业
+丛
+东
+丝
+丞
+両
+丢
+两
+严
+丧
+丨
+个
+丫
+中
+丰
+串
+临
+丶
+丸
+丹
+为
+主
+丼
+丽
+举
+丿
+乂
+乃
+乄
+久
+乇
+么
+义
+之
+乌
+乍
+乎
+乏
+乐
+乒
+乓
+乔
+乖
+乗
+乘
+乙
+乛
+乜
+九
+乞
+也
+习
+乡
+书
+乩
+买
+乱
+乳
+乸
+乾
+亀
+了
+亇
+予
+争
+亊
+事
+二
+亍
+于
+亏
+云
+互
+亓
+五
+井
+亘
+亚
+些
+亜
+亟
+亡
+亢
+交
+亥
+亦
+产
+亨
+亩
+享
+京
+亭
+亮
+亲
+亳
+亵
+亶
+亸
+亹
+人
+亻
+亽
+亾
+亿
+什
+仁
+仂
+仃
+仄
+仅
+仆
+仇
+仉
+今
+介
+仍
+从
+仏
+仑
+仓
+仔
+仕
+他
+仗
+付
+仙
+仝
+仞
+仟
+仡
+代
+令
+以
+仨
+仩
+仪
+仫
+们
+仮
+仰
+仲
+仳
+仵
+件
+价
+任
+仼
+份
+仿
+企
+伉
+伊
+伋
+伍
+伎
+伏
+伐
+休
+众
+优
+伙
+会
+伛
+伝
+伞
+伟
+传
+伢
+伤
+伥
+伦
+伧
+伪
+伫
+伯
+估
+伱
+伲
+伴
+伶
+伷
+伸
+伺
+似
+伽
+伾
+佀
+佃
+但
+佉
+位
+低
+住
+佐
+佑
+体
+何
+佗
+佘
+余
+佚
+佛
+作
+佝
+佞
+佟
+你
+佢
+佣
+佤
+佥
+佧
+佩
+佬
+佯
+佰
+佳
+佴
+佶
+佷
+佺
+佻
+佼
+佾
+使
+侁
+侂
+侃
+侄
+來
+侈
+侉
+例
+侍
+侏
+侑
+侔
+侗
+侘
+供
+依
+侠
+価
+侣
+侥
+侦
+侧
+侨
+侩
+侪
+侬
+侮
+侯
+侵
+便
+促
+俄
+俅
+俊
+俎
+俏
+俐
+俑
+俗
+俘
+俚
+俛
+俜
+保
+俞
+俟
+信
+俢
+俣
+俤
+俦
+俨
+俩
+俪
+俬
+俭
+修
+俯
+俱
+俳
+俶
+俸
+俺
+俾
+倅
+倌
+倍
+倏
+倒
+倓
+倔
+倕
+倘
+候
+倚
+倜
+倞
+借
+倡
+値
+倥
+倦
+倧
+倨
+倩
+倪
+倬
+倭
+倮
+倶
+债
+倻
+值
+倾
+偁
+偃
+假
+偈
+偌
+偍
+偎
+偏
+偓
+偕
+做
+停
+偠
+健
+偬
+偰
+偲
+偶
+偷
+偻
+偾
+偿
+傀
+傅
+傈
+傉
+傍
+傒
+傕
+傣
+傥
+傧
+储
+傩
+催
+傲
+傺
+傻
+働
+像
+僖
+僚
+僜
+僣
+僦
+僧
+僬
+僭
+僮
+僰
+僳
+僵
+僻
+僾
+儁
+儆
+儇
+儋
+儒
+儞
+儡
+儿
+兀
+允
+元
+兄
+充
+兆
+先
+光
+克
+免
+兎
+児
+兑
+兔
+兕
+兖
+党
+兜
+兢
+入
+全
+八
+公
+六
+兮
+兰
+共
+兲
+关
+兴
+兵
+其
+具
+典
+兹
+养
+兼
+兽
+冀
+内
+円
+冇
+冈
+冉
+册
+再
+冏
+冒
+冕
+冗
+写
+冚
+军
+农
+冠
+冢
+冤
+冥
+冧
+冨
+冬
+冮
+冯
+冰
+冲
+决
+冴
+况
+冶
+冷
+冻
+冼
+冽
+净
+凃
+凄
+准
+凇
+凉
+凊
+凋
+凌
+减
+凑
+凖
+凛
+凝
+几
+凡
+凤
+処
+凪
+凫
+凭
+凯
+凰
+凳
+凶
+凸
+凹
+出
+击
+凼
+函
+凿
+刀
+刁
+刂
+刃
+分
+切
+刈
+刊
+刋
+刍
+刎
+刑
+划
+刖
+列
+刘
+则
+刚
+创
+初
+删
+判
+刨
+利
+别
+刬
+刭
+刮
+到
+刳
+制
+刷
+券
+刹
+刺
+刻
+刽
+刿
+剀
+剁
+剂
+剃
+剅
+削
+剌
+前
+剐
+剑
+剔
+剖
+剜
+剞
+剡
+剣
+剥
+剧
+剩
+剪
+副
+割
+剽
+剿
+劂
+劄
+劈
+劏
+劓
+力
+劝
+办
+功
+加
+务
+劢
+劣
+动
+助
+努
+劫
+劬
+劭
+励
+劲
+劳
+劵
+効
+劻
+劼
+劾
+势
+勃
+勅
+勇
+勉
+勋
+勍
+勐
+勑
+勒
+勔
+勖
+勘
+募
+勠
+勤
+勰
+勲
+勳
+勷
+勺
+勾
+勿
+匀
+匂
+匄
+包
+匆
+匈
+匋
+匍
+匏
+匐
+匕
+化
+北
+匙
+匚
+匜
+匝
+匠
+匡
+匣
+匦
+匪
+匮
+匹
+区
+医
+匾
+匿
+區
+十
+千
+卅
+升
+午
+卉
+半
+卌
+卍
+华
+协
+卐
+卑
+卒
+卓
+单
+卖
+南
+単
+博
+卜
+卝
+卞
+卟
+占
+卡
+卢
+卣
+卤
+卦
+卧
+卨
+卫
+卬
+卮
+卯
+印
+危
+卲
+即
+却
+卵
+卷
+卸
+卺
+卽
+卿
+厂
+厄
+厅
+历
+厉
+压
+厌
+厍
+厓
+厔
+厕
+厘
+厚
+厝
+原
+厢
+厣
+厥
+厦
+厨
+厩
+厮
+厶
+去
+县
+叁
+参
+叆
+又
+叉
+及
+友
+双
+反
+収
+发
+叒
+叔
+叕
+取
+受
+变
+叙
+叛
+叟
+叠
+叡
+口
+古
+句
+另
+叨
+叩
+只
+叫
+召
+叭
+叮
+可
+台
+叱
+史
+右
+叵
+叶
+号
+司
+叹
+叻
+叼
+叽
+吁
+吃
+各
+吅
+吆
+吇
+合
+吉
+吊
+吋
+同
+名
+后
+吏
+吐
+向
+吒
+吓
+吔
+吕
+吖
+吗
+吙
+吚
+君
+吝
+吞
+吟
+吠
+吡
+吥
+否
+吧
+吨
+吩
+含
+听
+吭
+吮
+启
+吱
+吲
+吴
+吵
+吸
+吹
+吻
+吼
+吽
+吾
+吿
+呀
+呃
+呆
+呈
+呉
+告
+呋
+呎
+呐
+呑
+呒
+呓
+呔
+呕
+呖
+呗
+员
+呙
+呛
+呜
+呢
+呣
+呤
+呦
+周
+呪
+呬
+呯
+呱
+呲
+味
+呴
+呵
+呶
+呷
+呸
+呻
+呼
+命
+呾
+咀
+咁
+咂
+咄
+咆
+咋
+和
+咎
+咏
+咐
+咒
+咔
+咕
+咖
+咗
+咘
+咙
+咚
+咛
+咝
+咢
+咣
+咤
+咥
+咦
+咧
+咨
+咩
+咪
+咫
+咬
+咭
+咯
+咱
+咲
+咳
+咴
+咸
+咻
+咽
+咾
+咿
+哀
+品
+哂
+哄
+哆
+哇
+哈
+哉
+哋
+哌
+响
+哎
+哏
+哐
+哑
+哒
+哓
+哔
+哕
+哗
+哙
+哚
+哝
+哞
+哟
+哥
+哦
+哧
+哨
+哩
+哪
+哭
+哮
+哲
+哺
+哼
+哽
+哿
+唁
+唃
+唆
+唇
+唉
+唎
+唏
+唐
+唑
+唔
+唛
+唠
+唢
+唤
+唦
+唧
+唬
+售
+唯
+唰
+唱
+唳
+唵
+唷
+唻
+唼
+唾
+唿
+啁
+啃
+啄
+啅
+商
+啉
+啊
+啐
+啓
+啕
+啖
+啜
+啝
+啡
+啤
+啥
+啦
+啧
+啪
+啫
+啬
+啭
+啮
+啯
+啰
+啱
+啲
+啵
+啶
+啷
+啸
+啻
+啼
+啾
+喀
+喁
+喂
+喃
+善
+喆
+喇
+喈
+喉
+喊
+喋
+喎
+喏
+喑
+喔
+喘
+喙
+喜
+喝
+喟
+喦
+喧
+喰
+喱
+喳
+喵
+営
+喷
+喹
+喺
+喻
+喽
+喾
+嗄
+嗅
+嗉
+嗌
+嗍
+嗑
+嗒
+嗓
+嗔
+嗖
+嗜
+嗝
+嗞
+嗟
+嗡
+嗣
+嗤
+嗥
+嗦
+嗨
+嗪
+嗫
+嗬
+嗮
+嗯
+嗰
+嗲
+嗳
+嗵
+嗷
+嗽
+嗾
+嘀
+嘁
+嘅
+嘈
+嘉
+嘌
+嘎
+嘏
+嘘
+嘚
+嘛
+嘞
+嘟
+嘢
+嘣
+嘤
+嘦
+嘧
+嘬
+嘭
+嘱
+嘲
+嘴
+嘶
+嘹
+嘻
+嘿
+噁
+噉
+噌
+噎
+噐
+噔
+噗
+噘
+噙
+噜
+噢
+噤
+器
+噩
+噪
+噫
+噬
+噱
+噶
+噻
+噼
+嚅
+嚈
+嚎
+嚏
+嚐
+嚒
+嚓
+嚟
+嚣
+嚧
+嚩
+嚭
+嚯
+嚷
+嚼
+囊
+囍
+囔
+囖
+囗
+囘
+囚
+四
+囝
+回
+囟
+因
+囡
+团
+団
+囤
+囧
+囫
+囬
+园
+囯
+困
+囱
+囲
+図
+围
+囵
+囷
+囹
+固
+国
+图
+囿
+圃
+圄
+圆
+圈
+圉
+圊
+國
+圌
+圏
+圜
+圝
+圞
+土
+圣
+圧
+在
+圩
+圪
+圬
+圭
+圮
+圯
+地
+圳
+圹
+场
+圻
+圾
+址
+坂
+均
+坊
+坌
+坍
+坎
+坏
+坐
+坑
+块
+坚
+坛
+坜
+坝
+坞
+坟
+坠
+坡
+坤
+坦
+坨
+坩
+坪
+坫
+坬
+坭
+坯
+坳
+坷
+坻
+坼
+垂
+垃
+垄
+垅
+垆
+型
+垌
+垍
+垒
+垓
+垕
+垚
+垛
+垞
+垟
+垠
+垡
+垢
+垣
+垤
+垦
+垧
+垩
+垫
+垭
+垮
+垱
+垲
+垴
+垵
+垸
+埂
+埃
+埇
+埈
+埋
+埌
+城
+埏
+埒
+埔
+埕
+埗
+埘
+埙
+埚
+埜
+埝
+域
+埠
+埤
+埭
+埯
+埴
+埵
+埸
+培
+基
+埼
+埽
+堀
+堂
+堃
+堆
+堇
+堈
+堉
+堋
+堌
+堍
+堑
+堕
+堙
+堞
+堠
+堡
+堤
+堨
+堪
+堰
+堵
+堺
+堽
+塁
+塄
+塅
+塆
+塌
+塍
+塑
+塔
+塘
+塚
+塝
+塞
+塩
+填
+塬
+塭
+塱
+塽
+塾
+墀
+墁
+境
+墅
+墉
+墒
+墓
+墕
+増
+墘
+墙
+增
+墟
+墨
+墩
+墫
+壁
+壅
+壆
+壊
+壑
+壕
+壤
+士
+壬
+壮
+声
+壱
+売
+壳
+壶
+壸
+壹
+处
+备
+変
+复
+夏
+夑
+夔
+夕
+外
+夙
+多
+夜
+够
+夤
+夥
+大
+夨
+天
+太
+夫
+夬
+夭
+央
+夯
+失
+夲
+头
+夶
+夷
+夸
+夹
+夺
+夼
+奀
+奁
+奂
+奄
+奇
+奈
+奉
+奋
+奌
+奎
+奏
+契
+奔
+奕
+奖
+套
+奘
+奚
+奠
+奢
+奥
+奨
+奭
+女
+奴
+奶
+奸
+她
+好
+妁
+如
+妃
+妄
+妆
+妇
+妈
+妊
+妍
+妒
+妓
+妖
+妗
+妘
+妙
+妞
+妠
+妣
+妤
+妥
+妨
+妩
+妪
+妫
+妬
+妮
+妯
+妲
+妹
+妺
+妻
+妼
+妾
+姁
+姆
+姈
+姉
+姊
+始
+姐
+姑
+姒
+姓
+委
+姗
+姘
+姚
+姜
+姝
+姞
+姣
+姤
+姥
+姨
+姫
+姬
+姮
+姱
+姵
+姹
+姻
+姽
+姿
+威
+娃
+娄
+娅
+娆
+娇
+娈
+娉
+娌
+娑
+娓
+娘
+娜
+娟
+娠
+娡
+娣
+娥
+娩
+娭
+娱
+娲
+娴
+娶
+娼
+婀
+婆
+婉
+婊
+婕
+婚
+婠
+婢
+婧
+婪
+婬
+婳
+婴
+婵
+婶
+婷
+婺
+婻
+婼
+婿
+媄
+媒
+媗
+媚
+媛
+媜
+媞
+媪
+媲
+媳
+媵
+媸
+媺
+媾
+嫁
+嫂
+嫄
+嫉
+嫌
+嫒
+嫔
+嫖
+嫘
+嫚
+嫠
+嫡
+嫣
+嫦
+嫩
+嫪
+嫫
+嫰
+嫱
+嫲
+嫽
+嬅
+嬉
+嬖
+嬗
+嬛
+嬜
+嬢
+嬲
+嬴
+嬷
+嬾
+嬿
+孀
+子
+孑
+孒
+孓
+孔
+孕
+孖
+字
+存
+孙
+孚
+孛
+孜
+孝
+孟
+孢
+季
+孤
+孥
+学
+孩
+孪
+孬
+孰
+孱
+孳
+孵
+孺
+孽
+宀
+宁
+它
+宄
+宅
+宇
+守
+安
+宋
+完
+宍
+宏
+宓
+宕
+宗
+官
+宙
+定
+宛
+宜
+宝
+实
+実
+宠
+审
+客
+宣
+室
+宥
+宦
+宪
+宫
+宬
+宰
+害
+宴
+宵
+家
+宸
+容
+宽
+宾
+宿
+寀
+寂
+寃
+寄
+寅
+密
+寇
+富
+寐
+寒
+寓
+寔
+寘
+寛
+寝
+寞
+察
+寡
+寤
+寥
+寨
+寮
+寯
+寰
+寳
+寸
+对
+寺
+寻
+导
+対
+寿
+封
+専
+射
+尅
+将
+尉
+尊
+對
+小
+尐
+少
+尒
+尓
+尔
+尕
+尖
+尘
+尙
+尚
+尛
+尜
+尝
+尢
+尤
+尧
+尨
+尪
+尬
+就
+尴
+尸
+尹
+尺
+尻
+尼
+尽
+尾
+尿
+局
+屁
+层
+屃
+屄
+居
+屈
+屉
+届
+屋
+屌
+屍
+屎
+屏
+屐
+屑
+展
+屙
+属
+屠
+屡
+屣
+履
+屦
+屮
+屯
+山
+屹
+屺
+屾
+屿
+岀
+岁
+岂
+岈
+岌
+岐
+岑
+岔
+岕
+岖
+岗
+岘
+岙
+岚
+岛
+岜
+岞
+岢
+岣
+岩
+岫
+岬
+岭
+岱
+岳
+岵
+岷
+岸
+岺
+岽
+岿
+峁
+峄
+峇
+峋
+峒
+峕
+峙
+峠
+峡
+峣
+峤
+峥
+峦
+峨
+峩
+峪
+峭
+峯
+峰
+峻
+崀
+崁
+崂
+崃
+崄
+崆
+崇
+崎
+崐
+崑
+崔
+崖
+崚
+崛
+崞
+崟
+崤
+崦
+崧
+崩
+崭
+崮
+崴
+崽
+崾
+嵇
+嵊
+嵋
+嵌
+嵎
+嵖
+嵗
+嵘
+嵚
+嵛
+嵝
+嵩
+嵬
+嵯
+嵴
+嶂
+嶋
+嶙
+嶝
+嶲
+嶷
+巂
+巅
+巇
+巉
+巍
+巎
+巘
+巜
+川
+州
+巡
+巢
+巣
+工
+左
+巧
+巨
+巩
+巫
+差
+巯
+己
+已
+巳
+巴
+巷
+巻
+巽
+巾
+巿
+币
+市
+布
+帅
+帆
+师
+希
+帏
+帐
+帑
+帔
+帕
+帖
+帘
+帙
+帚
+帛
+帜
+帝
+带
+帧
+席
+帮
+帯
+帰
+帷
+常
+帻
+帼
+帽
+幂
+幄
+幅
+幌
+幔
+幕
+幛
+幞
+幡
+幢
+干
+平
+年
+幵
+并
+幷
+幸
+幺
+幻
+幼
+幽
+广
+庀
+庁
+広
+庄
+庆
+庇
+床
+庋
+序
+庐
+庑
+库
+应
+底
+庖
+店
+庙
+庚
+府
+庞
+废
+庠
+庤
+庥
+度
+座
+庭
+庵
+庶
+康
+庸
+庹
+庾
+廆
+廉
+廊
+廋
+廌
+廑
+廒
+廓
+廕
+廖
+廙
+廛
+廞
+廨
+廪
+廯
+延
+廷
+建
+廻
+廼
+廾
+廿
+开
+弁
+异
+弃
+弄
+弇
+弈
+弊
+弋
+式
+弐
+弑
+弓
+引
+弗
+弘
+弛
+弟
+张
+弢
+弥
+弦
+弧
+弩
+弭
+弯
+弱
+弹
+强
+弼
+弾
+彀
+归
+当
+录
+彖
+彗
+彘
+彝
+彟
+彡
+形
+彤
+彦
+彧
+彩
+彪
+彬
+彭
+彰
+影
+彳
+彵
+彷
+役
+彻
+彼
+往
+征
+徂
+径
+待
+徇
+很
+徉
+徊
+律
+徐
+徒
+従
+徕
+得
+徘
+徙
+徜
+御
+徧
+徨
+循
+徭
+微
+徳
+徴
+徵
+德
+徼
+徽
+心
+忄
+必
+忆
+忉
+忌
+忍
+忏
+忐
+忑
+忒
+忖
+志
+忘
+忙
+応
+忝
+忞
+忠
+忡
+忤
+忧
+忪
+快
+忭
+忱
+念
+忸
+忻
+忽
+忾
+忿
+怀
+态
+怂
+怃
+怄
+怅
+怆
+怍
+怎
+怏
+怒
+怔
+怕
+怖
+怙
+怛
+怜
+思
+怠
+怡
+急
+怦
+性
+怨
+怩
+怪
+怫
+怯
+怱
+怳
+怵
+怹
+总
+怼
+怿
+恁
+恂
+恃
+恋
+恍
+恏
+恐
+恒
+恕
+恙
+恚
+恠
+恢
+恣
+恤
+恨
+恩
+恪
+恫
+恬
+恭
+息
+恰
+恳
+恵
+恶
+恸
+恹
+恺
+恻
+恼
+恽
+恿
+悃
+悄
+悉
+悌
+悍
+悒
+悔
+悖
+悚
+悛
+悝
+悟
+悠
+患
+悦
+您
+悩
+悪
+悫
+悬
+悭
+悯
+悰
+悱
+悲
+悳
+悴
+悸
+悻
+悼
+情
+惆
+惇
+惊
+惋
+惑
+惔
+惕
+惘
+惚
+惛
+惜
+惝
+惟
+惠
+惢
+惣
+惦
+惧
+惨
+惩
+惪
+惫
+惬
+惭
+惮
+惯
+惰
+想
+惴
+惶
+惹
+惺
+愀
+愁
+愆
+愈
+愉
+愍
+愎
+意
+愔
+愕
+愚
+感
+愠
+愣
+愤
+愦
+愧
+愫
+愬
+愰
+愽
+愿
+慆
+慈
+慊
+慌
+慎
+慑
+慕
+慜
+慝
+慢
+慥
+慧
+慨
+慰
+慵
+慷
+慾
+憋
+憍
+憎
+憔
+憙
+憧
+憨
+憩
+憬
+憷
+憺
+憾
+懂
+懈
+懊
+懋
+懐
+懑
+懒
+懔
+懦
+懮
+懵
+懽
+懿
+戆
+戈
+戊
+戋
+戌
+戍
+戎
+戏
+成
+我
+戒
+戓
+戕
+或
+戗
+战
+戚
+戛
+戟
+戡
+戢
+戥
+戦
+截
+戬
+戮
+戯
+戳
+戴
+户
+戻
+戽
+戾
+房
+所
+扁
+扃
+扆
+扇
+扈
+扉
+手
+扌
+才
+扎
+扑
+扒
+打
+扔
+托
+扛
+扞
+扣
+扥
+扦
+执
+扩
+扪
+扫
+扬
+扭
+扮
+扯
+扰
+扳
+扶
+批
+扼
+扽
+找
+承
+技
+抃
+抄
+抉
+把
+抑
+抒
+抓
+抔
+投
+抖
+抗
+折
+抚
+抛
+抜
+抟
+抠
+抡
+抢
+护
+报
+抧
+抨
+披
+抬
+抱
+抳
+抵
+抹
+抺
+抻
+押
+抽
+抿
+拂
+拄
+担
+拆
+拇
+拈
+拉
+拊
+拌
+拍
+拎
+拏
+拐
+拒
+拓
+拔
+拖
+拗
+拘
+拙
+招
+拜
+拟
+拢
+拣
+拥
+拦
+拧
+拨
+择
+括
+拭
+拮
+拯
+拱
+拳
+拴
+拶
+拷
+拼
+拽
+拾
+拿
+挀
+持
+挂
+指
+挈
+按
+挎
+挑
+挒
+挖
+挚
+挛
+挝
+挞
+挟
+挠
+挡
+挢
+挣
+挤
+挥
+挨
+挪
+挫
+振
+挲
+挹
+挺
+挻
+挼
+挽
+捂
+捃
+捅
+捆
+捉
+捋
+捌
+捍
+捎
+捏
+捐
+捕
+捜
+捞
+损
+捡
+换
+捣
+捧
+捩
+捭
+据
+捯
+捱
+捶
+捷
+捺
+捻
+捽
+掀
+掂
+掇
+授
+掉
+掊
+掌
+掎
+掏
+掐
+排
+掖
+掘
+掞
+掠
+探
+掣
+掤
+接
+控
+推
+掩
+措
+掬
+掭
+掮
+掰
+掲
+掳
+掴
+掷
+掸
+掺
+掼
+掾
+揄
+揆
+揉
+揍
+描
+提
+插
+揖
+揠
+握
+揣
+揩
+揪
+揭
+揲
+援
+揵
+揶
+揸
+揺
+揽
+揾
+揿
+搀
+搁
+搂
+搅
+搋
+搏
+搐
+搓
+搔
+搜
+搞
+搠
+搡
+搢
+搥
+搦
+搧
+搨
+搪
+搬
+搭
+搴
+搵
+携
+搽
+搿
+摁
+摄
+摅
+摆
+摇
+摈
+摊
+摒
+摔
+摘
+摛
+摞
+摧
+摩
+摭
+摸
+摹
+摺
+摽
+撂
+撃
+撄
+撅
+撇
+撑
+撒
+撕
+撘
+撙
+撝
+撞
+撤
+撩
+撬
+播
+撮
+撰
+撵
+撷
+撸
+撺
+撼
+擀
+擂
+擅
+操
+擎
+擒
+擗
+擘
+擞
+擢
+擤
+擦
+擫
+擿
+攀
+攒
+攘
+攞
+攥
+攫
+支
+攲
+攴
+攵
+收
+攸
+改
+攻
+攽
+放
+政
+故
+效
+敉
+敌
+敎
+敏
+救
+敔
+敕
+敖
+教
+敚
+敛
+敝
+敞
+敢
+散
+敦
+敫
+敬
+数
+敲
+整
+敷
+敻
+文
+斉
+斋
+斌
+斎
+斐
+斑
+斓
+斗
+料
+斛
+斜
+斝
+斟
+斡
+斤
+斥
+斧
+斩
+斫
+断
+斯
+新
+斲
+斶
+方
+於
+施
+旁
+旃
+旄
+旅
+旆
+旋
+旌
+旎
+族
+旒
+旖
+旗
+旛
+无
+既
+旣
+日
+旦
+旧
+旨
+早
+旬
+旭
+旮
+旯
+旰
+旱
+旳
+旴
+时
+旷
+旸
+旺
+旻
+旼
+昀
+昂
+昃
+昆
+昇
+昉
+昊
+昌
+明
+昏
+易
+昔
+昕
+昙
+昚
+昝
+昞
+星
+映
+春
+昧
+昨
+昪
+昫
+昭
+是
+昰
+昱
+昳
+昴
+昵
+昶
+昺
+昼
+显
+晁
+時
+晃
+晄
+晋
+晌
+晏
+晒
+晓
+晔
+晕
+晖
+晗
+晙
+晚
+晞
+晟
+晡
+晢
+晤
+晦
+晧
+晨
+晩
+晬
+普
+景
+晰
+晳
+晴
+晶
+晷
+晸
+智
+晻
+晾
+暁
+暂
+暄
+暇
+暌
+暍
+暎
+暐
+暑
+暕
+暖
+暗
+暝
+暠
+暧
+暨
+暮
+暴
+暸
+暹
+暻
+暾
+曈
+曌
+曕
+曙
+曛
+曜
+曝
+曦
+曩
+曰
+曱
+曲
+曳
+更
+曷
+曹
+曺
+曼
+曽
+曾
+替
+最
+會
+朅
+月
+有
+朊
+朋
+服
+朏
+朐
+朓
+朔
+朕
+朗
+望
+朝
+期
+朥
+朦
+木
+未
+末
+本
+札
+术
+朱
+朲
+朴
+朵
+机
+朽
+朿
+杀
+杂
+权
+杆
+杈
+杉
+杌
+李
+杏
+材
+村
+杓
+杖
+杜
+杞
+束
+杠
+条
+来
+杧
+杨
+杪
+杭
+杮
+杯
+杰
+杲
+杳
+杵
+杷
+杼
+松
+板
+极
+构
+枇
+枉
+枋
+枏
+析
+枓
+枕
+林
+枘
+枚
+果
+枝
+枞
+枟
+枠
+枢
+枣
+枥
+枧
+枨
+枪
+枫
+枭
+枯
+枰
+枱
+枳
+枵
+架
+枷
+枸
+枹
+柁
+柃
+柄
+柊
+柏
+某
+柑
+柒
+染
+柔
+柘
+柙
+柚
+柜
+柝
+柞
+柟
+柠
+柢
+查
+柩
+柬
+柯
+柰
+柱
+柳
+柴
+柷
+査
+柽
+柾
+柿
+栀
+栃
+栄
+栅
+标
+栈
+栉
+栊
+栋
+栌
+栎
+栏
+树
+栒
+栓
+栖
+栗
+栝
+栞
+栟
+校
+栢
+栩
+株
+栱
+栲
+栳
+栴
+样
+核
+根
+栻
+格
+栽
+栾
+栿
+桀
+桁
+桂
+桃
+桄
+桅
+框
+案
+桉
+桌
+桎
+桐
+桑
+桓
+桔
+桕
+桖
+桜
+桠
+桡
+桢
+档
+桤
+桥
+桦
+桧
+桨
+桩
+桫
+桴
+桶
+桷
+梁
+梃
+梅
+梆
+梏
+梓
+梗
+梢
+梣
+梦
+梧
+梨
+梭
+梯
+械
+梳
+梵
+梶
+梼
+梿
+检
+棂
+棉
+棋
+棍
+棐
+棒
+棕
+棘
+棚
+棠
+棣
+棨
+棪
+棫
+森
+棰
+棱
+棵
+棹
+棺
+棻
+棼
+椀
+椁
+椅
+椇
+椋
+植
+椎
+椐
+椒
+椛
+検
+椟
+椠
+椤
+椪
+椭
+椰
+椴
+椹
+椽
+椿
+楂
+楔
+楗
+楙
+楚
+楛
+楝
+楞
+楠
+楢
+楣
+楤
+楦
+楩
+楪
+楫
+業
+楮
+楯
+楶
+楷
+楸
+楹
+楼
+楽
+榀
+概
+榃
+榄
+榆
+榇
+榈
+榉
+榊
+榎
+榔
+榕
+榖
+榘
+榛
+榜
+榧
+榨
+榫
+榭
+榱
+榴
+榷
+榻
+榼
+槁
+槃
+槅
+槊
+槌
+槎
+槐
+槑
+槔
+様
+槙
+槚
+槛
+槜
+槟
+槠
+槭
+槱
+槲
+槵
+槻
+槽
+槿
+樊
+樋
+樗
+樘
+樛
+樟
+模
+樨
+権
+横
+樫
+樯
+樱
+樵
+樽
+樾
+橄
+橇
+橐
+橘
+橙
+橚
+橛
+橡
+橥
+橦
+橱
+橹
+橼
+檀
+檄
+檇
+檎
+檐
+檗
+檞
+檠
+檩
+檫
+檬
+檵
+櫂
+櫆
+欠
+次
+欢
+欣
+欤
+欧
+欲
+欷
+欸
+欹
+欺
+欻
+款
+歃
+歆
+歇
+歉
+歌
+歔
+歘
+歙
+止
+正
+此
+步
+武
+歧
+歩
+歪
+歳
+歴
+歹
+歺
+死
+歼
+殁
+殂
+殃
+殄
+殆
+殇
+殉
+殊
+残
+殍
+殑
+殒
+殓
+殖
+殚
+殛
+殡
+殢
+殪
+殳
+殴
+段
+殷
+殿
+毁
+毂
+毅
+毋
+毌
+母
+毎
+每
+毐
+毑
+毒
+毓
+比
+毕
+毖
+毗
+毘
+毙
+毛
+毡
+毫
+毯
+毳
+毵
+毽
+氀
+氂
+氅
+氆
+氇
+氍
+氏
+氐
+民
+氓
+气
+氖
+気
+氘
+氙
+氚
+氛
+氟
+氡
+氢
+氤
+氦
+氧
+氨
+氩
+氪
+氮
+氯
+氰
+氲
+水
+氵
+氷
+永
+氹
+氺
+氽
+氿
+汀
+汁
+求
+汆
+汇
+汉
+汊
+汐
+汕
+汖
+汗
+汛
+汜
+汝
+汞
+江
+池
+污
+汤
+汧
+汨
+汩
+汪
+汭
+汰
+汲
+汴
+汶
+汸
+汹
+汽
+汾
+沁
+沂
+沃
+沄
+沅
+沆
+沇
+沈
+沉
+沌
+沏
+沐
+沒
+沓
+沔
+沕
+沙
+沚
+沛
+沟
+没
+沢
+沣
+沤
+沥
+沦
+沧
+沨
+沩
+沪
+沫
+沬
+沭
+沮
+沱
+河
+沴
+沵
+沸
+油
+治
+沼
+沽
+沾
+沿
+泃
+泄
+泅
+泇
+泉
+泊
+泌
+泐
+泓
+泔
+法
+泖
+泗
+泚
+泛
+泞
+泠
+泡
+波
+泣
+泥
+注
+泩
+泪
+泫
+泮
+泯
+泰
+泱
+泳
+泵
+泷
+泸
+泺
+泻
+泼
+泽
+泾
+洁
+洄
+洇
+洈
+洊
+洋
+洌
+洎
+洑
+洒
+洗
+洙
+洛
+洞
+洢
+洣
+津
+洧
+洨
+洪
+洫
+洮
+洱
+洲
+洳
+洵
+洸
+洹
+洺
+活
+洼
+洽
+派
+流
+浃
+浄
+浅
+浆
+浇
+浈
+浉
+浊
+测
+浍
+济
+浏
+浐
+浑
+浒
+浓
+浔
+浙
+浚
+浛
+浜
+浞
+浠
+浡
+浣
+浥
+浦
+浩
+浪
+浬
+浮
+浯
+浴
+海
+浸
+浼
+涂
+涅
+消
+涉
+涌
+涎
+涐
+涑
+涓
+涔
+涕
+涘
+涙
+涛
+涝
+涞
+涟
+涠
+涡
+涢
+涣
+涤
+润
+涧
+涨
+涩
+涪
+涫
+涮
+涯
+液
+涴
+涵
+涸
+涿
+淀
+淄
+淅
+淆
+淇
+淋
+淌
+淏
+淑
+淖
+淘
+淙
+淛
+淝
+淞
+淠
+淡
+淤
+淦
+淩
+淫
+淬
+淮
+淯
+深
+淳
+混
+淸
+淹
+添
+淼
+渀
+渃
+清
+済
+渉
+渊
+渋
+渌
+渍
+渎
+渐
+渑
+渔
+渕
+渖
+渗
+渚
+渝
+渟
+渠
+渡
+渣
+渤
+渥
+温
+渫
+渭
+港
+渲
+渴
+游
+渺
+渼
+湃
+湄
+湉
+湋
+湍
+湎
+湑
+湓
+湔
+湖
+湘
+湛
+湜
+湟
+湣
+湫
+湮
+湲
+湳
+湴
+湾
+湿
+満
+溁
+溃
+溅
+溆
+溇
+溉
+溍
+溏
+源
+溘
+溜
+溞
+溟
+溢
+溥
+溦
+溧
+溪
+溯
+溱
+溲
+溴
+溶
+溷
+溺
+溽
+滁
+滂
+滃
+滆
+滇
+滈
+滉
+滋
+滍
+滏
+滑
+滓
+滔
+滕
+滗
+滘
+滙
+滚
+滝
+滞
+滟
+滠
+满
+滢
+滤
+滥
+滦
+滨
+滩
+滴
+滹
+漀
+漂
+漆
+漈
+漉
+漏
+漓
+演
+漕
+漠
+漩
+漪
+漫
+漭
+漯
+漱
+漳
+漶
+漷
+漾
+潆
+潇
+潋
+潍
+潏
+潘
+潜
+潞
+潟
+潢
+潦
+潭
+潮
+潲
+潴
+潸
+潺
+潼
+潽
+潾
+澂
+澄
+澈
+澉
+澌
+澍
+澎
+澐
+澔
+澜
+澡
+澥
+澧
+澪
+澳
+澶
+澹
+激
+濂
+濆
+濉
+濊
+濑
+濒
+濙
+濛
+濞
+濠
+濡
+濩
+濬
+濮
+濯
+瀀
+瀍
+瀑
+瀚
+瀛
+瀞
+瀣
+瀬
+瀹
+瀼
+灌
+灏
+灞
+火
+灬
+灭
+灯
+灰
+灵
+灶
+灸
+灼
+灾
+灿
+炀
+炁
+炅
+炆
+炉
+炊
+炎
+炒
+炔
+炕
+炖
+炘
+炙
+炜
+炝
+炟
+炤
+炩
+炫
+炬
+炭
+炮
+炯
+炱
+炳
+炷
+炸
+点
+為
+炻
+炼
+炽
+烀
+烁
+烂
+烃
+烈
+烊
+烎
+烔
+烘
+烙
+烛
+烜
+烝
+烟
+烤
+烦
+烧
+烨
+烩
+烫
+烬
+热
+烯
+烷
+烹
+烺
+烽
+焉
+焊
+焌
+焐
+焓
+焕
+焖
+焗
+焘
+焙
+焚
+焜
+焞
+焦
+焮
+焯
+焰
+焱
+然
+焼
+煅
+煇
+煊
+煌
+煎
+煐
+煕
+煖
+煚
+煜
+煞
+煤
+煦
+照
+煨
+煮
+煲
+煳
+煴
+煸
+煺
+煽
+熄
+熇
+熊
+熏
+熔
+熘
+熙
+熜
+熟
+熠
+熥
+熨
+熬
+熳
+熵
+熹
+熺
+燃
+燊
+燋
+燎
+燏
+燔
+燕
+燚
+燠
+燥
+燧
+燮
+燹
+燻
+燿
+爀
+爆
+爇
+爨
+爪
+爬
+爰
+爱
+爲
+爵
+父
+爷
+爸
+爹
+爻
+爽
+爿
+牀
+牁
+牂
+片
+版
+牋
+牌
+牍
+牐
+牒
+牖
+牙
+牛
+牝
+牟
+牠
+牡
+牢
+牤
+牦
+牧
+物
+牯
+牲
+牵
+特
+牺
+牻
+牾
+犀
+犁
+犄
+犇
+犊
+犍
+犏
+犒
+犟
+犨
+犬
+犭
+犯
+犰
+犴
+状
+犷
+犸
+犹
+犼
+犽
+狁
+狂
+狃
+狄
+狈
+狌
+狍
+狎
+狐
+狒
+狗
+狙
+狛
+狝
+狞
+狠
+狡
+狨
+狩
+独
+狭
+狮
+狯
+狰
+狱
+狲
+狳
+狴
+狷
+狸
+狻
+狼
+猀
+猁
+猃
+猄
+猇
+猊
+猋
+猎
+猓
+猕
+猖
+猗
+猛
+猜
+猝
+猞
+猟
+猡
+猢
+猥
+猩
+猪
+猫
+猬
+献
+猰
+猱
+猴
+猷
+猹
+猾
+猿
+獍
+獐
+獒
+獗
+獠
+獣
+獬
+獭
+獴
+獾
+玁
+玄
+率
+玉
+王
+玎
+玏
+玑
+玕
+玖
+玗
+玘
+玙
+玚
+玛
+玟
+玠
+玡
+玢
+玥
+玦
+玧
+玩
+玫
+玭
+玮
+环
+现
+玲
+玳
+玷
+玹
+玺
+玻
+珀
+珂
+珅
+珈
+珉
+珊
+珍
+珏
+珐
+珑
+珖
+珙
+珝
+珞
+珠
+珣
+珥
+珦
+珧
+珩
+珪
+班
+珰
+珲
+珵
+珹
+珺
+珽
+琀
+球
+琅
+理
+琇
+琉
+琊
+琋
+琍
+琎
+琏
+琐
+琚
+琛
+琢
+琤
+琥
+琦
+琨
+琪
+琬
+琮
+琯
+琰
+琲
+琳
+琴
+琵
+琶
+琹
+琼
+瑀
+瑁
+瑄
+瑆
+瑊
+瑒
+瑕
+瑗
+瑙
+瑚
+瑛
+瑜
+瑞
+瑟
+瑠
+瑢
+瑧
+瑨
+瑭
+瑰
+瑱
+瑶
+瑷
+瑸
+瑺
+瑾
+璀
+璁
+璂
+璃
+璆
+璇
+璈
+璋
+璎
+璐
+璘
+璜
+璞
+璟
+璠
+璧
+璨
+璩
+璪
+璮
+璲
+璺
+璿
+瓌
+瓒
+瓘
+瓛
+瓜
+瓞
+瓟
+瓠
+瓢
+瓣
+瓤
+瓦
+瓮
+瓯
+瓴
+瓶
+瓷
+瓿
+甃
+甄
+甍
+甏
+甑
+甓
+甗
+甘
+甙
+甚
+甜
+生
+甡
+產
+甥
+用
+甩
+甪
+甫
+甬
+甭
+甯
+田
+由
+甲
+申
+电
+男
+甸
+町
+画
+甾
+畀
+畅
+畈
+畊
+畋
+界
+畎
+畏
+畑
+畔
+留
+畚
+畛
+畜
+畠
+畤
+略
+畦
+番
+畯
+畲
+畴
+畸
+畹
+畿
+疃
+疆
+疋
+疍
+疎
+疏
+疑
+疒
+疔
+疖
+疗
+疙
+疚
+疝
+疟
+疠
+疡
+疣
+疤
+疥
+疫
+疬
+疮
+疯
+疰
+疱
+疲
+疳
+疴
+疵
+疸
+疹
+疼
+疽
+疾
+痂
+痄
+病
+症
+痈
+痉
+痊
+痍
+痒
+痔
+痕
+痖
+痘
+痛
+痞
+痢
+痣
+痤
+痦
+痧
+痨
+痩
+痪
+痫
+痰
+痱
+痴
+痹
+痼
+痿
+瘀
+瘁
+瘅
+瘆
+瘊
+瘌
+瘐
+瘕
+瘖
+瘗
+瘘
+瘙
+瘛
+瘟
+瘠
+瘢
+瘤
+瘥
+瘦
+瘩
+瘪
+瘫
+瘰
+瘳
+瘴
+瘵
+瘸
+瘼
+瘾
+瘿
+癀
+癃
+癌
+癍
+癎
+癒
+癔
+癖
+癜
+癞
+癣
+癫
+癯
+癸
+発
+登
+發
+白
+百
+癿
+皂
+的
+皆
+皇
+皈
+皋
+皎
+皐
+皑
+皒
+皓
+皕
+皖
+皙
+皛
+皝
+皞
+皤
+皦
+皮
+皱
+皲
+皴
+皿
+盂
+盃
+盅
+盆
+盈
+盉
+益
+盌
+盍
+盎
+盏
+盐
+监
+盒
+盔
+盖
+盗
+盘
+盛
+盝
+盟
+盥
+盦
+盨
+盩
+目
+盯
+盱
+盲
+直
+相
+盹
+盼
+盾
+眀
+省
+眄
+眇
+眈
+眉
+看
+県
+眙
+眚
+眛
+眞
+真
+眠
+眦
+眨
+眩
+眬
+眭
+眯
+眵
+眶
+眷
+眸
+眺
+眼
+着
+睁
+睇
+睐
+睑
+睒
+睚
+睛
+睡
+睢
+督
+睥
+睦
+睨
+睪
+睫
+睬
+睱
+睹
+睺
+睽
+睾
+睿
+瞀
+瞄
+瞅
+瞋
+瞌
+瞎
+瞑
+瞒
+瞟
+瞠
+瞢
+瞥
+瞧
+瞩
+瞪
+瞬
+瞭
+瞰
+瞳
+瞻
+瞽
+瞿
+矍
+矗
+矛
+矜
+矞
+矢
+矣
+知
+矧
+矩
+矫
+矬
+短
+矮
+石
+矶
+矸
+矽
+矾
+矿
+砀
+码
+砂
+砉
+砌
+砍
+砑
+砒
+研
+砕
+砖
+砗
+砚
+砜
+砝
+砟
+砢
+砣
+砥
+砦
+砧
+砩
+砬
+砭
+砰
+砳
+破
+砵
+砷
+砸
+砹
+砺
+砻
+砼
+砾
+础
+硅
+硇
+硌
+硎
+硏
+硐
+硒
+硔
+硕
+硖
+硗
+硙
+硚
+硝
+硪
+硫
+硬
+确
+硷
+硼
+碁
+碇
+碉
+碌
+碍
+碎
+碏
+碑
+碓
+碗
+碘
+碚
+碛
+碜
+碟
+碡
+碣
+碥
+碧
+碰
+碱
+碲
+碳
+碴
+碶
+碹
+碾
+磁
+磅
+磉
+磊
+磋
+磐
+磔
+磕
+磙
+磜
+磡
+磦
+磨
+磬
+磲
+磴
+磷
+磺
+磻
+磾
+礁
+礅
+礐
+礓
+礞
+礤
+礴
+示
+礻
+礼
+礽
+社
+祀
+祁
+祂
+祆
+祇
+祈
+祉
+祊
+祎
+祏
+祐
+祓
+祔
+祖
+祗
+祘
+祚
+祛
+祜
+祝
+神
+祟
+祠
+祢
+祥
+祧
+票
+祭
+祯
+祲
+祷
+祸
+祹
+祺
+祼
+祾
+禀
+禁
+禄
+禅
+禇
+禊
+禋
+福
+禑
+禔
+禖
+禘
+禚
+禛
+禟
+禤
+禧
+禩
+禳
+禵
+禹
+禺
+离
+禽
+禾
+秀
+私
+秂
+秃
+秆
+秉
+秋
+种
+科
+秒
+秕
+秘
+租
+秣
+秤
+秦
+秧
+秩
+秫
+秬
+秭
+积
+称
+秸
+移
+秽
+秾
+稀
+稃
+程
+稍
+税
+稔
+稗
+稙
+稚
+稞
+稠
+稣
+稲
+稳
+稷
+稹
+稻
+稼
+稽
+稿
+穂
+穆
+穉
+穏
+穑
+穗
+穣
+穰
+穴
+究
+穷
+穹
+空
+穿
+突
+窃
+窄
+窅
+窆
+窈
+窊
+窋
+窍
+窑
+窒
+窓
+窕
+窖
+窗
+窘
+窜
+窝
+窟
+窠
+窣
+窥
+窦
+窨
+窭
+窰
+窳
+窸
+窿
+立
+竑
+竖
+站
+竜
+竝
+竞
+竟
+章
+竣
+童
+竦
+竭
+端
+竹
+竺
+竽
+竿
+笃
+笄
+笆
+笈
+笊
+笋
+笏
+笑
+笔
+笕
+笙
+笛
+笞
+笠
+笤
+笥
+符
+笨
+笪
+笫
+第
+笮
+笱
+笳
+笸
+笹
+笺
+笼
+笾
+筅
+筇
+等
+筊
+筋
+筌
+筏
+筐
+筑
+筒
+答
+策
+筘
+筚
+筛
+筜
+筝
+筠
+筭
+筮
+筯
+筱
+筲
+筴
+筵
+筷
+筹
+筼
+签
+简
+箅
+箌
+箍
+箐
+箓
+箔
+箕
+算
+箜
+箝
+管
+箦
+箧
+箨
+箩
+箪
+箫
+箬
+箭
+箱
+箴
+箸
+篁
+篆
+篇
+篌
+篑
+篓
+篙
+篚
+篝
+篡
+篥
+篦
+篪
+篮
+篯
+篱
+篷
+篼
+篾
+簃
+簇
+簋
+簌
+簏
+簕
+簖
+簟
+簠
+簦
+簧
+簪
+簰
+簸
+簿
+籀
+籁
+籍
+籓
+籙
+米
+籴
+籺
+类
+籼
+籽
+粄
+粉
+粑
+粒
+粕
+粗
+粘
+粜
+粝
+粞
+粟
+粢
+粤
+粥
+粦
+粧
+粪
+粬
+粮
+粱
+粲
+粳
+粹
+粼
+粽
+精
+粿
+糁
+糅
+糊
+糌
+糍
+糕
+糖
+糗
+糙
+糜
+糟
+糠
+糨
+糬
+糯
+糸
+系
+紊
+紘
+素
+索
+紧
+紫
+紬
+紮
+累
+経
+絜
+絪
+絮
+絵
+絶
+絷
+絺
+綎
+綖
+継
+続
+綝
+綦
+綫
+綮
+総
+緑
+緾
+縁
+縂
+縠
+縢
+縻
+繁
+繇
+繋
+繸
+繻
+纁
+纂
+纚
+纛
+纟
+纠
+纡
+红
+纣
+纤
+纥
+约
+级
+纨
+纩
+纪
+纫
+纬
+纭
+纮
+纯
+纰
+纱
+纲
+纳
+纵
+纶
+纷
+纸
+纹
+纺
+纻
+纽
+纾
+线
+绀
+绁
+绂
+练
+组
+绅
+细
+织
+终
+绉
+绊
+绌
+绍
+绎
+经
+绐
+绑
+绒
+结
+绔
+绕
+绗
+绘
+给
+绚
+绛
+络
+绝
+绞
+统
+绠
+绡
+绢
+绣
+绥
+绦
+继
+绨
+绩
+绪
+绫
+续
+绮
+绯
+绰
+绱
+绲
+绳
+维
+绵
+绶
+绷
+绸
+绹
+绺
+绻
+综
+绽
+绾
+绿
+缀
+缁
+缂
+缃
+缄
+缅
+缆
+缇
+缈
+缉
+缊
+缋
+缌
+缍
+缎
+缐
+缑
+缒
+缓
+缔
+缕
+编
+缗
+缘
+缙
+缚
+缛
+缜
+缝
+缟
+缠
+缡
+缢
+缣
+缤
+缥
+缦
+缧
+缨
+缩
+缪
+缫
+缬
+缭
+缮
+缯
+缰
+缱
+缲
+缳
+缴
+缵
+缶
+缷
+缸
+缺
+罂
+罃
+罄
+罅
+罍
+罐
+网
+罔
+罕
+罗
+罘
+罚
+罝
+罟
+罠
+罡
+罢
+罣
+罥
+罨
+罩
+罪
+置
+罱
+署
+罴
+罹
+罽
+罾
+羁
+羊
+羌
+美
+羑
+羔
+羕
+羚
+羝
+羞
+羟
+羡
+羣
+群
+羧
+羮
+羯
+羰
+羲
+羸
+羹
+羼
+羽
+羿
+翀
+翁
+翃
+翅
+翈
+翊
+翌
+翎
+翔
+翕
+翘
+翙
+翚
+翛
+翟
+翠
+翡
+翥
+翦
+翩
+翫
+翮
+翰
+翱
+翳
+翻
+翼
+翾
+耀
+老
+考
+耄
+者
+耆
+耋
+而
+耍
+耐
+耒
+耔
+耕
+耗
+耘
+耙
+耜
+耦
+耧
+耨
+耩
+耪
+耳
+耵
+耶
+耷
+耸
+耻
+耽
+耿
+聂
+聃
+聆
+聊
+聋
+职
+聍
+聒
+联
+聕
+聘
+聚
+聡
+聩
+聪
+聴
+聼
+聿
+肃
+肄
+肆
+肇
+肉
+肋
+肌
+肓
+肖
+肘
+肚
+肛
+肜
+肝
+肟
+肠
+股
+肢
+肤
+肥
+肩
+肪
+肫
+肮
+肯
+肱
+育
+肴
+肸
+肺
+肼
+肽
+肾
+肿
+胀
+胁
+胂
+胃
+胄
+胆
+背
+胍
+胎
+胖
+胗
+胙
+胚
+胛
+胜
+胝
+胞
+胠
+胡
+胤
+胥
+胧
+胨
+胪
+胫
+胬
+胭
+胯
+胰
+胱
+胳
+胴
+胶
+胸
+胺
+胼
+能
+脁
+脂
+脆
+脇
+脉
+脊
+脍
+脏
+脐
+脑
+脒
+脓
+脔
+脖
+脘
+脚
+脞
+脢
+脩
+脬
+脯
+脱
+脲
+脳
+脷
+脸
+脾
+脿
+腆
+腈
+腊
+腋
+腌
+腐
+腑
+腓
+腔
+腕
+腘
+腙
+腚
+腠
+腥
+腧
+腩
+腭
+腮
+腰
+腱
+腴
+腹
+腺
+腻
+腼
+腾
+腿
+膀
+膂
+膈
+膊
+膏
+膑
+膘
+膛
+膜
+膝
+膦
+膨
+膳
+膺
+膻
+臀
+臁
+臂
+臃
+臆
+臊
+臌
+臑
+臓
+臜
+臞
+臣
+臧
+自
+臬
+臭
+至
+致
+臵
+臻
+臼
+臾
+舀
+舁
+舂
+舄
+舅
+舆
+舌
+舍
+舎
+舐
+舒
+舔
+舖
+舘
+舛
+舜
+舞
+舟
+舡
+舢
+舣
+舨
+航
+舫
+般
+舯
+舰
+舱
+舲
+舳
+舴
+舵
+舶
+舷
+舸
+船
+舺
+舻
+舾
+艄
+艇
+艉
+艋
+艏
+艘
+艟
+艨
+艮
+良
+艰
+色
+艳
+艹
+艺
+艽
+艾
+艿
+节
+芃
+芄
+芈
+芊
+芋
+芍
+芎
+芏
+芑
+芒
+芗
+芘
+芙
+芜
+芝
+芟
+芡
+芣
+芤
+芥
+芦
+芨
+芩
+芪
+芫
+芬
+芭
+芮
+芯
+芰
+花
+芳
+芴
+芶
+芷
+芸
+芹
+芽
+芾
+苁
+苄
+苇
+苈
+苊
+苋
+苌
+苍
+苎
+苏
+苑
+苒
+苓
+苔
+苕
+苗
+苘
+苛
+苜
+苞
+苟
+苡
+苢
+苣
+苤
+若
+苦
+苫
+苭
+苯
+英
+苴
+苷
+苹
+苺
+苻
+苼
+苾
+茀
+茁
+茂
+范
+茄
+茅
+茆
+茇
+茈
+茉
+茌
+茎
+茏
+茑
+茔
+茕
+茗
+茚
+茛
+茜
+茝
+茧
+茨
+茫
+茬
+茭
+茯
+茱
+茳
+茴
+茵
+茶
+茸
+茹
+茺
+茼
+荀
+荃
+荄
+荅
+荆
+荇
+荈
+草
+荏
+荐
+荑
+荒
+荔
+荖
+荘
+荚
+荛
+荜
+荞
+荟
+荠
+荡
+荣
+荤
+荥
+荦
+荧
+荨
+荩
+荪
+荫
+荬
+荭
+荮
+药
+荳
+荷
+荸
+荻
+荼
+荽
+莃
+莅
+莆
+莉
+莎
+莒
+莓
+莘
+莙
+莛
+莜
+莞
+莠
+莨
+莩
+莪
+莫
+莱
+莲
+莳
+莴
+莶
+获
+莸
+莹
+莺
+莼
+莽
+菀
+菁
+菂
+菅
+菇
+菈
+菉
+菊
+菌
+菏
+菑
+菓
+菔
+菖
+菘
+菜
+菝
+菟
+菠
+菡
+菩
+菪
+菫
+菰
+菱
+菲
+菸
+菹
+菽
+菿
+萁
+萃
+萄
+萆
+萋
+萌
+萍
+萎
+萏
+萑
+萘
+萜
+萝
+萢
+萤
+营
+萦
+萧
+萨
+萩
+萱
+萸
+萼
+落
+葆
+葎
+葑
+葖
+著
+葙
+葚
+葛
+葜
+葡
+葢
+董
+葩
+葫
+葬
+葭
+葱
+葳
+葵
+葶
+葸
+葺
+蒂
+蒇
+蒉
+蒋
+蒌
+蒍
+蒎
+蒐
+蒗
+蒙
+蒜
+蒟
+蒡
+蒨
+蒯
+蒲
+蒴
+蒸
+蒹
+蒺
+蒻
+蒽
+蒾
+蒿
+蓁
+蓂
+蓄
+蓇
+蓉
+蓊
+蓍
+蓐
+蓑
+蓓
+蓖
+蓝
+蓟
+蓠
+蓢
+蓣
+蓥
+蓦
+蓬
+蓼
+蓿
+蔀
+蔌
+蔑
+蔓
+蔗
+蔘
+蔚
+蔟
+蔡
+蔫
+蔬
+蔴
+蔵
+蔷
+蔸
+蔹
+蔺
+蔻
+蔼
+蔽
+蕃
+蕅
+蕈
+蕉
+蕊
+蕖
+蕗
+蕙
+蕞
+蕡
+蕤
+蕨
+蕰
+蕲
+蕴
+蕹
+蕺
+蕻
+蕾
+薄
+薅
+薆
+薇
+薏
+薙
+薛
+薜
+薢
+薤
+薨
+薪
+薫
+薬
+薮
+薯
+薰
+薷
+薹
+藁
+藉
+藏
+藐
+藓
+藕
+藜
+藟
+藠
+藤
+藦
+藨
+藩
+藻
+藿
+蘅
+蘑
+蘖
+蘧
+蘩
+蘸
+蘼
+虎
+虏
+虐
+虑
+虒
+虓
+虔
+虚
+虞
+虢
+虫
+虬
+虮
+虱
+虹
+虺
+虻
+虽
+虾
+虿
+蚀
+蚁
+蚂
+蚊
+蚋
+蚌
+蚍
+蚓
+蚕
+蚖
+蚜
+蚝
+蚡
+蚣
+蚤
+蚧
+蚨
+蚩
+蚪
+蚬
+蚯
+蚰
+蚱
+蚴
+蚵
+蚶
+蚺
+蛀
+蛄
+蛆
+蛇
+蛉
+蛊
+蛋
+蛎
+蛏
+蛐
+蛔
+蛙
+蛛
+蛞
+蛟
+蛤
+蛩
+蛭
+蛮
+蛰
+蛱
+蛲
+蛳
+蛴
+蛸
+蛹
+蛾
+蜀
+蜂
+蜃
+蜇
+蜈
+蜉
+蜊
+蜍
+蜑
+蜒
+蜓
+蜕
+蜗
+蜘
+蜚
+蜜
+蜞
+蜡
+蜢
+蜣
+蜥
+蜩
+蜮
+蜱
+蜴
+蜷
+蜻
+蜾
+蜿
+蝇
+蝈
+蝉
+蝌
+蝎
+蝓
+蝗
+蝙
+蝠
+蝣
+蝤
+蝥
+蝮
+蝰
+蝲
+蝴
+蝶
+蝻
+蝼
+蝽
+蝾
+螂
+螃
+螅
+螈
+螋
+融
+螓
+螟
+螣
+螨
+螫
+螬
+螭
+螯
+螳
+螵
+螺
+螽
+蟀
+蟆
+蟊
+蟋
+蟌
+蟑
+蟒
+蟛
+蟜
+蟠
+蟥
+蟪
+蟮
+蟳
+蟹
+蟾
+蠃
+蠊
+蠋
+蠍
+蠓
+蠕
+蠖
+蠡
+蠢
+蠲
+蠹
+蠼
+血
+衄
+衅
+衆
+行
+衍
+衎
+衔
+街
+衙
+衞
+衡
+衢
+衣
+补
+表
+衩
+衪
+衫
+衬
+衮
+衰
+衲
+衷
+衽
+衾
+衿
+袁
+袂
+袄
+袅
+袆
+袈
+袋
+袍
+袒
+袓
+袖
+袛
+袜
+袢
+袤
+袪
+被
+袭
+袮
+袱
+袴
+袷
+袼
+裁
+裂
+装
+裆
+裈
+裉
+裒
+裔
+裕
+裘
+裙
+裛
+裟
+裡
+裢
+裤
+裥
+裨
+裪
+裱
+裳
+裴
+裸
+裹
+裼
+裾
+褀
+褂
+褆
+褊
+褐
+褒
+褓
+褔
+褙
+褚
+褛
+褡
+褥
+褪
+褫
+褰
+褴
+褶
+襁
+襄
+襌
+襕
+襜
+襞
+襟
+襦
+襻
+西
+要
+覃
+覆
+覇
+覚
+覧
+覩
+観
+见
+观
+规
+觅
+视
+觇
+览
+觉
+觊
+觋
+觌
+觎
+觏
+觐
+觑
+角
+觚
+觜
+觞
+解
+觥
+触
+觯
+觳
+觽
+觿
+言
+訇
+訏
+訚
+訫
+訳
+訾
+詈
+詜
+詝
+詧
+詹
+誉
+誊
+誐
+誓
+説
+読
+諡
+諲
+諴
+謇
+謦
+譞
+警
+譬
+譲
+讌
+讠
+计
+订
+讣
+认
+讥
+讦
+讧
+讨
+让
+讪
+讫
+讬
+训
+议
+讯
+记
+讲
+讳
+讴
+讵
+讶
+讷
+许
+讹
+论
+讼
+讽
+设
+访
+诀
+证
+诂
+诃
+评
+诅
+识
+诈
+诉
+诊
+诋
+诌
+词
+诎
+诏
+译
+诒
+诓
+诔
+试
+诖
+诗
+诘
+诙
+诚
+诛
+诜
+话
+诞
+诟
+诠
+诡
+询
+诣
+诤
+该
+详
+诧
+诨
+诩
+诫
+诬
+语
+诮
+误
+诰
+诱
+诲
+诳
+说
+诵
+诶
+请
+诸
+诹
+诺
+读
+诼
+诽
+课
+诿
+谀
+谁
+谂
+调
+谄
+谅
+谆
+谇
+谈
+谊
+谋
+谌
+谍
+谎
+谏
+谐
+谑
+谒
+谓
+谔
+谕
+谖
+谗
+谘
+谙
+谚
+谛
+谜
+谝
+谞
+谟
+谠
+谡
+谢
+谣
+谤
+谥
+谦
+谧
+谨
+谩
+谪
+谫
+谬
+谭
+谮
+谯
+谰
+谱
+谲
+谳
+谴
+谵
+谶
+谷
+谿
+豁
+豆
+豇
+豉
+豊
+豌
+豕
+豚
+象
+豢
+豨
+豪
+豫
+豳
+豸
+豹
+豺
+貂
+貅
+貉
+貊
+貌
+貐
+貔
+貘
+賨
+賸
+贇
+贝
+贞
+负
+贠
+贡
+财
+责
+贤
+败
+账
+货
+质
+贩
+贪
+贫
+贬
+购
+贮
+贯
+贰
+贱
+贲
+贳
+贴
+贵
+贶
+贷
+贸
+费
+贺
+贻
+贼
+贽
+贾
+贿
+赀
+赁
+赂
+赃
+资
+赅
+赈
+赉
+赊
+赋
+赌
+赍
+赎
+赏
+赐
+赑
+赓
+赔
+赕
+赖
+赘
+赙
+赚
+赛
+赜
+赝
+赞
+赟
+赠
+赡
+赢
+赣
+赤
+赦
+赧
+赪
+赫
+赭
+赮
+走
+赳
+赴
+赵
+赶
+起
+趁
+趄
+超
+越
+趋
+趔
+趟
+趣
+趯
+趱
+足
+趴
+趵
+趸
+趺
+趼
+趾
+趿
+跂
+跃
+跄
+跆
+跋
+跌
+跎
+跏
+跑
+跖
+跗
+跚
+跛
+距
+跞
+跟
+跢
+跣
+跤
+跨
+跩
+跪
+跫
+跬
+路
+跳
+践
+跶
+跷
+跸
+跹
+跺
+跻
+跽
+踅
+踉
+踊
+踌
+踏
+踔
+踝
+踞
+踟
+踢
+踣
+踩
+踪
+踬
+踮
+踯
+踰
+踱
+踵
+踹
+踽
+蹀
+蹁
+蹂
+蹄
+蹇
+蹈
+蹉
+蹊
+蹋
+蹑
+蹒
+蹓
+蹙
+蹚
+蹟
+蹦
+蹩
+蹬
+蹭
+蹰
+蹲
+蹴
+蹶
+蹻
+蹼
+蹿
+躁
+躄
+躅
+躇
+躏
+躐
+躔
+躜
+躞
+身
+躬
+躯
+躲
+躺
+転
+軽
+輋
+轘
+车
+轧
+轨
+轩
+轫
+转
+轭
+轮
+软
+轰
+轱
+轲
+轳
+轴
+轵
+轶
+轸
+轹
+轺
+轻
+轼
+载
+轾
+轿
+辂
+较
+辄
+辅
+辆
+辇
+辈
+辉
+辊
+辋
+辍
+辎
+辏
+辐
+辑
+输
+辔
+辕
+辖
+辗
+辘
+辙
+辚
+辛
+辜
+辞
+辟
+辣
+辨
+辩
+辫
+辰
+辱
+辶
+边
+辺
+辻
+込
+辽
+达
+辿
+迁
+迂
+迄
+迅
+过
+迈
+迍
+迎
+运
+近
+迓
+返
+迕
+还
+这
+进
+远
+违
+连
+迟
+迢
+迤
+迥
+迦
+迨
+迩
+迪
+迫
+迭
+迮
+述
+迳
+迷
+迸
+迹
+追
+退
+送
+适
+逃
+逄
+逅
+逆
+选
+逊
+逋
+逍
+透
+逐
+逑
+递
+途
+逖
+逗
+這
+通
+逛
+逝
+逞
+速
+造
+逡
+逢
+逦
+逭
+逮
+逯
+進
+逵
+逶
+逸
+逹
+逺
+逻
+逼
+逾
+遁
+遂
+遄
+遇
+遍
+遏
+遐
+遑
+遒
+道
+遗
+遘
+遛
+遢
+遣
+遥
+遨
+遭
+遮
+遯
+遴
+遵
+遶
+遹
+遽
+避
+邀
+邂
+邃
+邅
+邈
+邉
+邋
+邑
+邓
+邕
+邗
+邙
+邛
+邝
+邠
+邡
+邢
+那
+邦
+邨
+邪
+邬
+邮
+邯
+邰
+邱
+邳
+邴
+邵
+邶
+邸
+邹
+邺
+邻
+邽
+邾
+郁
+郄
+郅
+郇
+郈
+郊
+郎
+郏
+郐
+郑
+郓
+郕
+郗
+郚
+郛
+郜
+郝
+郞
+郡
+郢
+郤
+郦
+郧
+部
+郪
+郫
+郭
+郯
+郴
+郷
+郸
+都
+郾
+郿
+鄀
+鄂
+鄄
+鄗
+鄘
+鄙
+鄚
+鄜
+鄞
+鄠
+鄢
+鄣
+鄩
+鄫
+鄮
+鄯
+鄱
+鄹
+酂
+酃
+酆
+酉
+酊
+酋
+酌
+配
+酎
+酐
+酒
+酗
+酚
+酝
+酞
+酡
+酢
+酣
+酤
+酥
+酩
+酪
+酬
+酮
+酯
+酰
+酱
+酲
+酴
+酵
+酶
+酷
+酸
+酹
+酺
+酽
+酾
+酿
+醂
+醅
+醇
+醉
+醋
+醌
+醍
+醐
+醑
+醒
+醚
+醛
+醢
+醣
+醪
+醭
+醮
+醯
+醴
+醵
+醺
+醿
+釆
+采
+釉
+释
+里
+重
+野
+量
+金
+釜
+釭
+釿
+鈇
+鈈
+鈊
+鈎
+鈡
+鉄
+鉏
+鉨
+鉴
+鉷
+銎
+銙
+銭
+銮
+鋆
+鋈
+鋐
+鋗
+鋬
+鋹
+錞
+錡
+錤
+録
+錾
+鍀
+鍪
+鎌
+鎏
+鎚
+鏊
+鏐
+鏖
+鐏
+鑨
+鑫
+钅
+钆
+钇
+针
+钉
+钊
+钋
+钌
+钍
+钎
+钏
+钐
+钒
+钓
+钔
+钕
+钖
+钗
+钙
+钚
+钛
+钜
+钝
+钞
+钟
+钠
+钡
+钢
+钣
+钤
+钥
+钦
+钧
+钨
+钩
+钪
+钫
+钬
+钭
+钮
+钯
+钰
+钱
+钲
+钳
+钴
+钵
+钶
+钷
+钹
+钺
+钻
+钼
+钽
+钾
+钿
+铀
+铁
+铂
+铃
+铄
+铅
+铆
+铈
+铉
+铊
+铋
+铌
+铍
+铎
+铐
+铑
+铒
+铓
+铕
+铖
+铗
+铙
+铚
+铛
+铜
+铝
+铟
+铠
+铡
+铢
+铣
+铤
+铥
+铦
+铧
+铨
+铩
+铪
+铫
+铬
+铭
+铮
+铯
+铰
+铱
+铲
+铳
+铵
+银
+铷
+铸
+铺
+铼
+铽
+链
+铿
+销
+锁
+锂
+锃
+锄
+锅
+锆
+锇
+锈
+锉
+锋
+锌
+锍
+锎
+锏
+锐
+锑
+锒
+锔
+锕
+锖
+锗
+锘
+错
+锚
+锛
+锜
+锝
+锞
+锟
+锠
+锡
+锢
+锣
+锤
+锥
+锦
+锨
+锩
+锪
+锫
+锬
+锭
+键
+锯
+锰
+锱
+锲
+锳
+锴
+锵
+锶
+锷
+锸
+锹
+锺
+锻
+锼
+锽
+镀
+镁
+镂
+镄
+镅
+镆
+镇
+镈
+镉
+镊
+镋
+镌
+镍
+镎
+镏
+镐
+镑
+镒
+镓
+镔
+镕
+镖
+镗
+镘
+镙
+镚
+镛
+镜
+镝
+镞
+镟
+镠
+镡
+镢
+镣
+镥
+镦
+镧
+镨
+镪
+镫
+镬
+镭
+镯
+镰
+镱
+镲
+镳
+镶
+长
+開
+閟
+関
+閦
+闇
+闍
+闘
+门
+闩
+闪
+闫
+闭
+问
+闯
+闰
+闱
+闲
+闳
+间
+闵
+闶
+闷
+闸
+闹
+闺
+闻
+闼
+闽
+闾
+闿
+阀
+阁
+阂
+阃
+阄
+阅
+阆
+阇
+阈
+阉
+阊
+阋
+阌
+阍
+阎
+阏
+阐
+阑
+阒
+阓
+阔
+阕
+阖
+阗
+阙
+阚
+阛
+阜
+阝
+队
+阡
+阪
+阮
+阱
+防
+阳
+阴
+阵
+阶
+阻
+阼
+阿
+陀
+陂
+附
+际
+陆
+陇
+陈
+陉
+陋
+陌
+降
+限
+陔
+陕
+陛
+陞
+陟
+陡
+院
+除
+陨
+险
+陪
+陬
+陲
+陵
+陶
+陷
+険
+隂
+隅
+隆
+隈
+隋
+隍
+随
+隐
+隔
+隗
+隘
+隙
+障
+隠
+隣
+隧
+隰
+隳
+隶
+隹
+隼
+隽
+难
+雀
+雁
+雄
+雅
+集
+雇
+雉
+雊
+雌
+雍
+雎
+雏
+雑
+雒
+雕
+雠
+雨
+雩
+雪
+雫
+雯
+雱
+雳
+零
+雷
+雹
+雾
+需
+霁
+霂
+霄
+霅
+霆
+震
+霈
+霉
+霊
+霍
+霎
+霏
+霓
+霖
+霙
+霜
+霞
+霪
+霭
+霰
+露
+霸
+霹
+霾
+靑
+青
+靓
+靖
+静
+靛
+非
+靠
+靡
+面
+靥
+革
+靬
+靳
+靴
+靶
+靺
+靼
+鞅
+鞋
+鞍
+鞑
+鞔
+鞘
+鞞
+鞠
+鞣
+鞥
+鞨
+鞫
+鞬
+鞭
+鞮
+鞯
+鞴
+韘
+韡
+韦
+韧
+韩
+韪
+韫
+韬
+韭
+音
+韵
+韶
+頔
+頞
+頠
+頫
+頵
+頼
+顒
+顔
+顕
+顗
+页
+顶
+顷
+顸
+项
+顺
+须
+顼
+顽
+顾
+顿
+颀
+颁
+颂
+颃
+预
+颅
+领
+颇
+颈
+颉
+颊
+颋
+颌
+颍
+颎
+颏
+颐
+频
+颓
+颔
+颖
+颗
+题
+颙
+颚
+颛
+颜
+额
+颞
+颟
+颠
+颡
+颢
+颤
+颦
+颧
+风
+飏
+飐
+飑
+飒
+飓
+飕
+飖
+飘
+飙
+飚
+飞
+食
+飧
+飨
+餍
+餐
+餗
+餮
+饔
+饕
+饣
+饥
+饦
+饧
+饨
+饩
+饪
+饫
+饬
+饭
+饮
+饯
+饰
+饱
+饲
+饴
+饵
+饶
+饷
+饸
+饹
+饺
+饼
+饽
+饿
+馀
+馁
+馃
+馄
+馅
+馆
+馇
+馈
+馊
+馋
+馍
+馏
+馐
+馑
+馒
+馓
+馔
+馕
+首
+馗
+馘
+香
+馥
+馨
+駄
+駅
+駆
+騄
+騑
+騒
+験
+驎
+驒
+驩
+马
+驭
+驮
+驯
+驰
+驱
+驳
+驴
+驶
+驷
+驸
+驹
+驺
+驻
+驼
+驽
+驾
+驿
+骀
+骁
+骂
+骃
+骄
+骅
+骆
+骇
+骈
+骉
+骊
+骋
+验
+骍
+骎
+骏
+骐
+骑
+骓
+骕
+骖
+骗
+骘
+骙
+骚
+骛
+骜
+骝
+骞
+骟
+骠
+骡
+骢
+骤
+骥
+骧
+骨
+骰
+骱
+骶
+骷
+骸
+骺
+骼
+髀
+髁
+髂
+髃
+髅
+髈
+髋
+髌
+髎
+髑
+髓
+高
+髙
+髡
+髦
+髪
+髫
+髭
+髯
+髹
+髻
+鬃
+鬈
+鬐
+鬓
+鬘
+鬟
+鬣
+鬯
+鬲
+鬶
+鬻
+鬼
+魁
+魂
+魃
+魄
+魅
+魆
+魇
+魈
+魉
+魋
+魍
+魏
+魑
+魔
+魟
+鮀
+鮈
+鮋
+鮟
+鮠
+鮨
+鮰
+鰕
+鰤
+鱀
+鱇
+鱓
+鱬
+鱲
+鱻
+鱼
+鱿
+鲀
+鲁
+鲂
+鲃
+鲅
+鲆
+鲇
+鲈
+鲉
+鲊
+鲋
+鲌
+鲍
+鲎
+鲏
+鲐
+鲑
+鲔
+鲕
+鲗
+鲘
+鲙
+鲚
+鲛
+鲜
+鲞
+鲟
+鲠
+鲡
+鲢
+鲣
+鲤
+鲥
+鲦
+鲧
+鲨
+鲩
+鲫
+鲭
+鲮
+鲱
+鲲
+鲳
+鲴
+鲵
+鲶
+鲷
+鲸
+鲹
+鲺
+鲻
+鲼
+鲽
+鲿
+鳀
+鳃
+鳄
+鳅
+鳇
+鳉
+鳊
+鳌
+鳍
+鳎
+鳏
+鳐
+鳑
+鳓
+鳔
+鳕
+鳖
+鳗
+鳙
+鳚
+鳜
+鳝
+鳞
+鳟
+鳡
+鳢
+鳣
+鳯
+鳽
+鳾
+鴂
+鴞
+鴷
+鵖
+鵙
+鵟
+鵺
+鶒
+鶲
+鷇
+鷉
+鷟
+鸂
+鸊
+鸑
+鸟
+鸠
+鸡
+鸢
+鸣
+鸥
+鸦
+鸨
+鸩
+鸪
+鸫
+鸬
+鸭
+鸮
+鸯
+鸰
+鸱
+鸲
+鸳
+鸵
+鸶
+鸷
+鸸
+鸹
+鸺
+鸻
+鸽
+鸾
+鸿
+鹀
+鹁
+鹂
+鹃
+鹄
+鹅
+鹆
+鹇
+鹈
+鹉
+鹊
+鹋
+鹌
+鹍
+鹎
+鹏
+鹑
+鹓
+鹕
+鹖
+鹗
+鹘
+鹚
+鹛
+鹜
+鹞
+鹟
+鹠
+鹡
+鹢
+鹣
+鹤
+鹦
+鹧
+鹨
+鹩
+鹪
+鹫
+鹬
+鹭
+鹮
+鹰
+鹱
+鹳
+鹾
+鹿
+麂
+麇
+麈
+麋
+麐
+麑
+麒
+麓
+麝
+麟
+麤
+麦
+麴
+麸
+麹
+麻
+麽
+麾
+麿
+黄
+黉
+黍
+黎
+黏
+黐
+黑
+黒
+黔
+默
+黙
+黛
+黜
+黝
+黟
+黠
+黡
+黢
+黥
+黧
+黩
+黯
+黻
+黼
+黾
+鼋
+鼍
+鼎
+鼐
+鼓
+鼗
+鼙
+鼠
+鼢
+鼩
+鼬
+鼯
+鼱
+鼷
+鼹
+鼻
+鼽
+鼾
+齁
+齐
+齑
+齢
+齮
+齿
+龀
+龁
+龃
+龄
+龅
+龆
+龇
+龈
+龉
+龊
+龋
+龌
+龑
+龘
+龙
+龚
+龛
+龟
+龠
+龢
+A
+B
+C
+D
+E
+F
+G
+H
+I
+J
+K
+L
+M
+N
+O
+P
+Q
+R
+S
+T
+U
+V
+W
+X
+Y
+Z
+a
+b
+c
+d
+e
+f
+g
+h
+i
+j
+k
+l
+m
+n
+o
+p
+q
+r
+s
+t
+u
+v
+w
+x
+y
+z
+AA
+AB
+AC
+AD
+AE
+AF
+AG
+AH
+AI
+AJ
+AK
+AL
+AM
+AN
+AP
+AQ
+AR
+AS
+AT
+AU
+AV
+AW
+AX
+AZ
+Al
+An
+Au
+Aw
+BA
+BB
+BC
+BD
+BE
+BF
+BG
+BH
+BI
+BJ
+BK
+BL
+BM
+BN
+BO
+BP
+BQ
+BR
+BS
+BT
+BU
+BV
+BW
+BY
+Bo
+Br
+Bu
+CA
+CB
+CC
+CD
+CE
+CF
+CG
+CH
+CI
+CJ
+CK
+CL
+CM
+CN
+CO
+CP
+CQ
+CR
+CS
+CT
+CU
+CV
+CW
+CX
+CY
+CZ
+Ca
+Ch
+Cl
+Co
+Cu
+DA
+DB
+DC
+DD
+DE
+DF
+DG
+DH
+DI
+DJ
+DK
+DL
+DM
+DN
+DO
+DQ
+DR
+DS
+DT
+DV
+DW
+DX
+DY
+DZ
+Da
+De
+Di
+Do
+Dr
+Du
+EA
+EB
+EC
+ED
+EE
+EF
+EG
+EH
+EI
+EK
+EL
+EM
+EN
+EP
+EQ
+ER
+ES
+ET
+EU
+EV
+EW
+EX
+EZ
+Ed
+En
+Ev
+Ex
+FA
+FB
+FC
+FD
+FE
+FF
+FG
+FH
+FI
+FJ
+FL
+FM
+FN
+FO
+FP
+FR
+FS
+FT
+FU
+FW
+FX
+FY
+FZ
+Fa
+Fi
+Fl
+Fo
+Fr
+Fu
+GA
+GB
+GC
+GD
+GE
+GF
+GG
+GH
+GI
+GJ
+GK
+GL
+GM
+GN
+GO
+GP
+GQ
+GR
+GS
+GT
+GU
+GW
+GX
+GY
+GZ
+Ga
+Go
+Gr
+Gu
+HA
+HB
+HC
+HD
+HE
+HF
+HG
+HH
+HI
+HJ
+HK
+HL
+HO
+HP
+HQ
+HR
+HS
+HT
+HU
+HV
+HW
+HX
+HY
+HZ
+Ha
+He
+Hi
+Ho
+Hu
+Hz
+IB
+IC
+ID
+IE
+IF
+IG
+IH
+II
+IK
+IL
+IM
+IN
+IO
+IP
+IQ
+IR
+IS
+IT
+IU
+IV
+IX
+If
+In
+JA
+JB
+JC
+JD
+JF
+JG
+JH
+JI
+JJ
+JK
+JL
+JM
+JO
+JP
+JQ
+JR
+JS
+JT
+JU
+JW
+JX
+JY
+JZ
+Ja
+Ji
+Jo
+Ju
+KA
+KB
+KC
+KD
+KE
+KF
+KG
+KH
+KI
+KJ
+KK
+KL
+KM
+KN
+KO
+KP
+KR
+KS
+KT
+KV
+KW
+KX
+KY
+KZ
+LA
+LB
+LC
+LD
+LE
+LF
+LG
+LH
+LI
+LJ
+LK
+LL
+LM
+LN
+LO
+LP
+LQ
+LR
+LS
+LT
+LU
+LV
+LW
+LX
+LY
+LZ
+La
+Le
+Li
+Lo
+Lu
+MA
+MB
+MC
+MD
+ME
+MF
+MG
+MH
+MI
+MJ
+MK
+ML
+MM
+MN
+MO
+MP
+MQ
+MR
+MS
+MT
+MU
+MV
+MW
+MX
+MY
+Ma
+Me
+Mi
+Mo
+Mu
+My
+NA
+NB
+NC
+ND
+NE
+NF
+NG
+NH
+NI
+NJ
+NK
+NL
+NN
+NO
+NP
+NR
+NS
+NT
+NU
+NV
+NW
+NX
+NY
+NZ
+Na
+Ne
+No
+Nu
+OA
+OB
+OC
+OD
+OE
+OF
+OG
+OH
+OK
+OL
+OM
+ON
+OO
+OP
+OR
+OS
+OT
+OU
+OV
+OZ
+Of
+Oh
+On
+Op
+Or
+Ou
+Ox
+PA
+PB
+PC
+PD
+PE
+PF
+PG
+PH
+PI
+PJ
+PK
+PL
+PM
+PN
+PO
+PP
+PQ
+PR
+PS
+PT
+PU
+PV
+PW
+PX
+Pa
+Ph
+Pl
+Po
+Pr
+Pu
+QA
+QB
+QC
+QE
+QF
+QG
+QJ
+QL
+QQ
+QR
+QS
+QT
+QU
+QW
+QY
+Qi
+Qu
+RA
+RB
+RC
+RE
+RF
+RG
+RH
+RI
+RJ
+RK
+RL
+RM
+RN
+RO
+RP
+RQ
+RR
+RS
+RT
+RV
+RW
+RX
+RZ
+Ra
+Re
+Ro
+Ru
+SA
+SB
+SC
+SD
+SE
+SF
+SG
+SH
+SI
+SJ
+SK
+SL
+SM
+SN
+SO
+SP
+SQ
+SR
+SS
+ST
+SU
+SV
+SW
+SX
+SY
+SZ
+Sc
+Sh
+So
+Sp
+St
+Su
+Sw
+Sy
+TA
+TB
+TC
+TD
+TE
+TF
+TG
+TH
+TI
+TJ
+TK
+TL
+TM
+TN
+TO
+TP
+TQ
+TR
+TS
+TT
+TU
+TV
+TW
+TX
+TY
+TZ
+Th
+To
+Tr
+Tw
+UA
+UC
+UD
+UE
+UF
+UG
+UH
+UI
+UK
+UL
+UM
+UN
+UP
+UR
+US
+UT
+UU
+UV
+UW
+UX
+Ub
+Un
+Up
+VA
+VB
+VC
+VE
+VF
+VG
+VH
+VI
+VJ
+VK
+VL
+VM
+VN
+VO
+VP
+VR
+VS
+VT
+VU
+VV
+VX
+Vi
+Vo
+WA
+WB
+WC
+WE
+WH
+WI
+WJ
+WL
+WM
+WN
+WO
+WQ
+WR
+WS
+WT
+WU
+WW
+WX
+WZ
+Wa
+We
+Wi
+Wo
+Wu
+XB
+XC
+XD
+XF
+XG
+XH
+XI
+XJ
+XK
+XL
+XM
+XO
+XP
+XQ
+XR
+XS
+XT
+XU
+XV
+XW
+XX
+XY
+XZ
+Xi
+Xu
+YA
+YB
+YC
+YD
+YE
+YF
+YG
+YH
+YJ
+YL
+YM
+YO
+YP
+YS
+YT
+YU
+YX
+YY
+YZ
+Ya
+Yo
+Yu
+ZA
+ZB
+ZC
+ZD
+ZE
+ZF
+ZG
+ZH
+ZI
+ZJ
+ZL
+ZM
+ZN
+ZO
+ZQ
+ZR
+ZS
+ZU
+ZW
+ZX
+ZY
+ZZ
+Zh
+ab
+aj
+an
+ap
+ar
+bb
+be
+bj
+bo
+bu
+by
+ca
+cb
+cf
+ch
+cl
+cm
+co
+cp
+cv
+dB
+da
+de
+di
+dj
+dn
+do
+dr
+dv
+ed
+em
+en
+ep
+eq
+ev
+ex
+ez
+fa
+fe
+ff
+fi
+fl
+fo
+fr
+fu
+gb
+gd
+gh
+gi
+go
+gp
+gr
+gu
+gz
+ha
+he
+hi
+ho
+hp
+hz
+iP
+iT
+ib
+ic
+id
+if
+ig
+im
+in
+io
+ip
+iq
+is
+it
+jQ
+ja
+ji
+jj
+jo
+jq
+ju
+kJ
+kN
+kW
+kg
+kn
+kz
+la
+ld
+le
+lg
+li
+ll
+lo
+lp
+lz
+ma
+mb
+me
+mi
+mm
+mo
+mp
+mq
+mu
+mv
+my
+na
+nb
+ng
+no
+nv
+ob
+of
+oh
+ok
+ol
+on
+op
+or
+ou
+ow
+oz
+pH
+pa
+pc
+ph
+pk
+pl
+po
+pp
+pr
+pu
+pv
+qf
+qq
+qu
+qz
+ra
+re
+rn
+ro
+rq
+se
+sh
+sk
+so
+sp
+sq
+st
+su
+sw
+sz
+th
+ti
+to
+tr
+tv
+tw
+ub
+uc
+uf
+ui
+uk
+un
+up
+us
+uv
+ux
+uz
+vc
+vi
+vo
+vr
+wa
+we
+wh
+wi
+wo
+wr
+ww
+xj
+xq
+xx
+ya
+ye
+yj
+yo
+yu
+yy
+yz
+zf
+zh
+zi
+zj
+zq
+zu
+zz
+AAA
+AAC
+ABA
+ABB
+ABC
+ABO
+ABS
+ABT
+ACA
+ACC
+ACD
+ACE
+ACG
+ACK
+ACL
+ACM
+ACP
+ACR
+ACS
+ACT
+ADA
+ADC
+ADD
+ADF
+ADI
+ADO
+ADP
+ADR
+ADS
+ADV
+AED
+AES
+AFC
+AFP
+AFS
+AGB
+AGC
+AGE
+AGM
+AGP
+AGV
+AIA
+AIC
+AIG
+AIM
+AIP
+AIR
+AIS
+AIX
+AKB
+AKM
+ALA
+ALL
+ALT
+AMA
+AMC
+AMD
+AMG
+AMI
+AML
+AMP
+AMR
+AMS
+AMT
+AMX
+AND
+AOC
+AOE
+AOL
+APA
+APC
+APE
+APG
+API
+APK
+APL
+APM
+APP
+APS
+APT
+APU
+ARA
+ARC
+ARE
+ARM
+ARP
+ART
+ASA
+ASC
+ASF
+ASM
+ASP
+ASR
+AST
+ATA
+ATC
+ATF
+ATI
+ATK
+ATM
+ATP
+ATS
+ATV
+ATX
+AUC
+AUG
+AUX
+AVC
+AVG
+AVI
+AVR
+AVS
+AVX
+AWM
+AWS
+All
+And
+Ang
+App
+Aqu
+BAC
+BAD
+BAE
+BAR
+BAT
+BAU
+BBA
+BBB
+BBC
+BBE
+BBQ
+BBS
+BBT
+BCD
+BEA
+BEC
+BEI
+BET
+BGA
+BGM
+BGP
+BIG
+BIM
+BIS
+BLG
+BMC
+BMD
+BMG
+BMI
+BMP
+BMW
+BMX
+BNC
+BOD
+BOM
+BOT
+BOX
+BOY
+BPM
+BPO
+BRN
+BRT
+BSA
+BSC
+BSD
+BSI
+BSM
+BSP
+BSS
+BTC
+BTR
+BTS
+BTV
+BUG
+BUN
+BUS
+BWV
+Bur
+Bus
+But
+CAA
+CAC
+CAD
+CAE
+CAI
+CAJ
+CAM
+CAN
+CAP
+CAR
+CAS
+CAT
+CBA
+CBC
+CBD
+CBN
+CBR
+CBS
+CCA
+CCC
+CCD
+CCF
+CCG
+CCI
+CCK
+CCM
+CCN
+CCP
+CCS
+CDC
+CDM
+CDN
+CDO
+CDP
+CDR
+CDS
+CEA
+CEC
+CEO
+CES
+CET
+CFA
+CFC
+CFD
+CFO
+CFR
+CGI
+CHA
+CHM
+CHO
+CIA
+CIC
+CID
+CIE
+CIF
+CIK
+CIO
+CIP
+CIS
+CLA
+CLI
+CLM
+CLS
+CMA
+CMC
+CME
+CML
+CMM
+CMO
+CMP
+CMS
+CMV
+CNC
+CNG
+CNN
+CNS
+COB
+COC
+COD
+COM
+CON
+COO
+COP
+COS
+COX
+CPA
+CPC
+CPE
+CPI
+CPL
+CPM
+CPP
+CPR
+CPS
+CPU
+CQC
+CRC
+CRM
+CRP
+CRS
+CRT
+CSA
+CSF
+CSI
+CSM
+CSP
+CSR
+CSS
+CST
+CTA
+CTC
+CTI
+CTO
+CTP
+CTS
+CUB
+CUT
+CVD
+CVN
+CVS
+CVT
+CXW
+CYP
+Car
+Cha
+Chr
+Chu
+Com
+Con
+Cou
+Cur
+DAB
+DAC
+DAO
+DAS
+DAT
+DAY
+DBA
+DBM
+DCD
+DCE
+DCF
+DCS
+DCT
+DDC
+DDD
+DDG
+DDN
+DDR
+DDS
+DDT
+DEA
+DEC
+DEM
+DES
+DFS
+DFT
+DHA
+DHL
+DIC
+DID
+DIF
+DIN
+DIP
+DIV
+DIY
+DLC
+DLL
+DLP
+DLT
+DMA
+DMC
+DMD
+DMF
+DMI
+DMO
+DMZ
+DNA
+DNF
+DNS
+DNV
+DOC
+DOI
+DOM
+DON
+DOS
+DOT
+DPI
+DPP
+DPS
+DRM
+DRX
+DSA
+DSC
+DSG
+DSL
+DSM
+DSP
+DSS
+DTC
+DTE
+DTM
+DTS
+DTU
+DVB
+DVD
+DVI
+DVR
+DWG
+DYG
+Day
+Div
+Don
+Dou
+Dow
+EAN
+EAP
+EBD
+EBS
+ECC
+ECM
+ECO
+ECT
+ECU
+ECW
+EDA
+EDG
+EDI
+EDM
+EDP
+EDR
+EEG
+EEP
+EFR
+EGF
+EHS
+EIA
+EJB
+EMA
+EMC
+EMI
+EMP
+EMS
+END
+EOS
+EPA
+EPC
+EPO
+EPR
+EPS
+ERP
+ESC
+ESD
+ESI
+ESL
+ESP
+ESR
+EST
+ETC
+ETF
+ETH
+ETL
+ETS
+EVA
+EVE
+EVO
+EXE
+EXO
+EXP
+EYE
+Eff
+Ell
+Emb
+Emp
+End
+Eng
+Equ
+Eur
+Eva
+Exc
+Exp
+FAA
+FAB
+FAG
+FAL
+FAN
+FAO
+FAQ
+FAT
+FBI
+FCA
+FCC
+FCI
+FCS
+FDA
+FDD
+FDI
+FEM
+FES
+FET
+FFT
+FGO
+FHD
+FIA
+FLV
+FLY
+FMS
+FNC
+FOB
+FOF
+FOR
+FOX
+FPC
+FPS
+FPX
+FRP
+FSA
+FSB
+FSC
+FSH
+FTA
+FTC
+FTP
+FUE
+FUN
+Fin
+Fiv
+Fly
+For
+Fou
+Fuj
+Fun
+Fut
+GAP
+GAT
+GAY
+GBA
+GBK
+GBT
+GBU
+GCC
+GCS
+GCT
+GDI
+GDP
+GEN
+GEO
+GET
+GFP
+GHz
+GIA
+GIF
+GIS
+GLA
+GLC
+GLP
+GLS
+GMA
+GMC
+GMP
+GMS
+GMT
+GMV
+GND
+GNP
+GNU
+GOD
+GOT
+GPA
+GPL
+GPS
+GPT
+GPU
+GRC
+GRE
+GRF
+GSH
+GSM
+GSP
+GTA
+GTI
+GTO
+GTP
+GTR
+GTS
+GTX
+GUI
+Giv
+Gmb
+Gua
+Gui
+Gun
+Guo
+Guy
+HAD
+HAL
+HBA
+HBO
+HBV
+HBs
+HCG
+HCI
+HCV
+HCl
+HDD
+HDL
+HDR
+HDV
+HEY
+HFC
+HGH
+HGT
+HID
+HIP
+HIS
+HIT
+HIV
+HLA
+HMG
+HMI
+HMS
+HOP
+HOT
+HOW
+HPC
+HPV
+HRC
+HRT
+HSE
+HSK
+HSV
+HTC
+HUB
+HUD
+HVG
+Haz
+Her
+Hom
+Hon
+Hou
+How
+Hua
+Hub
+Hum
+Hun
+IAI
+IAS
+IAT
+IBC
+IBF
+IBM
+ICA
+ICC
+ICD
+ICE
+ICO
+ICP
+ICQ
+ICS
+ICT
+ICU
+IDC
+IDD
+IDE
+IDF
+IDG
+IDS
+IEC
+IET
+IFA
+IFC
+IFI
+IFN
+IGF
+IGN
+IIA
+III
+IIS
+IKO
+IMA
+IMC
+IMD
+IME
+IMF
+IMG
+IMO
+IMS
+IMT
+INA
+INC
+INF
+ING
+INS
+INT
+IOS
+IPA
+IPC
+IPO
+IPS
+IPX
+IRC
+IRI
+ISA
+ISI
+ISM
+ISO
+ISP
+ITC
+ITF
+ITO
+ITS
+ITT
+ITU
+ITV
+IVR
+Imp
+InC
+Inf
+Inj
+Int
+JAR
+JBL
+JBT
+JCB
+JCR
+JDB
+JDG
+JET
+JGJ
+JIS
+JIT
+JKL
+JOE
+JPG
+JSF
+JSP
+JST
+JTA
+JVC
+JVM
+JYJ
+JYP
+Jac
+Jam
+Jan
+Jap
+Jav
+Jay
+Jin
+Joh
+Jon
+Jul
+Jun
+Jus
+KAB
+KAT
+KBS
+KDF
+KDJ
+KEY
+KFC
+KFR
+KID
+KIS
+KJm
+KOF
+KOH
+KOL
+KPI
+KPL
+KTV
+KVM
+Kin
+Kon
+Kur
+LAB
+LAN
+LBS
+LCA
+LCD
+LCK
+LCS
+LDA
+LDH
+LDL
+LDP
+LED
+LEE
+LEO
+LES
+LET
+LGA
+LGD
+LIN
+LIU
+LLC
+LME
+LMS
+LNG
+LOF
+LOL
+LOW
+LPG
+LPL
+LPR
+LRC
+LSA
+LSD
+LSI
+LSP
+LTD
+LTE
+LUC
+LUN
+LVM
+Laz
+Lib
+Lif
+Lin
+Liu
+Liz
+Lon
+Lou
+Low
+Luc
+Lum
+Luo
+Lux
+MAC
+MAD
+MAG
+MAN
+MAO
+MAP
+MAR
+MAS
+MAT
+MAX
+MAY
+MBA
+MBC
+MBO
+MBR
+MBS
+MCA
+MCC
+MCM
+MCN
+MCP
+MCS
+MCU
+MDA
+MDI
+MDL
+MDR
+MDS
+MEN
+MES
+MFA
+MFC
+MHC
+MHz
+MIB
+MIC
+MID
+MIL
+MIN
+MIS
+MIT
+MIX
+MKV
+MLC
+MLF
+MMA
+MMC
+MMI
+MMO
+MMS
+MMX
+MOD
+MOM
+MOS
+MOV
+MPA
+MPC
+MPG
+MPI
+MPS
+MPV
+MPa
+MRC
+MRI
+MRO
+MRP
+MSA
+MSC
+MSI
+MSN
+MTI
+MTK
+MTS
+MTU
+MTV
+MVC
+MVP
+Mac
+Mag
+Maj
+Man
+Mar
+Max
+May
+Mic
+Min
+Mon
+Mou
+Mur
+NAD
+NAS
+NAT
+NBA
+NBC
+NBL
+NCT
+NDS
+NEC
+NEO
+NES
+NET
+NEW
+NEX
+NFA
+NFC
+NFL
+NFS
+NGC
+NGN
+NGO
+NHK
+NHL
+NIC
+NIH
+NLP
+NME
+NMR
+NOT
+NOW
+NOX
+NOx
+NPC
+NPN
+NPR
+NSA
+NSC
+NSF
+NSK
+NTN
+NTP
+NTT
+NTV
+NVH
+NWA
+NXT
+NYT
+Nic
+Nob
+Nor
+Nov
+Now
+Nur
+OAD
+OBD
+OCG
+OCP
+OCR
+OCT
+ODM
+OEM
+OFF
+OGG
+OLE
+OMG
+ONE
+ONU
+OOO
+OPC
+OPP
+ORC
+OSD
+OSI
+OSS
+OST
+OTA
+OTC
+OTG
+OTT
+OUT
+OVA
+OVP
+Obj
+Off
+Oly
+Ope
+Oph
+Opt
+Our
+Out
+Ove
+PAC
+PAD
+PAH
+PAL
+PAM
+PAN
+PAS
+PBS
+PBT
+PCA
+PCB
+PCD
+PCI
+PCL
+PCM
+PCR
+PCS
+PCT
+PDA
+PDB
+PDC
+PDD
+PDF
+PDM
+PDP
+PDU
+PEG
+PEP
+PER
+PES
+PET
+PFA
+PFC
+PGA
+PGC
+PHP
+PHS
+PIC
+PID
+PIM
+PIN
+PKI
+PLA
+PLC
+PLD
+PLL
+PLM
+PMC
+PMI
+PMP
+PND
+PNG
+PNP
+POE
+POM
+PON
+POP
+POS
+PPA
+PPC
+PPG
+PPH
+PPI
+PPM
+PPP
+PPR
+PPS
+PPT
+PPV
+PRL
+PRO
+PSA
+PSD
+PSE
+PSG
+PSI
+PSK
+PSP
+PSS
+PSV
+PSW
+PSY
+PTA
+PTC
+PTH
+PTT
+PUB
+PVA
+PVC
+PVE
+PVP
+PWM
+Par
+Per
+Pic
+Pow
+Pro
+Pur
+QAM
+QDI
+QFP
+QGh
+QOS
+QPI
+QPS
+QRS
+QTL
+Qin
+Qua
+Que
+RAM
+RAP
+RAR
+RAS
+RAW
+RBC
+RCA
+RCS
+RDF
+RDS
+RED
+REF
+REG
+REM
+REX
+RFC
+RGB
+RIA
+RIM
+RIP
+RMB
+RMS
+RNA
+RNG
+ROC
+ROE
+ROI
+ROM
+RPC
+RPG
+RPM
+RRW
+RSA
+RSC
+RSI
+RSS
+RTA
+RTC
+RTK
+RTP
+RTS
+RTU
+RTX
+RUN
+RUS
+Ray
+Raz
+Ric
+Riv
+Rom
+Rou
+Rub
+Run
+Rus
+SAC
+SAE
+SAM
+SAN
+SAO
+SAP
+SAR
+SAS
+SAT
+SAY
+SBR
+SBS
+SCE
+SCH
+SCI
+SCM
+SCP
+SCR
+SDH
+SDI
+SDK
+SDR
+SDS
+SEA
+SEC
+SEE
+SEM
+SEO
+SER
+SET
+SFC
+SFP
+SGH
+SGI
+SGS
+SHA
+SHE
+SID
+SIG
+SIM
+SIP
+SIR
+SIS
+SKF
+SKT
+SKU
+SKY
+SLA
+SLC
+SLE
+SLG
+SLI
+SLR
+SLS
+SMA
+SMB
+SMC
+SMD
+SMG
+SMI
+SMP
+SMS
+SMT
+SNK
+SNP
+SNR
+SNS
+SOA
+SOC
+SOD
+SOI
+SOP
+SOS
+SPA
+SPC
+SPD
+SPE
+SPF
+SPI
+SPR
+SPS
+SPT
+SPV
+SQL
+SQU
+SRS
+SRT
+SSA
+SSC
+SSD
+SSE
+SSH
+SSL
+SSR
+SSS
+SST
+STC
+STD
+STK
+STL
+STM
+STN
+STP
+STR
+STS
+SUB
+SUN
+SUV
+SVC
+SVD
+SVG
+SVM
+SWF
+SXG
+SYN
+SYS
+Sch
+Ser
+She
+Siz
+Som
+Sou
+Squ
+Sub
+Sum
+Sun
+Sup
+Suz
+TAB
+TAC
+TAG
+TAO
+TBC
+TBM
+TBS
+TCG
+TCL
+TCM
+TCO
+TCP
+TCR
+TCS
+TCT
+TDD
+TDI
+TDM
+TDP
+TDS
+TEC
+TED
+TEL
+TEM
+TES
+TEU
+TEX
+TFT
+TGA
+TGF
+TGV
+THD
+THE
+TIA
+TIF
+TKO
+TLC
+TLS
+TMD
+TMP
+TMS
+TMT
+TNA
+TNF
+TNT
+TOC
+TOD
+TOE
+TOM
+TOP
+TPC
+TPE
+TPM
+TPO
+TPP
+TPR
+TPS
+TPU
+TQM
+TSC
+TSH
+TSI
+TSP
+TTL
+TTS
+TTT
+TUV
+TVB
+TVC
+TVP
+TVS
+TWO
+TXT
+Tay
+The
+Tom
+Tou
+Tow
+Tur
+UAR
+UBC
+UCC
+UCL
+UDP
+UFC
+UFO
+UGC
+UHF
+UIP
+UMD
+UML
+UNI
+UPC
+UPS
+URL
+USA
+USB
+USD
+USM
+USP
+USS
+UTC
+UTF
+UTP
+UTR
+UVA
+UVB
+UWB
+UZI
+Umb
+Uni
+Upp
+Uzi
+VAC
+VAR
+VBA
+VBR
+VBS
+VCC
+VCD
+VCR
+VDC
+VDE
+VGA
+VHF
+VHS
+VIA
+VII
+VIP
+VIS
+VMw
+VOA
+VOB
+VOC
+VOD
+VOL
+VPN
+VPS
+VRP
+VSS
+VTE
+VVT
+Ver
+Vic
+Vid
+Vis
+Viv
+WAN
+WAP
+WAV
+WAY
+WBA
+WBC
+WBO
+WBS
+WCG
+WCW
+WDM
+WDS
+WEB
+WEP
+WEY
+WGK
+WHO
+WIN
+WMA
+WMS
+WMV
+WOW
+WPA
+WPF
+WPS
+WRC
+WSA
+WTA
+WTI
+WTO
+WVG
+WWE
+WWF
+WWW
+Way
+Wha
+Whe
+Whi
+Who
+Why
+WiF
+Win
+Wiz
+Wom
+Wor
+Wou
+XGA
+XII
+XML
+XPS
+XXX
+XYZ
+YAG
+YES
+YOU
+YZB
+Yin
+You
+Yua
+Yuk
+Yun
+ZIP
+ZOL
+Zer
+Zha
+Zhu
+Zom
+Zon
+Zou
+abb
+abc
+abo
+abs
+act
+adj
+aff
+all
+and
+ang
+any
+app
+aws
+bbb
+bbc
+bbq
+bbs
+but
+cAM
+cDN
+cGM
+can
+car
+cba
+cha
+chi
+col
+com
+con
+cor
+cou
+cpi
+cpu
+dan
+day
+des
+did
+dif
+dis
+div
+diy
+doc
+don
+dow
+eAA
+eSA
+ech
+eff
+emb
+emp
+end
+eng
+eqc
+equ
+euv
+eve
+exc
+exe
+exp
+fac
+fil
+fin
+fir
+fiv
+fla
+fly
+for
+fox
+fre
+fri
+gAS
+gdp
+gen
+giv
+gmp
+gon
+goo
+got
+gps
+gra
+gre
+gro
+had
+har
+has
+hav
+haz
+her
+his
+hiv
+hol
+hom
+hou
+how
+iBT
+iOS
+iPa
+iPh
+iPo
+iSC
+ima
+imp
+inc
+inf
+inj
+int
+ipa
+iph
+ipo
+isb
+iso
+jam
+jap
+jav
+jay
+jus
+kHz
+kJm
+kdj
+kin
+lay
+laz
+lck
+lea
+led
+let
+lib
+lif
+lin
+liq
+lis
+lit
+liv
+liz
+lly
+lng
+loc
+lof
+log
+loo
+los
+low
+mRN
+mac
+mad
+maj
+man
+mar
+mat
+max
+may
+maz
+mba
+men
+mic
+min
+mmH
+mod
+mon
+mor
+mys
+nVI
+nba
+nex
+nic
+not
+nov
+now
+nxp
+obj
+off
+one
+ope
+opp
+our
+out
+ove
+par
+pay
+per
+phe
+php
+piz
+pla
+pow
+ppp
+pre
+pro
+pvc
+qHD
+qgh
+qua
+que
+qui
+rRN
+ray
+raz
+rea
+rec
+red
+ref
+reg
+rem
+rep
+req
+res
+rev
+ric
+riv
+rmb
+rng
+rom
+rou
+say
+sch
+sha
+she
+shi
+sho
+sim
+sin
+siz
+som
+sou
+spa
+spe
+sql
+squ
+sta
+ste
+sto
+str
+sty
+sub
+suv
+tRN
+tha
+the
+thi
+thr
+tim
+tip
+top
+tow
+tpp
+tra
+tur
+tuv
+two
+ubc
+uiv
+unc
+und
+uni
+unk
+ups
+usb
+uva
+uvb
+uzi
+val
+var
+ver
+vie
+vip
+vis
+viv
+wan
+was
+way
+web
+wer
+wha
+whi
+who
+why
+wif
+wit
+wom
+won
+wor
+wou
+www
+xin
+xxx
+yin
+you
+zha
+zhi
+zho
+zhu
+zon
+zzf
+zzy
+AAAA
+AACS
+ABCD
+ACCA
+ACCE
+ACCP
+ACDC
+ACGN
+ACID
+ACPI
+ACTH
+ADHD
+ADPC
+ADSL
+AIDS
+AJAX
+ALPH
+AMEX
+AMOL
+ANGE
+ANSI
+ANSY
+APEC
+APPL
+APTE
+ARDS
+ARPA
+ARPG
+ASCE
+ASCI
+ASIA
+ASIC
+ASIN
+ASME
+ASSO
+ASTM
+ASUS
+AUDI
+AUTO
+AVCH
+AWAR
+Andr
+BABY
+BACK
+BAND
+BANG
+BANK
+BASI
+BASS
+BATT
+BEAS
+BEAT
+BEST
+BEYO
+BIGB
+BIOS
+BLAC
+BLEA
+BLOG
+BLOO
+BLUE
+BOBO
+BOOK
+BOOL
+BOOM
+BOPP
+BOSS
+BOYS
+BRAV
+BREA
+BUFF
+Buck
+Buff
+Bull
+Bung
+Buzz
+CADC
+CALL
+CAPC
+CAPP
+CARD
+CASE
+CASI
+CAST
+CATI
+CATV
+CAXA
+CCFL
+CCIE
+CCNA
+CCTV
+CDMA
+CEPA
+CERN
+CHAN
+CHAP
+CHAR
+CHEN
+CHIN
+CHOR
+CIMS
+CIPA
+CISC
+CITE
+CITY
+CLAM
+CLAN
+CLAS
+CLOS
+CLUB
+CMMB
+CMMI
+CMOS
+CMYK
+CNAS
+CNBC
+CNBL
+CNKI
+CNNI
+COCO
+CODE
+COLL
+COLO
+COMB
+COME
+COMI
+COMP
+CONT
+COOL
+CORB
+CORE
+COSM
+COSP
+COST
+COUN
+COVI
+CPLD
+CREA
+CROS
+CSCD
+CSDN
+CSMA
+CSOL
+CSSC
+CSTN
+CTRL
+CUBA
+CUDA
+CURR
+CVBS
+Chin
+Chur
+DANC
+DARK
+DARP
+DASH
+DATA
+DAYS
+DCDC
+DDNS
+DDOS
+DDRI
+DELL
+DEMO
+DESI
+DEST
+DHCP
+DIGI
+DIMM
+DISC
+DIVX
+DLNA
+DOHC
+DOTA
+DOWN
+DRAG
+DRAM
+DREA
+DRIV
+DSLR
+DVDC
+DVGA
+DWDM
+DWOR
+EAST
+EASY
+EBIT
+ECMO
+EDGE
+EDIT
+EDTA
+EGFR
+EINE
+ELIS
+ELLE
+EMBA
+ENER
+ENGI
+ENTE
+EPDM
+EPIS
+EPON
+EPSO
+EPUB
+ERCP
+ERRO
+ESET
+ESPN
+ETSI
+EVDO
+EVER
+EXCE
+EXIL
+EXPO
+Ever
+Exch
+Exer
+FACE
+FALS
+FANS
+FANU
+FAST
+FDDI
+FIBA
+FIDI
+FIFA
+FIFO
+FILE
+FINA
+FIRE
+FIRS
+FISH
+FIVE
+FLAC
+FLAS
+FLOW
+FMVP
+FORT
+FPGA
+FREE
+FROM
+FTTH
+FULL
+FWVG
+FXCM
+Fuck
+Full
+Fund
+Fung
+Fuzz
+GABA
+GALA
+GAME
+GANK
+GATT
+GEAR
+GENE
+GHOS
+GIRL
+GLON
+GMAT
+GNSS
+GOLD
+GOOD
+GOOG
+GPRS
+GREE
+GROU
+GSMG
+GUCC
+GUND
+GUTS
+Gund
+HACC
+HAPP
+HARD
+HART
+HDCP
+HDMI
+HDPE
+HDTV
+HEAD
+HEAR
+HELL
+HEPA
+HERO
+HIFI
+HIGH
+HIPH
+HKEY
+HOLD
+HOME
+HOST
+HOUS
+HPLC
+HSDP
+HSPA
+HTML
+HTTP
+HUNT
+Hugh
+Hung
+ICAN
+ICMP
+ICON
+IDEA
+IDOL
+IEEE
+IELT
+IETF
+IFPI
+IGBT
+IGMP
+IMAX
+IMDB
+INFO
+INTE
+IPAD
+IPTV
+ISBN
+ISDN
+ISIS
+ISOI
+ISRC
+ISSN
+ISTP
+ITER
+ITIL
+IUCN
+Inte
+Inve
+JACK
+JAPA
+JAVA
+JAZZ
+JBOD
+JOHN
+JOJO
+JOKE
+JOUR
+JPEG
+JUMP
+JUST
+Jack
+Jake
+Jazz
+John
+Joke
+July
+Jump
+Jung
+KING
+KISS
+KONA
+KOYO
+LASI
+LAST
+LEED
+LEEP
+LESS
+LEVE
+LEXU
+LIFE
+LIKE
+LIMI
+LINE
+LINK
+LINU
+LIST
+LIVE
+LLDP
+LOCA
+LOFT
+LOGO
+LOLI
+LONG
+LOOK
+LOVE
+LPGA
+LTPS
+LVDS
+Ligh
+Like
+Lily
+Lind
+Ling
+Liqu
+Live
+Luck
+Luke
+MACD
+MACH
+MAGI
+MALL
+MAMA
+MARK
+MAST
+MATL
+MATX
+MAYA
+MBLA
+MEDI
+MEGA
+MEMS
+MERS
+META
+MIDI
+MIDP
+MIMO
+MINI
+MIPS
+MISS
+MIUI
+MMOR
+MOBA
+MODB
+MODE
+MOMO
+MOOC
+MOON
+MORE
+MOSF
+MOTO
+MOVI
+MPEG
+MPLS
+MSCI
+MSDS
+MTBF
+MUSI
+Mach
+Make
+Maur
+Mazz
+NACH
+NADH
+NADP
+NAMC
+NAME
+NANA
+NAND
+NASA
+NASD
+NATO
+NAVE
+NCAA
+NCAP
+NCIS
+NEDC
+NEOP
+NERV
+NEST
+NEWS
+NEXT
+NICO
+NIGH
+NIKE
+NINE
+NOKI
+NOTE
+NOVA
+NSAI
+NTFS
+NTSC
+NULL
+NURB
+NVID
+NYSE
+Nove
+ODBC
+OECD
+OFDM
+OFFI
+OLAP
+OLED
+ONLI
+ONLY
+OPEC
+OPEN
+OPPO
+ORAC
+ORIC
+ORIG
+OSPF
+OVER
+Oper
+PACS
+PAGE
+PARK
+PART
+PASS
+PCMC
+PDCA
+PEEK
+PERC
+PERF
+PETS
+PHEV
+PHIL
+PHOT
+PICC
+PIEC
+PLAN
+PLAY
+PLUS
+PMMA
+PNAS
+POLO
+POSE
+POST
+POWE
+PPTP
+PPTV
+PRAD
+PROD
+PROF
+PROJ
+PSTN
+PTFE
+PUNK
+PVDF
+Pric
+Prin
+Priv
+Priz
+Prom
+QFII
+QVGA
+QVOD
+QWER
+Quic
+Quin
+Quiz
+RADI
+RAID
+RAIN
+REAC
+READ
+REAL
+REIT
+RESE
+RFID
+RIDE
+RISC
+RMON
+RMRM
+RMVB
+ROAD
+ROCK
+ROHS
+ROOT
+ROSE
+RTEC
+RWBY
+Ruby
+SAAS
+SAMS
+SARS
+SATA
+SCAD
+SCAR
+SCDM
+SCHO
+SCIE
+SCSI
+SDHC
+SDMM
+SDRA
+SDSD
+SDXC
+SECA
+SECC
+SECT
+SEED
+SEGA
+SELE
+SERV
+SEVE
+SFDA
+SHIF
+SHIN
+SHOC
+SHOP
+SHOW
+SIDE
+SIEM
+SING
+SIZE
+SKIP
+SMAP
+SMAR
+SMIL
+SMTP
+SNMP
+SOAP
+SOCK
+SOHO
+SOLO
+SONG
+SONY
+SOSO
+SOUL
+SPAC
+SPCC
+SPDI
+SPEC
+SPEE
+SPIE
+SPOR
+SPSS
+SRAM
+SSCI
+STAF
+STAG
+STAR
+STAT
+STEM
+STEP
+STER
+STOP
+STOR
+STUD
+STYL
+SUMM
+SUPE
+SUSE
+SWAT
+SWIF
+SWOT
+SYST
+Subj
+Sull
+Sund
+Sung
+Supp
+TABL
+TANK
+TCPI
+TDMA
+TEAM
+TECH
+TEST
+TEXT
+TFBO
+TFSI
+TFTP
+THIS
+THRE
+TIFF
+TIME
+TIMK
+TIPS
+TOEF
+TOKY
+TOSH
+TOUC
+TOUR
+TOWN
+TRAC
+TRIP
+TRIZ
+TRUE
+TVBS
+TVOC
+TWIC
+TYPE
+Ther
+Thin
+Thom
+Thou
+UCLA
+UHMW
+ULTR
+UMTS
+UNES
+UNIT
+UNIV
+UNIX
+Unic
+Unit
+Univ
+VAIO
+VCCI
+VEGF
+VERS
+VHDL
+VIDE
+VIER
+VIII
+VISA
+VISI
+VIST
+VIVO
+VLAN
+VLSI
+VOCA
+VOGU
+VOIP
+VRay
+VSAT
+Vick
+Vill
+WANG
+WAPI
+WASD
+WAVE
+WCBA
+WCDM
+WEEK
+WEST
+WHAT
+WHIT
+WIFI
+WIND
+WITH
+WLAN
+WORD
+WORK
+WORL
+WQVG
+WXGA
+Wang
+Wher
+WiMA
+Will
+Wind
+Wing
+XBOX
+XBRL
+XHTM
+XVID
+XXXX
+YAMA
+YANG
+YEAH
+YONE
+YOUN
+YOUR
+YOYO
+Yong
+Your
+ZAFT
+ZARA
+ZERO
+ZGMF
+ZHAN
+ZONE
+Zhon
+Zhou
+abby
+abou
+andr
+appl
+baby
+back
+blic
+call
+char
+chic
+chin
+coff
+coll
+comb
+comm
+comp
+cond
+cons
+cont
+dick
+diff
+ding
+dock
+doin
+dong
+down
+ever
+exch
+find
+foll
+four
+from
+fron
+goin
+good
+goog
+gove
+hack
+hall
+hand
+hang
+happ
+have
+here
+high
+home
+into
+inve
+jack
+java
+jazz
+jump
+jung
+just
+know
+life
+ligh
+like
+lily
+ling
+liqu
+live
+lock
+logo
+lond
+long
+look
+love
+macd
+mach
+make
+mapp
+mmer
+nove
+okay
+only
+oper
+oppo
+othe
+over
+play
+pray
+pric
+prin
+priv
+priz
+prod
+prom
+quic
+real
+requ
+righ
+scho
+shou
+show
+some
+star
+stat
+stay
+stom
+subj
+such
+suff
+supp
+take
+than
+they
+thin
+thou
+toke
+uber
+unic
+univ
+upon
+usdj
+user
+usin
+vill
+vivo
+wake
+wall
+wang
+want
+wave
+were
+what
+when
+wifi
+will
+wind
+wing
+with
+work
+xing
+xxxx
+year
+your
+zhon
+China
+Inter
+Journ
+china
+every
+inter
+iphon
+thing
+think
+where
+which
+Univer
+univer
+Windows
+windows
+##A
+##B
+##C
+##D
+##E
+##F
+##G
+##H
+##I
+##J
+##K
+##L
+##M
+##N
+##O
+##P
+##Q
+##R
+##S
+##T
+##U
+##V
+##W
+##X
+##Y
+##Z
+##a
+##b
+##c
+##d
+##e
+##f
+##g
+##h
+##i
+##j
+##k
+##l
+##m
+##n
+##o
+##p
+##q
+##r
+##s
+##t
+##u
+##v
+##w
+##x
+##y
+##z
+##AA
+##AB
+##AC
+##AD
+##AE
+##AF
+##AG
+##AH
+##AI
+##AK
+##AL
+##AM
+##AN
+##AO
+##AP
+##AQ
+##AR
+##AS
+##AT
+##AV
+##AW
+##AX
+##AY
+##AZ
+##BA
+##BB
+##BC
+##BE
+##BG
+##BI
+##BM
+##BN
+##BO
+##BP
+##BR
+##BS
+##BT
+##BU
+##BY
+##CA
+##CB
+##CC
+##CD
+##CE
+##CF
+##CG
+##CH
+##CI
+##CK
+##CL
+##CM
+##CN
+##CO
+##CP
+##CR
+##CS
+##CT
+##CU
+##DA
+##DB
+##DC
+##DD
+##DE
+##DI
+##DL
+##DM
+##DN
+##DO
+##DP
+##DR
+##DS
+##DT
+##DU
+##DX
+##DY
+##EA
+##EB
+##EC
+##ED
+##EE
+##EF
+##EG
+##EI
+##EK
+##EL
+##EM
+##EN
+##EO
+##EP
+##ER
+##ES
+##ET
+##EV
+##EW
+##EX
+##EY
+##FA
+##FC
+##FD
+##FE
+##FF
+##FI
+##FL
+##FO
+##FP
+##FR
+##FS
+##FT
+##FU
+##FX
+##Fi
+##GA
+##GC
+##GE
+##GF
+##GH
+##GI
+##GL
+##GN
+##GO
+##GP
+##GR
+##GS
+##GU
+##GY
+##HA
+##HC
+##HD
+##HE
+##HG
+##HI
+##HM
+##HN
+##HO
+##HP
+##HR
+##HS
+##HT
+##IA
+##IB
+##IC
+##ID
+##IE
+##IF
+##IG
+##II
+##IK
+##IL
+##IM
+##IN
+##IO
+##IP
+##IR
+##IS
+##IT
+##IU
+##IV
+##IX
+##IZ
+##JI
+##JO
+##Jo
+##Ju
+##KA
+##KE
+##KI
+##KK
+##KO
+##KU
+##KY
+##LA
+##LC
+##LD
+##LE
+##LF
+##LG
+##LI
+##LK
+##LL
+##LM
+##LO
+##LP
+##LS
+##LT
+##LU
+##LV
+##LY
+##MA
+##MB
+##MC
+##MD
+##ME
+##MF
+##MI
+##ML
+##MM
+##MN
+##MO
+##MP
+##MR
+##MS
+##MT
+##MV
+##MY
+##NA
+##NC
+##ND
+##NE
+##NG
+##NI
+##NJ
+##NK
+##NN
+##NO
+##NP
+##NS
+##NT
+##NU
+##NX
+##NY
+##NZ
+##OB
+##OC
+##OD
+##OE
+##OF
+##OG
+##OH
+##OI
+##OK
+##OL
+##OM
+##ON
+##OO
+##OP
+##OR
+##OS
+##OT
+##OU
+##OV
+##OW
+##OX
+##PA
+##PC
+##PD
+##PE
+##PF
+##PG
+##PH
+##PI
+##PL
+##PM
+##PO
+##PP
+##PR
+##PS
+##PT
+##PU
+##QU
+##Qu
+##RA
+##RB
+##RC
+##RD
+##RE
+##RF
+##RG
+##RH
+##RI
+##RK
+##RL
+##RM
+##RN
+##RO
+##RP
+##RR
+##RS
+##RT
+##RU
+##RY
+##SA
+##SB
+##SC
+##SD
+##SE
+##SF
+##SH
+##SI
+##SK
+##SL
+##SM
+##SN
+##SO
+##SP
+##SS
+##ST
+##SU
+##SY
+##TA
+##TC
+##TD
+##TE
+##TH
+##TI
+##TM
+##TO
+##TP
+##TR
+##TS
+##TT
+##TU
+##TV
+##TY
+##Tu
+##UB
+##UC
+##UD
+##UE
+##UF
+##UG
+##UI
+##UK
+##UL
+##UM
+##UN
+##UP
+##UR
+##US
+##UT
+##VA
+##VB
+##VC
+##VD
+##VE
+##VI
+##VO
+##VP
+##VR
+##VT
+##WA
+##WC
+##WE
+##WF
+##WI
+##WL
+##WM
+##WO
+##WS
+##XA
+##XC
+##XE
+##XG
+##XO
+##XP
+##XT
+##XX
+##XY
+##YA
+##YE
+##YL
+##YO
+##YP
+##YS
+##YT
+##ZA
+##ZB
+##ZE
+##ZI
+##ZO
+##ZR
+##ZU
+##ZX
+##ZZ
+##ab
+##ag
+##al
+##am
+##an
+##ar
+##as
+##at
+##ax
+##ay
+##az
+##bi
+##bj
+##bl
+##bo
+##by
+##ce
+##ch
+##ci
+##ck
+##cq
+##ct
+##dj
+##ed
+##en
+##er
+##ew
+##ex
+##ff
+##fi
+##gh
+##gn
+##ha
+##he
+##ho
+##hz
+##ic
+##id
+##im
+##in
+##is
+##it
+##iv
+##ix
+##iz
+##jj
+##jo
+##ke
+##ky
+##kz
+##ld
+##le
+##lf
+##ll
+##ly
+##mb
+##mp
+##na
+##nc
+##nd
+##ng
+##nj
+##nk
+##nn
+##nt
+##nz
+##ob
+##oj
+##ok
+##ol
+##om
+##on
+##op
+##or
+##ou
+##ow
+##ox
+##ph
+##pp
+##pu
+##pv
+##qf
+##ql
+##qq
+##qu
+##re
+##rk
+##ro
+##ry
+##sh
+##sq
+##st
+##th
+##ty
+##ub
+##ul
+##um
+##un
+##ur
+##us
+##uv
+##ux
+##uz
+##ve
+##vi
+##wn
+##ws
+##ww
+##xp
+##xx
+##xy
+##zh
+##zy
+##zz
+##ACE
+##ACH
+##ACT
+##ADE
+##AGE
+##AIN
+##AME
+##AND
+##ANG
+##ANO
+##ANT
+##ARD
+##ARE
+##ASS
+##AST
+##ATE
+##BER
+##BLE
+##BOX
+##BSD
+##Bay
+##CAD
+##CAL
+##CAM
+##COM
+##CSE
+##DEO
+##DER
+##DIA
+##DNA
+##DSL
+##DVD
+##EAM
+##EAR
+##ECT
+##EEN
+##ENS
+##ENT
+##ERA
+##ERS
+##ESE
+##ESS
+##FTA
+##GER
+##GHT
+##GIS
+##IAL
+##IBA
+##IBU
+##ICE
+##ICS
+##IDE
+##INA
+##INE
+##ING
+##INT
+##INY
+##ION
+##IPS
+##ITE
+##IVE
+##KER
+##KON
+##LAY
+##LLA
+##LOR
+##MAN
+##MAS
+##MAX
+##MES
+##NAD
+##NAL
+##NCE
+##NET
+##NEY
+##NIC
+##NNA
+##OCK
+##ODE
+##OME
+##ONE
+##ORA
+##OWS
+##Off
+##PAC
+##PER
+##PRS
+##RAN
+##RIS
+##RNA
+##ROM
+##RON
+##ROR
+##SCO
+##SHI
+##SIC
+##SOL
+##SON
+##SQL
+##TAL
+##TED
+##TER
+##TML
+##TON
+##TRA
+##UND
+##UNG
+##UPA
+##USB
+##USE
+##VEL
+##VER
+##VGA
+##VID
+##WER
+##You
+##abl
+##aby
+##ach
+##ack
+##act
+##ain
+##ake
+##all
+##aly
+##anc
+##and
+##ang
+##ank
+##app
+##ard
+##ark
+##art
+##ary
+##ash
+##ath
+##auv
+##ave
+##avi
+##azi
+##azy
+##azz
+##bVI
+##bby
+##ber
+##bje
+##ble
+##cGI
+##cho
+##com
+##cqu
+##day
+##der
+##ebo
+##ect
+##ell
+##emb
+##enc
+##eng
+##ent
+##erJ
+##ern
+##erv
+##ery
+##eve
+##ews
+##exp
+##ext
+##ezy
+##fer
+##ffe
+##fic
+##for
+##gaz
+##ger
+##ght
+##gin
+##hen
+##her
+##hev
+##hin
+##hon
+##hou
+##iRF
+##ial
+##ica
+##ice
+##ich
+##ick
+##iff
+##igh
+##ike
+##ill
+##ily
+##ime
+##ine
+##ing
+##ink
+##ion
+##iqu
+##ish
+##ith
+##ive
+##iza
+##ize
+##izz
+##jin
+##ker
+##kin
+##lDR
+##lay
+##laz
+##lex
+##lic
+##lin
+##liz
+##llo
+##lly
+##man
+##maz
+##men
+##mer
+##min
+##mpl
+##mpo
+##nGL
+##nRH
+##nal
+##ner
+##ngz
+##niz
+##now
+##nxp
+##oCA
+##obj
+##ock
+##oll
+##omb
+##ome
+##omm
+##omp
+##one
+##ong
+##ook
+##ork
+##orm
+##ort
+##ory
+##oul
+##oup
+##our
+##ous
+##out
+##ove
+##own
+##ows
+##per
+##phe
+##ply
+##por
+##ppl
+##ppy
+##qqu
+##qua
+##que
+##qui
+##raz
+##rch
+##ric
+##rou
+##son
+##tBI
+##tch
+##ter
+##the
+##tic
+##tim
+##tiv
+##tur
+##uch
+##uck
+##uct
+##uff
+##ugh
+##umb
+##ung
+##ure
+##urn
+##vel
+##ven
+##ver
+##vic
+##vid
+##vin
+##war
+##way
+##whe
+##wor
+##www
+##xxx
+##ymb
+##yth
+##zhe
+##zym
+##zzy
+##ATIO
+##CESS
+##CIAT
+##CTIO
+##CTOR
+##ENGI
+##ERSI
+##HCSD
+##INES
+##INUE
+##IONA
+##LOID
+##MENT
+##NEER
+##NOLO
+##NTER
+##NTSC
+##ORMA
+##OSHO
+##RISE
+##RNAT
+##RNET
+##SATA
+##SION
+##TION
+##TTLE
+##VERS
+##ally
+##arch
+##ayer
+##azer
+##azin
+##bert
+##book
+##chin
+##ctor
+##ding
+##echn
+##erPC
+##erVR
+##eriz
+##erve
+##ever
+##ffer
+##ffff
+##ffic
+##fter
+##ghly
+##hell
+##ical
+##iche
+##icke
+##ific
+##ight
+##iver
+##izon
+##izzy
+##king
+##lack
+##land
+##llow
+##mber
+##ngin
+##ning
+##omic
+##onom
+##othe
+##ouch
+##ough
+##ound
+##ower
+##pper
+##ppin
+##pter
+##ster
+##ther
+##tion
+##tive
+##tter
+##ture
+##urch
+##vely
+##ction
+##ctive
+##enter
+##erica
+##ional
+##thing
diff --git a/fengshen/workspace/randeng-bart-base/pretrain/config.json b/fengshen/workspace/randeng-bart-base/pretrain/config.json
new file mode 100644
index 0000000000000000000000000000000000000000..693bf7a8ab25b35bea914122c631a4a9d38204ec
--- /dev/null
+++ b/fengshen/workspace/randeng-bart-base/pretrain/config.json
@@ -0,0 +1,52 @@
+{
+    "_name_or_path": "bart-base",
+    "activation_dropout": 0.1,
+    "activation_function": "gelu",
+    "add_bias_logits": false,
+    "add_final_layer_norm": false,
+    "architectures": [
+        "BartForConditionalGeneration"
+    ],
+    "attention_dropout": 0.1,
+    "bos_token_id": 0,
+    "classif_dropout": 0.1,
+    "classifier_dropout": 0.0,
+    "d_model": 768,
+    "decoder_attention_heads": 12,
+    "decoder_ffn_dim": 3072,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 6,
+    "decoder_start_token_id": 2,
+    "dropout": 0.1,
+    "encoder_attention_heads": 12,
+    "encoder_ffn_dim": 3072,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 6,
+    "eos_token_id": 2,
+    "forced_eos_token_id": 2,
+    "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1",
+        "2": "LABEL_2"
+    },
+    "init_std": 0.02,
+    "is_encoder_decoder": true,
+    "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1,
+        "LABEL_2": 2
+    },
+    "max_position_embeddings": 1024,
+    "model_type": "bart",
+    "no_repeat_ngram_size": 3,
+    "normalize_before": false,
+    "normalize_embedding": true,
+    "num_beams": 4,
+    "num_hidden_layers": 6,
+    "pad_token_id": 1,
+    "scale_embedding": false,
+    "torch_dtype": "float16",
+    "transformers_version": "4.16.0.dev0",
+    "use_cache": true,
+    "vocab_size": 21128
+}
\ No newline at end of file
diff --git a/fengshen/workspace/randeng-bart-base/pretrain/special_tokens_map.json b/fengshen/workspace/randeng-bart-base/pretrain/special_tokens_map.json
new file mode 100644
index 0000000000000000000000000000000000000000..e7b0375001f109a6b8873d756ad4f7bbb15fbaa5
--- /dev/null
+++ b/fengshen/workspace/randeng-bart-base/pretrain/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
\ No newline at end of file
diff --git a/fengshen/workspace/randeng-bart-base/pretrain/tokenizer_config.json b/fengshen/workspace/randeng-bart-base/pretrain/tokenizer_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..53d88ea9bb5c978402b7d9bb4a80690171e5491c
--- /dev/null
+++ b/fengshen/workspace/randeng-bart-base/pretrain/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+    "do_lower_case": true,
+    "do_basic_tokenize": true,
+    "never_split": null,
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]",
+    "tokenize_chinese_chars": true,
+    "strip_accents": null,
+    "special_tokens_map_file": null,
+    "name_or_path": "/cognitive_comp/gaoxinyu/pretrained_model/bert-1.3B",
+    "tokenizer_class": "BertTokenizer"
+}
\ No newline at end of file
diff --git a/fengshen/workspace/randeng-bart-base/pretrain/vocab.txt b/fengshen/workspace/randeng-bart-base/pretrain/vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66b6d20eebda7da8499e68ef9b1980990c0042cc
--- /dev/null
+++ b/fengshen/workspace/randeng-bart-base/pretrain/vocab.txt
@@ -0,0 +1,21128 @@
+[PAD]
+[unused1]
+[unused2]
+[unused3]
+[unused4]
+[unused5]
+[unused6]
+[unused7]
+[unused8]
+[unused9]
+[unused10]
+[unused11]
+[unused12]
+[unused13]
+[unused14]
+[unused15]
+[unused16]
+[unused17]
+[unused18]
+[unused19]
+[unused20]
+[unused21]
+[unused22]
+[unused23]
+[unused24]
+[unused25]
+[unused26]
+[unused27]
+[unused28]
+[unused29]
+[unused30]
+[unused31]
+[unused32]
+[unused33]
+[unused34]
+[unused35]
+[unused36]
+[unused37]
+[unused38]
+[unused39]
+[unused40]
+[unused41]
+[unused42]
+[unused43]
+[unused44]
+[unused45]
+[unused46]
+[unused47]
+[unused48]
+[unused49]
+[unused50]
+[unused51]
+[unused52]
+[unused53]
+[unused54]
+[unused55]
+[unused56]
+[unused57]
+[unused58]
+[unused59]
+[unused60]
+[unused61]
+[unused62]
+[unused63]
+[unused64]
+[unused65]
+[unused66]
+[unused67]
+[unused68]
+[unused69]
+[unused70]
+[unused71]
+[unused72]
+[unused73]
+[unused74]
+[unused75]
+[unused76]
+[unused77]
+[unused78]
+[unused79]
+[unused80]
+[unused81]
+[unused82]
+[unused83]
+[unused84]
+[unused85]
+[unused86]
+[unused87]
+[unused88]
+[unused89]
+[unused90]
+[unused91]
+[unused92]
+[unused93]
+[unused94]
+[unused95]
+[unused96]
+[unused97]
+[unused98]
+[unused99]
+[UNK]
+[CLS]
+[SEP]
+[MASK]
+<S>
+<T>
+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+:
+;
+<
+=
+>
+?
+@
+[
+\
+]
+^
+_
+a
+b
+c
+d
+e
+f
+g
+h
+i
+j
+k
+l
+m
+n
+o
+p
+q
+r
+s
+t
+u
+v
+w
+x
+y
+z
+{
+|
+}
+~
+£
+¤
+¥
+§
+©
+«
+®
+°
+±
+²
+³
+µ
+·
+¹
+º
+»
+¼
+×
+ß
+æ
+÷
+ø
+đ
+ŋ
+ɔ
+ə
+ɡ
+ʰ
+ˇ
+ˈ
+ˊ
+ˋ
+ˍ
+ː
+˙
+˚
+ˢ
+α
+β
+γ
+δ
+ε
+η
+θ
+ι
+κ
+λ
+μ
+ν
+ο
+π
+ρ
+ς
+σ
+τ
+υ
+φ
+χ
+ψ
+ω
+а
+б
+в
+г
+д
+е
+ж
+з
+и
+к
+л
+м
+н
+о
+п
+р
+с
+т
+у
+ф
+х
+ц
+ч
+ш
+ы
+ь
+я
+і
+ا
+ب
+ة
+ت
+د
+ر
+س
+ع
+ل
+م
+ن
+ه
+و
+ي
+۩
+ก
+ง
+น
+ม
+ย
+ร
+อ
+า
+เ
+๑
+་
+ღ
+ᄀ
+ᄁ
+ᄂ
+ᄃ
+ᄅ
+ᄆ
+ᄇ
+ᄈ
+ᄉ
+ᄋ
+ᄌ
+ᄎ
+ᄏ
+ᄐ
+ᄑ
+ᄒ
+ᅡ
+ᅢ
+ᅣ
+ᅥ
+ᅦ
+ᅧ
+ᅨ
+ᅩ
+ᅪ
+ᅬ
+ᅭ
+ᅮ
+ᅯ
+ᅲ
+ᅳ
+ᅴ
+ᅵ
+ᆨ
+ᆫ
+ᆯ
+ᆷ
+ᆸ
+ᆺ
+ᆻ
+ᆼ
+ᗜ
+ᵃ
+ᵉ
+ᵍ
+ᵏ
+ᵐ
+ᵒ
+ᵘ
+‖
+„
+†
+•
+‥
+‧
+
+‰
+′
+″
+‹
+›
+※
+‿
+⁄
+ⁱ
+⁺
+ⁿ
+₁
+₂
+₃
+₄
+€
+℃
+№
+™
+ⅰ
+ⅱ
+ⅲ
+ⅳ
+ⅴ
+←
+↑
+→
+↓
+↔
+↗
+↘
+⇒
+∀
+−
+∕
+∙
+√
+∞
+∟
+∠
+∣
+∥
+∩
+∮
+∶
+∼
+∽
+≈
+≒
+≡
+≤
+≥
+≦
+≧
+≪
+≫
+⊙
+⋅
+⋈
+⋯
+⌒
+①
+②
+③
+④
+⑤
+⑥
+⑦
+⑧
+⑨
+⑩
+⑴
+⑵
+⑶
+⑷
+⑸
+⒈
+⒉
+⒊
+⒋
+ⓒ
+ⓔ
+ⓘ
+─
+━
+│
+┃
+┅
+┆
+┊
+┌
+└
+├
+┣
+═
+║
+╚
+╞
+╠
+╭
+╮
+╯
+╰
+╱
+╳
+▂
+▃
+▅
+▇
+█
+▉
+▋
+▌
+▍
+▎
+■
+□
+▪
+▫
+▬
+▲
+△
+▶
+►
+▼
+▽
+◆
+◇
+○
+◎
+●
+◕
+◠
+◢
+◤
+☀
+★
+☆
+☕
+☞
+☺
+☼
+♀
+♂
+♠
+♡
+♣
+♥
+♦
+♪
+♫
+♬
+✈
+✔
+✕
+✖
+✦
+✨
+✪
+✰
+✿
+❀
+❤
+➜
+➤
+⦿
+、
+。
+〃
+々
+〇
+〈
+〉
+《
+》
+「
+」
+『
+』
+【
+】
+〓
+〔
+〕
+〖
+〗
+〜
+〝
+〞
+ぁ
+あ
+ぃ
+い
+う
+ぇ
+え
+お
+か
+き
+く
+け
+こ
+さ
+し
+す
+せ
+そ
+た
+ち
+っ
+つ
+て
+と
+な
+に
+ぬ
+ね
+の
+は
+ひ
+ふ
+へ
+ほ
+ま
+み
+む
+め
+も
+ゃ
+や
+ゅ
+ゆ
+ょ
+よ
+ら
+り
+る
+れ
+ろ
+わ
+を
+ん
+゜
+ゝ
+ァ
+ア
+ィ
+イ
+ゥ
+ウ
+ェ
+エ
+ォ
+オ
+カ
+キ
+ク
+ケ
+コ
+サ
+シ
+ス
+セ
+ソ
+タ
+チ
+ッ
+ツ
+テ
+ト
+ナ
+ニ
+ヌ
+ネ
+ノ
+ハ
+ヒ
+フ
+ヘ
+ホ
+マ
+ミ
+ム
+メ
+モ
+ャ
+ヤ
+ュ
+ユ
+ョ
+ヨ
+ラ
+リ
+ル
+レ
+ロ
+ワ
+ヲ
+ン
+ヶ
+・
+ー
+ヽ
+ㄅ
+ㄆ
+ㄇ
+ㄉ
+ㄋ
+ㄌ
+ㄍ
+ㄎ
+ㄏ
+ㄒ
+ㄚ
+ㄛ
+ㄞ
+ㄟ
+ㄢ
+ㄤ
+ㄥ
+ㄧ
+ㄨ
+ㆍ
+㈦
+㊣
+㎡
+㗎
+一
+丁
+七
+万
+丈
+三
+上
+下
+不
+与
+丐
+丑
+专
+且
+丕
+世
+丘
+丙
+业
+丛
+东
+丝
+丞
+丟
+両
+丢
+两
+严
+並
+丧
+丨
+个
+丫
+中
+丰
+串
+临
+丶
+丸
+丹
+为
+主
+丼
+丽
+举
+丿
+乂
+乃
+久
+么
+义
+之
+乌
+乍
+乎
+乏
+乐
+乒
+乓
+乔
+乖
+乗
+乘
+乙
+乜
+九
+乞
+也
+习
+乡
+书
+乩
+买
+乱
+乳
+乾
+亀
+亂
+了
+予
+争
+事
+二
+于
+亏
+云
+互
+五
+井
+亘
+亙
+亚
+些
+亜
+亞
+亟
+亡
+亢
+交
+亥
+亦
+产
+亨
+亩
+享
+京
+亭
+亮
+亲
+亳
+亵
+人
+亿
+什
+仁
+仃
+仄
+仅
+仆
+仇
+今
+介
+仍
+从
+仏
+仑
+仓
+仔
+仕
+他
+仗
+付
+仙
+仝
+仞
+仟
+代
+令
+以
+仨
+仪
+们
+仮
+仰
+仲
+件
+价
+任
+份
+仿
+企
+伉
+伊
+伍
+伎
+伏
+伐
+休
+伕
+众
+优
+伙
+会
+伝
+伞
+伟
+传
+伢
+伤
+伦
+伪
+伫
+伯
+估
+伴
+伶
+伸
+伺
+似
+伽
+佃
+但
+佇
+佈
+位
+低
+住
+佐
+佑
+体
+佔
+何
+佗
+佘
+余
+佚
+佛
+作
+佝
+佞
+佟
+你
+佢
+佣
+佤
+佥
+佩
+佬
+佯
+佰
+佳
+併
+佶
+佻
+佼
+使
+侃
+侄
+來
+侈
+例
+侍
+侏
+侑
+侖
+侗
+供
+依
+侠
+価
+侣
+侥
+侦
+侧
+侨
+侬
+侮
+侯
+侵
+侶
+侷
+便
+係
+促
+俄
+俊
+俎
+俏
+俐
+俑
+俗
+俘
+俚
+保
+俞
+俟
+俠
+信
+俨
+俩
+俪
+俬
+俭
+修
+俯
+俱
+俳
+俸
+俺
+俾
+倆
+倉
+個
+倌
+倍
+倏
+們
+倒
+倔
+倖
+倘
+候
+倚
+倜
+借
+倡
+値
+倦
+倩
+倪
+倫
+倬
+倭
+倶
+债
+值
+倾
+偃
+假
+偈
+偉
+偌
+偎
+偏
+偕
+做
+停
+健
+側
+偵
+偶
+偷
+偻
+偽
+偿
+傀
+傅
+傍
+傑
+傘
+備
+傚
+傢
+傣
+傥
+储
+傩
+催
+傭
+傲
+傳
+債
+傷
+傻
+傾
+僅
+働
+像
+僑
+僕
+僖
+僚
+僥
+僧
+僭
+僮
+僱
+僵
+價
+僻
+儀
+儂
+億
+儆
+儉
+儋
+儒
+儕
+儘
+償
+儡
+優
+儲
+儷
+儼
+儿
+兀
+允
+元
+兄
+充
+兆
+兇
+先
+光
+克
+兌
+免
+児
+兑
+兒
+兔
+兖
+党
+兜
+兢
+入
+內
+全
+兩
+八
+公
+六
+兮
+兰
+共
+兲
+关
+兴
+兵
+其
+具
+典
+兹
+养
+兼
+兽
+冀
+内
+円
+冇
+冈
+冉
+冊
+册
+再
+冏
+冒
+冕
+冗
+写
+军
+农
+冠
+冢
+冤
+冥
+冨
+冪
+冬
+冯
+冰
+冲
+决
+况
+冶
+冷
+冻
+冼
+冽
+冾
+净
+凄
+准
+凇
+凈
+凉
+凋
+凌
+凍
+减
+凑
+凛
+凜
+凝
+几
+凡
+凤
+処
+凪
+凭
+凯
+凰
+凱
+凳
+凶
+凸
+凹
+出
+击
+函
+凿
+刀
+刁
+刃
+分
+切
+刈
+刊
+刍
+刎
+刑
+划
+列
+刘
+则
+刚
+创
+初
+删
+判
+別
+刨
+利
+刪
+别
+刮
+到
+制
+刷
+券
+刹
+刺
+刻
+刽
+剁
+剂
+剃
+則
+剉
+削
+剋
+剌
+前
+剎
+剐
+剑
+剔
+剖
+剛
+剜
+剝
+剣
+剤
+剥
+剧
+剩
+剪
+副
+割
+創
+剷
+剽
+剿
+劃
+劇
+劈
+劉
+劊
+劍
+劏
+劑
+力
+劝
+办
+功
+加
+务
+劣
+动
+助
+努
+劫
+劭
+励
+劲
+劳
+労
+劵
+効
+劾
+势
+勁
+勃
+勇
+勉
+勋
+勐
+勒
+動
+勖
+勘
+務
+勛
+勝
+勞
+募
+勢
+勤
+勧
+勳
+勵
+勸
+勺
+勻
+勾
+勿
+匀
+包
+匆
+匈
+匍
+匐
+匕
+化
+北
+匙
+匝
+匠
+匡
+匣
+匪
+匮
+匯
+匱
+匹
+区
+医
+匾
+匿
+區
+十
+千
+卅
+升
+午
+卉
+半
+卍
+华
+协
+卑
+卒
+卓
+協
+单
+卖
+南
+単
+博
+卜
+卞
+卟
+占
+卡
+卢
+卤
+卦
+卧
+卫
+卮
+卯
+印
+危
+即
+却
+卵
+卷
+卸
+卻
+卿
+厂
+厄
+厅
+历
+厉
+压
+厌
+厕
+厘
+厚
+厝
+原
+厢
+厥
+厦
+厨
+厩
+厭
+厮
+厲
+厳
+去
+县
+叁
+参
+參
+又
+叉
+及
+友
+双
+反
+収
+发
+叔
+取
+受
+变
+叙
+叛
+叟
+叠
+叡
+叢
+口
+古
+句
+另
+叨
+叩
+只
+叫
+召
+叭
+叮
+可
+台
+叱
+史
+右
+叵
+叶
+号
+司
+叹
+叻
+叼
+叽
+吁
+吃
+各
+吆
+合
+吉
+吊
+吋
+同
+名
+后
+吏
+吐
+向
+吒
+吓
+吕
+吖
+吗
+君
+吝
+吞
+吟
+吠
+吡
+否
+吧
+吨
+吩
+含
+听
+吭
+吮
+启
+吱
+吳
+吴
+吵
+吶
+吸
+吹
+吻
+吼
+吽
+吾
+呀
+呂
+呃
+呆
+呈
+告
+呋
+呎
+呐
+呓
+呕
+呗
+员
+呛
+呜
+呢
+呤
+呦
+周
+呱
+呲
+味
+呵
+呷
+呸
+呻
+呼
+命
+咀
+咁
+咂
+咄
+咆
+咋
+和
+咎
+咏
+咐
+咒
+咔
+咕
+咖
+咗
+咘
+咙
+咚
+咛
+咣
+咤
+咦
+咧
+咨
+咩
+咪
+咫
+咬
+咭
+咯
+咱
+咲
+咳
+咸
+咻
+咽
+咿
+哀
+品
+哂
+哄
+哆
+哇
+哈
+哉
+哋
+哌
+响
+哎
+哏
+哐
+哑
+哒
+哔
+哗
+哟
+員
+哥
+哦
+哧
+哨
+哩
+哪
+哭
+哮
+哲
+哺
+哼
+哽
+唁
+唄
+唆
+唇
+唉
+唏
+唐
+唑
+唔
+唠
+唤
+唧
+唬
+售
+唯
+唰
+唱
+唳
+唷
+唸
+唾
+啃
+啄
+商
+啉
+啊
+問
+啓
+啕
+啖
+啜
+啞
+啟
+啡
+啤
+啥
+啦
+啧
+啪
+啫
+啬
+啮
+啰
+啱
+啲
+啵
+啶
+啷
+啸
+啻
+啼
+啾
+喀
+喂
+喃
+善
+喆
+喇
+喉
+喊
+喋
+喎
+喏
+喔
+喘
+喙
+喚
+喜
+喝
+喟
+喧
+喪
+喫
+喬
+單
+喰
+喱
+喲
+喳
+喵
+営
+喷
+喹
+喺
+喻
+喽
+嗅
+嗆
+嗇
+嗎
+嗑
+嗒
+嗓
+嗔
+嗖
+嗚
+嗜
+嗝
+嗟
+嗡
+嗣
+嗤
+嗦
+嗨
+嗪
+嗬
+嗯
+嗰
+嗲
+嗳
+嗶
+嗷
+嗽
+嘀
+嘅
+嘆
+嘈
+嘉
+嘌
+嘍
+嘎
+嘔
+嘖
+嘗
+嘘
+嘚
+嘛
+嘜
+嘞
+嘟
+嘢
+嘣
+嘤
+嘧
+嘩
+嘭
+嘮
+嘯
+嘰
+嘱
+嘲
+嘴
+嘶
+嘸
+嘹
+嘻
+嘿
+噁
+噌
+噎
+噓
+噔
+噗
+噙
+噜
+噠
+噢
+噤
+器
+噩
+噪
+噬
+噱
+噴
+噶
+噸
+噹
+噻
+噼
+嚀
+嚇
+嚎
+嚏
+嚐
+嚓
+嚕
+嚟
+嚣
+嚥
+嚨
+嚮
+嚴
+嚷
+嚼
+囂
+囉
+囊
+囍
+囑
+囔
+囗
+囚
+四
+囝
+回
+囟
+因
+囡
+团
+団
+囤
+囧
+囪
+囫
+园
+困
+囱
+囲
+図
+围
+囹
+固
+国
+图
+囿
+圃
+圄
+圆
+圈
+國
+圍
+圏
+園
+圓
+圖
+團
+圜
+土
+圣
+圧
+在
+圩
+圭
+地
+圳
+场
+圻
+圾
+址
+坂
+均
+坊
+坍
+坎
+坏
+坐
+坑
+块
+坚
+坛
+坝
+坞
+坟
+坠
+坡
+坤
+坦
+坨
+坪
+坯
+坳
+坵
+坷
+垂
+垃
+垄
+型
+垒
+垚
+垛
+垠
+垢
+垣
+垦
+垩
+垫
+垭
+垮
+垵
+埂
+埃
+埋
+城
+埔
+埕
+埗
+域
+埠
+埤
+埵
+執
+埸
+培
+基
+埼
+堀
+堂
+堃
+堅
+堆
+堇
+堑
+堕
+堙
+堡
+堤
+堪
+堯
+堰
+報
+場
+堵
+堺
+堿
+塊
+塌
+塑
+塔
+塗
+塘
+塚
+塞
+塢
+塩
+填
+塬
+塭
+塵
+塾
+墀
+境
+墅
+墉
+墊
+墒
+墓
+増
+墘
+墙
+墜
+增
+墟
+墨
+墩
+墮
+墳
+墻
+墾
+壁
+壅
+壆
+壇
+壊
+壑
+壓
+壕
+壘
+壞
+壟
+壢
+壤
+壩
+士
+壬
+壮
+壯
+声
+売
+壳
+壶
+壹
+壺
+壽
+处
+备
+変
+复
+夏
+夔
+夕
+外
+夙
+多
+夜
+够
+夠
+夢
+夥
+大
+天
+太
+夫
+夭
+央
+夯
+失
+头
+夷
+夸
+夹
+夺
+夾
+奂
+奄
+奇
+奈
+奉
+奋
+奎
+奏
+奐
+契
+奔
+奕
+奖
+套
+奘
+奚
+奠
+奢
+奥
+奧
+奪
+奬
+奮
+女
+奴
+奶
+奸
+她
+好
+如
+妃
+妄
+妆
+妇
+妈
+妊
+妍
+妒
+妓
+妖
+妘
+妙
+妝
+妞
+妣
+妤
+妥
+妨
+妩
+妪
+妮
+妲
+妳
+妹
+妻
+妾
+姆
+姉
+姊
+始
+姍
+姐
+姑
+姒
+姓
+委
+姗
+姚
+姜
+姝
+姣
+姥
+姦
+姨
+姪
+姫
+姬
+姹
+姻
+姿
+威
+娃
+娄
+娅
+娆
+娇
+娉
+娑
+娓
+娘
+娛
+娜
+娟
+娠
+娣
+娥
+娩
+娱
+娲
+娴
+娶
+娼
+婀
+婁
+婆
+婉
+婊
+婕
+婚
+婢
+婦
+婧
+婪
+婭
+婴
+婵
+婶
+婷
+婺
+婿
+媒
+媚
+媛
+媞
+媧
+媲
+媳
+媽
+媾
+嫁
+嫂
+嫉
+嫌
+嫑
+嫔
+嫖
+嫘
+嫚
+嫡
+嫣
+嫦
+嫩
+嫲
+嫵
+嫻
+嬅
+嬉
+嬌
+嬗
+嬛
+嬢
+嬤
+嬪
+嬰
+嬴
+嬷
+嬸
+嬿
+孀
+孃
+子
+孑
+孔
+孕
+孖
+字
+存
+孙
+孚
+孛
+孜
+孝
+孟
+孢
+季
+孤
+学
+孩
+孪
+孫
+孬
+孰
+孱
+孳
+孵
+學
+孺
+孽
+孿
+宁
+它
+宅
+宇
+守
+安
+宋
+完
+宏
+宓
+宕
+宗
+官
+宙
+定
+宛
+宜
+宝
+实
+実
+宠
+审
+客
+宣
+室
+宥
+宦
+宪
+宫
+宮
+宰
+害
+宴
+宵
+家
+宸
+容
+宽
+宾
+宿
+寂
+寄
+寅
+密
+寇
+富
+寐
+寒
+寓
+寛
+寝
+寞
+察
+寡
+寢
+寥
+實
+寧
+寨
+審
+寫
+寬
+寮
+寰
+寵
+寶
+寸
+对
+寺
+寻
+导
+対
+寿
+封
+専
+射
+将
+將
+專
+尉
+尊
+尋
+對
+導
+小
+少
+尔
+尕
+尖
+尘
+尚
+尝
+尤
+尧
+尬
+就
+尴
+尷
+尸
+尹
+尺
+尻
+尼
+尽
+尾
+尿
+局
+屁
+层
+屄
+居
+屆
+屈
+屉
+届
+屋
+屌
+屍
+屎
+屏
+屐
+屑
+展
+屜
+属
+屠
+屡
+屢
+層
+履
+屬
+屯
+山
+屹
+屿
+岀
+岁
+岂
+岌
+岐
+岑
+岔
+岖
+岗
+岘
+岙
+岚
+岛
+岡
+岩
+岫
+岬
+岭
+岱
+岳
+岷
+岸
+峇
+峋
+峒
+峙
+峡
+峤
+峥
+峦
+峨
+峪
+峭
+峯
+峰
+峴
+島
+峻
+峽
+崁
+崂
+崆
+崇
+崎
+崑
+崔
+崖
+崗
+崙
+崛
+崧
+崩
+崭
+崴
+崽
+嵇
+嵊
+嵋
+嵌
+嵐
+嵘
+嵩
+嵬
+嵯
+嶂
+嶄
+嶇
+嶋
+嶙
+嶺
+嶼
+嶽
+巅
+巍
+巒
+巔
+巖
+川
+州
+巡
+巢
+工
+左
+巧
+巨
+巩
+巫
+差
+己
+已
+巳
+巴
+巷
+巻
+巽
+巾
+巿
+币
+市
+布
+帅
+帆
+师
+希
+帐
+帑
+帕
+帖
+帘
+帚
+帛
+帜
+帝
+帥
+带
+帧
+師
+席
+帮
+帯
+帰
+帳
+帶
+帷
+常
+帼
+帽
+幀
+幂
+幄
+幅
+幌
+幔
+幕
+幟
+幡
+幢
+幣
+幫
+干
+平
+年
+并
+幸
+幹
+幺
+幻
+幼
+幽
+幾
+广
+庁
+広
+庄
+庆
+庇
+床
+序
+庐
+库
+应
+底
+庖
+店
+庙
+庚
+府
+庞
+废
+庠
+度
+座
+庫
+庭
+庵
+庶
+康
+庸
+庹
+庾
+廁
+廂
+廃
+廈
+廉
+廊
+廓
+廖
+廚
+廝
+廟
+廠
+廢
+廣
+廬
+廳
+延
+廷
+建
+廿
+开
+弁
+异
+弃
+弄
+弈
+弊
+弋
+式
+弑
+弒
+弓
+弔
+引
+弗
+弘
+弛
+弟
+张
+弥
+弦
+弧
+弩
+弭
+弯
+弱
+張
+強
+弹
+强
+弼
+弾
+彅
+彆
+彈
+彌
+彎
+归
+当
+录
+彗
+彙
+彝
+形
+彤
+彥
+彦
+彧
+彩
+彪
+彫
+彬
+彭
+彰
+影
+彷
+役
+彻
+彼
+彿
+往
+征
+径
+待
+徇
+很
+徉
+徊
+律
+後
+徐
+徑
+徒
+従
+徕
+得
+徘
+徙
+徜
+從
+徠
+御
+徨
+復
+循
+徬
+微
+徳
+徴
+徵
+德
+徹
+徼
+徽
+心
+必
+忆
+忌
+忍
+忏
+忐
+忑
+忒
+忖
+志
+忘
+忙
+応
+忠
+忡
+忤
+忧
+忪
+快
+忱
+念
+忻
+忽
+忿
+怀
+态
+怂
+怅
+怆
+怎
+怏
+怒
+怔
+怕
+怖
+怙
+怜
+思
+怠
+怡
+急
+怦
+性
+怨
+怪
+怯
+怵
+总
+怼
+恁
+恃
+恆
+恋
+恍
+恐
+恒
+恕
+恙
+恚
+恢
+恣
+恤
+恥
+恨
+恩
+恪
+恫
+恬
+恭
+息
+恰
+恳
+恵
+恶
+恸
+恺
+恻
+恼
+恿
+悄
+悅
+悉
+悌
+悍
+悔
+悖
+悚
+悟
+悠
+患
+悦
+您
+悩
+悪
+悬
+悯
+悱
+悲
+悴
+悵
+悶
+悸
+悻
+悼
+悽
+情
+惆
+惇
+惊
+惋
+惑
+惕
+惘
+惚
+惜
+惟
+惠
+惡
+惦
+惧
+惨
+惩
+惫
+惬
+惭
+惮
+惯
+惰
+惱
+想
+惴
+惶
+惹
+惺
+愁
+愆
+愈
+愉
+愍
+意
+愕
+愚
+愛
+愜
+感
+愣
+愤
+愧
+愫
+愷
+愿
+慄
+慈
+態
+慌
+慎
+慑
+慕
+慘
+慚
+慟
+慢
+慣
+慧
+慨
+慫
+慮
+慰
+慳
+慵
+慶
+慷
+慾
+憂
+憊
+憋
+憎
+憐
+憑
+憔
+憚
+憤
+憧
+憨
+憩
+憫
+憬
+憲
+憶
+憾
+懂
+懇
+懈
+應
+懊
+懋
+懑
+懒
+懦
+懲
+懵
+懶
+懷
+懸
+懺
+懼
+懾
+懿
+戀
+戈
+戊
+戌
+戍
+戎
+戏
+成
+我
+戒
+戕
+或
+战
+戚
+戛
+戟
+戡
+戦
+截
+戬
+戮
+戰
+戲
+戳
+戴
+戶
+户
+戸
+戻
+戾
+房
+所
+扁
+扇
+扈
+扉
+手
+才
+扎
+扑
+扒
+打
+扔
+払
+托
+扛
+扣
+扦
+执
+扩
+扪
+扫
+扬
+扭
+扮
+扯
+扰
+扱
+扳
+扶
+批
+扼
+找
+承
+技
+抄
+抉
+把
+抑
+抒
+抓
+投
+抖
+抗
+折
+抚
+抛
+抜
+択
+抟
+抠
+抡
+抢
+护
+报
+抨
+披
+抬
+抱
+抵
+抹
+押
+抽
+抿
+拂
+拄
+担
+拆
+拇
+拈
+拉
+拋
+拌
+拍
+拎
+拐
+拒
+拓
+拔
+拖
+拗
+拘
+拙
+拚
+招
+拜
+拟
+拡
+拢
+拣
+拥
+拦
+拧
+拨
+择
+括
+拭
+拮
+拯
+拱
+拳
+拴
+拷
+拼
+拽
+拾
+拿
+持
+挂
+指
+挈
+按
+挎
+挑
+挖
+挙
+挚
+挛
+挝
+挞
+挟
+挠
+挡
+挣
+挤
+挥
+挨
+挪
+挫
+振
+挲
+挹
+挺
+挽
+挾
+捂
+捅
+捆
+捉
+捋
+捌
+捍
+捎
+捏
+捐
+捕
+捞
+损
+捡
+换
+捣
+捧
+捨
+捩
+据
+捱
+捲
+捶
+捷
+捺
+捻
+掀
+掂
+掃
+掇
+授
+掉
+掌
+掏
+掐
+排
+掖
+掘
+掙
+掛
+掠
+採
+探
+掣
+接
+控
+推
+掩
+措
+掬
+掰
+掲
+掳
+掴
+掷
+掸
+掺
+揀
+揃
+揄
+揆
+揉
+揍
+描
+提
+插
+揖
+揚
+換
+握
+揣
+揩
+揪
+揭
+揮
+援
+揶
+揸
+揹
+揽
+搀
+搁
+搂
+搅
+損
+搏
+搐
+搓
+搔
+搖
+搗
+搜
+搞
+搡
+搪
+搬
+搭
+搵
+搶
+携
+搽
+摀
+摁
+摄
+摆
+摇
+摈
+摊
+摒
+摔
+摘
+摞
+摟
+摧
+摩
+摯
+摳
+摸
+摹
+摺
+摻
+撂
+撃
+撅
+撇
+撈
+撐
+撑
+撒
+撓
+撕
+撚
+撞
+撤
+撥
+撩
+撫
+撬
+播
+撮
+撰
+撲
+撵
+撷
+撸
+撻
+撼
+撿
+擀
+擁
+擂
+擄
+擅
+擇
+擊
+擋
+操
+擎
+擒
+擔
+擘
+據
+擞
+擠
+擡
+擢
+擦
+擬
+擰
+擱
+擲
+擴
+擷
+擺
+擼
+擾
+攀
+攏
+攒
+攔
+攘
+攙
+攜
+攝
+攞
+攢
+攣
+攤
+攥
+攪
+攫
+攬
+支
+收
+攸
+改
+攻
+放
+政
+故
+效
+敌
+敍
+敎
+敏
+救
+敕
+敖
+敗
+敘
+教
+敛
+敝
+敞
+敢
+散
+敦
+敬
+数
+敲
+整
+敵
+敷
+數
+斂
+斃
+文
+斋
+斌
+斎
+斐
+斑
+斓
+斗
+料
+斛
+斜
+斟
+斡
+斤
+斥
+斧
+斩
+斫
+斬
+断
+斯
+新
+斷
+方
+於
+施
+旁
+旃
+旅
+旋
+旌
+旎
+族
+旖
+旗
+无
+既
+日
+旦
+旧
+旨
+早
+旬
+旭
+旮
+旱
+时
+旷
+旺
+旻
+昀
+昂
+昆
+昇
+昉
+昊
+昌
+明
+昏
+易
+昔
+昕
+昙
+星
+映
+春
+昧
+昨
+昭
+是
+昱
+昴
+昵
+昶
+昼
+显
+晁
+時
+晃
+晉
+晋
+晌
+晏
+晒
+晓
+晔
+晕
+晖
+晗
+晚
+晝
+晞
+晟
+晤
+晦
+晨
+晩
+普
+景
+晰
+晴
+晶
+晷
+智
+晾
+暂
+暄
+暇
+暈
+暉
+暌
+暐
+暑
+暖
+暗
+暝
+暢
+暧
+暨
+暫
+暮
+暱
+暴
+暸
+暹
+曄
+曆
+曇
+曉
+曖
+曙
+曜
+曝
+曠
+曦
+曬
+曰
+曲
+曳
+更
+書
+曹
+曼
+曾
+替
+最
+會
+月
+有
+朋
+服
+朐
+朔
+朕
+朗
+望
+朝
+期
+朦
+朧
+木
+未
+末
+本
+札
+朮
+术
+朱
+朴
+朵
+机
+朽
+杀
+杂
+权
+杆
+杈
+杉
+李
+杏
+材
+村
+杓
+杖
+杜
+杞
+束
+杠
+条
+来
+杨
+杭
+杯
+杰
+東
+杳
+杵
+杷
+杼
+松
+板
+极
+构
+枇
+枉
+枋
+析
+枕
+林
+枚
+果
+枝
+枢
+枣
+枪
+枫
+枭
+枯
+枰
+枱
+枳
+架
+枷
+枸
+柄
+柏
+某
+柑
+柒
+染
+柔
+柘
+柚
+柜
+柞
+柠
+柢
+查
+柩
+柬
+柯
+柱
+柳
+柴
+柵
+査
+柿
+栀
+栃
+栄
+栅
+标
+栈
+栉
+栋
+栎
+栏
+树
+栓
+栖
+栗
+校
+栩
+株
+样
+核
+根
+格
+栽
+栾
+桀
+桁
+桂
+桃
+桅
+框
+案
+桉
+桌
+桎
+桐
+桑
+桓
+桔
+桜
+桠
+桡
+桢
+档
+桥
+桦
+桧
+桨
+桩
+桶
+桿
+梁
+梅
+梆
+梏
+梓
+梗
+條
+梟
+梢
+梦
+梧
+梨
+梭
+梯
+械
+梳
+梵
+梶
+检
+棂
+棄
+棉
+棋
+棍
+棒
+棕
+棗
+棘
+棚
+棟
+棠
+棣
+棧
+森
+棱
+棲
+棵
+棹
+棺
+椁
+椅
+椋
+植
+椎
+椒
+検
+椪
+椭
+椰
+椹
+椽
+椿
+楂
+楊
+楓
+楔
+楚
+楝
+楞
+楠
+楣
+楨
+楫
+業
+楮
+極
+楷
+楸
+楹
+楼
+楽
+概
+榄
+榆
+榈
+榉
+榔
+榕
+榖
+榛
+榜
+榨
+榫
+榭
+榮
+榱
+榴
+榷
+榻
+槁
+槃
+構
+槌
+槍
+槎
+槐
+槓
+様
+槛
+槟
+槤
+槭
+槲
+槳
+槻
+槽
+槿
+樁
+樂
+樊
+樑
+樓
+標
+樞
+樟
+模
+樣
+権
+横
+樫
+樯
+樱
+樵
+樸
+樹
+樺
+樽
+樾
+橄
+橇
+橋
+橐
+橘
+橙
+機
+橡
+橢
+橫
+橱
+橹
+橼
+檀
+檄
+檎
+檐
+檔
+檗
+檜
+檢
+檬
+檯
+檳
+檸
+檻
+櫃
+櫚
+櫛
+櫥
+櫸
+櫻
+欄
+權
+欒
+欖
+欠
+次
+欢
+欣
+欧
+欲
+欸
+欺
+欽
+款
+歆
+歇
+歉
+歌
+歎
+歐
+歓
+歙
+歛
+歡
+止
+正
+此
+步
+武
+歧
+歩
+歪
+歯
+歲
+歳
+歴
+歷
+歸
+歹
+死
+歼
+殁
+殃
+殆
+殇
+殉
+殊
+残
+殒
+殓
+殖
+殘
+殞
+殡
+殤
+殭
+殯
+殲
+殴
+段
+殷
+殺
+殼
+殿
+毀
+毁
+毂
+毅
+毆
+毋
+母
+毎
+每
+毒
+毓
+比
+毕
+毗
+毘
+毙
+毛
+毡
+毫
+毯
+毽
+氈
+氏
+氐
+民
+氓
+气
+氖
+気
+氙
+氛
+氟
+氡
+氢
+氣
+氤
+氦
+氧
+氨
+氪
+氫
+氮
+氯
+氰
+氲
+水
+氷
+永
+氹
+氾
+汀
+汁
+求
+汆
+汇
+汉
+汎
+汐
+汕
+汗
+汙
+汛
+汝
+汞
+江
+池
+污
+汤
+汨
+汩
+汪
+汰
+汲
+汴
+汶
+汹
+決
+汽
+汾
+沁
+沂
+沃
+沅
+沈
+沉
+沌
+沏
+沐
+沒
+沓
+沖
+沙
+沛
+沟
+没
+沢
+沣
+沥
+沦
+沧
+沪
+沫
+沭
+沮
+沱
+河
+沸
+油
+治
+沼
+沽
+沾
+沿
+況
+泄
+泉
+泊
+泌
+泓
+法
+泗
+泛
+泞
+泠
+泡
+波
+泣
+泥
+注
+泪
+泫
+泮
+泯
+泰
+泱
+泳
+泵
+泷
+泸
+泻
+泼
+泽
+泾
+洁
+洄
+洋
+洒
+洗
+洙
+洛
+洞
+津
+洩
+洪
+洮
+洱
+洲
+洵
+洶
+洸
+洹
+活
+洼
+洽
+派
+流
+浃
+浄
+浅
+浆
+浇
+浊
+测
+济
+浏
+浑
+浒
+浓
+浔
+浙
+浚
+浜
+浣
+浦
+浩
+浪
+浬
+浮
+浯
+浴
+海
+浸
+涂
+涅
+涇
+消
+涉
+涌
+涎
+涓
+涔
+涕
+涙
+涛
+涝
+涞
+涟
+涠
+涡
+涣
+涤
+润
+涧
+涨
+涩
+涪
+涮
+涯
+液
+涵
+涸
+涼
+涿
+淀
+淄
+淅
+淆
+淇
+淋
+淌
+淑
+淒
+淖
+淘
+淙
+淚
+淞
+淡
+淤
+淦
+淨
+淩
+淪
+淫
+淬
+淮
+深
+淳
+淵
+混
+淹
+淺
+添
+淼
+清
+済
+渉
+渊
+渋
+渍
+渎
+渐
+渔
+渗
+渙
+渚
+減
+渝
+渠
+渡
+渣
+渤
+渥
+渦
+温
+測
+渭
+港
+渲
+渴
+游
+渺
+渾
+湃
+湄
+湊
+湍
+湖
+湘
+湛
+湟
+湧
+湫
+湮
+湯
+湳
+湾
+湿
+満
+溃
+溅
+溉
+溏
+源
+準
+溜
+溝
+溟
+溢
+溥
+溧
+溪
+溫
+溯
+溱
+溴
+溶
+溺
+溼
+滁
+滂
+滄
+滅
+滇
+滋
+滌
+滑
+滓
+滔
+滕
+滙
+滚
+滝
+滞
+滟
+满
+滢
+滤
+滥
+滦
+滨
+滩
+滬
+滯
+滲
+滴
+滷
+滸
+滾
+滿
+漁
+漂
+漆
+漉
+漏
+漓
+演
+漕
+漠
+漢
+漣
+漩
+漪
+漫
+漬
+漯
+漱
+漲
+漳
+漸
+漾
+漿
+潆
+潇
+潋
+潍
+潑
+潔
+潘
+潛
+潜
+潞
+潟
+潢
+潤
+潦
+潧
+潭
+潮
+潰
+潴
+潸
+潺
+潼
+澀
+澄
+澆
+澈
+澍
+澎
+澗
+澜
+澡
+澤
+澧
+澱
+澳
+澹
+激
+濁
+濂
+濃
+濑
+濒
+濕
+濘
+濛
+濟
+濠
+濡
+濤
+濫
+濬
+濮
+濯
+濱
+濺
+濾
+瀅
+瀆
+瀉
+瀋
+瀏
+瀑
+瀕
+瀘
+瀚
+瀛
+瀝
+瀞
+瀟
+瀧
+瀨
+瀬
+瀰
+瀾
+灌
+灏
+灑
+灘
+灝
+灞
+灣
+火
+灬
+灭
+灯
+灰
+灵
+灶
+灸
+灼
+災
+灾
+灿
+炀
+炁
+炅
+炉
+炊
+炎
+炒
+炔
+炕
+炖
+炙
+炜
+炫
+炬
+炭
+炮
+炯
+炳
+炷
+炸
+点
+為
+炼
+炽
+烁
+烂
+烃
+烈
+烊
+烏
+烘
+烙
+烛
+烟
+烤
+烦
+烧
+烨
+烩
+烫
+烬
+热
+烯
+烷
+烹
+烽
+焉
+焊
+焕
+焖
+焗
+焘
+焙
+焚
+焜
+無
+焦
+焯
+焰
+焱
+然
+焼
+煅
+煉
+煊
+煌
+煎
+煒
+煖
+煙
+煜
+煞
+煤
+煥
+煦
+照
+煨
+煩
+煮
+煲
+煸
+煽
+熄
+熊
+熏
+熒
+熔
+熙
+熟
+熠
+熨
+熬
+熱
+熵
+熹
+熾
+燁
+燃
+燄
+燈
+燉
+燊
+燎
+燒
+燔
+燕
+燙
+燜
+營
+燥
+燦
+燧
+燭
+燮
+燴
+燻
+燼
+燿
+爆
+爍
+爐
+爛
+爪
+爬
+爭
+爰
+爱
+爲
+爵
+父
+爷
+爸
+爹
+爺
+爻
+爽
+爾
+牆
+片
+版
+牌
+牍
+牒
+牙
+牛
+牝
+牟
+牠
+牡
+牢
+牦
+牧
+物
+牯
+牲
+牴
+牵
+特
+牺
+牽
+犀
+犁
+犄
+犊
+犍
+犒
+犢
+犧
+犬
+犯
+状
+犷
+犸
+犹
+狀
+狂
+狄
+狈
+狎
+狐
+狒
+狗
+狙
+狞
+狠
+狡
+狩
+独
+狭
+狮
+狰
+狱
+狸
+狹
+狼
+狽
+猎
+猕
+猖
+猗
+猙
+猛
+猜
+猝
+猥
+猩
+猪
+猫
+猬
+献
+猴
+猶
+猷
+猾
+猿
+獄
+獅
+獎
+獐
+獒
+獗
+獠
+獣
+獨
+獭
+獰
+獲
+獵
+獷
+獸
+獺
+獻
+獼
+獾
+玄
+率
+玉
+王
+玑
+玖
+玛
+玟
+玠
+玥
+玩
+玫
+玮
+环
+现
+玲
+玳
+玷
+玺
+玻
+珀
+珂
+珅
+珈
+珉
+珊
+珍
+珏
+珐
+珑
+珙
+珞
+珠
+珣
+珥
+珩
+珪
+班
+珮
+珲
+珺
+現
+球
+琅
+理
+琇
+琉
+琊
+琍
+琏
+琐
+琛
+琢
+琥
+琦
+琨
+琪
+琬
+琮
+琰
+琲
+琳
+琴
+琵
+琶
+琺
+琼
+瑀
+瑁
+瑄
+瑋
+瑕
+瑗
+瑙
+瑚
+瑛
+瑜
+瑞
+瑟
+瑠
+瑣
+瑤
+瑩
+瑪
+瑯
+瑰
+瑶
+瑾
+璀
+璁
+璃
+璇
+璉
+璋
+璎
+璐
+璜
+璞
+璟
+璧
+璨
+環
+璽
+璿
+瓊
+瓏
+瓒
+瓜
+瓢
+瓣
+瓤
+瓦
+瓮
+瓯
+瓴
+瓶
+瓷
+甄
+甌
+甕
+甘
+甙
+甚
+甜
+生
+產
+産
+甥
+甦
+用
+甩
+甫
+甬
+甭
+甯
+田
+由
+甲
+申
+电
+男
+甸
+町
+画
+甾
+畀
+畅
+界
+畏
+畑
+畔
+留
+畜
+畝
+畢
+略
+畦
+番
+畫
+異
+畲
+畳
+畴
+當
+畸
+畹
+畿
+疆
+疇
+疊
+疏
+疑
+疔
+疖
+疗
+疙
+疚
+疝
+疟
+疡
+疣
+疤
+疥
+疫
+疮
+疯
+疱
+疲
+疳
+疵
+疸
+疹
+疼
+疽
+疾
+痂
+病
+症
+痈
+痉
+痊
+痍
+痒
+痔
+痕
+痘
+痙
+痛
+痞
+痠
+痢
+痣
+痤
+痧
+痨
+痪
+痫
+痰
+痱
+痴
+痹
+痺
+痼
+痿
+瘀
+瘁
+瘋
+瘍
+瘓
+瘘
+瘙
+瘟
+瘠
+瘡
+瘢
+瘤
+瘦
+瘧
+瘩
+瘪
+瘫
+瘴
+瘸
+瘾
+療
+癇
+癌
+癒
+癖
+癜
+癞
+癡
+癢
+癣
+癥
+癫
+癬
+癮
+癱
+癲
+癸
+発
+登
+發
+白
+百
+皂
+的
+皆
+皇
+皈
+皋
+皎
+皑
+皓
+皖
+皙
+皚
+皮
+皰
+皱
+皴
+皺
+皿
+盂
+盃
+盅
+盆
+盈
+益
+盎
+盏
+盐
+监
+盒
+盔
+盖
+盗
+盘
+盛
+盜
+盞
+盟
+盡
+監
+盤
+盥
+盧
+盪
+目
+盯
+盱
+盲
+直
+相
+盹
+盼
+盾
+省
+眈
+眉
+看
+県
+眙
+眞
+真
+眠
+眦
+眨
+眩
+眯
+眶
+眷
+眸
+眺
+眼
+眾
+着
+睁
+睇
+睏
+睐
+睑
+睛
+睜
+睞
+睡
+睢
+督
+睥
+睦
+睨
+睪
+睫
+睬
+睹
+睽
+睾
+睿
+瞄
+瞅
+瞇
+瞋
+瞌
+瞎
+瞑
+瞒
+瞓
+瞞
+瞟
+瞠
+瞥
+瞧
+瞩
+瞪
+瞬
+瞭
+瞰
+瞳
+瞻
+瞼
+瞿
+矇
+矍
+矗
+矚
+矛
+矜
+矢
+矣
+知
+矩
+矫
+短
+矮
+矯
+石
+矶
+矽
+矾
+矿
+码
+砂
+砌
+砍
+砒
+研
+砖
+砗
+砚
+砝
+砣
+砥
+砧
+砭
+砰
+砲
+破
+砷
+砸
+砺
+砼
+砾
+础
+硅
+硐
+硒
+硕
+硝
+硫
+硬
+确
+硯
+硼
+碁
+碇
+碉
+碌
+碍
+碎
+碑
+碓
+碗
+碘
+碚
+碛
+碟
+碣
+碧
+碩
+碰
+碱
+碳
+碴
+確
+碼
+碾
+磁
+磅
+磊
+磋
+磐
+磕
+磚
+磡
+磨
+磬
+磯
+磲
+磷
+磺
+礁
+礎
+礙
+礡
+礦
+礪
+礫
+礴
+示
+礼
+社
+祀
+祁
+祂
+祇
+祈
+祉
+祎
+祐
+祕
+祖
+祗
+祚
+祛
+祜
+祝
+神
+祟
+祠
+祢
+祥
+票
+祭
+祯
+祷
+祸
+祺
+祿
+禀
+禁
+禄
+禅
+禍
+禎
+福
+禛
+禦
+禧
+禪
+禮
+禱
+禹
+禺
+离
+禽
+禾
+禿
+秀
+私
+秃
+秆
+秉
+秋
+种
+科
+秒
+秘
+租
+秣
+秤
+秦
+秧
+秩
+秭
+积
+称
+秸
+移
+秽
+稀
+稅
+程
+稍
+税
+稔
+稗
+稚
+稜
+稞
+稟
+稠
+稣
+種
+稱
+稲
+稳
+稷
+稹
+稻
+稼
+稽
+稿
+穀
+穂
+穆
+穌
+積
+穎
+穗
+穢
+穩
+穫
+穴
+究
+穷
+穹
+空
+穿
+突
+窃
+窄
+窈
+窍
+窑
+窒
+窓
+窕
+窖
+窗
+窘
+窜
+窝
+窟
+窠
+窥
+窦
+窨
+窩
+窪
+窮
+窯
+窺
+窿
+竄
+竅
+竇
+竊
+立
+竖
+站
+竜
+竞
+竟
+章
+竣
+童
+竭
+端
+競
+竹
+竺
+竽
+竿
+笃
+笆
+笈
+笋
+笏
+笑
+笔
+笙
+笛
+笞
+笠
+符
+笨
+第
+笹
+笺
+笼
+筆
+等
+筊
+筋
+筍
+筏
+筐
+筑
+筒
+答
+策
+筛
+筝
+筠
+筱
+筲
+筵
+筷
+筹
+签
+简
+箇
+箋
+箍
+箏
+箐
+箔
+箕
+算
+箝
+管
+箩
+箫
+箭
+箱
+箴
+箸
+節
+篁
+範
+篆
+篇
+築
+篑
+篓
+篙
+篝
+篠
+篡
+篤
+篩
+篪
+篮
+篱
+篷
+簇
+簌
+簍
+簡
+簦
+簧
+簪
+簫
+簷
+簸
+簽
+簾
+簿
+籁
+籃
+籌
+籍
+籐
+籟
+籠
+籤
+籬
+籮
+籲
+米
+类
+籼
+籽
+粄
+粉
+粑
+粒
+粕
+粗
+粘
+粟
+粤
+粥
+粧
+粪
+粮
+粱
+粲
+粳
+粵
+粹
+粼
+粽
+精
+粿
+糅
+糊
+糍
+糕
+糖
+糗
+糙
+糜
+糞
+糟
+糠
+糧
+糬
+糯
+糰
+糸
+系
+糾
+紀
+紂
+約
+紅
+紉
+紊
+紋
+納
+紐
+紓
+純
+紗
+紘
+紙
+級
+紛
+紜
+素
+紡
+索
+紧
+紫
+紮
+累
+細
+紳
+紹
+紺
+終
+絃
+組
+絆
+経
+結
+絕
+絞
+絡
+絢
+給
+絨
+絮
+統
+絲
+絳
+絵
+絶
+絹
+綁
+綏
+綑
+經
+継
+続
+綜
+綠
+綢
+綦
+綫
+綬
+維
+綱
+網
+綴
+綵
+綸
+綺
+綻
+綽
+綾
+綿
+緊
+緋
+総
+緑
+緒
+緘
+線
+緝
+緞
+締
+緣
+編
+緩
+緬
+緯
+練
+緹
+緻
+縁
+縄
+縈
+縛
+縝
+縣
+縫
+縮
+縱
+縴
+縷
+總
+績
+繁
+繃
+繆
+繇
+繋
+織
+繕
+繚
+繞
+繡
+繩
+繪
+繫
+繭
+繳
+繹
+繼
+繽
+纂
+續
+纍
+纏
+纓
+纔
+纖
+纜
+纠
+红
+纣
+纤
+约
+级
+纨
+纪
+纫
+纬
+纭
+纯
+纰
+纱
+纲
+纳
+纵
+纶
+纷
+纸
+纹
+纺
+纽
+纾
+线
+绀
+练
+组
+绅
+细
+织
+终
+绊
+绍
+绎
+经
+绑
+绒
+结
+绔
+绕
+绘
+给
+绚
+绛
+络
+绝
+绞
+统
+绡
+绢
+绣
+绥
+绦
+继
+绩
+绪
+绫
+续
+绮
+绯
+绰
+绳
+维
+绵
+绶
+绷
+绸
+绻
+综
+绽
+绾
+绿
+缀
+缄
+缅
+缆
+缇
+缈
+缉
+缎
+缓
+缔
+缕
+编
+缘
+缙
+缚
+缜
+缝
+缠
+缢
+缤
+缥
+缨
+缩
+缪
+缭
+缮
+缰
+缱
+缴
+缸
+缺
+缽
+罂
+罄
+罌
+罐
+网
+罔
+罕
+罗
+罚
+罡
+罢
+罩
+罪
+置
+罰
+署
+罵
+罷
+罹
+羁
+羅
+羈
+羊
+羌
+美
+羔
+羚
+羞
+羟
+羡
+羣
+群
+羥
+羧
+羨
+義
+羯
+羲
+羸
+羹
+羽
+羿
+翁
+翅
+翊
+翌
+翎
+習
+翔
+翘
+翟
+翠
+翡
+翦
+翩
+翰
+翱
+翳
+翹
+翻
+翼
+耀
+老
+考
+耄
+者
+耆
+耋
+而
+耍
+耐
+耒
+耕
+耗
+耘
+耙
+耦
+耨
+耳
+耶
+耷
+耸
+耻
+耽
+耿
+聂
+聆
+聊
+聋
+职
+聒
+联
+聖
+聘
+聚
+聞
+聪
+聯
+聰
+聲
+聳
+聴
+聶
+職
+聽
+聾
+聿
+肃
+肄
+肅
+肆
+肇
+肉
+肋
+肌
+肏
+肓
+肖
+肘
+肚
+肛
+肝
+肠
+股
+肢
+肤
+肥
+肩
+肪
+肮
+肯
+肱
+育
+肴
+肺
+肽
+肾
+肿
+胀
+胁
+胃
+胄
+胆
+背
+胍
+胎
+胖
+胚
+胛
+胜
+胝
+胞
+胡
+胤
+胥
+胧
+胫
+胭
+胯
+胰
+胱
+胳
+胴
+胶
+胸
+胺
+能
+脂
+脅
+脆
+脇
+脈
+脉
+脊
+脍
+脏
+脐
+脑
+脓
+脖
+脘
+脚
+脛
+脣
+脩
+脫
+脯
+脱
+脲
+脳
+脸
+脹
+脾
+腆
+腈
+腊
+腋
+腌
+腎
+腐
+腑
+腓
+腔
+腕
+腥
+腦
+腩
+腫
+腭
+腮
+腰
+腱
+腳
+腴
+腸
+腹
+腺
+腻
+腼
+腾
+腿
+膀
+膈
+膊
+膏
+膑
+膘
+膚
+膛
+膜
+膝
+膠
+膦
+膨
+膩
+膳
+膺
+膻
+膽
+膾
+膿
+臀
+臂
+臃
+臆
+臉
+臊
+臍
+臓
+臘
+臟
+臣
+臥
+臧
+臨
+自
+臬
+臭
+至
+致
+臺
+臻
+臼
+臾
+舀
+舂
+舅
+舆
+與
+興
+舉
+舊
+舌
+舍
+舎
+舐
+舒
+舔
+舖
+舗
+舛
+舜
+舞
+舟
+航
+舫
+般
+舰
+舱
+舵
+舶
+舷
+舸
+船
+舺
+舾
+艇
+艋
+艘
+艙
+艦
+艮
+良
+艰
+艱
+色
+艳
+艷
+艹
+艺
+艾
+节
+芃
+芈
+芊
+芋
+芍
+芎
+芒
+芙
+芜
+芝
+芡
+芥
+芦
+芩
+芪
+芫
+芬
+芭
+芮
+芯
+花
+芳
+芷
+芸
+芹
+芻
+芽
+芾
+苁
+苄
+苇
+苋
+苍
+苏
+苑
+苒
+苓
+苔
+苕
+苗
+苛
+苜
+苞
+苟
+苡
+苣
+若
+苦
+苫
+苯
+英
+苷
+苹
+苻
+茁
+茂
+范
+茄
+茅
+茉
+茎
+茏
+茗
+茜
+茧
+茨
+茫
+茬
+茭
+茯
+茱
+茲
+茴
+茵
+茶
+茸
+茹
+茼
+荀
+荃
+荆
+草
+荊
+荏
+荐
+荒
+荔
+荖
+荘
+荚
+荞
+荟
+荠
+荡
+荣
+荤
+荥
+荧
+荨
+荪
+荫
+药
+荳
+荷
+荸
+荻
+荼
+荽
+莅
+莆
+莉
+莊
+莎
+莒
+莓
+莖
+莘
+莞
+莠
+莢
+莧
+莪
+莫
+莱
+莲
+莴
+获
+莹
+莺
+莽
+莿
+菀
+菁
+菅
+菇
+菈
+菊
+菌
+菏
+菓
+菖
+菘
+菜
+菟
+菠
+菡
+菩
+華
+菱
+菲
+菸
+菽
+萁
+萃
+萄
+萊
+萋
+萌
+萍
+萎
+萘
+萝
+萤
+营
+萦
+萧
+萨
+萩
+萬
+萱
+萵
+萸
+萼
+落
+葆
+葉
+著
+葚
+葛
+葡
+董
+葦
+葩
+葫
+葬
+葭
+葯
+葱
+葳
+葵
+葷
+葺
+蒂
+蒋
+蒐
+蒔
+蒙
+蒜
+蒞
+蒟
+蒡
+蒨
+蒲
+蒸
+蒹
+蒻
+蒼
+蒿
+蓁
+蓄
+蓆
+蓉
+蓋
+蓑
+蓓
+蓖
+蓝
+蓟
+蓦
+蓬
+蓮
+蓼
+蓿
+蔑
+蔓
+蔔
+蔗
+蔘
+蔚
+蔡
+蔣
+蔥
+蔫
+蔬
+蔭
+蔵
+蔷
+蔺
+蔻
+蔼
+蔽
+蕁
+蕃
+蕈
+蕉
+蕊
+蕎
+蕙
+蕤
+蕨
+蕩
+蕪
+蕭
+蕲
+蕴
+蕻
+蕾
+薄
+薅
+薇
+薈
+薊
+薏
+薑
+薔
+薙
+薛
+薦
+薨
+薩
+薪
+薬
+薯
+薰
+薹
+藉
+藍
+藏
+藐
+藓
+藕
+藜
+藝
+藤
+藥
+藩
+藹
+藻
+藿
+蘆
+蘇
+蘊
+蘋
+蘑
+蘚
+蘭
+蘸
+蘼
+蘿
+虎
+虏
+虐
+虑
+虔
+處
+虚
+虛
+虜
+虞
+號
+虢
+虧
+虫
+虬
+虱
+虹
+虻
+虽
+虾
+蚀
+蚁
+蚂
+蚊
+蚌
+蚓
+蚕
+蚜
+蚝
+蚣
+蚤
+蚩
+蚪
+蚯
+蚱
+蚵
+蛀
+蛆
+蛇
+蛊
+蛋
+蛎
+蛐
+蛔
+蛙
+蛛
+蛟
+蛤
+蛭
+蛮
+蛰
+蛳
+蛹
+蛻
+蛾
+蜀
+蜂
+蜃
+蜆
+蜇
+蜈
+蜊
+蜍
+蜒
+蜓
+蜕
+蜗
+蜘
+蜚
+蜜
+蜡
+蜢
+蜥
+蜱
+蜴
+蜷
+蜻
+蜿
+蝇
+蝈
+蝉
+蝌
+蝎
+蝕
+蝗
+蝙
+蝟
+蝠
+蝦
+蝨
+蝴
+蝶
+蝸
+蝼
+螂
+螃
+融
+螞
+螢
+螨
+螯
+螳
+螺
+蟀
+蟄
+蟆
+蟋
+蟎
+蟑
+蟒
+蟠
+蟬
+蟲
+蟹
+蟻
+蟾
+蠅
+蠍
+蠔
+蠕
+蠛
+蠟
+蠡
+蠢
+蠣
+蠱
+蠶
+蠹
+蠻
+血
+衄
+衅
+衆
+行
+衍
+術
+衔
+街
+衙
+衛
+衝
+衞
+衡
+衢
+衣
+补
+表
+衩
+衫
+衬
+衮
+衰
+衲
+衷
+衹
+衾
+衿
+袁
+袂
+袄
+袅
+袈
+袋
+袍
+袒
+袖
+袜
+袞
+袤
+袪
+被
+袭
+袱
+裁
+裂
+装
+裆
+裊
+裏
+裔
+裕
+裘
+裙
+補
+裝
+裟
+裡
+裤
+裨
+裱
+裳
+裴
+裸
+裹
+製
+裾
+褂
+複
+褐
+褒
+褓
+褔
+褚
+褥
+褪
+褫
+褲
+褶
+褻
+襁
+襄
+襟
+襠
+襪
+襬
+襯
+襲
+西
+要
+覃
+覆
+覇
+見
+規
+覓
+視
+覚
+覦
+覧
+親
+覬
+観
+覷
+覺
+覽
+觀
+见
+观
+规
+觅
+视
+览
+觉
+觊
+觎
+觐
+觑
+角
+觞
+解
+觥
+触
+觸
+言
+訂
+計
+訊
+討
+訓
+訕
+訖
+託
+記
+訛
+訝
+訟
+訣
+訥
+訪
+設
+許
+訳
+訴
+訶
+診
+註
+証
+詆
+詐
+詔
+評
+詛
+詞
+詠
+詡
+詢
+詣
+試
+詩
+詫
+詬
+詭
+詮
+詰
+話
+該
+詳
+詹
+詼
+誅
+誇
+誉
+誌
+認
+誓
+誕
+誘
+語
+誠
+誡
+誣
+誤
+誥
+誦
+誨
+說
+説
+読
+誰
+課
+誹
+誼
+調
+諄
+談
+請
+諏
+諒
+論
+諗
+諜
+諡
+諦
+諧
+諫
+諭
+諮
+諱
+諳
+諷
+諸
+諺
+諾
+謀
+謁
+謂
+謄
+謊
+謎
+謐
+謔
+謗
+謙
+講
+謝
+謠
+謨
+謬
+謹
+謾
+譁
+證
+譎
+譏
+識
+譙
+譚
+譜
+警
+譬
+譯
+議
+譲
+譴
+護
+譽
+讀
+變
+讓
+讚
+讞
+计
+订
+认
+讥
+讧
+讨
+让
+讪
+讫
+训
+议
+讯
+记
+讲
+讳
+讴
+讶
+讷
+许
+讹
+论
+讼
+讽
+设
+访
+诀
+证
+诃
+评
+诅
+识
+诈
+诉
+诊
+诋
+词
+诏
+译
+试
+诗
+诘
+诙
+诚
+诛
+话
+诞
+诟
+诠
+诡
+询
+诣
+诤
+该
+详
+诧
+诩
+诫
+诬
+语
+误
+诰
+诱
+诲
+说
+诵
+诶
+请
+诸
+诺
+读
+诽
+课
+诿
+谀
+谁
+调
+谄
+谅
+谆
+谈
+谊
+谋
+谌
+谍
+谎
+谏
+谐
+谑
+谒
+谓
+谔
+谕
+谗
+谘
+谙
+谚
+谛
+谜
+谟
+谢
+谣
+谤
+谥
+谦
+谧
+谨
+谩
+谪
+谬
+谭
+谯
+谱
+谲
+谴
+谶
+谷
+豁
+豆
+豇
+豈
+豉
+豊
+豌
+豎
+豐
+豔
+豚
+象
+豢
+豪
+豫
+豬
+豹
+豺
+貂
+貅
+貌
+貓
+貔
+貘
+貝
+貞
+負
+財
+貢
+貧
+貨
+販
+貪
+貫
+責
+貯
+貰
+貳
+貴
+貶
+買
+貸
+費
+貼
+貽
+貿
+賀
+賁
+賂
+賃
+賄
+資
+賈
+賊
+賑
+賓
+賜
+賞
+賠
+賡
+賢
+賣
+賤
+賦
+質
+賬
+賭
+賴
+賺
+購
+賽
+贅
+贈
+贊
+贍
+贏
+贓
+贖
+贛
+贝
+贞
+负
+贡
+财
+责
+贤
+败
+账
+货
+质
+贩
+贪
+贫
+贬
+购
+贮
+贯
+贰
+贱
+贲
+贴
+贵
+贷
+贸
+费
+贺
+贻
+贼
+贾
+贿
+赁
+赂
+赃
+资
+赅
+赈
+赊
+赋
+赌
+赎
+赏
+赐
+赓
+赔
+赖
+赘
+赚
+赛
+赝
+赞
+赠
+赡
+赢
+赣
+赤
+赦
+赧
+赫
+赭
+走
+赳
+赴
+赵
+赶
+起
+趁
+超
+越
+趋
+趕
+趙
+趟
+趣
+趨
+足
+趴
+趵
+趸
+趺
+趾
+跃
+跄
+跆
+跋
+跌
+跎
+跑
+跖
+跚
+跛
+距
+跟
+跡
+跤
+跨
+跩
+跪
+路
+跳
+践
+跷
+跹
+跺
+跻
+踉
+踊
+踌
+踏
+踐
+踝
+踞
+踟
+踢
+踩
+踪
+踮
+踱
+踴
+踵
+踹
+蹂
+蹄
+蹇
+蹈
+蹉
+蹊
+蹋
+蹑
+蹒
+蹙
+蹟
+蹣
+蹤
+蹦
+蹩
+蹬
+蹭
+蹲
+蹴
+蹶
+蹺
+蹼
+蹿
+躁
+躇
+躉
+躊
+躋
+躍
+躏
+躪
+身
+躬
+躯
+躲
+躺
+軀
+車
+軋
+軌
+軍
+軒
+軟
+転
+軸
+軼
+軽
+軾
+較
+載
+輒
+輓
+輔
+輕
+輛
+輝
+輟
+輩
+輪
+輯
+輸
+輻
+輾
+輿
+轄
+轅
+轆
+轉
+轍
+轎
+轟
+车
+轧
+轨
+轩
+转
+轭
+轮
+软
+轰
+轲
+轴
+轶
+轻
+轼
+载
+轿
+较
+辄
+辅
+辆
+辇
+辈
+辉
+辊
+辍
+辐
+辑
+输
+辕
+辖
+辗
+辘
+辙
+辛
+辜
+辞
+辟
+辣
+辦
+辨
+辩
+辫
+辭
+辮
+辯
+辰
+辱
+農
+边
+辺
+辻
+込
+辽
+达
+迁
+迂
+迄
+迅
+过
+迈
+迎
+运
+近
+返
+还
+这
+进
+远
+违
+连
+迟
+迢
+迤
+迥
+迦
+迩
+迪
+迫
+迭
+述
+迴
+迷
+迸
+迹
+迺
+追
+退
+送
+适
+逃
+逅
+逆
+选
+逊
+逍
+透
+逐
+递
+途
+逕
+逗
+這
+通
+逛
+逝
+逞
+速
+造
+逢
+連
+逮
+週
+進
+逵
+逶
+逸
+逻
+逼
+逾
+遁
+遂
+遅
+遇
+遊
+運
+遍
+過
+遏
+遐
+遑
+遒
+道
+達
+違
+遗
+遙
+遛
+遜
+遞
+遠
+遢
+遣
+遥
+遨
+適
+遭
+遮
+遲
+遴
+遵
+遶
+遷
+選
+遺
+遼
+遽
+避
+邀
+邁
+邂
+邃
+還
+邇
+邈
+邊
+邋
+邏
+邑
+邓
+邕
+邛
+邝
+邢
+那
+邦
+邨
+邪
+邬
+邮
+邯
+邰
+邱
+邳
+邵
+邸
+邹
+邺
+邻
+郁
+郅
+郊
+郎
+郑
+郜
+郝
+郡
+郢
+郤
+郦
+郧
+部
+郫
+郭
+郴
+郵
+郷
+郸
+都
+鄂
+鄉
+鄒
+鄔
+鄙
+鄞
+鄢
+鄧
+鄭
+鄰
+鄱
+鄲
+鄺
+酉
+酊
+酋
+酌
+配
+酐
+酒
+酗
+酚
+酝
+酢
+酣
+酥
+酩
+酪
+酬
+酮
+酯
+酰
+酱
+酵
+酶
+酷
+酸
+酿
+醃
+醇
+醉
+醋
+醍
+醐
+醒
+醚
+醛
+醜
+醞
+醣
+醪
+醫
+醬
+醮
+醯
+醴
+醺
+釀
+釁
+采
+釉
+释
+釋
+里
+重
+野
+量
+釐
+金
+釗
+釘
+釜
+針
+釣
+釦
+釧
+釵
+鈀
+鈉
+鈍
+鈎
+鈔
+鈕
+鈞
+鈣
+鈦
+鈪
+鈴
+鈺
+鈾
+鉀
+鉄
+鉅
+鉉
+鉑
+鉗
+鉚
+鉛
+鉤
+鉴
+鉻
+銀
+銃
+銅
+銑
+銓
+銖
+銘
+銜
+銬
+銭
+銮
+銳
+銷
+銹
+鋁
+鋅
+鋒
+鋤
+鋪
+鋰
+鋸
+鋼
+錄
+錐
+錘
+錚
+錠
+錢
+錦
+錨
+錫
+錮
+錯
+録
+錳
+錶
+鍊
+鍋
+鍍
+鍛
+鍥
+鍰
+鍵
+鍺
+鍾
+鎂
+鎊
+鎌
+鎏
+鎔
+鎖
+鎗
+鎚
+鎧
+鎬
+鎮
+鎳
+鏈
+鏖
+鏗
+鏘
+鏞
+鏟
+鏡
+鏢
+鏤
+鏽
+鐘
+鐮
+鐲
+鐳
+鐵
+鐸
+鐺
+鑄
+鑊
+鑑
+鑒
+鑣
+鑫
+鑰
+鑲
+鑼
+鑽
+鑾
+鑿
+针
+钉
+钊
+钎
+钏
+钒
+钓
+钗
+钙
+钛
+钜
+钝
+钞
+钟
+钠
+钡
+钢
+钣
+钤
+钥
+钦
+钧
+钨
+钩
+钮
+钯
+钰
+钱
+钳
+钴
+钵
+钺
+钻
+钼
+钾
+钿
+铀
+铁
+铂
+铃
+铄
+铅
+铆
+铉
+铎
+铐
+铛
+铜
+铝
+铠
+铡
+铢
+铣
+铤
+铨
+铩
+铬
+铭
+铮
+铰
+铲
+铵
+银
+铸
+铺
+链
+铿
+销
+锁
+锂
+锄
+锅
+锆
+锈
+锉
+锋
+锌
+锏
+锐
+锑
+错
+锚
+锟
+锡
+锢
+锣
+锤
+锥
+锦
+锭
+键
+锯
+锰
+锲
+锵
+锹
+锺
+锻
+镀
+镁
+镂
+镇
+镉
+镌
+镍
+镐
+镑
+镕
+镖
+镗
+镛
+镜
+镣
+镭
+镯
+镰
+镳
+镶
+長
+长
+門
+閃
+閉
+開
+閎
+閏
+閑
+閒
+間
+閔
+閘
+閡
+関
+閣
+閥
+閨
+閩
+閱
+閲
+閹
+閻
+閾
+闆
+闇
+闊
+闌
+闍
+闔
+闕
+闖
+闘
+關
+闡
+闢
+门
+闪
+闫
+闭
+问
+闯
+闰
+闲
+间
+闵
+闷
+闸
+闹
+闺
+闻
+闽
+闾
+阀
+阁
+阂
+阅
+阆
+阇
+阈
+阉
+阎
+阐
+阑
+阔
+阕
+阖
+阙
+阚
+阜
+队
+阡
+阪
+阮
+阱
+防
+阳
+阴
+阵
+阶
+阻
+阿
+陀
+陂
+附
+际
+陆
+陇
+陈
+陋
+陌
+降
+限
+陕
+陛
+陝
+陞
+陟
+陡
+院
+陣
+除
+陨
+险
+陪
+陰
+陲
+陳
+陵
+陶
+陷
+陸
+険
+陽
+隅
+隆
+隈
+隊
+隋
+隍
+階
+随
+隐
+隔
+隕
+隘
+隙
+際
+障
+隠
+隣
+隧
+隨
+險
+隱
+隴
+隶
+隸
+隻
+隼
+隽
+难
+雀
+雁
+雄
+雅
+集
+雇
+雉
+雋
+雌
+雍
+雎
+雏
+雑
+雒
+雕
+雖
+雙
+雛
+雜
+雞
+離
+難
+雨
+雪
+雯
+雰
+雲
+雳
+零
+雷
+雹
+電
+雾
+需
+霁
+霄
+霆
+震
+霈
+霉
+霊
+霍
+霎
+霏
+霑
+霓
+霖
+霜
+霞
+霧
+霭
+霰
+露
+霸
+霹
+霽
+霾
+靂
+靄
+靈
+青
+靓
+靖
+静
+靚
+靛
+靜
+非
+靠
+靡
+面
+靥
+靦
+革
+靳
+靴
+靶
+靼
+鞅
+鞋
+鞍
+鞏
+鞑
+鞘
+鞠
+鞣
+鞦
+鞭
+韆
+韋
+韌
+韓
+韜
+韦
+韧
+韩
+韬
+韭
+音
+韵
+韶
+韻
+響
+頁
+頂
+頃
+項
+順
+須
+頌
+預
+頑
+頒
+頓
+頗
+領
+頜
+頡
+頤
+頫
+頭
+頰
+頷
+頸
+頹
+頻
+頼
+顆
+題
+額
+顎
+顏
+顔
+願
+顛
+類
+顧
+顫
+顯
+顱
+顴
+页
+顶
+顷
+项
+顺
+须
+顼
+顽
+顾
+顿
+颁
+颂
+预
+颅
+领
+颇
+颈
+颉
+颊
+颌
+颍
+颐
+频
+颓
+颔
+颖
+颗
+题
+颚
+颛
+颜
+额
+颞
+颠
+颡
+颢
+颤
+颦
+颧
+風
+颯
+颱
+颳
+颶
+颼
+飄
+飆
+风
+飒
+飓
+飕
+飘
+飙
+飚
+飛
+飞
+食
+飢
+飨
+飩
+飪
+飯
+飲
+飼
+飽
+飾
+餃
+餅
+餉
+養
+餌
+餐
+餒
+餓
+餘
+餚
+餛
+餞
+餡
+館
+餮
+餵
+餾
+饅
+饈
+饋
+饌
+饍
+饑
+饒
+饕
+饗
+饞
+饥
+饨
+饪
+饬
+饭
+饮
+饯
+饰
+饱
+饲
+饴
+饵
+饶
+饷
+饺
+饼
+饽
+饿
+馀
+馁
+馄
+馅
+馆
+馈
+馋
+馍
+馏
+馒
+馔
+首
+馗
+香
+馥
+馨
+馬
+馭
+馮
+馳
+馴
+駁
+駄
+駅
+駆
+駐
+駒
+駕
+駛
+駝
+駭
+駱
+駿
+騁
+騎
+騏
+験
+騙
+騨
+騰
+騷
+驀
+驅
+驊
+驍
+驒
+驕
+驗
+驚
+驛
+驟
+驢
+驥
+马
+驭
+驮
+驯
+驰
+驱
+驳
+驴
+驶
+驷
+驸
+驹
+驻
+驼
+驾
+驿
+骁
+骂
+骄
+骅
+骆
+骇
+骈
+骊
+骋
+验
+骏
+骐
+骑
+骗
+骚
+骛
+骜
+骞
+骠
+骡
+骤
+骥
+骧
+骨
+骯
+骰
+骶
+骷
+骸
+骼
+髂
+髅
+髋
+髏
+髒
+髓
+體
+髖
+高
+髦
+髪
+髮
+髯
+髻
+鬃
+鬆
+鬍
+鬓
+鬚
+鬟
+鬢
+鬣
+鬥
+鬧
+鬱
+鬼
+魁
+魂
+魄
+魅
+魇
+魍
+魏
+魔
+魘
+魚
+魯
+魷
+鮑
+鮨
+鮪
+鮭
+鮮
+鯉
+鯊
+鯖
+鯛
+鯨
+鯰
+鯽
+鰍
+鰓
+鰭
+鰲
+鰻
+鰾
+鱈
+鱉
+鱔
+鱗
+鱷
+鱸
+鱼
+鱿
+鲁
+鲈
+鲍
+鲑
+鲛
+鲜
+鲟
+鲢
+鲤
+鲨
+鲫
+鲱
+鲲
+鲶
+鲷
+鲸
+鳃
+鳄
+鳅
+鳌
+鳍
+鳕
+鳖
+鳗
+鳝
+鳞
+鳥
+鳩
+鳳
+鳴
+鳶
+鴉
+鴕
+鴛
+鴦
+鴨
+鴻
+鴿
+鵑
+鵜
+鵝
+鵡
+鵬
+鵰
+鵲
+鶘
+鶩
+鶯
+鶴
+鷗
+鷲
+鷹
+鷺
+鸚
+鸞
+鸟
+鸠
+鸡
+鸢
+鸣
+鸥
+鸦
+鸨
+鸪
+鸭
+鸯
+鸳
+鸵
+鸽
+鸾
+鸿
+鹂
+鹃
+鹄
+鹅
+鹈
+鹉
+鹊
+鹌
+鹏
+鹑
+鹕
+鹘
+鹜
+鹞
+鹤
+鹦
+鹧
+鹫
+鹭
+鹰
+鹳
+鹵
+鹹
+鹼
+鹽
+鹿
+麂
+麋
+麒
+麓
+麗
+麝
+麟
+麥
+麦
+麩
+麴
+麵
+麸
+麺
+麻
+麼
+麽
+麾
+黃
+黄
+黍
+黎
+黏
+黑
+黒
+黔
+默
+黛
+黜
+黝
+點
+黠
+黨
+黯
+黴
+鼋
+鼎
+鼐
+鼓
+鼠
+鼬
+鼹
+鼻
+鼾
+齁
+齊
+齋
+齐
+齒
+齡
+齢
+齣
+齦
+齿
+龄
+龅
+龈
+龊
+龋
+龌
+龍
+龐
+龔
+龕
+龙
+龚
+龛
+龜
+龟
+︰
+︱
+︶
+︿
+﹁
+﹂
+﹍
+﹏
+﹐
+﹑
+﹒
+﹔
+﹕
+﹖
+﹗
+﹙
+﹚
+﹝
+﹞
+﹡
+﹣
+！
+＂
+＃
+＄
+％
+＆
+＇
+（
+）
+＊
+＋
+，
+－
+．
+／
+０
+１
+２
+３
+４
+５
+６
+７
+８
+９
+：
+；
+＜
+＝
+＞
+？
+＠
+［
+＼
+］
+＾
+＿
+｀
+ａ
+ｂ
+ｃ
+ｄ
+ｅ
+ｆ
+ｇ
+ｈ
+ｉ
+ｊ
+ｋ
+ｌ
+ｍ
+ｎ
+ｏ
+ｐ
+ｑ
+ｒ
+ｓ
+ｔ
+ｕ
+ｖ
+ｗ
+ｘ
+ｙ
+ｚ
+｛
+｜
+｝
+～
+｡
+｢
+｣
+､
+･
+ｯ
+ｰ
+ｲ
+ｸ
+ｼ
+ｽ
+ﾄ
+ﾉ
+ﾌ
+ﾗ
+ﾙ
+ﾝ
+ﾞ
+ﾟ
+￣
+￥
+👍
+🔥
+😂
+😎
+...
+yam
+10
+2017
+12
+11
+2016
+20
+30
+15
+06
+lofter
+##s
+2015
+by
+16
+14
+18
+13
+24
+17
+2014
+21
+##0
+22
+19
+25
+23
+com
+100
+00
+05
+2013
+##a
+03
+09
+08
+28
+##2
+50
+01
+04
+##1
+27
+02
+2012
+##3
+26
+##e
+07
+##8
+##5
+##6
+##4
+##9
+##7
+29
+2011
+40
+##t
+2010
+##o
+##d
+##i
+2009
+##n
+app
+www
+the
+##m
+31
+##c
+##l
+##y
+##r
+##g
+2008
+60
+http
+200
+qq
+##p
+80
+##f
+google
+pixnet
+90
+cookies
+tripadvisor
+500
+##er
+##k
+35
+##h
+facebook
+2007
+2000
+70
+##b
+of
+##x
+##u
+45
+300
+iphone
+32
+1000
+2006
+48
+ip
+36
+in
+38
+3d
+##w
+##ing
+55
+ctrip
+##on
+##v
+33
+##の
+to
+34
+400
+id
+2005
+it
+37
+windows
+llc
+top
+99
+42
+39
+000
+led
+at
+##an
+41
+51
+52
+46
+49
+43
+53
+44
+##z
+android
+58
+and
+59
+2004
+56
+vr
+##か
+5000
+2003
+47
+blogthis
+twitter
+54
+##le
+150
+ok
+2018
+57
+75
+cn
+no
+ios
+##in
+##mm
+##00
+800
+on
+te
+3000
+65
+2001
+360
+95
+ig
+lv
+120
+##ng
+##を
+##us
+##に
+pc
+てす
+──
+600
+##te
+85
+2002
+88
+##ed
+html
+ncc
+wifi
+email
+64
+blog
+is
+##10
+##て
+mail
+online
+##al
+dvd
+##ic
+studio
+##は
+##℃
+##ia
+##と
+line
+vip
+72
+##q
+98
+##ce
+##en
+for
+##is
+##ra
+##es
+##j
+usb
+net
+cp
+1999
+asia
+4g
+##cm
+diy
+new
+3c
+##お
+ta
+66
+language
+vs
+apple
+tw
+86
+web
+##ne
+ipad
+62
+you
+##re
+101
+68
+##tion
+ps
+de
+bt
+pony
+atm
+##2017
+1998
+67
+##ch
+ceo
+##or
+go
+##na
+av
+pro
+cafe
+96
+pinterest
+97
+63
+pixstyleme3c
+##ta
+more
+said
+##2016
+1997
+mp3
+700
+##ll
+nba
+jun
+##20
+92
+tv
+1995
+pm
+61
+76
+nbsp
+250
+##ie
+linux
+##ma
+cd
+110
+hd
+##17
+78
+##ion
+77
+6000
+am
+##th
+##st
+94
+##se
+##et
+69
+180
+gdp
+my
+105
+81
+abc
+89
+flash
+79
+one
+93
+1990
+1996
+##ck
+gps
+##も
+##ly
+web885
+106
+2020
+91
+##ge
+4000
+1500
+xd
+boss
+isbn
+1994
+org
+##ry
+me
+love
+##11
+0fork
+73
+##12
+3g
+##ter
+##ar
+71
+82
+##la
+hotel
+130
+1970
+pk
+83
+87
+140
+ie
+##os
+##30
+##el
+74
+##50
+seo
+cpu
+##ml
+p2p
+84
+may
+##る
+sun
+tue
+internet
+cc
+posted
+youtube
+##at
+##ン
+##man
+ii
+##ル
+##15
+abs
+nt
+pdf
+yahoo
+ago
+1980
+##it
+news
+mac
+104
+##てす
+##me
+##り
+java
+1992
+spa
+##de
+##nt
+hk
+all
+plus
+la
+1993
+##mb
+##16
+##ve
+west
+##da
+160
+air
+##い
+##ps
+から
+##to
+1989
+logo
+htc
+php
+https
+fi
+momo
+##son
+sat
+##ke
+##80
+ebd
+suv
+wi
+day
+apk
+##88
+##um
+mv
+galaxy
+wiki
+or
+brake
+##ス
+1200
+する
+this
+1991
+mon
+##こ
+❤2017
+po
+##ない
+javascript
+life
+home
+june
+##ss
+system
+900
+##ー
+##０
+pp
+1988
+world
+fb
+4k
+br
+##as
+ic
+ai
+leonardo
+safari
+##60
+live
+free
+xx
+wed
+win7
+kiehl
+##co
+lg
+o2o
+##go
+us
+235
+1949
+mm
+しい
+vfm
+kanye
+##90
+##2015
+##id
+jr
+##ey
+123
+rss
+##sa
+##ro
+##am
+##no
+thu
+fri
+350
+##sh
+##ki
+103
+comments
+name
+##のて
+##pe
+##ine
+max
+1987
+8000
+uber
+##mi
+##ton
+wordpress
+office
+1986
+1985
+##ment
+107
+bd
+win10
+##ld
+##li
+gmail
+bb
+dior
+##rs
+##ri
+##rd
+##ます
+up
+cad
+##®
+dr
+して
+read
+##21
+をお
+##io
+##99
+url
+1984
+pvc
+paypal
+show
+policy
+##40
+##ty
+##18
+with
+##★
+##01
+txt
+102
+##ba
+dna
+from
+post
+mini
+ar
+taiwan
+john
+##ga
+privacy
+agoda
+##13
+##ny
+word
+##24
+##22
+##by
+##ur
+##hz
+1982
+##ang
+265
+cookie
+netscape
+108
+##ka
+##～
+##ad
+house
+share
+note
+ibm
+code
+hello
+nike
+sim
+survey
+##016
+1979
+1950
+wikia
+##32
+##017
+5g
+cbc
+##tor
+##kg
+1983
+##rt
+##14
+campaign
+store
+2500
+os
+##ct
+##ts
+##°
+170
+api
+##ns
+365
+excel
+##な
+##ao
+##ら
+##し
+～～
+##nd
+university
+163
+には
+518
+##70
+##ya
+##il
+##25
+pierre
+ipo
+0020
+897
+##23
+hotels
+##ian
+のお
+125
+years
+6606
+##ers
+##26
+high
+##day
+time
+##ay
+bug
+##line
+##く
+##す
+##be
+xp
+talk2yam
+yamservice
+10000
+coco
+##dy
+sony
+##ies
+1978
+microsoft
+david
+people
+##ha
+1960
+instagram
+intel
+その
+##ot
+iso
+1981
+##va
+115
+##mo
+##land
+xxx
+man
+co
+ltxsw
+##ation
+baby
+220
+##pa
+##ol
+1945
+7000
+tag
+450
+##ue
+msn
+##31
+oppo
+##ト
+##ca
+control
+##om
+st
+chrome
+##ure
+##ん
+be
+##き
+lol
+##19
+した
+##bo
+240
+lady
+##100
+##way
+##から
+4600
+##ko
+##do
+##un
+4s
+corporation
+168
+##ni
+herme
+##28
+ｃｐ
+978
+##up
+##06
+ui
+##ds
+ppt
+admin
+three
+します
+bbc
+re
+128
+##48
+ca
+##015
+##35
+hp
+##ee
+tpp
+##た
+##ive
+××
+root
+##cc
+##ました
+##ble
+##ity
+adobe
+park
+114
+et
+oled
+city
+##ex
+##ler
+##ap
+china
+##book
+20000
+view
+##ice
+global
+##km
+your
+hong
+##mg
+out
+##ms
+ng
+ebay
+##29
+menu
+ubuntu
+##cy
+rom
+##view
+open
+ktv
+do
+server
+##lo
+if
+english
+##ね
+##５
+##oo
+1600
+##02
+step1
+kong
+club
+135
+july
+inc
+1976
+mr
+hi
+##net
+touch
+##ls
+##ii
+michael
+lcd
+##05
+##33
+phone
+james
+step2
+1300
+ios9
+##box
+dc
+##２
+##ley
+samsung
+111
+280
+pokemon
+css
+##ent
+##les
+いいえ
+##１
+s8
+atom
+play
+bmw
+##said
+sa
+etf
+ctrl
+♥yoyo♥
+##55
+2025
+##2014
+##66
+adidas
+amazon
+1958
+##ber
+##ner
+visa
+##77
+##der
+1800
+connectivity
+##hi
+firefox
+109
+118
+hr
+so
+style
+mark
+pop
+ol
+skip
+1975
+as
+##27
+##ir
+##61
+190
+mba
+##う
+##ai
+le
+##ver
+1900
+cafe2017
+lte
+super
+113
+129
+##ron
+amd
+like
+##☆
+are
+##ster
+we
+##sk
+paul
+data
+international
+##ft
+longchamp
+ssd
+good
+##ート
+##ti
+reply
+##my
+↓↓↓
+apr
+star
+##ker
+source
+136
+js
+112
+get
+force
+photo
+##one
+126
+##2013
+##ow
+link
+bbs
+1972
+goods
+##lin
+python
+119
+##ip
+game
+##ics
+##ません
+blue
+##●
+520
+##45
+page
+itunes
+##03
+1955
+260
+1968
+gt
+gif
+618
+##ff
+##47
+group
+くたさい
+about
+bar
+ganji
+##nce
+music
+lee
+not
+1977
+1971
+1973
+##per
+an
+faq
+comment
+##って
+days
+##ock
+116
+##bs
+1974
+1969
+v1
+player
+1956
+xbox
+sql
+fm
+f1
+139
+##ah
+210
+##lv
+##mp
+##000
+melody
+1957
+##３
+550
+17life
+199
+1966
+xml
+market
+##au
+##71
+999
+##04
+what
+gl
+##95
+##age
+tips
+##68
+book
+##ting
+mysql
+can
+1959
+230
+##ung
+wonderland
+watch
+10℃
+##ction
+9000
+mar
+mobile
+1946
+1962
+article
+##db
+part
+▲top
+party
+って
+1967
+1964
+1948
+##07
+##ore
+##op
+この
+dj
+##78
+##38
+010
+main
+225
+1965
+##ong
+art
+320
+ad
+134
+020
+##73
+117
+pm2
+japan
+228
+##08
+ts
+1963
+##ica
+der
+sm
+##36
+2019
+##wa
+ct
+##７
+##や
+##64
+1937
+homemesh
+search
+##85
+##れは
+##tv
+##di
+macbook
+##９
+##くたさい
+service
+##♥
+type
+った
+750
+##ier
+##si
+##75
+##います
+##ok
+best
+##ット
+goris
+lock
+##った
+cf
+3m
+big
+##ut
+ftp
+carol
+##vi
+１０
+1961
+happy
+sd
+##ac
+122
+anti
+pe
+cnn
+iii
+1920
+138
+##ラ
+1940
+esp
+jan
+tags
+##98
+##51
+august
+vol
+##86
+154
+##™
+##fs
+##れ
+##sion
+design
+ac
+##ム
+press
+jordan
+ppp
+that
+key
+check
+##６
+##tt
+##㎡
+1080p
+##lt
+power
+##42
+1952
+##bc
+vivi
+##ック
+he
+133
+121
+jpg
+##rry
+201
+175
+3500
+1947
+nb
+##ted
+##rn
+しています
+1954
+usd
+##t00
+master
+##ンク
+001
+model
+##58
+al
+##09
+1953
+##34
+ram
+goo
+ても
+##ui
+127
+1930
+red
+##ary
+rpg
+item
+##pm
+##41
+270
+##za
+project
+##2012
+hot
+td
+blogabstract
+##ger
+##62
+650
+##44
+gr2
+##します
+##ｍ
+black
+electronic
+nfc
+year
+asus
+また
+html5
+cindy
+##hd
+m3
+132
+esc
+##od
+booking
+##53
+fed
+tvb
+##81
+##ina
+mit
+165
+##いる
+chan
+192
+distribution
+next
+になる
+peter
+bios
+steam
+cm
+1941
+にも
+pk10
+##ix
+##65
+##91
+dec
+nasa
+##ana
+icecat
+00z
+b1
+will
+##46
+li
+se
+##ji
+##み
+##ard
+oct
+##ain
+jp
+##ze
+##bi
+cio
+##56
+smart
+h5
+##39
+##port
+curve
+vpn
+##nm
+##dia
+utc
+##あり
+12345678910
+##52
+rmvb
+chanel
+a4
+miss
+##and
+##im
+media
+who
+##63
+she
+girl
+5s
+124
+vera
+##して
+class
+vivo
+king
+##フ
+##ei
+national
+ab
+1951
+5cm
+888
+145
+ipod
+ap
+1100
+5mm
+211
+ms
+2756
+##69
+mp4
+msci
+##po
+##89
+131
+mg
+index
+380
+##bit
+##out
+##zz
+##97
+##67
+158
+apec
+##８
+photoshop
+opec
+￥799
+ては
+##96
+##tes
+##ast
+2g
+○○
+##ール
+￥2899
+##ling
+##よ
+##ory
+1938
+##ical
+kitty
+content
+##43
+step3
+##cn
+win8
+155
+vc
+1400
+iphone7
+robert
+##した
+tcl
+137
+beauty
+##87
+en
+dollars
+##ys
+##oc
+step
+pay
+yy
+a1
+##2011
+##lly
+##ks
+##♪
+1939
+188
+download
+1944
+sep
+exe
+ph
+います
+school
+gb
+center
+pr
+street
+##board
+uv
+##37
+##lan
+winrar
+##que
+##ua
+##com
+1942
+1936
+480
+gpu
+##４
+ettoday
+fu
+tom
+##54
+##ren
+##via
+149
+##72
+b2b
+144
+##79
+##tch
+rose
+arm
+mb
+##49
+##ial
+##nn
+nvidia
+step4
+mvp
+00㎡
+york
+156
+##イ
+how
+cpi
+591
+2765
+gov
+kg
+joe
+##xx
+mandy
+pa
+##ser
+copyright
+fashion
+1935
+don
+##け
+ecu
+##ist
+##art
+erp
+wap
+have
+##lm
+talk
+##ek
+##ning
+##if
+ch
+##ite
+video
+1943
+cs
+san
+iot
+look
+##84
+##2010
+##ku
+october
+##ux
+trump
+##hs
+##ide
+box
+141
+first
+##ins
+april
+##ight
+##83
+185
+angel
+protected
+aa
+151
+162
+x1
+m2
+##fe
+##×
+##ho
+size
+143
+min
+ofo
+fun
+gomaji
+ex
+hdmi
+food
+dns
+march
+chris
+kevin
+##のか
+##lla
+##pp
+##ec
+ag
+ems
+6s
+720p
+##rm
+##ham
+off
+##92
+asp
+team
+fandom
+ed
+299
+▌♥
+##ell
+info
+されています
+##82
+sina
+4066
+161
+##able
+##ctor
+330
+399
+315
+dll
+rights
+ltd
+idc
+jul
+3kg
+1927
+142
+ma
+surface
+##76
+##ク
+～～～
+304
+mall
+eps
+146
+green
+##59
+map
+space
+donald
+v2
+sodu
+##light
+1931
+148
+1700
+まて
+310
+reserved
+htm
+##han
+##57
+2d
+178
+mod
+##ise
+##tions
+152
+ti
+##shi
+doc
+1933
+icp
+055
+wang
+##ram
+shopping
+aug
+##pi
+##well
+now
+wam
+b2
+からお
+##hu
+236
+1928
+##gb
+266
+f2
+##93
+153
+mix
+##ef
+##uan
+bwl
+##plus
+##res
+core
+##ess
+tea
+5℃
+hktvmall
+nhk
+##ate
+list
+##ese
+301
+feb
+4m
+inn
+ての
+nov
+159
+12345
+daniel
+##ci
+pass
+##bet
+##nk
+coffee
+202
+ssl
+airbnb
+##ute
+fbi
+woshipm
+skype
+ea
+cg
+sp
+##fc
+##www
+yes
+edge
+alt
+007
+##94
+fpga
+##ght
+##gs
+iso9001
+さい
+##ile
+##wood
+##uo
+image
+lin
+icon
+american
+##em
+1932
+set
+says
+##king
+##tive
+blogger
+##74
+なと
+256
+147
+##ox
+##zy
+##red
+##ium
+##lf
+nokia
+claire
+##リ
+##ding
+november
+lohas
+##500
+##tic
+##マ
+##cs
+##ある
+##che
+##ire
+##gy
+##ult
+db
+january
+win
+##カ
+166
+road
+ptt
+##ま
+##つ
+198
+##fa
+##mer
+anna
+pchome
+はい
+udn
+ef
+420
+##time
+##tte
+2030
+##ア
+g20
+white
+かかります
+1929
+308
+garden
+eleven
+di
+##おります
+chen
+309b
+777
+172
+young
+cosplay
+ちてない
+4500
+bat
+##123
+##tra
+##ては
+kindle
+npc
+steve
+etc
+##ern
+##｜
+call
+xperia
+ces
+travel
+sk
+s7
+##ous
+1934
+##int
+みいたたけます
+183
+edu
+file
+cho
+qr
+##car
+##our
+186
+##ant
+##ｄ
+eric
+1914
+rends
+##jo
+##する
+mastercard
+##2000
+kb
+##min
+290
+##ino
+vista
+##ris
+##ud
+jack
+2400
+##set
+169
+pos
+1912
+##her
+##ou
+taipei
+しく
+205
+beta
+##ませんか
+232
+##fi
+express
+255
+body
+##ill
+aphojoy
+user
+december
+meiki
+##ick
+tweet
+richard
+##av
+##ᆫ
+iphone6
+##dd
+ちてすか
+views
+##mark
+321
+pd
+##００
+times
+##▲
+level
+##ash
+10g
+point
+5l
+##ome
+208
+koreanmall
+##ak
+george
+q2
+206
+wma
+tcp
+##200
+スタッフ
+full
+mlb
+##lle
+##watch
+tm
+run
+179
+911
+smith
+business
+##und
+1919
+color
+##tal
+222
+171
+##less
+moon
+4399
+##rl
+update
+pcb
+shop
+499
+157
+little
+なし
+end
+##mhz
+van
+dsp
+easy
+660
+##house
+##key
+history
+##ｏ
+oh
+##001
+##hy
+##web
+oem
+let
+was
+##2009
+##gg
+review
+##wan
+182
+##°c
+203
+uc
+title
+##val
+united
+233
+2021
+##ons
+doi
+trivago
+overdope
+sbs
+##ance
+##ち
+grand
+special
+573032185
+imf
+216
+wx17house
+##so
+##ーム
+audi
+##he
+london
+william
+##rp
+##ake
+science
+beach
+cfa
+amp
+ps4
+880
+##800
+##link
+##hp
+crm
+ferragamo
+bell
+make
+##eng
+195
+under
+zh
+photos
+2300
+##style
+##ント
+via
+176
+da
+##gi
+company
+i7
+##ray
+thomas
+370
+ufo
+i5
+##max
+plc
+ben
+back
+research
+8g
+173
+mike
+##pc
+##ッフ
+september
+189
+##ace
+vps
+february
+167
+pantos
+wp
+lisa
+1921
+★★
+jquery
+night
+long
+offer
+##berg
+##news
+1911
+##いて
+ray
+fks
+wto
+せます
+over
+164
+340
+##all
+##rus
+1924
+##888
+##works
+blogtitle
+loftpermalink
+##→
+187
+martin
+test
+ling
+km
+##め
+15000
+fda
+v3
+##ja
+##ロ
+ｗedding
+かある
+outlet
+family
+##ea
+をこ
+##top
+story
+##ness
+salvatore
+##lu
+204
+swift
+215
+room
+している
+oracle
+##ul
+1925
+sam
+b2c
+week
+pi
+rock
+##のは
+##ａ
+##けと
+##ean
+##300
+##gle
+cctv
+after
+chinese
+##back
+powered
+x2
+##tan
+1918
+##nes
+##イン
+canon
+only
+181
+##zi
+##las
+say
+##oe
+184
+##sd
+221
+##bot
+##world
+##zo
+sky
+made
+top100
+just
+1926
+pmi
+802
+234
+gap
+##vr
+177
+les
+174
+▲topoct
+ball
+vogue
+vi
+ing
+ofweek
+cos
+##list
+##ort
+▲topmay
+##なら
+##lon
+として
+last
+##tc
+##of
+##bus
+##gen
+real
+eva
+##コ
+a3
+nas
+##lie
+##ria
+##coin
+##bt
+▲topapr
+his
+212
+cat
+nata
+vive
+health
+⋯⋯
+drive
+sir
+▲topmar
+du
+cup
+##カー
+##ook
+##よう
+##sy
+alex
+msg
+tour
+しました
+3ce
+##word
+193
+ebooks
+r8
+block
+318
+##より
+2200
+nice
+pvp
+207
+months
+1905
+rewards
+##ther
+1917
+0800
+##xi
+##チ
+##sc
+micro
+850
+gg
+blogfp
+op
+1922
+daily
+m1
+264
+true
+##bb
+ml
+##tar
+##のお
+##ky
+anthony
+196
+253
+##yo
+state
+218
+##ara
+##aa
+##rc
+##tz
+##ston
+より
+gear
+##eo
+##ade
+ge
+see
+1923
+##win
+##ura
+ss
+heart
+##den
+##ita
+down
+##sm
+el
+png
+2100
+610
+rakuten
+whatsapp
+bay
+dream
+add
+##use
+680
+311
+pad
+gucci
+mpv
+##ode
+##fo
+island
+▲topjun
+##▼
+223
+jason
+214
+chicago
+##❤
+しの
+##hone
+io
+##れる
+##ことか
+sogo
+be2
+##ology
+990
+cloud
+vcd
+##con
+2～3
+##ford
+##joy
+##kb
+##こさいます
+##rade
+but
+##ach
+docker
+##ful
+rfid
+ul
+##ase
+hit
+ford
+##star
+580
+##○
+１１
+a2
+sdk
+reading
+edited
+##are
+cmos
+##mc
+238
+siri
+light
+##ella
+##ため
+bloomberg
+##read
+pizza
+##ison
+jimmy
+##vm
+college
+node
+journal
+ba
+18k
+##play
+245
+##cer
+２０
+magic
+##yu
+191
+jump
+288
+tt
+##ings
+asr
+##lia
+3200
+step5
+network
+##cd
+mc
+いします
+1234
+pixstyleme
+273
+##600
+2800
+money
+★★★★★
+1280
+１２
+430
+bl
+みの
+act
+##tus
+tokyo
+##rial
+##life
+emba
+##ae
+saas
+tcs
+##rk
+##wang
+summer
+##sp
+ko
+##ving
+390
+premium
+##その
+netflix
+##ヒ
+uk
+mt
+##lton
+right
+frank
+two
+209
+える
+##ple
+##cal
+021
+##んな
+##sen
+##ville
+hold
+nexus
+dd
+##ius
+てお
+##mah
+##なく
+tila
+zero
+820
+ce
+##tin
+resort
+##ws
+charles
+old
+p10
+5d
+report
+##360
+##ru
+##には
+bus
+vans
+lt
+##est
+pv
+##レ
+links
+rebecca
+##ツ
+##dm
+azure
+##365
+きな
+limited
+bit
+4gb
+##mon
+1910
+moto
+##eam
+213
+1913
+var
+eos
+なとの
+226
+blogspot
+された
+699
+e3
+dos
+dm
+fc
+##ments
+##ik
+##kw
+boy
+##bin
+##ata
+960
+er
+##せ
+219
+##vin
+##tu
+##ula
+194
+##∥
+station
+##ろ
+##ature
+835
+files
+zara
+hdr
+top10
+nature
+950
+magazine
+s6
+marriott
+##シ
+avira
+case
+##っと
+tab
+##ran
+tony
+##home
+oculus
+im
+##ral
+jean
+saint
+cry
+307
+rosie
+##force
+##ini
+ice
+##bert
+のある
+##nder
+##mber
+pet
+2600
+##◆
+plurk
+▲topdec
+##sis
+00kg
+▲topnov
+720
+##ence
+tim
+##ω
+##nc
+##ても
+##name
+log
+ips
+great
+ikea
+malaysia
+unix
+##イト
+3600
+##ncy
+##nie
+12000
+akb48
+##ye
+##oid
+404
+##chi
+##いた
+oa
+xuehai
+##1000
+##orm
+##rf
+275
+さん
+##ware
+##リー
+980
+ho
+##pro
+text
+##era
+560
+bob
+227
+##ub
+##2008
+8891
+scp
+avi
+##zen
+2022
+mi
+wu
+museum
+qvod
+apache
+lake
+jcb
+▲topaug
+★★★
+ni
+##hr
+hill
+302
+ne
+weibo
+490
+ruby
+##ーシ
+##ヶ
+##row
+4d
+▲topjul
+iv
+##ish
+github
+306
+mate
+312
+##スト
+##lot
+##ane
+andrew
+のハイト
+##tina
+t1
+rf
+ed2k
+##vel
+##900
+way
+final
+りの
+ns
+5a
+705
+197
+##メ
+sweet
+bytes
+##ene
+▲topjan
+231
+##cker
+##2007
+##px
+100g
+topapp
+229
+helpapp
+rs
+low
+14k
+g4g
+care
+630
+ldquo
+あり
+##fork
+leave
+rm
+edition
+##gan
+##zon
+##qq
+▲topsep
+##google
+##ism
+gold
+224
+explorer
+##zer
+toyota
+category
+select
+visual
+##labels
+restaurant
+##md
+posts
+s1
+##ico
+もっと
+angelababy
+123456
+217
+sports
+s3
+mbc
+1915
+してくたさい
+shell
+x86
+candy
+##new
+kbs
+face
+xl
+470
+##here
+4a
+swissinfo
+v8
+▲topfeb
+dram
+##ual
+##vice
+3a
+##wer
+sport
+q1
+ios10
+public
+int
+card
+##ｃ
+ep
+au
+rt
+##れた
+1080
+bill
+##mll
+kim
+３０
+460
+wan
+##uk
+##ミ
+x3
+298
+0t
+scott
+##ming
+239
+e5
+##3d
+h7n9
+worldcat
+brown
+##あります
+##vo
+##led
+##580
+##ax
+249
+410
+##ert
+paris
+##～6
+polo
+925
+##lr
+599
+##ナ
+capital
+##hing
+bank
+cv
+1g
+##chat
+##ｓ
+##たい
+adc
+##ule
+2m
+##ｅ
+digital
+hotmail
+268
+##pad
+870
+bbq
+quot
+##ring
+before
+wali
+##まて
+mcu
+2k
+2b
+という
+costco
+316
+north
+333
+switch
+##city
+##ｐ
+philips
+##mann
+management
+panasonic
+##cl
+##vd
+##ping
+##rge
+alice
+##lk
+##ましょう
+css3
+##ney
+vision
+alpha
+##ular
+##400
+##tter
+lz
+にお
+##ありません
+mode
+gre
+1916
+pci
+##tm
+237
+1～2
+##yan
+##そ
+について
+##let
+##キ
+work
+war
+coach
+ah
+mary
+##ᅵ
+huang
+##pt
+a8
+pt
+follow
+##berry
+1895
+##ew
+a5
+ghost
+##ション
+##wn
+##og
+south
+##code
+girls
+##rid
+action
+villa
+git
+r11
+table
+games
+##cket
+error
+##anonymoussaid
+##ag
+here
+##ame
+##gc
+qa
+##■
+##lis
+gmp
+##gin
+vmalife
+##cher
+yu
+wedding
+##tis
+demo
+dragon
+530
+soho
+social
+bye
+##rant
+river
+orz
+acer
+325
+##↑
+##ース
+##ats
+261
+del
+##ven
+440
+ups
+##ように
+##ター
+305
+value
+macd
+yougou
+##dn
+661
+##ano
+ll
+##urt
+##rent
+continue
+script
+##wen
+##ect
+paper
+263
+319
+shift
+##chel
+##フト
+##cat
+258
+x5
+fox
+243
+##さん
+car
+aaa
+##blog
+loading
+##yn
+##tp
+kuso
+799
+si
+sns
+イカせるテンマ
+ヒンクテンマ3
+rmb
+vdc
+forest
+central
+prime
+help
+ultra
+##rmb
+##ような
+241
+square
+688
+##しい
+のないフロクに
+##field
+##reen
+##ors
+##ju
+c1
+start
+510
+##air
+##map
+cdn
+##wo
+cba
+stephen
+m8
+100km
+##get
+opera
+##base
+##ood
+vsa
+com™
+##aw
+##ail
+251
+なのて
+count
+t2
+##ᅡ
+##een
+2700
+hop
+##gp
+vsc
+tree
+##eg
+##ose
+816
+285
+##ories
+##shop
+alphago
+v4
+1909
+simon
+##ᆼ
+fluke62max
+zip
+スホンサー
+##sta
+louis
+cr
+bas
+##～10
+bc
+##yer
+hadoop
+##ube
+##wi
+1906
+0755
+hola
+##low
+place
+centre
+5v
+d3
+##fer
+252
+##750
+##media
+281
+540
+0l
+exchange
+262
+series
+##ハー
+##san
+eb
+##bank
+##ｋ
+q3
+##nge
+##mail
+take
+##lp
+259
+1888
+client
+east
+cache
+event
+vincent
+##ールを
+きを
+##nse
+sui
+855
+adchoice
+##и
+##stry
+##なたの
+246
+##zone
+ga
+apps
+sea
+##ab
+248
+cisco
+##タ
+##rner
+kymco
+##care
+dha
+##pu
+##yi
+minkoff
+royal
+p1
+への
+annie
+269
+collection
+kpi
+playstation
+257
+になります
+866
+bh
+##bar
+queen
+505
+radio
+1904
+andy
+armani
+##xy
+manager
+iherb
+##ery
+##share
+spring
+raid
+johnson
+1908
+##ob
+volvo
+hall
+##ball
+v6
+our
+taylor
+##hk
+bi
+242
+##cp
+kate
+bo
+water
+technology
+##rie
+サイトは
+277
+##ona
+##sl
+hpv
+303
+gtx
+hip
+rdquo
+jayz
+stone
+##lex
+##rum
+namespace
+##やり
+620
+##ale
+##atic
+des
+##erson
+##ql
+##ves
+##type
+enter
+##この
+##てきます
+d2
+##168
+##mix
+##bian
+との
+a9
+jj
+ky
+##lc
+access
+movie
+##hc
+リストに
+tower
+##ration
+##mit
+ます
+##nch
+ua
+tel
+prefix
+##o2
+1907
+##point
+1901
+ott
+～10
+##http
+##ury
+baidu
+##ink
+member
+##logy
+bigbang
+nownews
+##js
+##shot
+##tb
+##こと
+247
+eba
+##tics
+##lus
+ける
+v5
+spark
+##ama
+there
+##ions
+god
+##lls
+##down
+hiv
+##ress
+burberry
+day2
+##kv
+◆◆
+jeff
+related
+film
+edit
+joseph
+283
+##ark
+cx
+32gb
+order
+g9
+30000
+##ans
+##tty
+s5
+##bee
+かあります
+thread
+xr
+buy
+sh
+005
+land
+spotify
+mx
+##ari
+276
+##verse
+×email
+sf
+why
+##ことて
+244
+7headlines
+nego
+sunny
+dom
+exo
+401
+666
+positioning
+fit
+rgb
+##tton
+278
+kiss
+alexa
+adam
+lp
+みリストを
+##ｇ
+mp
+##ties
+##llow
+amy
+##du
+np
+002
+institute
+271
+##rth
+##lar
+2345
+590
+##des
+sidebar
+１５
+imax
+site
+##cky
+##kit
+##ime
+##009
+season
+323
+##fun
+##ンター
+##ひ
+gogoro
+a7
+pu
+lily
+fire
+twd600
+##ッセーシを
+いて
+##vis
+30ml
+##cture
+##をお
+information
+##オ
+close
+friday
+##くれる
+yi
+nick
+てすか
+##tta
+##tel
+6500
+##lock
+cbd
+economy
+254
+かお
+267
+tinker
+double
+375
+8gb
+voice
+##app
+oops
+channel
+today
+985
+##right
+raw
+xyz
+##＋
+jim
+edm
+##cent
+7500
+supreme
+814
+ds
+##its
+##asia
+dropbox
+##てすか
+##tti
+books
+272
+100ml
+##tle
+##ller
+##ken
+##more
+##boy
+sex
+309
+##dom
+t3
+##ider
+##なります
+##unch
+1903
+810
+feel
+5500
+##かった
+##put
+により
+s2
+mo
+##gh
+men
+ka
+amoled
+div
+##tr
+##n1
+port
+howard
+##tags
+ken
+dnf
+##nus
+adsense
+##а
+ide
+##へ
+buff
+thunder
+##town
+##ique
+has
+##body
+auto
+pin
+##erry
+tee
+てした
+295
+number
+##the
+##013
+object
+psp
+cool
+udnbkk
+16gb
+##mic
+miui
+##tro
+most
+r2
+##alk
+##nity
+1880
+±0
+##いました
+428
+s4
+law
+version
+##oa
+n1
+sgs
+docomo
+##tf
+##ack
+henry
+fc2
+##ded
+##sco
+##014
+##rite
+286
+0mm
+linkedin
+##ada
+##now
+wii
+##ndy
+ucbug
+##◎
+sputniknews
+legalminer
+##ika
+##xp
+2gb
+##bu
+q10
+oo
+b6
+come
+##rman
+cheese
+ming
+maker
+##gm
+nikon
+##fig
+ppi
+kelly
+##ります
+jchere
+てきます
+ted
+md
+003
+fgo
+tech
+##tto
+dan
+soc
+##gl
+##len
+hair
+earth
+640
+521
+img
+##pper
+##a1
+##てきる
+##ロク
+acca
+##ition
+##ference
+suite
+##ig
+outlook
+##mond
+##cation
+398
+##pr
+279
+101vip
+358
+##999
+282
+64gb
+3800
+345
+airport
+##over
+284
+##おり
+jones
+##ith
+lab
+##su
+##いるのて
+co2
+town
+piece
+##llo
+no1
+vmware
+24h
+##qi
+focus
+reader
+##admin
+##ora
+tb
+false
+##log
+1898
+know
+lan
+838
+##ces
+f4
+##ume
+motel
+stop
+##oper
+na
+flickr
+netcomponents
+##af
+##─
+pose
+williams
+local
+##ound
+##cg
+##site
+##iko
+いお
+274
+5m
+gsm
+con
+##ath
+1902
+friends
+##hip
+cell
+317
+##rey
+780
+cream
+##cks
+012
+##dp
+facebooktwitterpinterestgoogle
+sso
+324
+shtml
+song
+swiss
+##mw
+##キンク
+lumia
+xdd
+string
+tiffany
+522
+marc
+られた
+insee
+russell
+sc
+dell
+##ations
+ｏｋ
+camera
+289
+##vs
+##flow
+##late
+classic
+287
+##nter
+stay
+g1
+mtv
+512
+##ever
+##lab
+##nger
+qe
+sata
+ryan
+d1
+50ml
+cms
+##cing
+su
+292
+3300
+editor
+296
+##nap
+security
+sunday
+association
+##ens
+##700
+##bra
+acg
+##かり
+sofascore
+とは
+mkv
+##ign
+jonathan
+gary
+build
+labels
+##oto
+tesla
+moba
+qi
+gohappy
+general
+ajax
+1024
+##かる
+サイト
+society
+##test
+##urs
+wps
+fedora
+##ich
+mozilla
+328
+##480
+##dr
+usa
+urn
+##lina
+##ｒ
+grace
+##die
+##try
+##ader
+1250
+##なり
+elle
+570
+##chen
+##ᆯ
+price
+##ten
+uhz
+##ough
+eq
+##hen
+states
+push
+session
+balance
+wow
+506
+##cus
+##py
+when
+##ward
+##ep
+34e
+wong
+library
+prada
+##サイト
+##cle
+running
+##ree
+313
+ck
+date
+q4
+##ctive
+##ool
+##＞
+mk
+##ira
+##163
+388
+die
+secret
+rq
+dota
+buffet
+は１ヶ
+e6
+##ez
+pan
+368
+ha
+##card
+##cha
+2a
+##さ
+alan
+day3
+eye
+f3
+##end
+france
+keep
+adi
+rna
+tvbs
+##ala
+solo
+nova
+##え
+##tail
+##ょう
+support
+##ries
+##なる
+##ved
+base
+copy
+iis
+fps
+##ways
+hero
+hgih
+profile
+fish
+mu
+ssh
+entertainment
+chang
+##wd
+click
+cake
+##ond
+pre
+##tom
+kic
+pixel
+##ov
+##fl
+product
+6a
+##pd
+dear
+##gate
+es
+yumi
+audio
+##²
+##sky
+echo
+bin
+where
+##ture
+329
+##ape
+find
+sap
+isis
+##なと
+nand
+##101
+##load
+##ream
+band
+a6
+525
+never
+##post
+festival
+50cm
+##we
+555
+guide
+314
+zenfone
+##ike
+335
+gd
+forum
+jessica
+strong
+alexander
+##ould
+software
+allen
+##ious
+program
+360°
+else
+lohasthree
+##gar
+することかてきます
+please
+##れます
+rc
+##ggle
+##ric
+bim
+50000
+##own
+eclipse
+355
+brian
+3ds
+##side
+061
+361
+##other
+##ける
+##tech
+##ator
+485
+engine
+##ged
+##ｔ
+plaza
+##fit
+cia
+ngo
+westbrook
+shi
+tbs
+50mm
+##みませんか
+sci
+291
+reuters
+##ily
+contextlink
+##hn
+af
+##cil
+bridge
+very
+##cel
+1890
+cambridge
+##ize
+15g
+##aid
+##data
+790
+frm
+##head
+award
+butler
+##sun
+meta
+##mar
+america
+ps3
+puma
+pmid
+##すか
+lc
+670
+kitchen
+##lic
+オーフン5
+きなしソフトサーヒス
+そして
+day1
+future
+★★★★
+##text
+##page
+##rris
+pm1
+##ket
+fans
+##っています
+1001
+christian
+bot
+kids
+trackback
+##hai
+c3
+display
+##hl
+n2
+1896
+idea
+さんも
+##sent
+airmail
+##ug
+##men
+pwm
+けます
+028
+##lution
+369
+852
+awards
+schemas
+354
+asics
+wikipedia
+font
+##tional
+##vy
+c2
+293
+##れている
+##dget
+##ein
+っている
+contact
+pepper
+スキル
+339
+##～5
+294
+##uel
+##ument
+730
+##hang
+みてす
+q5
+##sue
+rain
+##ndi
+wei
+swatch
+##cept
+わせ
+331
+popular
+##ste
+##tag
+p2
+501
+trc
+1899
+##west
+##live
+justin
+honda
+ping
+messenger
+##rap
+v9
+543
+##とは
+unity
+appqq
+はすへて
+025
+leo
+##tone
+##テ
+##ass
+uniqlo
+##010
+502
+her
+jane
+memory
+moneydj
+##tical
+human
+12306
+していると
+##m2
+coc
+miacare
+##mn
+tmt
+##core
+vim
+kk
+##may
+fan
+target
+use
+too
+338
+435
+2050
+867
+737
+fast
+##2c
+services
+##ope
+omega
+energy
+##わ
+pinkoi
+1a
+##なから
+##rain
+jackson
+##ement
+##シャンルの
+374
+366
+そんな
+p9
+rd
+##ᆨ
+1111
+##tier
+##vic
+zone
+##│
+385
+690
+dl
+isofix
+cpa
+m4
+322
+kimi
+めて
+davis
+##lay
+lulu
+##uck
+050
+weeks
+qs
+##hop
+920
+##ｎ
+ae
+##ear
+～5
+eia
+405
+##fly
+korea
+jpeg
+boost
+##ship
+small
+##リア
+1860
+eur
+297
+425
+valley
+##iel
+simple
+##ude
+rn
+k2
+##ena
+されます
+non
+patrick
+しているから
+##ナー
+feed
+5757
+30g
+process
+well
+qqmei
+##thing
+they
+aws
+lu
+pink
+##ters
+##kin
+または
+board
+##vertisement
+wine
+##ien
+unicode
+##dge
+r1
+359
+##tant
+いを
+##twitter
+##3c
+cool1
+される
+##れて
+##ｌ
+isp
+##012
+standard
+45㎡2
+402
+##150
+matt
+##fu
+326
+##iner
+googlemsn
+pixnetfacebookyahoo
+##ラン
+x7
+886
+##uce
+メーカー
+sao
+##ev
+##きました
+##file
+9678
+403
+xddd
+shirt
+6l
+##rio
+##hat
+3mm
+givenchy
+ya
+bang
+##lio
+monday
+crystal
+ロクイン
+##abc
+336
+head
+890
+ubuntuforumwikilinuxpastechat
+##vc
+##～20
+##rity
+cnc
+7866
+ipv6
+null
+1897
+##ost
+yang
+imsean
+tiger
+##fet
+##ンス
+352
+##＝
+dji
+327
+ji
+maria
+##come
+##んて
+foundation
+3100
+##beth
+##なった
+1m
+601
+active
+##aft
+##don
+3p
+sr
+349
+emma
+##khz
+living
+415
+353
+1889
+341
+709
+457
+sas
+x6
+##face
+pptv
+x4
+##mate
+han
+sophie
+##jing
+337
+fifa
+##mand
+other
+sale
+inwedding
+##gn
+てきちゃいます
+##mmy
+##pmlast
+bad
+nana
+nbc
+してみてくたさいね
+なとはお
+##wu
+##かあります
+##あ
+note7
+single
+##340
+せからこ
+してくたさい♪この
+しにはとんとんワークケートを
+するとあなたにもっとマッチした
+ならワークケートへ
+もみつかっちゃうかも
+ワークケートの
+##bel
+window
+##dio
+##ht
+union
+age
+382
+１４
+##ivity
+##ｙ
+コメント
+domain
+neo
+##isa
+##lter
+5k
+f5
+steven
+##cts
+powerpoint
+tft
+self
+g2
+ft
+##テル
+zol
+##act
+mwc
+381
+343
+もう
+nbapop
+408
+てある
+eds
+ace
+##room
+previous
+author
+tomtom
+il
+##ets
+hu
+financial
+☆☆☆
+っています
+bp
+5t
+chi
+1gb
+##hg
+fairmont
+cross
+008
+gay
+h2
+function
+##けて
+356
+also
+1b
+625
+##ータ
+##raph
+1894
+3～5
+##ils
+i3
+334
+avenue
+##host
+による
+##bon
+##tsu
+message
+navigation
+50g
+fintech
+h6
+##ことを
+8cm
+##ject
+##vas
+##firm
+credit
+##wf
+xxxx
+form
+##nor
+##space
+huawei
+plan
+json
+sbl
+##dc
+machine
+921
+392
+wish
+##120
+##sol
+windows7
+edward
+##ために
+development
+washington
+##nsis
+lo
+818
+##sio
+##ym
+##bor
+planet
+##～8
+##wt
+ieee
+gpa
+##めて
+camp
+ann
+gm
+##tw
+##oka
+connect
+##rss
+##work
+##atus
+wall
+chicken
+soul
+2mm
+##times
+fa
+##ather
+##cord
+009
+##eep
+hitachi
+gui
+harry
+##pan
+e1
+disney
+##press
+##ーション
+wind
+386
+frigidaire
+##tl
+liu
+hsu
+332
+basic
+von
+ev
+いた
+てきる
+スホンサーサイト
+learning
+##ull
+expedia
+archives
+change
+##wei
+santa
+cut
+ins
+6gb
+turbo
+brand
+cf1
+508
+004
+return
+747
+##rip
+h1
+##nis
+##をこ
+128gb
+##にお
+3t
+application
+しており
+emc
+rx
+##oon
+384
+quick
+412
+15058
+wilson
+wing
+chapter
+##bug
+beyond
+##cms
+##dar
+##oh
+zoom
+e2
+trip
+sb
+##nba
+rcep
+342
+aspx
+ci
+080
+gc
+gnu
+める
+##count
+advanced
+dance
+dv
+##url
+##ging
+367
+8591
+am09
+shadow
+battle
+346
+##ｉ
+##cia
+##という
+emily
+##のてす
+##tation
+host
+ff
+techorz
+sars
+##mini
+##mporary
+##ering
+nc
+4200
+798
+##next
+cma
+##mbps
+##gas
+##ift
+##dot
+##ィ
+455
+##～17
+amana
+##りの
+426
+##ros
+ir
+00㎡1
+##eet
+##ible
+##↓
+710
+ˋ▽ˊ
+##aka
+dcs
+iq
+##ｖ
+l1
+##lor
+maggie
+##011
+##iu
+588
+##～1
+830
+##gt
+1tb
+articles
+create
+##burg
+##iki
+database
+fantasy
+##rex
+##cam
+dlc
+dean
+##you
+hard
+path
+gaming
+victoria
+maps
+cb
+##lee
+##itor
+overchicstoretvhome
+systems
+##xt
+416
+p3
+sarah
+760
+##nan
+407
+486
+x9
+install
+second
+626
+##ann
+##ph
+##rcle
+##nic
+860
+##nar
+ec
+##とう
+768
+metro
+chocolate
+##rian
+～4
+##table
+##しています
+skin
+##sn
+395
+mountain
+##0mm
+inparadise
+6m
+7x24
+ib
+4800
+##jia
+eeworld
+creative
+g5
+g3
+357
+parker
+ecfa
+village
+からの
+18000
+sylvia
+サーヒス
+hbl
+##ques
+##onsored
+##x2
+##きます
+##v4
+##tein
+ie6
+383
+##stack
+389
+ver
+##ads
+##baby
+sound
+bbe
+##110
+##lone
+##uid
+ads
+022
+gundam
+351
+thinkpad
+006
+scrum
+match
+##ave
+mems
+##470
+##oy
+##なりました
+##talk
+glass
+lamigo
+span
+##eme
+job
+##a5
+jay
+wade
+kde
+498
+##lace
+ocean
+tvg
+##covery
+##r3
+##ners
+##rea
+junior
+think
+##aine
+cover
+##ision
+##sia
+↓↓
+##bow
+msi
+413
+458
+406
+##love
+711
+801
+soft
+z2
+##pl
+456
+1840
+mobil
+mind
+##uy
+427
+nginx
+##oi
+めた
+##rr
+6221
+##mple
+##sson
+##ーシてす
+371
+##nts
+91tv
+comhd
+crv3000
+##uard
+1868
+397
+deep
+lost
+field
+gallery
+##bia
+rate
+spf
+redis
+traction
+930
+icloud
+011
+なら
+fe
+jose
+372
+##tory
+into
+sohu
+fx
+899
+379
+kicstart2
+##hia
+すく
+##～3
+##sit
+ra
+２４
+##walk
+##xure
+500g
+##pact
+pacific
+xa
+natural
+carlo
+##250
+##walker
+1850
+##can
+cto
+gigi
+516
+##サー
+pen
+##hoo
+ob
+matlab
+##ｂ
+##yy
+13913459
+##iti
+mango
+##bbs
+sense
+c5
+oxford
+##ニア
+walker
+jennifer
+##ola
+course
+##bre
+701
+##pus
+##rder
+lucky
+075
+##ぁ
+ivy
+なお
+##nia
+sotheby
+side
+##ugh
+joy
+##orage
+##ush
+##bat
+##dt
+364
+r9
+##2d
+##gio
+511
+country
+wear
+##lax
+##～7
+##moon
+393
+seven
+study
+411
+348
+lonzo
+8k
+##ェ
+evolution
+##イフ
+##kk
+gs
+kd
+##レス
+arduino
+344
+b12
+##lux
+arpg
+##rdon
+cook
+##x5
+dark
+five
+##als
+##ida
+とても
+sign
+362
+##ちの
+something
+20mm
+##nda
+387
+##posted
+fresh
+tf
+1870
+422
+cam
+##mine
+##skip
+##form
+##ssion
+education
+394
+##tee
+dyson
+stage
+##jie
+want
+##night
+epson
+pack
+あります
+##ppy
+テリヘル
+##█
+wd
+##eh
+##rence
+left
+##lvin
+golden
+mhz
+discovery
+##trix
+##n2
+loft
+##uch
+##dra
+##sse
+speed
+～1
+1mdb
+sorry
+welcome
+##urn
+wave
+gaga
+##lmer
+teddy
+##160
+トラックハック
+せよ
+611
+##f2016
+378
+rp
+##sha
+rar
+##あなたに
+##きた
+840
+holiday
+##ュー
+373
+074
+##vg
+##nos
+##rail
+gartner
+gi
+6p
+##dium
+kit
+488
+b3
+eco
+##ろう
+20g
+sean
+##stone
+autocad
+nu
+##np
+f16
+write
+029
+m5
+##ias
+images
+atp
+##dk
+fsm
+504
+1350
+ve
+52kb
+##xxx
+##のに
+##cake
+414
+unit
+lim
+ru
+1v
+##ification
+published
+angela
+16g
+analytics
+ak
+##ｑ
+##nel
+gmt
+##icon
+again
+##₂
+##bby
+ios11
+445
+かこさいます
+waze
+いてす
+##ハ
+9985
+##ust
+##ティー
+framework
+##007
+iptv
+delete
+52sykb
+cl
+wwdc
+027
+30cm
+##fw
+##ての
+1389
+##xon
+brandt
+##ses
+##dragon
+tc
+vetements
+anne
+monte
+modern
+official
+##へて
+##ere
+##nne
+##oud
+もちろん
+５０
+etnews
+##a2
+##graphy
+421
+863
+##ちゃん
+444
+##rtex
+##てお
+l2
+##gma
+mount
+ccd
+たと
+archive
+morning
+tan
+ddos
+e7
+##ホ
+day4
+##ウ
+gis
+453
+its
+495
+factory
+bruce
+pg
+##ito
+ってくたさい
+guest
+cdma
+##lling
+536
+n3
+しかし
+3～4
+mega
+eyes
+ro
+１３
+women
+dac
+church
+##jun
+singapore
+##facebook
+6991
+starbucks
+##tos
+##stin
+##shine
+zen
+##mu
+tina
+20℃
+1893
+##たけて
+503
+465
+request
+##gence
+qt
+##っ
+1886
+347
+363
+q7
+##zzi
+diary
+##tore
+409
+##ead
+468
+cst
+##osa
+canada
+agent
+va
+##jiang
+##ちは
+##ーク
+##lam
+sg
+##nix
+##sday
+##よって
+g6
+##master
+bing
+##zl
+charlie
+１６
+8mm
+nb40
+##ーン
+thai
+##ルフ
+ln284ct
+##itz
+##2f
+bonnie
+##food
+##lent
+originals
+##stro
+##lts
+418
+∟∣
+##bscribe
+children
+ntd
+yesstyle
+##かも
+hmv
+##tment
+d5
+2cm
+arts
+sms
+##pn
+##я
+##いい
+topios9
+539
+lifestyle
+virtual
+##ague
+xz
+##deo
+muji
+024
+unt
+##nnis
+##ᅩ
+faq1
+1884
+396
+##ette
+fly
+64㎡
+はしめまして
+441
+curry
+##pop
+のこ
+release
+##←
+##◆◆
+##cast
+073
+ありな
+500ml
+##ews
+5c
+##stle
+ios7
+##ima
+787
+dog
+lenovo
+##r4
+roger
+013
+cbs
+vornado
+100m
+417
+##desk
+##クok
+##ald
+1867
+9595
+2900
+##van
+oil
+##ｘ
+some
+break
+common
+##jy
+##lines
+g7
+twice
+419
+ella
+nano
+belle
+にこ
+##mes
+##self
+##note
+jb
+##ことかてきます
+benz
+##との
+##ova
+451
+save
+##wing
+##ますのて
+kai
+りは
+##hua
+##rect
+rainer
+##unge
+448
+##0m
+adsl
+##かな
+guestname
+##uma
+##kins
+##zu
+tokichoi
+##price
+county
+##med
+##mus
+rmk
+391
+address
+vm
+えて
+openload
+##group
+##hin
+##iginal
+amg
+urban
+##oz
+jobs
+emi
+##public
+beautiful
+##sch
+album
+##dden
+##bell
+jerry
+works
+hostel
+miller
+##drive
+##rmin
+##１０
+376
+boot
+828
+##370
+##fx
+##cm～
+1885
+##nome
+##ctionary
+##oman
+##lish
+##cr
+##hm
+433
+##how
+432
+francis
+xi
+c919
+b5
+evernote
+##uc
+vga
+##3000
+coupe
+##urg
+##cca
+##uality
+019
+6g
+れる
+multi
+##また
+##ett
+em
+hey
+##ani
+##tax
+##rma
+inside
+than
+740
+leonnhurt
+##jin
+ict
+れた
+bird
+notes
+200mm
+くの
+##dical
+##lli
+result
+442
+iu
+ee
+438
+smap
+gopro
+##last
+yin
+pure
+998
+32g
+けた
+5kg
+##dan
+##rame
+mama
+##oot
+bean
+marketing
+##hur
+2l
+bella
+sync
+xuite
+##ground
+515
+discuz
+##getrelax
+##ince
+##bay
+##5s
+cj
+##イス
+gmat
+apt
+##pass
+jing
+##rix
+c4
+rich
+##とても
+niusnews
+##ello
+bag
+770
+##eting
+##mobile
+１８
+culture
+015
+##のてすか
+377
+1020
+area
+##ience
+616
+details
+gp
+universal
+silver
+dit
+はお
+private
+ddd
+u11
+kanshu
+##ified
+fung
+##nny
+dx
+##520
+tai
+475
+023
+##fr
+##lean
+3s
+##pin
+429
+##rin
+25000
+ly
+rick
+##bility
+usb3
+banner
+##baru
+##gion
+metal
+dt
+vdf
+1871
+karl
+qualcomm
+bear
+1010
+oldid
+ian
+jo
+##tors
+population
+##ernel
+1882
+mmorpg
+##mv
+##bike
+603
+##©
+ww
+friend
+##ager
+exhibition
+##del
+##pods
+fpx
+structure
+##free
+##tings
+kl
+##rley
+##copyright
+##mma
+california
+3400
+orange
+yoga
+4l
+canmake
+honey
+##anda
+##コメント
+595
+nikkie
+##ルハイト
+dhl
+publishing
+##mall
+##gnet
+20cm
+513
+##クセス
+##┅
+e88
+970
+##dog
+fishbase
+##!
+##"
+###
+##$
+##%
+##&
+##'
+##(
+##)
+##*
+##+
+##,
+##-
+##.
+##/
+##:
+##;
+##<
+##=
+##>
+##?
+##@
+##[
+##\
+##]
+##^
+##_
+##{
+##|
+##}
+##~
+##£
+##¤
+##¥
+##§
+##«
+##±
+##³
+##µ
+##·
+##¹
+##º
+##»
+##¼
+##ß
+##æ
+##÷
+##ø
+##đ
+##ŋ
+##ɔ
+##ə
+##ɡ
+##ʰ
+##ˇ
+##ˈ
+##ˊ
+##ˋ
+##ˍ
+##ː
+##˙
+##˚
+##ˢ
+##α
+##β
+##γ
+##δ
+##ε
+##η
+##θ
+##ι
+##κ
+##λ
+##μ
+##ν
+##ο
+##π
+##ρ
+##ς
+##σ
+##τ
+##υ
+##φ
+##χ
+##ψ
+##б
+##в
+##г
+##д
+##е
+##ж
+##з
+##к
+##л
+##м
+##н
+##о
+##п
+##р
+##с
+##т
+##у
+##ф
+##х
+##ц
+##ч
+##ш
+##ы
+##ь
+##і
+##ا
+##ب
+##ة
+##ت
+##د
+##ر
+##س
+##ع
+##ل
+##م
+##ن
+##ه
+##و
+##ي
+##۩
+##ก
+##ง
+##น
+##ม
+##ย
+##ร
+##อ
+##า
+##เ
+##๑
+##་
+##ღ
+##ᄀ
+##ᄁ
+##ᄂ
+##ᄃ
+##ᄅ
+##ᄆ
+##ᄇ
+##ᄈ
+##ᄉ
+##ᄋ
+##ᄌ
+##ᄎ
+##ᄏ
+##ᄐ
+##ᄑ
+##ᄒ
+##ᅢ
+##ᅣ
+##ᅥ
+##ᅦ
+##ᅧ
+##ᅨ
+##ᅪ
+##ᅬ
+##ᅭ
+##ᅮ
+##ᅯ
+##ᅲ
+##ᅳ
+##ᅴ
+##ᆷ
+##ᆸ
+##ᆺ
+##ᆻ
+##ᗜ
+##ᵃ
+##ᵉ
+##ᵍ
+##ᵏ
+##ᵐ
+##ᵒ
+##ᵘ
+##‖
+##„
+##†
+##•
+##‥
+##‧
+##
+##‰
+##′
+##″
+##‹
+##›
+##※
+##‿
+##⁄
+##ⁱ
+##⁺
+##ⁿ
+##₁
+##₃
+##₄
+##€
+##№
+##ⅰ
+##ⅱ
+##ⅲ
+##ⅳ
+##ⅴ
+##↔
+##↗
+##↘
+##⇒
+##∀
+##−
+##∕
+##∙
+##√
+##∞
+##∟
+##∠
+##∣
+##∩
+##∮
+##∶
+##∼
+##∽
+##≈
+##≒
+##≡
+##≤
+##≥
+##≦
+##≧
+##≪
+##≫
+##⊙
+##⋅
+##⋈
+##⋯
+##⌒
+##①
+##②
+##③
+##④
+##⑤
+##⑥
+##⑦
+##⑧
+##⑨
+##⑩
+##⑴
+##⑵
+##⑶
+##⑷
+##⑸
+##⒈
+##⒉
+##⒊
+##⒋
+##ⓒ
+##ⓔ
+##ⓘ
+##━
+##┃
+##┆
+##┊
+##┌
+##└
+##├
+##┣
+##═
+##║
+##╚
+##╞
+##╠
+##╭
+##╮
+##╯
+##╰
+##╱
+##╳
+##▂
+##▃
+##▅
+##▇
+##▉
+##▋
+##▌
+##▍
+##▎
+##□
+##▪
+##▫
+##▬
+##△
+##▶
+##►
+##▽
+##◇
+##◕
+##◠
+##◢
+##◤
+##☀
+##☕
+##☞
+##☺
+##☼
+##♀
+##♂
+##♠
+##♡
+##♣
+##♦
+##♫
+##♬
+##✈
+##✔
+##✕
+##✖
+##✦
+##✨
+##✪
+##✰
+##✿
+##❀
+##➜
+##➤
+##⦿
+##、
+##。
+##〃
+##々
+##〇
+##〈
+##〉
+##《
+##》
+##「
+##」
+##『
+##』
+##【
+##】
+##〓
+##〔
+##〕
+##〖
+##〗
+##〜
+##〝
+##〞
+##ぃ
+##ぇ
+##ぬ
+##ふ
+##ほ
+##む
+##ゃ
+##ゅ
+##ゆ
+##ょ
+##゜
+##ゝ
+##ァ
+##ゥ
+##エ
+##ォ
+##ケ
+##サ
+##セ
+##ソ
+##ッ
+##ニ
+##ヌ
+##ネ
+##ノ
+##ヘ
+##モ
+##ャ
+##ヤ
+##ュ
+##ユ
+##ョ
+##ヨ
+##ワ
+##ヲ
+##・
+##ヽ
+##ㄅ
+##ㄆ
+##ㄇ
+##ㄉ
+##ㄋ
+##ㄌ
+##ㄍ
+##ㄎ
+##ㄏ
+##ㄒ
+##ㄚ
+##ㄛ
+##ㄞ
+##ㄟ
+##ㄢ
+##ㄤ
+##ㄥ
+##ㄧ
+##ㄨ
+##ㆍ
+##㈦
+##㊣
+##㗎
+##一
+##丁
+##七
+##万
+##丈
+##三
+##上
+##下
+##不
+##与
+##丐
+##丑
+##专
+##且
+##丕
+##世
+##丘
+##丙
+##业
+##丛
+##东
+##丝
+##丞
+##丟
+##両
+##丢
+##两
+##严
+##並
+##丧
+##丨
+##个
+##丫
+##中
+##丰
+##串
+##临
+##丶
+##丸
+##丹
+##为
+##主
+##丼
+##丽
+##举
+##丿
+##乂
+##乃
+##久
+##么
+##义
+##之
+##乌
+##乍
+##乎
+##乏
+##乐
+##乒
+##乓
+##乔
+##乖
+##乗
+##乘
+##乙
+##乜
+##九
+##乞
+##也
+##习
+##乡
+##书
+##乩
+##买
+##乱
+##乳
+##乾
+##亀
+##亂
+##了
+##予
+##争
+##事
+##二
+##于
+##亏
+##云
+##互
+##五
+##井
+##亘
+##亙
+##亚
+##些
+##亜
+##亞
+##亟
+##亡
+##亢
+##交
+##亥
+##亦
+##产
+##亨
+##亩
+##享
+##京
+##亭
+##亮
+##亲
+##亳
+##亵
+##人
+##亿
+##什
+##仁
+##仃
+##仄
+##仅
+##仆
+##仇
+##今
+##介
+##仍
+##从
+##仏
+##仑
+##仓
+##仔
+##仕
+##他
+##仗
+##付
+##仙
+##仝
+##仞
+##仟
+##代
+##令
+##以
+##仨
+##仪
+##们
+##仮
+##仰
+##仲
+##件
+##价
+##任
+##份
+##仿
+##企
+##伉
+##伊
+##伍
+##伎
+##伏
+##伐
+##休
+##伕
+##众
+##优
+##伙
+##会
+##伝
+##伞
+##伟
+##传
+##伢
+##伤
+##伦
+##伪
+##伫
+##伯
+##估
+##伴
+##伶
+##伸
+##伺
+##似
+##伽
+##佃
+##但
+##佇
+##佈
+##位
+##低
+##住
+##佐
+##佑
+##体
+##佔
+##何
+##佗
+##佘
+##余
+##佚
+##佛
+##作
+##佝
+##佞
+##佟
+##你
+##佢
+##佣
+##佤
+##佥
+##佩
+##佬
+##佯
+##佰
+##佳
+##併
+##佶
+##佻
+##佼
+##使
+##侃
+##侄
+##來
+##侈
+##例
+##侍
+##侏
+##侑
+##侖
+##侗
+##供
+##依
+##侠
+##価
+##侣
+##侥
+##侦
+##侧
+##侨
+##侬
+##侮
+##侯
+##侵
+##侶
+##侷
+##便
+##係
+##促
+##俄
+##俊
+##俎
+##俏
+##俐
+##俑
+##俗
+##俘
+##俚
+##保
+##俞
+##俟
+##俠
+##信
+##俨
+##俩
+##俪
+##俬
+##俭
+##修
+##俯
+##俱
+##俳
+##俸
+##俺
+##俾
+##倆
+##倉
+##個
+##倌
+##倍
+##倏
+##們
+##倒
+##倔
+##倖
+##倘
+##候
+##倚
+##倜
+##借
+##倡
+##値
+##倦
+##倩
+##倪
+##倫
+##倬
+##倭
+##倶
+##债
+##值
+##倾
+##偃
+##假
+##偈
+##偉
+##偌
+##偎
+##偏
+##偕
+##做
+##停
+##健
+##側
+##偵
+##偶
+##偷
+##偻
+##偽
+##偿
+##傀
+##傅
+##傍
+##傑
+##傘
+##備
+##傚
+##傢
+##傣
+##傥
+##储
+##傩
+##催
+##傭
+##傲
+##傳
+##債
+##傷
+##傻
+##傾
+##僅
+##働
+##像
+##僑
+##僕
+##僖
+##僚
+##僥
+##僧
+##僭
+##僮
+##僱
+##僵
+##價
+##僻
+##儀
+##儂
+##億
+##儆
+##儉
+##儋
+##儒
+##儕
+##儘
+##償
+##儡
+##優
+##儲
+##儷
+##儼
+##儿
+##兀
+##允
+##元
+##兄
+##充
+##兆
+##兇
+##先
+##光
+##克
+##兌
+##免
+##児
+##兑
+##兒
+##兔
+##兖
+##党
+##兜
+##兢
+##入
+##內
+##全
+##兩
+##八
+##公
+##六
+##兮
+##兰
+##共
+##兲
+##关
+##兴
+##兵
+##其
+##具
+##典
+##兹
+##养
+##兼
+##兽
+##冀
+##内
+##円
+##冇
+##冈
+##冉
+##冊
+##册
+##再
+##冏
+##冒
+##冕
+##冗
+##写
+##军
+##农
+##冠
+##冢
+##冤
+##冥
+##冨
+##冪
+##冬
+##冯
+##冰
+##冲
+##决
+##况
+##冶
+##冷
+##冻
+##冼
+##冽
+##冾
+##净
+##凄
+##准
+##凇
+##凈
+##凉
+##凋
+##凌
+##凍
+##减
+##凑
+##凛
+##凜
+##凝
+##几
+##凡
+##凤
+##処
+##凪
+##凭
+##凯
+##凰
+##凱
+##凳
+##凶
+##凸
+##凹
+##出
+##击
+##函
+##凿
+##刀
+##刁
+##刃
+##分
+##切
+##刈
+##刊
+##刍
+##刎
+##刑
+##划
+##列
+##刘
+##则
+##刚
+##创
+##初
+##删
+##判
+##別
+##刨
+##利
+##刪
+##别
+##刮
+##到
+##制
+##刷
+##券
+##刹
+##刺
+##刻
+##刽
+##剁
+##剂
+##剃
+##則
+##剉
+##削
+##剋
+##剌
+##前
+##剎
+##剐
+##剑
+##剔
+##剖
+##剛
+##剜
+##剝
+##剣
+##剤
+##剥
+##剧
+##剩
+##剪
+##副
+##割
+##創
+##剷
+##剽
+##剿
+##劃
+##劇
+##劈
+##劉
+##劊
+##劍
+##劏
+##劑
+##力
+##劝
+##办
+##功
+##加
+##务
+##劣
+##动
+##助
+##努
+##劫
+##劭
+##励
+##劲
+##劳
+##労
+##劵
+##効
+##劾
+##势
+##勁
+##勃
+##勇
+##勉
+##勋
+##勐
+##勒
+##動
+##勖
+##勘
+##務
+##勛
+##勝
+##勞
+##募
+##勢
+##勤
+##勧
+##勳
+##勵
+##勸
+##勺
+##勻
+##勾
+##勿
+##匀
+##包
+##匆
+##匈
+##匍
+##匐
+##匕
+##化
+##北
+##匙
+##匝
+##匠
+##匡
+##匣
+##匪
+##匮
+##匯
+##匱
+##匹
+##区
+##医
+##匾
+##匿
+##區
+##十
+##千
+##卅
+##升
+##午
+##卉
+##半
+##卍
+##华
+##协
+##卑
+##卒
+##卓
+##協
+##单
+##卖
+##南
+##単
+##博
+##卜
+##卞
+##卟
+##占
+##卡
+##卢
+##卤
+##卦
+##卧
+##卫
+##卮
+##卯
+##印
+##危
+##即
+##却
+##卵
+##卷
+##卸
+##卻
+##卿
+##厂
+##厄
+##厅
+##历
+##厉
+##压
+##厌
+##厕
+##厘
+##厚
+##厝
+##原
+##厢
+##厥
+##厦
+##厨
+##厩
+##厭
+##厮
+##厲
+##厳
+##去
+##县
+##叁
+##参
+##參
+##又
+##叉
+##及
+##友
+##双
+##反
+##収
+##发
+##叔
+##取
+##受
+##变
+##叙
+##叛
+##叟
+##叠
+##叡
+##叢
+##口
+##古
+##句
+##另
+##叨
+##叩
+##只
+##叫
+##召
+##叭
+##叮
+##可
+##台
+##叱
+##史
+##右
+##叵
+##叶
+##号
+##司
+##叹
+##叻
+##叼
+##叽
+##吁
+##吃
+##各
+##吆
+##合
+##吉
+##吊
+##吋
+##同
+##名
+##后
+##吏
+##吐
+##向
+##吒
+##吓
+##吕
+##吖
+##吗
+##君
+##吝
+##吞
+##吟
+##吠
+##吡
+##否
+##吧
+##吨
+##吩
+##含
+##听
+##吭
+##吮
+##启
+##吱
+##吳
+##吴
+##吵
+##吶
+##吸
+##吹
+##吻
+##吼
+##吽
+##吾
+##呀
+##呂
+##呃
+##呆
+##呈
+##告
+##呋
+##呎
+##呐
+##呓
+##呕
+##呗
+##员
+##呛
+##呜
+##呢
+##呤
+##呦
+##周
+##呱
+##呲
+##味
+##呵
+##呷
+##呸
+##呻
+##呼
+##命
+##咀
+##咁
+##咂
+##咄
+##咆
+##咋
+##和
+##咎
+##咏
+##咐
+##咒
+##咔
+##咕
+##咖
+##咗
+##咘
+##咙
+##咚
+##咛
+##咣
+##咤
+##咦
+##咧
+##咨
+##咩
+##咪
+##咫
+##咬
+##咭
+##咯
+##咱
+##咲
+##咳
+##咸
+##咻
+##咽
+##咿
+##哀
+##品
+##哂
+##哄
+##哆
+##哇
+##哈
+##哉
+##哋
+##哌
+##响
+##哎
+##哏
+##哐
+##哑
+##哒
+##哔
+##哗
+##哟
+##員
+##哥
+##哦
+##哧
+##哨
+##哩
+##哪
+##哭
+##哮
+##哲
+##哺
+##哼
+##哽
+##唁
+##唄
+##唆
+##唇
+##唉
+##唏
+##唐
+##唑
+##唔
+##唠
+##唤
+##唧
+##唬
+##售
+##唯
+##唰
+##唱
+##唳
+##唷
+##唸
+##唾
+##啃
+##啄
+##商
+##啉
+##啊
+##問
+##啓
+##啕
+##啖
+##啜
+##啞
+##啟
+##啡
+##啤
+##啥
+##啦
+##啧
+##啪
+##啫
+##啬
+##啮
+##啰
+##啱
+##啲
+##啵
+##啶
+##啷
+##啸
+##啻
+##啼
+##啾
+##喀
+##喂
+##喃
+##善
+##喆
+##喇
+##喉
+##喊
+##喋
+##喎
+##喏
+##喔
+##喘
+##喙
+##喚
+##喜
+##喝
+##喟
+##喧
+##喪
+##喫
+##喬
+##單
+##喰
+##喱
+##喲
+##喳
+##喵
+##営
+##喷
+##喹
+##喺
+##喻
+##喽
+##嗅
+##嗆
+##嗇
+##嗎
+##嗑
+##嗒
+##嗓
+##嗔
+##嗖
+##嗚
+##嗜
+##嗝
+##嗟
+##嗡
+##嗣
+##嗤
+##嗦
+##嗨
+##嗪
+##嗬
+##嗯
+##嗰
+##嗲
+##嗳
+##嗶
+##嗷
+##嗽
+##嘀
+##嘅
+##嘆
+##嘈
+##嘉
+##嘌
+##嘍
+##嘎
+##嘔
+##嘖
+##嘗
+##嘘
+##嘚
+##嘛
+##嘜
+##嘞
+##嘟
+##嘢
+##嘣
+##嘤
+##嘧
+##嘩
+##嘭
+##嘮
+##嘯
+##嘰
+##嘱
+##嘲
+##嘴
+##嘶
+##嘸
+##嘹
+##嘻
+##嘿
+##噁
+##噌
+##噎
+##噓
+##噔
+##噗
+##噙
+##噜
+##噠
+##噢
+##噤
+##器
+##噩
+##噪
+##噬
+##噱
+##噴
+##噶
+##噸
+##噹
+##噻
+##噼
+##嚀
+##嚇
+##嚎
+##嚏
+##嚐
+##嚓
+##嚕
+##嚟
+##嚣
+##嚥
+##嚨
+##嚮
+##嚴
+##嚷
+##嚼
+##囂
+##囉
+##囊
+##囍
+##囑
+##囔
+##囗
+##囚
+##四
+##囝
+##回
+##囟
+##因
+##囡
+##团
+##団
+##囤
+##囧
+##囪
+##囫
+##园
+##困
+##囱
+##囲
+##図
+##围
+##囹
+##固
+##国
+##图
+##囿
+##圃
+##圄
+##圆
+##圈
+##國
+##圍
+##圏
+##園
+##圓
+##圖
+##團
+##圜
+##土
+##圣
+##圧
+##在
+##圩
+##圭
+##地
+##圳
+##场
+##圻
+##圾
+##址
+##坂
+##均
+##坊
+##坍
+##坎
+##坏
+##坐
+##坑
+##块
+##坚
+##坛
+##坝
+##坞
+##坟
+##坠
+##坡
+##坤
+##坦
+##坨
+##坪
+##坯
+##坳
+##坵
+##坷
+##垂
+##垃
+##垄
+##型
+##垒
+##垚
+##垛
+##垠
+##垢
+##垣
+##垦
+##垩
+##垫
+##垭
+##垮
+##垵
+##埂
+##埃
+##埋
+##城
+##埔
+##埕
+##埗
+##域
+##埠
+##埤
+##埵
+##執
+##埸
+##培
+##基
+##埼
+##堀
+##堂
+##堃
+##堅
+##堆
+##堇
+##堑
+##堕
+##堙
+##堡
+##堤
+##堪
+##堯
+##堰
+##報
+##場
+##堵
+##堺
+##堿
+##塊
+##塌
+##塑
+##塔
+##塗
+##塘
+##塚
+##塞
+##塢
+##塩
+##填
+##塬
+##塭
+##塵
+##塾
+##墀
+##境
+##墅
+##墉
+##墊
+##墒
+##墓
+##増
+##墘
+##墙
+##墜
+##增
+##墟
+##墨
+##墩
+##墮
+##墳
+##墻
+##墾
+##壁
+##壅
+##壆
+##壇
+##壊
+##壑
+##壓
+##壕
+##壘
+##壞
+##壟
+##壢
+##壤
+##壩
+##士
+##壬
+##壮
+##壯
+##声
+##売
+##壳
+##壶
+##壹
+##壺
+##壽
+##处
+##备
+##変
+##复
+##夏
+##夔
+##夕
+##外
+##夙
+##多
+##夜
+##够
+##夠
+##夢
+##夥
+##大
+##天
+##太
+##夫
+##夭
+##央
+##夯
+##失
+##头
+##夷
+##夸
+##夹
+##夺
+##夾
+##奂
+##奄
+##奇
+##奈
+##奉
+##奋
+##奎
+##奏
+##奐
+##契
+##奔
+##奕
+##奖
+##套
+##奘
+##奚
+##奠
+##奢
+##奥
+##奧
+##奪
+##奬
+##奮
+##女
+##奴
+##奶
+##奸
+##她
+##好
+##如
+##妃
+##妄
+##妆
+##妇
+##妈
+##妊
+##妍
+##妒
+##妓
+##妖
+##妘
+##妙
+##妝
+##妞
+##妣
+##妤
+##妥
+##妨
+##妩
+##妪
+##妮
+##妲
+##妳
+##妹
+##妻
+##妾
+##姆
+##姉
+##姊
+##始
+##姍
+##姐
+##姑
+##姒
+##姓
+##委
+##姗
+##姚
+##姜
+##姝
+##姣
+##姥
+##姦
+##姨
+##姪
+##姫
+##姬
+##姹
+##姻
+##姿
+##威
+##娃
+##娄
+##娅
+##娆
+##娇
+##娉
+##娑
+##娓
+##娘
+##娛
+##娜
+##娟
+##娠
+##娣
+##娥
+##娩
+##娱
+##娲
+##娴
+##娶
+##娼
+##婀
+##婁
+##婆
+##婉
+##婊
+##婕
+##婚
+##婢
+##婦
+##婧
+##婪
+##婭
+##婴
+##婵
+##婶
+##婷
+##婺
+##婿
+##媒
+##媚
+##媛
+##媞
+##媧
+##媲
+##媳
+##媽
+##媾
+##嫁
+##嫂
+##嫉
+##嫌
+##嫑
+##嫔
+##嫖
+##嫘
+##嫚
+##嫡
+##嫣
+##嫦
+##嫩
+##嫲
+##嫵
+##嫻
+##嬅
+##嬉
+##嬌
+##嬗
+##嬛
+##嬢
+##嬤
+##嬪
+##嬰
+##嬴
+##嬷
+##嬸
+##嬿
+##孀
+##孃
+##子
+##孑
+##孔
+##孕
+##孖
+##字
+##存
+##孙
+##孚
+##孛
+##孜
+##孝
+##孟
+##孢
+##季
+##孤
+##学
+##孩
+##孪
+##孫
+##孬
+##孰
+##孱
+##孳
+##孵
+##學
+##孺
+##孽
+##孿
+##宁
+##它
+##宅
+##宇
+##守
+##安
+##宋
+##完
+##宏
+##宓
+##宕
+##宗
+##官
+##宙
+##定
+##宛
+##宜
+##宝
+##实
+##実
+##宠
+##审
+##客
+##宣
+##室
+##宥
+##宦
+##宪
+##宫
+##宮
+##宰
+##害
+##宴
+##宵
+##家
+##宸
+##容
+##宽
+##宾
+##宿
+##寂
+##寄
+##寅
+##密
+##寇
+##富
+##寐
+##寒
+##寓
+##寛
+##寝
+##寞
+##察
+##寡
+##寢
+##寥
+##實
+##寧
+##寨
+##審
+##寫
+##寬
+##寮
+##寰
+##寵
+##寶
+##寸
+##对
+##寺
+##寻
+##导
+##対
+##寿
+##封
+##専
+##射
+##将
+##將
+##專
+##尉
+##尊
+##尋
+##對
+##導
+##小
+##少
+##尔
+##尕
+##尖
+##尘
+##尚
+##尝
+##尤
+##尧
+##尬
+##就
+##尴
+##尷
+##尸
+##尹
+##尺
+##尻
+##尼
+##尽
+##尾
+##尿
+##局
+##屁
+##层
+##屄
+##居
+##屆
+##屈
+##屉
+##届
+##屋
+##屌
+##屍
+##屎
+##屏
+##屐
+##屑
+##展
+##屜
+##属
+##屠
+##屡
+##屢
+##層
+##履
+##屬
+##屯
+##山
+##屹
+##屿
+##岀
+##岁
+##岂
+##岌
+##岐
+##岑
+##岔
+##岖
+##岗
+##岘
+##岙
+##岚
+##岛
+##岡
+##岩
+##岫
+##岬
+##岭
+##岱
+##岳
+##岷
+##岸
+##峇
+##峋
+##峒
+##峙
+##峡
+##峤
+##峥
+##峦
+##峨
+##峪
+##峭
+##峯
+##峰
+##峴
+##島
+##峻
+##峽
+##崁
+##崂
+##崆
+##崇
+##崎
+##崑
+##崔
+##崖
+##崗
+##崙
+##崛
+##崧
+##崩
+##崭
+##崴
+##崽
+##嵇
+##嵊
+##嵋
+##嵌
+##嵐
+##嵘
+##嵩
+##嵬
+##嵯
+##嶂
+##嶄
+##嶇
+##嶋
+##嶙
+##嶺
+##嶼
+##嶽
+##巅
+##巍
+##巒
+##巔
+##巖
+##川
+##州
+##巡
+##巢
+##工
+##左
+##巧
+##巨
+##巩
+##巫
+##差
+##己
+##已
+##巳
+##巴
+##巷
+##巻
+##巽
+##巾
+##巿
+##币
+##市
+##布
+##帅
+##帆
+##师
+##希
+##帐
+##帑
+##帕
+##帖
+##帘
+##帚
+##帛
+##帜
+##帝
+##帥
+##带
+##帧
+##師
+##席
+##帮
+##帯
+##帰
+##帳
+##帶
+##帷
+##常
+##帼
+##帽
+##幀
+##幂
+##幄
+##幅
+##幌
+##幔
+##幕
+##幟
+##幡
+##幢
+##幣
+##幫
+##干
+##平
+##年
+##并
+##幸
+##幹
+##幺
+##幻
+##幼
+##幽
+##幾
+##广
+##庁
+##広
+##庄
+##庆
+##庇
+##床
+##序
+##庐
+##库
+##应
+##底
+##庖
+##店
+##庙
+##庚
+##府
+##庞
+##废
+##庠
+##度
+##座
+##庫
+##庭
+##庵
+##庶
+##康
+##庸
+##庹
+##庾
+##廁
+##廂
+##廃
+##廈
+##廉
+##廊
+##廓
+##廖
+##廚
+##廝
+##廟
+##廠
+##廢
+##廣
+##廬
+##廳
+##延
+##廷
+##建
+##廿
+##开
+##弁
+##异
+##弃
+##弄
+##弈
+##弊
+##弋
+##式
+##弑
+##弒
+##弓
+##弔
+##引
+##弗
+##弘
+##弛
+##弟
+##张
+##弥
+##弦
+##弧
+##弩
+##弭
+##弯
+##弱
+##張
+##強
+##弹
+##强
+##弼
+##弾
+##彅
+##彆
+##彈
+##彌
+##彎
+##归
+##当
+##录
+##彗
+##彙
+##彝
+##形
+##彤
+##彥
+##彦
+##彧
+##彩
+##彪
+##彫
+##彬
+##彭
+##彰
+##影
+##彷
+##役
+##彻
+##彼
+##彿
+##往
+##征
+##径
+##待
+##徇
+##很
+##徉
+##徊
+##律
+##後
+##徐
+##徑
+##徒
+##従
+##徕
+##得
+##徘
+##徙
+##徜
+##從
+##徠
+##御
+##徨
+##復
+##循
+##徬
+##微
+##徳
+##徴
+##徵
+##德
+##徹
+##徼
+##徽
+##心
+##必
+##忆
+##忌
+##忍
+##忏
+##忐
+##忑
+##忒
+##忖
+##志
+##忘
+##忙
+##応
+##忠
+##忡
+##忤
+##忧
+##忪
+##快
+##忱
+##念
+##忻
+##忽
+##忿
+##怀
+##态
+##怂
+##怅
+##怆
+##怎
+##怏
+##怒
+##怔
+##怕
+##怖
+##怙
+##怜
+##思
+##怠
+##怡
+##急
+##怦
+##性
+##怨
+##怪
+##怯
+##怵
+##总
+##怼
+##恁
+##恃
+##恆
+##恋
+##恍
+##恐
+##恒
+##恕
+##恙
+##恚
+##恢
+##恣
+##恤
+##恥
+##恨
+##恩
+##恪
+##恫
+##恬
+##恭
+##息
+##恰
+##恳
+##恵
+##恶
+##恸
+##恺
+##恻
+##恼
+##恿
+##悄
+##悅
+##悉
+##悌
+##悍
+##悔
+##悖
+##悚
+##悟
+##悠
+##患
+##悦
+##您
+##悩
+##悪
+##悬
+##悯
+##悱
+##悲
+##悴
+##悵
+##悶
+##悸
+##悻
+##悼
+##悽
+##情
+##惆
+##惇
+##惊
+##惋
+##惑
+##惕
+##惘
+##惚
+##惜
+##惟
+##惠
+##惡
+##惦
+##惧
+##惨
+##惩
+##惫
+##惬
+##惭
+##惮
+##惯
+##惰
+##惱
+##想
+##惴
+##惶
+##惹
+##惺
+##愁
+##愆
+##愈
+##愉
+##愍
+##意
+##愕
+##愚
+##愛
+##愜
+##感
+##愣
+##愤
+##愧
+##愫
+##愷
+##愿
+##慄
+##慈
+##態
+##慌
+##慎
+##慑
+##慕
+##慘
+##慚
+##慟
+##慢
+##慣
+##慧
+##慨
+##慫
+##慮
+##慰
+##慳
+##慵
+##慶
+##慷
+##慾
+##憂
+##憊
+##憋
+##憎
+##憐
+##憑
+##憔
+##憚
+##憤
+##憧
+##憨
+##憩
+##憫
+##憬
+##憲
+##憶
+##憾
+##懂
+##懇
+##懈
+##應
+##懊
+##懋
+##懑
+##懒
+##懦
+##懲
+##懵
+##懶
+##懷
+##懸
+##懺
+##懼
+##懾
+##懿
+##戀
+##戈
+##戊
+##戌
+##戍
+##戎
+##戏
+##成
+##我
+##戒
+##戕
+##或
+##战
+##戚
+##戛
+##戟
+##戡
+##戦
+##截
+##戬
+##戮
+##戰
+##戲
+##戳
+##戴
+##戶
+##户
+##戸
+##戻
+##戾
+##房
+##所
+##扁
+##扇
+##扈
+##扉
+##手
+##才
+##扎
+##扑
+##扒
+##打
+##扔
+##払
+##托
+##扛
+##扣
+##扦
+##执
+##扩
+##扪
+##扫
+##扬
+##扭
+##扮
+##扯
+##扰
+##扱
+##扳
+##扶
+##批
+##扼
+##找
+##承
+##技
+##抄
+##抉
+##把
+##抑
+##抒
+##抓
+##投
+##抖
+##抗
+##折
+##抚
+##抛
+##抜
+##択
+##抟
+##抠
+##抡
+##抢
+##护
+##报
+##抨
+##披
+##抬
+##抱
+##抵
+##抹
+##押
+##抽
+##抿
+##拂
+##拄
+##担
+##拆
+##拇
+##拈
+##拉
+##拋
+##拌
+##拍
+##拎
+##拐
+##拒
+##拓
+##拔
+##拖
+##拗
+##拘
+##拙
+##拚
+##招
+##拜
+##拟
+##拡
+##拢
+##拣
+##拥
+##拦
+##拧
+##拨
+##择
+##括
+##拭
+##拮
+##拯
+##拱
+##拳
+##拴
+##拷
+##拼
+##拽
+##拾
+##拿
+##持
+##挂
+##指
+##挈
+##按
+##挎
+##挑
+##挖
+##挙
+##挚
+##挛
+##挝
+##挞
+##挟
+##挠
+##挡
+##挣
+##挤
+##挥
+##挨
+##挪
+##挫
+##振
+##挲
+##挹
+##挺
+##挽
+##挾
+##捂
+##捅
+##捆
+##捉
+##捋
+##捌
+##捍
+##捎
+##捏
+##捐
+##捕
+##捞
+##损
+##捡
+##换
+##捣
+##捧
+##捨
+##捩
+##据
+##捱
+##捲
+##捶
+##捷
+##捺
+##捻
+##掀
+##掂
+##掃
+##掇
+##授
+##掉
+##掌
+##掏
+##掐
+##排
+##掖
+##掘
+##掙
+##掛
+##掠
+##採
+##探
+##掣
+##接
+##控
+##推
+##掩
+##措
+##掬
+##掰
+##掲
+##掳
+##掴
+##掷
+##掸
+##掺
+##揀
+##揃
+##揄
+##揆
+##揉
+##揍
+##描
+##提
+##插
+##揖
+##揚
+##換
+##握
+##揣
+##揩
+##揪
+##揭
+##揮
+##援
+##揶
+##揸
+##揹
+##揽
+##搀
+##搁
+##搂
+##搅
+##損
+##搏
+##搐
+##搓
+##搔
+##搖
+##搗
+##搜
+##搞
+##搡
+##搪
+##搬
+##搭
+##搵
+##搶
+##携
+##搽
+##摀
+##摁
+##摄
+##摆
+##摇
+##摈
+##摊
+##摒
+##摔
+##摘
+##摞
+##摟
+##摧
+##摩
+##摯
+##摳
+##摸
+##摹
+##摺
+##摻
+##撂
+##撃
+##撅
+##撇
+##撈
+##撐
+##撑
+##撒
+##撓
+##撕
+##撚
+##撞
+##撤
+##撥
+##撩
+##撫
+##撬
+##播
+##撮
+##撰
+##撲
+##撵
+##撷
+##撸
+##撻
+##撼
+##撿
+##擀
+##擁
+##擂
+##擄
+##擅
+##擇
+##擊
+##擋
+##操
+##擎
+##擒
+##擔
+##擘
+##據
+##擞
+##擠
+##擡
+##擢
+##擦
+##擬
+##擰
+##擱
+##擲
+##擴
+##擷
+##擺
+##擼
+##擾
+##攀
+##攏
+##攒
+##攔
+##攘
+##攙
+##攜
+##攝
+##攞
+##攢
+##攣
+##攤
+##攥
+##攪
+##攫
+##攬
+##支
+##收
+##攸
+##改
+##攻
+##放
+##政
+##故
+##效
+##敌
+##敍
+##敎
+##敏
+##救
+##敕
+##敖
+##敗
+##敘
+##教
+##敛
+##敝
+##敞
+##敢
+##散
+##敦
+##敬
+##数
+##敲
+##整
+##敵
+##敷
+##數
+##斂
+##斃
+##文
+##斋
+##斌
+##斎
+##斐
+##斑
+##斓
+##斗
+##料
+##斛
+##斜
+##斟
+##斡
+##斤
+##斥
+##斧
+##斩
+##斫
+##斬
+##断
+##斯
+##新
+##斷
+##方
+##於
+##施
+##旁
+##旃
+##旅
+##旋
+##旌
+##旎
+##族
+##旖
+##旗
+##无
+##既
+##日
+##旦
+##旧
+##旨
+##早
+##旬
+##旭
+##旮
+##旱
+##时
+##旷
+##旺
+##旻
+##昀
+##昂
+##昆
+##昇
+##昉
+##昊
+##昌
+##明
+##昏
+##易
+##昔
+##昕
+##昙
+##星
+##映
+##春
+##昧
+##昨
+##昭
+##是
+##昱
+##昴
+##昵
+##昶
+##昼
+##显
+##晁
+##時
+##晃
+##晉
+##晋
+##晌
+##晏
+##晒
+##晓
+##晔
+##晕
+##晖
+##晗
+##晚
+##晝
+##晞
+##晟
+##晤
+##晦
+##晨
+##晩
+##普
+##景
+##晰
+##晴
+##晶
+##晷
+##智
+##晾
+##暂
+##暄
+##暇
+##暈
+##暉
+##暌
+##暐
+##暑
+##暖
+##暗
+##暝
+##暢
+##暧
+##暨
+##暫
+##暮
+##暱
+##暴
+##暸
+##暹
+##曄
+##曆
+##曇
+##曉
+##曖
+##曙
+##曜
+##曝
+##曠
+##曦
+##曬
+##曰
+##曲
+##曳
+##更
+##書
+##曹
+##曼
+##曾
+##替
+##最
+##會
+##月
+##有
+##朋
+##服
+##朐
+##朔
+##朕
+##朗
+##望
+##朝
+##期
+##朦
+##朧
+##木
+##未
+##末
+##本
+##札
+##朮
+##术
+##朱
+##朴
+##朵
+##机
+##朽
+##杀
+##杂
+##权
+##杆
+##杈
+##杉
+##李
+##杏
+##材
+##村
+##杓
+##杖
+##杜
+##杞
+##束
+##杠
+##条
+##来
+##杨
+##杭
+##杯
+##杰
+##東
+##杳
+##杵
+##杷
+##杼
+##松
+##板
+##极
+##构
+##枇
+##枉
+##枋
+##析
+##枕
+##林
+##枚
+##果
+##枝
+##枢
+##枣
+##枪
+##枫
+##枭
+##枯
+##枰
+##枱
+##枳
+##架
+##枷
+##枸
+##柄
+##柏
+##某
+##柑
+##柒
+##染
+##柔
+##柘
+##柚
+##柜
+##柞
+##柠
+##柢
+##查
+##柩
+##柬
+##柯
+##柱
+##柳
+##柴
+##柵
+##査
+##柿
+##栀
+##栃
+##栄
+##栅
+##标
+##栈
+##栉
+##栋
+##栎
+##栏
+##树
+##栓
+##栖
+##栗
+##校
+##栩
+##株
+##样
+##核
+##根
+##格
+##栽
+##栾
+##桀
+##桁
+##桂
+##桃
+##桅
+##框
+##案
+##桉
+##桌
+##桎
+##桐
+##桑
+##桓
+##桔
+##桜
+##桠
+##桡
+##桢
+##档
+##桥
+##桦
+##桧
+##桨
+##桩
+##桶
+##桿
+##梁
+##梅
+##梆
+##梏
+##梓
+##梗
+##條
+##梟
+##梢
+##梦
+##梧
+##梨
+##梭
+##梯
+##械
+##梳
+##梵
+##梶
+##检
+##棂
+##棄
+##棉
+##棋
+##棍
+##棒
+##棕
+##棗
+##棘
+##棚
+##棟
+##棠
+##棣
+##棧
+##森
+##棱
+##棲
+##棵
+##棹
+##棺
+##椁
+##椅
+##椋
+##植
+##椎
+##椒
+##検
+##椪
+##椭
+##椰
+##椹
+##椽
+##椿
+##楂
+##楊
+##楓
+##楔
+##楚
+##楝
+##楞
+##楠
+##楣
+##楨
+##楫
+##業
+##楮
+##極
+##楷
+##楸
+##楹
+##楼
+##楽
+##概
+##榄
+##榆
+##榈
+##榉
+##榔
+##榕
+##榖
+##榛
+##榜
+##榨
+##榫
+##榭
+##榮
+##榱
+##榴
+##榷
+##榻
+##槁
+##槃
+##構
+##槌
+##槍
+##槎
+##槐
+##槓
+##様
+##槛
+##槟
+##槤
+##槭
+##槲
+##槳
+##槻
+##槽
+##槿
+##樁
+##樂
+##樊
+##樑
+##樓
+##標
+##樞
+##樟
+##模
+##樣
+##権
+##横
+##樫
+##樯
+##樱
+##樵
+##樸
+##樹
+##樺
+##樽
+##樾
+##橄
+##橇
+##橋
+##橐
+##橘
+##橙
+##機
+##橡
+##橢
+##橫
+##橱
+##橹
+##橼
+##檀
+##檄
+##檎
+##檐
+##檔
+##檗
+##檜
+##檢
+##檬
+##檯
+##檳
+##檸
+##檻
+##櫃
+##櫚
+##櫛
+##櫥
+##櫸
+##櫻
+##欄
+##權
+##欒
+##欖
+##欠
+##次
+##欢
+##欣
+##欧
+##欲
+##欸
+##欺
+##欽
+##款
+##歆
+##歇
+##歉
+##歌
+##歎
+##歐
+##歓
+##歙
+##歛
+##歡
+##止
+##正
+##此
+##步
+##武
+##歧
+##歩
+##歪
+##歯
+##歲
+##歳
+##歴
+##歷
+##歸
+##歹
+##死
+##歼
+##殁
+##殃
+##殆
+##殇
+##殉
+##殊
+##残
+##殒
+##殓
+##殖
+##殘
+##殞
+##殡
+##殤
+##殭
+##殯
+##殲
+##殴
+##段
+##殷
+##殺
+##殼
+##殿
+##毀
+##毁
+##毂
+##毅
+##毆
+##毋
+##母
+##毎
+##每
+##毒
+##毓
+##比
+##毕
+##毗
+##毘
+##毙
+##毛
+##毡
+##毫
+##毯
+##毽
+##氈
+##氏
+##氐
+##民
+##氓
+##气
+##氖
+##気
+##氙
+##氛
+##氟
+##氡
+##氢
+##氣
+##氤
+##氦
+##氧
+##氨
+##氪
+##氫
+##氮
+##氯
+##氰
+##氲
+##水
+##氷
+##永
+##氹
+##氾
+##汀
+##汁
+##求
+##汆
+##汇
+##汉
+##汎
+##汐
+##汕
+##汗
+##汙
+##汛
+##汝
+##汞
+##江
+##池
+##污
+##汤
+##汨
+##汩
+##汪
+##汰
+##汲
+##汴
+##汶
+##汹
+##決
+##汽
+##汾
+##沁
+##沂
+##沃
+##沅
+##沈
+##沉
+##沌
+##沏
+##沐
+##沒
+##沓
+##沖
+##沙
+##沛
+##沟
+##没
+##沢
+##沣
+##沥
+##沦
+##沧
+##沪
+##沫
+##沭
+##沮
+##沱
+##河
+##沸
+##油
+##治
+##沼
+##沽
+##沾
+##沿
+##況
+##泄
+##泉
+##泊
+##泌
+##泓
+##法
+##泗
+##泛
+##泞
+##泠
+##泡
+##波
+##泣
+##泥
+##注
+##泪
+##泫
+##泮
+##泯
+##泰
+##泱
+##泳
+##泵
+##泷
+##泸
+##泻
+##泼
+##泽
+##泾
+##洁
+##洄
+##洋
+##洒
+##洗
+##洙
+##洛
+##洞
+##津
+##洩
+##洪
+##洮
+##洱
+##洲
+##洵
+##洶
+##洸
+##洹
+##活
+##洼
+##洽
+##派
+##流
+##浃
+##浄
+##浅
+##浆
+##浇
+##浊
+##测
+##济
+##浏
+##浑
+##浒
+##浓
+##浔
+##浙
+##浚
+##浜
+##浣
+##浦
+##浩
+##浪
+##浬
+##浮
+##浯
+##浴
+##海
+##浸
+##涂
+##涅
+##涇
+##消
+##涉
+##涌
+##涎
+##涓
+##涔
+##涕
+##涙
+##涛
+##涝
+##涞
+##涟
+##涠
+##涡
+##涣
+##涤
+##润
+##涧
+##涨
+##涩
+##涪
+##涮
+##涯
+##液
+##涵
+##涸
+##涼
+##涿
+##淀
+##淄
+##淅
+##淆
+##淇
+##淋
+##淌
+##淑
+##淒
+##淖
+##淘
+##淙
+##淚
+##淞
+##淡
+##淤
+##淦
+##淨
+##淩
+##淪
+##淫
+##淬
+##淮
+##深
+##淳
+##淵
+##混
+##淹
+##淺
+##添
+##淼
+##清
+##済
+##渉
+##渊
+##渋
+##渍
+##渎
+##渐
+##渔
+##渗
+##渙
+##渚
+##減
+##渝
+##渠
+##渡
+##渣
+##渤
+##渥
+##渦
+##温
+##測
+##渭
+##港
+##渲
+##渴
+##游
+##渺
+##渾
+##湃
+##湄
+##湊
+##湍
+##湖
+##湘
+##湛
+##湟
+##湧
+##湫
+##湮
+##湯
+##湳
+##湾
+##湿
+##満
+##溃
+##溅
+##溉
+##溏
+##源
+##準
+##溜
+##溝
+##溟
+##溢
+##溥
+##溧
+##溪
+##溫
+##溯
+##溱
+##溴
+##溶
+##溺
+##溼
+##滁
+##滂
+##滄
+##滅
+##滇
+##滋
+##滌
+##滑
+##滓
+##滔
+##滕
+##滙
+##滚
+##滝
+##滞
+##滟
+##满
+##滢
+##滤
+##滥
+##滦
+##滨
+##滩
+##滬
+##滯
+##滲
+##滴
+##滷
+##滸
+##滾
+##滿
+##漁
+##漂
+##漆
+##漉
+##漏
+##漓
+##演
+##漕
+##漠
+##漢
+##漣
+##漩
+##漪
+##漫
+##漬
+##漯
+##漱
+##漲
+##漳
+##漸
+##漾
+##漿
+##潆
+##潇
+##潋
+##潍
+##潑
+##潔
+##潘
+##潛
+##潜
+##潞
+##潟
+##潢
+##潤
+##潦
+##潧
+##潭
+##潮
+##潰
+##潴
+##潸
+##潺
+##潼
+##澀
+##澄
+##澆
+##澈
+##澍
+##澎
+##澗
+##澜
+##澡
+##澤
+##澧
+##澱
+##澳
+##澹
+##激
+##濁
+##濂
+##濃
+##濑
+##濒
+##濕
+##濘
+##濛
+##濟
+##濠
+##濡
+##濤
+##濫
+##濬
+##濮
+##濯
+##濱
+##濺
+##濾
+##瀅
+##瀆
+##瀉
+##瀋
+##瀏
+##瀑
+##瀕
+##瀘
+##瀚
+##瀛
+##瀝
+##瀞
+##瀟
+##瀧
+##瀨
+##瀬
+##瀰
+##瀾
+##灌
+##灏
+##灑
+##灘
+##灝
+##灞
+##灣
+##火
+##灬
+##灭
+##灯
+##灰
+##灵
+##灶
+##灸
+##灼
+##災
+##灾
+##灿
+##炀
+##炁
+##炅
+##炉
+##炊
+##炎
+##炒
+##炔
+##炕
+##炖
+##炙
+##炜
+##炫
+##炬
+##炭
+##炮
+##炯
+##炳
+##炷
+##炸
+##点
+##為
+##炼
+##炽
+##烁
+##烂
+##烃
+##烈
+##烊
+##烏
+##烘
+##烙
+##烛
+##烟
+##烤
+##烦
+##烧
+##烨
+##烩
+##烫
+##烬
+##热
+##烯
+##烷
+##烹
+##烽
+##焉
+##焊
+##焕
+##焖
+##焗
+##焘
+##焙
+##焚
+##焜
+##無
+##焦
+##焯
+##焰
+##焱
+##然
+##焼
+##煅
+##煉
+##煊
+##煌
+##煎
+##煒
+##煖
+##煙
+##煜
+##煞
+##煤
+##煥
+##煦
+##照
+##煨
+##煩
+##煮
+##煲
+##煸
+##煽
+##熄
+##熊
+##熏
+##熒
+##熔
+##熙
+##熟
+##熠
+##熨
+##熬
+##熱
+##熵
+##熹
+##熾
+##燁
+##燃
+##燄
+##燈
+##燉
+##燊
+##燎
+##燒
+##燔
+##燕
+##燙
+##燜
+##營
+##燥
+##燦
+##燧
+##燭
+##燮
+##燴
+##燻
+##燼
+##燿
+##爆
+##爍
+##爐
+##爛
+##爪
+##爬
+##爭
+##爰
+##爱
+##爲
+##爵
+##父
+##爷
+##爸
+##爹
+##爺
+##爻
+##爽
+##爾
+##牆
+##片
+##版
+##牌
+##牍
+##牒
+##牙
+##牛
+##牝
+##牟
+##牠
+##牡
+##牢
+##牦
+##牧
+##物
+##牯
+##牲
+##牴
+##牵
+##特
+##牺
+##牽
+##犀
+##犁
+##犄
+##犊
+##犍
+##犒
+##犢
+##犧
+##犬
+##犯
+##状
+##犷
+##犸
+##犹
+##狀
+##狂
+##狄
+##狈
+##狎
+##狐
+##狒
+##狗
+##狙
+##狞
+##狠
+##狡
+##狩
+##独
+##狭
+##狮
+##狰
+##狱
+##狸
+##狹
+##狼
+##狽
+##猎
+##猕
+##猖
+##猗
+##猙
+##猛
+##猜
+##猝
+##猥
+##猩
+##猪
+##猫
+##猬
+##献
+##猴
+##猶
+##猷
+##猾
+##猿
+##獄
+##獅
+##獎
+##獐
+##獒
+##獗
+##獠
+##獣
+##獨
+##獭
+##獰
+##獲
+##獵
+##獷
+##獸
+##獺
+##獻
+##獼
+##獾
+##玄
+##率
+##玉
+##王
+##玑
+##玖
+##玛
+##玟
+##玠
+##玥
+##玩
+##玫
+##玮
+##环
+##现
+##玲
+##玳
+##玷
+##玺
+##玻
+##珀
+##珂
+##珅
+##珈
+##珉
+##珊
+##珍
+##珏
+##珐
+##珑
+##珙
+##珞
+##珠
+##珣
+##珥
+##珩
+##珪
+##班
+##珮
+##珲
+##珺
+##現
+##球
+##琅
+##理
+##琇
+##琉
+##琊
+##琍
+##琏
+##琐
+##琛
+##琢
+##琥
+##琦
+##琨
+##琪
+##琬
+##琮
+##琰
+##琲
+##琳
+##琴
+##琵
+##琶
+##琺
+##琼
+##瑀
+##瑁
+##瑄
+##瑋
+##瑕
+##瑗
+##瑙
+##瑚
+##瑛
+##瑜
+##瑞
+##瑟
+##瑠
+##瑣
+##瑤
+##瑩
+##瑪
+##瑯
+##瑰
+##瑶
+##瑾
+##璀
+##璁
+##璃
+##璇
+##璉
+##璋
+##璎
+##璐
+##璜
+##璞
+##璟
+##璧
+##璨
+##環
+##璽
+##璿
+##瓊
+##瓏
+##瓒
+##瓜
+##瓢
+##瓣
+##瓤
+##瓦
+##瓮
+##瓯
+##瓴
+##瓶
+##瓷
+##甄
+##甌
+##甕
+##甘
+##甙
+##甚
+##甜
+##生
+##產
+##産
+##甥
+##甦
+##用
+##甩
+##甫
+##甬
+##甭
+##甯
+##田
+##由
+##甲
+##申
+##电
+##男
+##甸
+##町
+##画
+##甾
+##畀
+##畅
+##界
+##畏
+##畑
+##畔
+##留
+##畜
+##畝
+##畢
+##略
+##畦
+##番
+##畫
+##異
+##畲
+##畳
+##畴
+##當
+##畸
+##畹
+##畿
+##疆
+##疇
+##疊
+##疏
+##疑
+##疔
+##疖
+##疗
+##疙
+##疚
+##疝
+##疟
+##疡
+##疣
+##疤
+##疥
+##疫
+##疮
+##疯
+##疱
+##疲
+##疳
+##疵
+##疸
+##疹
+##疼
+##疽
+##疾
+##痂
+##病
+##症
+##痈
+##痉
+##痊
+##痍
+##痒
+##痔
+##痕
+##痘
+##痙
+##痛
+##痞
+##痠
+##痢
+##痣
+##痤
+##痧
+##痨
+##痪
+##痫
+##痰
+##痱
+##痴
+##痹
+##痺
+##痼
+##痿
+##瘀
+##瘁
+##瘋
+##瘍
+##瘓
+##瘘
+##瘙
+##瘟
+##瘠
+##瘡
+##瘢
+##瘤
+##瘦
+##瘧
+##瘩
+##瘪
+##瘫
+##瘴
+##瘸
+##瘾
+##療
+##癇
+##癌
+##癒
+##癖
+##癜
+##癞
+##癡
+##癢
+##癣
+##癥
+##癫
+##癬
+##癮
+##癱
+##癲
+##癸
+##発
+##登
+##發
+##白
+##百
+##皂
+##的
+##皆
+##皇
+##皈
+##皋
+##皎
+##皑
+##皓
+##皖
+##皙
+##皚
+##皮
+##皰
+##皱
+##皴
+##皺
+##皿
+##盂
+##盃
+##盅
+##盆
+##盈
+##益
+##盎
+##盏
+##盐
+##监
+##盒
+##盔
+##盖
+##盗
+##盘
+##盛
+##盜
+##盞
+##盟
+##盡
+##監
+##盤
+##盥
+##盧
+##盪
+##目
+##盯
+##盱
+##盲
+##直
+##相
+##盹
+##盼
+##盾
+##省
+##眈
+##眉
+##看
+##県
+##眙
+##眞
+##真
+##眠
+##眦
+##眨
+##眩
+##眯
+##眶
+##眷
+##眸
+##眺
+##眼
+##眾
+##着
+##睁
+##睇
+##睏
+##睐
+##睑
+##睛
+##睜
+##睞
+##睡
+##睢
+##督
+##睥
+##睦
+##睨
+##睪
+##睫
+##睬
+##睹
+##睽
+##睾
+##睿
+##瞄
+##瞅
+##瞇
+##瞋
+##瞌
+##瞎
+##瞑
+##瞒
+##瞓
+##瞞
+##瞟
+##瞠
+##瞥
+##瞧
+##瞩
+##瞪
+##瞬
+##瞭
+##瞰
+##瞳
+##瞻
+##瞼
+##瞿
+##矇
+##矍
+##矗
+##矚
+##矛
+##矜
+##矢
+##矣
+##知
+##矩
+##矫
+##短
+##矮
+##矯
+##石
+##矶
+##矽
+##矾
+##矿
+##码
+##砂
+##砌
+##砍
+##砒
+##研
+##砖
+##砗
+##砚
+##砝
+##砣
+##砥
+##砧
+##砭
+##砰
+##砲
+##破
+##砷
+##砸
+##砺
+##砼
+##砾
+##础
+##硅
+##硐
+##硒
+##硕
+##硝
+##硫
+##硬
+##确
+##硯
+##硼
+##碁
+##碇
+##碉
+##碌
+##碍
+##碎
+##碑
+##碓
+##碗
+##碘
+##碚
+##碛
+##碟
+##碣
+##碧
+##碩
+##碰
+##碱
+##碳
+##碴
+##確
+##碼
+##碾
+##磁
+##磅
+##磊
+##磋
+##磐
+##磕
+##磚
+##磡
+##磨
+##磬
+##磯
+##磲
+##磷
+##磺
+##礁
+##礎
+##礙
+##礡
+##礦
+##礪
+##礫
+##礴
+##示
+##礼
+##社
+##祀
+##祁
+##祂
+##祇
+##祈
+##祉
+##祎
+##祐
+##祕
+##祖
+##祗
+##祚
+##祛
+##祜
+##祝
+##神
+##祟
+##祠
+##祢
+##祥
+##票
+##祭
+##祯
+##祷
+##祸
+##祺
+##祿
+##禀
+##禁
+##禄
+##禅
+##禍
+##禎
+##福
+##禛
+##禦
+##禧
+##禪
+##禮
+##禱
+##禹
+##禺
+##离
+##禽
+##禾
+##禿
+##秀
+##私
+##秃
+##秆
+##秉
+##秋
+##种
+##科
+##秒
+##秘
+##租
+##秣
+##秤
+##秦
+##秧
+##秩
+##秭
+##积
+##称
+##秸
+##移
+##秽
+##稀
+##稅
+##程
+##稍
+##税
+##稔
+##稗
+##稚
+##稜
+##稞
+##稟
+##稠
+##稣
+##種
+##稱
+##稲
+##稳
+##稷
+##稹
+##稻
+##稼
+##稽
+##稿
+##穀
+##穂
+##穆
+##穌
+##積
+##穎
+##穗
+##穢
+##穩
+##穫
+##穴
+##究
+##穷
+##穹
+##空
+##穿
+##突
+##窃
+##窄
+##窈
+##窍
+##窑
+##窒
+##窓
+##窕
+##窖
+##窗
+##窘
+##窜
+##窝
+##窟
+##窠
+##窥
+##窦
+##窨
+##窩
+##窪
+##窮
+##窯
+##窺
+##窿
+##竄
+##竅
+##竇
+##竊
+##立
+##竖
+##站
+##竜
+##竞
+##竟
+##章
+##竣
+##童
+##竭
+##端
+##競
+##竹
+##竺
+##竽
+##竿
+##笃
+##笆
+##笈
+##笋
+##笏
+##笑
+##笔
+##笙
+##笛
+##笞
+##笠
+##符
+##笨
+##第
+##笹
+##笺
+##笼
+##筆
+##等
+##筊
+##筋
+##筍
+##筏
+##筐
+##筑
+##筒
+##答
+##策
+##筛
+##筝
+##筠
+##筱
+##筲
+##筵
+##筷
+##筹
+##签
+##简
+##箇
+##箋
+##箍
+##箏
+##箐
+##箔
+##箕
+##算
+##箝
+##管
+##箩
+##箫
+##箭
+##箱
+##箴
+##箸
+##節
+##篁
+##範
+##篆
+##篇
+##築
+##篑
+##篓
+##篙
+##篝
+##篠
+##篡
+##篤
+##篩
+##篪
+##篮
+##篱
+##篷
+##簇
+##簌
+##簍
+##簡
+##簦
+##簧
+##簪
+##簫
+##簷
+##簸
+##簽
+##簾
+##簿
+##籁
+##籃
+##籌
+##籍
+##籐
+##籟
+##籠
+##籤
+##籬
+##籮
+##籲
+##米
+##类
+##籼
+##籽
+##粄
+##粉
+##粑
+##粒
+##粕
+##粗
+##粘
+##粟
+##粤
+##粥
+##粧
+##粪
+##粮
+##粱
+##粲
+##粳
+##粵
+##粹
+##粼
+##粽
+##精
+##粿
+##糅
+##糊
+##糍
+##糕
+##糖
+##糗
+##糙
+##糜
+##糞
+##糟
+##糠
+##糧
+##糬
+##糯
+##糰
+##糸
+##系
+##糾
+##紀
+##紂
+##約
+##紅
+##紉
+##紊
+##紋
+##納
+##紐
+##紓
+##純
+##紗
+##紘
+##紙
+##級
+##紛
+##紜
+##素
+##紡
+##索
+##紧
+##紫
+##紮
+##累
+##細
+##紳
+##紹
+##紺
+##終
+##絃
+##組
+##絆
+##経
+##結
+##絕
+##絞
+##絡
+##絢
+##給
+##絨
+##絮
+##統
+##絲
+##絳
+##絵
+##絶
+##絹
+##綁
+##綏
+##綑
+##經
+##継
+##続
+##綜
+##綠
+##綢
+##綦
+##綫
+##綬
+##維
+##綱
+##網
+##綴
+##綵
+##綸
+##綺
+##綻
+##綽
+##綾
+##綿
+##緊
+##緋
+##総
+##緑
+##緒
+##緘
+##線
+##緝
+##緞
+##締
+##緣
+##編
+##緩
+##緬
+##緯
+##練
+##緹
+##緻
+##縁
+##縄
+##縈
+##縛
+##縝
+##縣
+##縫
+##縮
+##縱
+##縴
+##縷
+##總
+##績
+##繁
+##繃
+##繆
+##繇
+##繋
+##織
+##繕
+##繚
+##繞
+##繡
+##繩
+##繪
+##繫
+##繭
+##繳
+##繹
+##繼
+##繽
+##纂
+##續
+##纍
+##纏
+##纓
+##纔
+##纖
+##纜
+##纠
+##红
+##纣
+##纤
+##约
+##级
+##纨
+##纪
+##纫
+##纬
+##纭
+##纯
+##纰
+##纱
+##纲
+##纳
+##纵
+##纶
+##纷
+##纸
+##纹
+##纺
+##纽
+##纾
+##线
+##绀
+##练
+##组
+##绅
+##细
+##织
+##终
+##绊
+##绍
+##绎
+##经
+##绑
+##绒
+##结
+##绔
+##绕
+##绘
+##给
+##绚
+##绛
+##络
+##绝
+##绞
+##统
+##绡
+##绢
+##绣
+##绥
+##绦
+##继
+##绩
+##绪
+##绫
+##续
+##绮
+##绯
+##绰
+##绳
+##维
+##绵
+##绶
+##绷
+##绸
+##绻
+##综
+##绽
+##绾
+##绿
+##缀
+##缄
+##缅
+##缆
+##缇
+##缈
+##缉
+##缎
+##缓
+##缔
+##缕
+##编
+##缘
+##缙
+##缚
+##缜
+##缝
+##缠
+##缢
+##缤
+##缥
+##缨
+##缩
+##缪
+##缭
+##缮
+##缰
+##缱
+##缴
+##缸
+##缺
+##缽
+##罂
+##罄
+##罌
+##罐
+##网
+##罔
+##罕
+##罗
+##罚
+##罡
+##罢
+##罩
+##罪
+##置
+##罰
+##署
+##罵
+##罷
+##罹
+##羁
+##羅
+##羈
+##羊
+##羌
+##美
+##羔
+##羚
+##羞
+##羟
+##羡
+##羣
+##群
+##羥
+##羧
+##羨
+##義
+##羯
+##羲
+##羸
+##羹
+##羽
+##羿
+##翁
+##翅
+##翊
+##翌
+##翎
+##習
+##翔
+##翘
+##翟
+##翠
+##翡
+##翦
+##翩
+##翰
+##翱
+##翳
+##翹
+##翻
+##翼
+##耀
+##老
+##考
+##耄
+##者
+##耆
+##耋
+##而
+##耍
+##耐
+##耒
+##耕
+##耗
+##耘
+##耙
+##耦
+##耨
+##耳
+##耶
+##耷
+##耸
+##耻
+##耽
+##耿
+##聂
+##聆
+##聊
+##聋
+##职
+##聒
+##联
+##聖
+##聘
+##聚
+##聞
+##聪
+##聯
+##聰
+##聲
+##聳
+##聴
+##聶
+##職
+##聽
+##聾
+##聿
+##肃
+##肄
+##肅
+##肆
+##肇
+##肉
+##肋
+##肌
+##肏
+##肓
+##肖
+##肘
+##肚
+##肛
+##肝
+##肠
+##股
+##肢
+##肤
+##肥
+##肩
+##肪
+##肮
+##肯
+##肱
+##育
+##肴
+##肺
+##肽
+##肾
+##肿
+##胀
+##胁
+##胃
+##胄
+##胆
+##背
+##胍
+##胎
+##胖
+##胚
+##胛
+##胜
+##胝
+##胞
+##胡
+##胤
+##胥
+##胧
+##胫
+##胭
+##胯
+##胰
+##胱
+##胳
+##胴
+##胶
+##胸
+##胺
+##能
+##脂
+##脅
+##脆
+##脇
+##脈
+##脉
+##脊
+##脍
+##脏
+##脐
+##脑
+##脓
+##脖
+##脘
+##脚
+##脛
+##脣
+##脩
+##脫
+##脯
+##脱
+##脲
+##脳
+##脸
+##脹
+##脾
+##腆
+##腈
+##腊
+##腋
+##腌
+##腎
+##腐
+##腑
+##腓
+##腔
+##腕
+##腥
+##腦
+##腩
+##腫
+##腭
+##腮
+##腰
+##腱
+##腳
+##腴
+##腸
+##腹
+##腺
+##腻
+##腼
+##腾
+##腿
+##膀
+##膈
+##膊
+##膏
+##膑
+##膘
+##膚
+##膛
+##膜
+##膝
+##膠
+##膦
+##膨
+##膩
+##膳
+##膺
+##膻
+##膽
+##膾
+##膿
+##臀
+##臂
+##臃
+##臆
+##臉
+##臊
+##臍
+##臓
+##臘
+##臟
+##臣
+##臥
+##臧
+##臨
+##自
+##臬
+##臭
+##至
+##致
+##臺
+##臻
+##臼
+##臾
+##舀
+##舂
+##舅
+##舆
+##與
+##興
+##舉
+##舊
+##舌
+##舍
+##舎
+##舐
+##舒
+##舔
+##舖
+##舗
+##舛
+##舜
+##舞
+##舟
+##航
+##舫
+##般
+##舰
+##舱
+##舵
+##舶
+##舷
+##舸
+##船
+##舺
+##舾
+##艇
+##艋
+##艘
+##艙
+##艦
+##艮
+##良
+##艰
+##艱
+##色
+##艳
+##艷
+##艹
+##艺
+##艾
+##节
+##芃
+##芈
+##芊
+##芋
+##芍
+##芎
+##芒
+##芙
+##芜
+##芝
+##芡
+##芥
+##芦
+##芩
+##芪
+##芫
+##芬
+##芭
+##芮
+##芯
+##花
+##芳
+##芷
+##芸
+##芹
+##芻
+##芽
+##芾
+##苁
+##苄
+##苇
+##苋
+##苍
+##苏
+##苑
+##苒
+##苓
+##苔
+##苕
+##苗
+##苛
+##苜
+##苞
+##苟
+##苡
+##苣
+##若
+##苦
+##苫
+##苯
+##英
+##苷
+##苹
+##苻
+##茁
+##茂
+##范
+##茄
+##茅
+##茉
+##茎
+##茏
+##茗
+##茜
+##茧
+##茨
+##茫
+##茬
+##茭
+##茯
+##茱
+##茲
+##茴
+##茵
+##茶
+##茸
+##茹
+##茼
+##荀
+##荃
+##荆
+##草
+##荊
+##荏
+##荐
+##荒
+##荔
+##荖
+##荘
+##荚
+##荞
+##荟
+##荠
+##荡
+##荣
+##荤
+##荥
+##荧
+##荨
+##荪
+##荫
+##药
+##荳
+##荷
+##荸
+##荻
+##荼
+##荽
+##莅
+##莆
+##莉
+##莊
+##莎
+##莒
+##莓
+##莖
+##莘
+##莞
+##莠
+##莢
+##莧
+##莪
+##莫
+##莱
+##莲
+##莴
+##获
+##莹
+##莺
+##莽
+##莿
+##菀
+##菁
+##菅
+##菇
+##菈
+##菊
+##菌
+##菏
+##菓
+##菖
+##菘
+##菜
+##菟
+##菠
+##菡
+##菩
+##華
+##菱
+##菲
+##菸
+##菽
+##萁
+##萃
+##萄
+##萊
+##萋
+##萌
+##萍
+##萎
+##萘
+##萝
+##萤
+##营
+##萦
+##萧
+##萨
+##萩
+##萬
+##萱
+##萵
+##萸
+##萼
+##落
+##葆
+##葉
+##著
+##葚
+##葛
+##葡
+##董
+##葦
+##葩
+##葫
+##葬
+##葭
+##葯
+##葱
+##葳
+##葵
+##葷
+##葺
+##蒂
+##蒋
+##蒐
+##蒔
+##蒙
+##蒜
+##蒞
+##蒟
+##蒡
+##蒨
+##蒲
+##蒸
+##蒹
+##蒻
+##蒼
+##蒿
+##蓁
+##蓄
+##蓆
+##蓉
+##蓋
+##蓑
+##蓓
+##蓖
+##蓝
+##蓟
+##蓦
+##蓬
+##蓮
+##蓼
+##蓿
+##蔑
+##蔓
+##蔔
+##蔗
+##蔘
+##蔚
+##蔡
+##蔣
+##蔥
+##蔫
+##蔬
+##蔭
+##蔵
+##蔷
+##蔺
+##蔻
+##蔼
+##蔽
+##蕁
+##蕃
+##蕈
+##蕉
+##蕊
+##蕎
+##蕙
+##蕤
+##蕨
+##蕩
+##蕪
+##蕭
+##蕲
+##蕴
+##蕻
+##蕾
+##薄
+##薅
+##薇
+##薈
+##薊
+##薏
+##薑
+##薔
+##薙
+##薛
+##薦
+##薨
+##薩
+##薪
+##薬
+##薯
+##薰
+##薹
+##藉
+##藍
+##藏
+##藐
+##藓
+##藕
+##藜
+##藝
+##藤
+##藥
+##藩
+##藹
+##藻
+##藿
+##蘆
+##蘇
+##蘊
+##蘋
+##蘑
+##蘚
+##蘭
+##蘸
+##蘼
+##蘿
+##虎
+##虏
+##虐
+##虑
+##虔
+##處
+##虚
+##虛
+##虜
+##虞
+##號
+##虢
+##虧
+##虫
+##虬
+##虱
+##虹
+##虻
+##虽
+##虾
+##蚀
+##蚁
+##蚂
+##蚊
+##蚌
+##蚓
+##蚕
+##蚜
+##蚝
+##蚣
+##蚤
+##蚩
+##蚪
+##蚯
+##蚱
+##蚵
+##蛀
+##蛆
+##蛇
+##蛊
+##蛋
+##蛎
+##蛐
+##蛔
+##蛙
+##蛛
+##蛟
+##蛤
+##蛭
+##蛮
+##蛰
+##蛳
+##蛹
+##蛻
+##蛾
+##蜀
+##蜂
+##蜃
+##蜆
+##蜇
+##蜈
+##蜊
+##蜍
+##蜒
+##蜓
+##蜕
+##蜗
+##蜘
+##蜚
+##蜜
+##蜡
+##蜢
+##蜥
+##蜱
+##蜴
+##蜷
+##蜻
+##蜿
+##蝇
+##蝈
+##蝉
+##蝌
+##蝎
+##蝕
+##蝗
+##蝙
+##蝟
+##蝠
+##蝦
+##蝨
+##蝴
+##蝶
+##蝸
+##蝼
+##螂
+##螃
+##融
+##螞
+##螢
+##螨
+##螯
+##螳
+##螺
+##蟀
+##蟄
+##蟆
+##蟋
+##蟎
+##蟑
+##蟒
+##蟠
+##蟬
+##蟲
+##蟹
+##蟻
+##蟾
+##蠅
+##蠍
+##蠔
+##蠕
+##蠛
+##蠟
+##蠡
+##蠢
+##蠣
+##蠱
+##蠶
+##蠹
+##蠻
+##血
+##衄
+##衅
+##衆
+##行
+##衍
+##術
+##衔
+##街
+##衙
+##衛
+##衝
+##衞
+##衡
+##衢
+##衣
+##补
+##表
+##衩
+##衫
+##衬
+##衮
+##衰
+##衲
+##衷
+##衹
+##衾
+##衿
+##袁
+##袂
+##袄
+##袅
+##袈
+##袋
+##袍
+##袒
+##袖
+##袜
+##袞
+##袤
+##袪
+##被
+##袭
+##袱
+##裁
+##裂
+##装
+##裆
+##裊
+##裏
+##裔
+##裕
+##裘
+##裙
+##補
+##裝
+##裟
+##裡
+##裤
+##裨
+##裱
+##裳
+##裴
+##裸
+##裹
+##製
+##裾
+##褂
+##複
+##褐
+##褒
+##褓
+##褔
+##褚
+##褥
+##褪
+##褫
+##褲
+##褶
+##褻
+##襁
+##襄
+##襟
+##襠
+##襪
+##襬
+##襯
+##襲
+##西
+##要
+##覃
+##覆
+##覇
+##見
+##規
+##覓
+##視
+##覚
+##覦
+##覧
+##親
+##覬
+##観
+##覷
+##覺
+##覽
+##觀
+##见
+##观
+##规
+##觅
+##视
+##览
+##觉
+##觊
+##觎
+##觐
+##觑
+##角
+##觞
+##解
+##觥
+##触
+##觸
+##言
+##訂
+##計
+##訊
+##討
+##訓
+##訕
+##訖
+##託
+##記
+##訛
+##訝
+##訟
+##訣
+##訥
+##訪
+##設
+##許
+##訳
+##訴
+##訶
+##診
+##註
+##証
+##詆
+##詐
+##詔
+##評
+##詛
+##詞
+##詠
+##詡
+##詢
+##詣
+##試
+##詩
+##詫
+##詬
+##詭
+##詮
+##詰
+##話
+##該
+##詳
+##詹
+##詼
+##誅
+##誇
+##誉
+##誌
+##認
+##誓
+##誕
+##誘
+##語
+##誠
+##誡
+##誣
+##誤
+##誥
+##誦
+##誨
+##說
+##説
+##読
+##誰
+##課
+##誹
+##誼
+##調
+##諄
+##談
+##請
+##諏
+##諒
+##論
+##諗
+##諜
+##諡
+##諦
+##諧
+##諫
+##諭
+##諮
+##諱
+##諳
+##諷
+##諸
+##諺
+##諾
+##謀
+##謁
+##謂
+##謄
+##謊
+##謎
+##謐
+##謔
+##謗
+##謙
+##講
+##謝
+##謠
+##謨
+##謬
+##謹
+##謾
+##譁
+##證
+##譎
+##譏
+##識
+##譙
+##譚
+##譜
+##警
+##譬
+##譯
+##議
+##譲
+##譴
+##護
+##譽
+##讀
+##變
+##讓
+##讚
+##讞
+##计
+##订
+##认
+##讥
+##讧
+##讨
+##让
+##讪
+##讫
+##训
+##议
+##讯
+##记
+##讲
+##讳
+##讴
+##讶
+##讷
+##许
+##讹
+##论
+##讼
+##讽
+##设
+##访
+##诀
+##证
+##诃
+##评
+##诅
+##识
+##诈
+##诉
+##诊
+##诋
+##词
+##诏
+##译
+##试
+##诗
+##诘
+##诙
+##诚
+##诛
+##话
+##诞
+##诟
+##诠
+##诡
+##询
+##诣
+##诤
+##该
+##详
+##诧
+##诩
+##诫
+##诬
+##语
+##误
+##诰
+##诱
+##诲
+##说
+##诵
+##诶
+##请
+##诸
+##诺
+##读
+##诽
+##课
+##诿
+##谀
+##谁
+##调
+##谄
+##谅
+##谆
+##谈
+##谊
+##谋
+##谌
+##谍
+##谎
+##谏
+##谐
+##谑
+##谒
+##谓
+##谔
+##谕
+##谗
+##谘
+##谙
+##谚
+##谛
+##谜
+##谟
+##谢
+##谣
+##谤
+##谥
+##谦
+##谧
+##谨
+##谩
+##谪
+##谬
+##谭
+##谯
+##谱
+##谲
+##谴
+##谶
+##谷
+##豁
+##豆
+##豇
+##豈
+##豉
+##豊
+##豌
+##豎
+##豐
+##豔
+##豚
+##象
+##豢
+##豪
+##豫
+##豬
+##豹
+##豺
+##貂
+##貅
+##貌
+##貓
+##貔
+##貘
+##貝
+##貞
+##負
+##財
+##貢
+##貧
+##貨
+##販
+##貪
+##貫
+##責
+##貯
+##貰
+##貳
+##貴
+##貶
+##買
+##貸
+##費
+##貼
+##貽
+##貿
+##賀
+##賁
+##賂
+##賃
+##賄
+##資
+##賈
+##賊
+##賑
+##賓
+##賜
+##賞
+##賠
+##賡
+##賢
+##賣
+##賤
+##賦
+##質
+##賬
+##賭
+##賴
+##賺
+##購
+##賽
+##贅
+##贈
+##贊
+##贍
+##贏
+##贓
+##贖
+##贛
+##贝
+##贞
+##负
+##贡
+##财
+##责
+##贤
+##败
+##账
+##货
+##质
+##贩
+##贪
+##贫
+##贬
+##购
+##贮
+##贯
+##贰
+##贱
+##贲
+##贴
+##贵
+##贷
+##贸
+##费
+##贺
+##贻
+##贼
+##贾
+##贿
+##赁
+##赂
+##赃
+##资
+##赅
+##赈
+##赊
+##赋
+##赌
+##赎
+##赏
+##赐
+##赓
+##赔
+##赖
+##赘
+##赚
+##赛
+##赝
+##赞
+##赠
+##赡
+##赢
+##赣
+##赤
+##赦
+##赧
+##赫
+##赭
+##走
+##赳
+##赴
+##赵
+##赶
+##起
+##趁
+##超
+##越
+##趋
+##趕
+##趙
+##趟
+##趣
+##趨
+##足
+##趴
+##趵
+##趸
+##趺
+##趾
+##跃
+##跄
+##跆
+##跋
+##跌
+##跎
+##跑
+##跖
+##跚
+##跛
+##距
+##跟
+##跡
+##跤
+##跨
+##跩
+##跪
+##路
+##跳
+##践
+##跷
+##跹
+##跺
+##跻
+##踉
+##踊
+##踌
+##踏
+##踐
+##踝
+##踞
+##踟
+##踢
+##踩
+##踪
+##踮
+##踱
+##踴
+##踵
+##踹
+##蹂
+##蹄
+##蹇
+##蹈
+##蹉
+##蹊
+##蹋
+##蹑
+##蹒
+##蹙
+##蹟
+##蹣
+##蹤
+##蹦
+##蹩
+##蹬
+##蹭
+##蹲
+##蹴
+##蹶
+##蹺
+##蹼
+##蹿
+##躁
+##躇
+##躉
+##躊
+##躋
+##躍
+##躏
+##躪
+##身
+##躬
+##躯
+##躲
+##躺
+##軀
+##車
+##軋
+##軌
+##軍
+##軒
+##軟
+##転
+##軸
+##軼
+##軽
+##軾
+##較
+##載
+##輒
+##輓
+##輔
+##輕
+##輛
+##輝
+##輟
+##輩
+##輪
+##輯
+##輸
+##輻
+##輾
+##輿
+##轄
+##轅
+##轆
+##轉
+##轍
+##轎
+##轟
+##车
+##轧
+##轨
+##轩
+##转
+##轭
+##轮
+##软
+##轰
+##轲
+##轴
+##轶
+##轻
+##轼
+##载
+##轿
+##较
+##辄
+##辅
+##辆
+##辇
+##辈
+##辉
+##辊
+##辍
+##辐
+##辑
+##输
+##辕
+##辖
+##辗
+##辘
+##辙
+##辛
+##辜
+##辞
+##辟
+##辣
+##辦
+##辨
+##辩
+##辫
+##辭
+##辮
+##辯
+##辰
+##辱
+##農
+##边
+##辺
+##辻
+##込
+##辽
+##达
+##迁
+##迂
+##迄
+##迅
+##过
+##迈
+##迎
+##运
+##近
+##返
+##还
+##这
+##进
+##远
+##违
+##连
+##迟
+##迢
+##迤
+##迥
+##迦
+##迩
+##迪
+##迫
+##迭
+##述
+##迴
+##迷
+##迸
+##迹
+##迺
+##追
+##退
+##送
+##适
+##逃
+##逅
+##逆
+##选
+##逊
+##逍
+##透
+##逐
+##递
+##途
+##逕
+##逗
+##這
+##通
+##逛
+##逝
+##逞
+##速
+##造
+##逢
+##連
+##逮
+##週
+##進
+##逵
+##逶
+##逸
+##逻
+##逼
+##逾
+##遁
+##遂
+##遅
+##遇
+##遊
+##運
+##遍
+##過
+##遏
+##遐
+##遑
+##遒
+##道
+##達
+##違
+##遗
+##遙
+##遛
+##遜
+##遞
+##遠
+##遢
+##遣
+##遥
+##遨
+##適
+##遭
+##遮
+##遲
+##遴
+##遵
+##遶
+##遷
+##選
+##遺
+##遼
+##遽
+##避
+##邀
+##邁
+##邂
+##邃
+##還
+##邇
+##邈
+##邊
+##邋
+##邏
+##邑
+##邓
+##邕
+##邛
+##邝
+##邢
+##那
+##邦
+##邨
+##邪
+##邬
+##邮
+##邯
+##邰
+##邱
+##邳
+##邵
+##邸
+##邹
+##邺
+##邻
+##郁
+##郅
+##郊
+##郎
+##郑
+##郜
+##郝
+##郡
+##郢
+##郤
+##郦
+##郧
+##部
+##郫
+##郭
+##郴
+##郵
+##郷
+##郸
+##都
+##鄂
+##鄉
+##鄒
+##鄔
+##鄙
+##鄞
+##鄢
+##鄧
+##鄭
+##鄰
+##鄱
+##鄲
+##鄺
+##酉
+##酊
+##酋
+##酌
+##配
+##酐
+##酒
+##酗
+##酚
+##酝
+##酢
+##酣
+##酥
+##酩
+##酪
+##酬
+##酮
+##酯
+##酰
+##酱
+##酵
+##酶
+##酷
+##酸
+##酿
+##醃
+##醇
+##醉
+##醋
+##醍
+##醐
+##醒
+##醚
+##醛
+##醜
+##醞
+##醣
+##醪
+##醫
+##醬
+##醮
+##醯
+##醴
+##醺
+##釀
+##釁
+##采
+##釉
+##释
+##釋
+##里
+##重
+##野
+##量
+##釐
+##金
+##釗
+##釘
+##釜
+##針
+##釣
+##釦
+##釧
+##釵
+##鈀
+##鈉
+##鈍
+##鈎
+##鈔
+##鈕
+##鈞
+##鈣
+##鈦
+##鈪
+##鈴
+##鈺
+##鈾
+##鉀
+##鉄
+##鉅
+##鉉
+##鉑
+##鉗
+##鉚
+##鉛
+##鉤
+##鉴
+##鉻
+##銀
+##銃
+##銅
+##銑
+##銓
+##銖
+##銘
+##銜
+##銬
+##銭
+##銮
+##銳
+##銷
+##銹
+##鋁
+##鋅
+##鋒
+##鋤
+##鋪
+##鋰
+##鋸
+##鋼
+##錄
+##錐
+##錘
+##錚
+##錠
+##錢
+##錦
+##錨
+##錫
+##錮
+##錯
+##録
+##錳
+##錶
+##鍊
+##鍋
+##鍍
+##鍛
+##鍥
+##鍰
+##鍵
+##鍺
+##鍾
+##鎂
+##鎊
+##鎌
+##鎏
+##鎔
+##鎖
+##鎗
+##鎚
+##鎧
+##鎬
+##鎮
+##鎳
+##鏈
+##鏖
+##鏗
+##鏘
+##鏞
+##鏟
+##鏡
+##鏢
+##鏤
+##鏽
+##鐘
+##鐮
+##鐲
+##鐳
+##鐵
+##鐸
+##鐺
+##鑄
+##鑊
+##鑑
+##鑒
+##鑣
+##鑫
+##鑰
+##鑲
+##鑼
+##鑽
+##鑾
+##鑿
+##针
+##钉
+##钊
+##钎
+##钏
+##钒
+##钓
+##钗
+##钙
+##钛
+##钜
+##钝
+##钞
+##钟
+##钠
+##钡
+##钢
+##钣
+##钤
+##钥
+##钦
+##钧
+##钨
+##钩
+##钮
+##钯
+##钰
+##钱
+##钳
+##钴
+##钵
+##钺
+##钻
+##钼
+##钾
+##钿
+##铀
+##铁
+##铂
+##铃
+##铄
+##铅
+##铆
+##铉
+##铎
+##铐
+##铛
+##铜
+##铝
+##铠
+##铡
+##铢
+##铣
+##铤
+##铨
+##铩
+##铬
+##铭
+##铮
+##铰
+##铲
+##铵
+##银
+##铸
+##铺
+##链
+##铿
+##销
+##锁
+##锂
+##锄
+##锅
+##锆
+##锈
+##锉
+##锋
+##锌
+##锏
+##锐
+##锑
+##错
+##锚
+##锟
+##锡
+##锢
+##锣
+##锤
+##锥
+##锦
+##锭
+##键
+##锯
+##锰
+##锲
+##锵
+##锹
+##锺
+##锻
+##镀
+##镁
+##镂
+##镇
+##镉
+##镌
+##镍
+##镐
+##镑
+##镕
+##镖
+##镗
+##镛
+##镜
+##镣
+##镭
+##镯
+##镰
+##镳
+##镶
+##長
+##长
+##門
+##閃
+##閉
+##開
+##閎
+##閏
+##閑
+##閒
+##間
+##閔
+##閘
+##閡
+##関
+##閣
+##閥
+##閨
+##閩
+##閱
+##閲
+##閹
+##閻
+##閾
+##闆
+##闇
+##闊
+##闌
+##闍
+##闔
+##闕
+##闖
+##闘
+##關
+##闡
+##闢
+##门
+##闪
+##闫
+##闭
+##问
+##闯
+##闰
+##闲
+##间
+##闵
+##闷
+##闸
+##闹
+##闺
+##闻
+##闽
+##闾
+##阀
+##阁
+##阂
+##阅
+##阆
+##阇
+##阈
+##阉
+##阎
+##阐
+##阑
+##阔
+##阕
+##阖
+##阙
+##阚
+##阜
+##队
+##阡
+##阪
+##阮
+##阱
+##防
+##阳
+##阴
+##阵
+##阶
+##阻
+##阿
+##陀
+##陂
+##附
+##际
+##陆
+##陇
+##陈
+##陋
+##陌
+##降
+##限
+##陕
+##陛
+##陝
+##陞
+##陟
+##陡
+##院
+##陣
+##除
+##陨
+##险
+##陪
+##陰
+##陲
+##陳
+##陵
+##陶
+##陷
+##陸
+##険
+##陽
+##隅
+##隆
+##隈
+##隊
+##隋
+##隍
+##階
+##随
+##隐
+##隔
+##隕
+##隘
+##隙
+##際
+##障
+##隠
+##隣
+##隧
+##隨
+##險
+##隱
+##隴
+##隶
+##隸
+##隻
+##隼
+##隽
+##难
+##雀
+##雁
+##雄
+##雅
+##集
+##雇
+##雉
+##雋
+##雌
+##雍
+##雎
+##雏
+##雑
+##雒
+##雕
+##雖
+##雙
+##雛
+##雜
+##雞
+##離
+##難
+##雨
+##雪
+##雯
+##雰
+##雲
+##雳
+##零
+##雷
+##雹
+##電
+##雾
+##需
+##霁
+##霄
+##霆
+##震
+##霈
+##霉
+##霊
+##霍
+##霎
+##霏
+##霑
+##霓
+##霖
+##霜
+##霞
+##霧
+##霭
+##霰
+##露
+##霸
+##霹
+##霽
+##霾
+##靂
+##靄
+##靈
+##青
+##靓
+##靖
+##静
+##靚
+##靛
+##靜
+##非
+##靠
+##靡
+##面
+##靥
+##靦
+##革
+##靳
+##靴
+##靶
+##靼
+##鞅
+##鞋
+##鞍
+##鞏
+##鞑
+##鞘
+##鞠
+##鞣
+##鞦
+##鞭
+##韆
+##韋
+##韌
+##韓
+##韜
+##韦
+##韧
+##韩
+##韬
+##韭
+##音
+##韵
+##韶
+##韻
+##響
+##頁
+##頂
+##頃
+##項
+##順
+##須
+##頌
+##預
+##頑
+##頒
+##頓
+##頗
+##領
+##頜
+##頡
+##頤
+##頫
+##頭
+##頰
+##頷
+##頸
+##頹
+##頻
+##頼
+##顆
+##題
+##額
+##顎
+##顏
+##顔
+##願
+##顛
+##類
+##顧
+##顫
+##顯
+##顱
+##顴
+##页
+##顶
+##顷
+##项
+##顺
+##须
+##顼
+##顽
+##顾
+##顿
+##颁
+##颂
+##预
+##颅
+##领
+##颇
+##颈
+##颉
+##颊
+##颌
+##颍
+##颐
+##频
+##颓
+##颔
+##颖
+##颗
+##题
+##颚
+##颛
+##颜
+##额
+##颞
+##颠
+##颡
+##颢
+##颤
+##颦
+##颧
+##風
+##颯
+##颱
+##颳
+##颶
+##颼
+##飄
+##飆
+##风
+##飒
+##飓
+##飕
+##飘
+##飙
+##飚
+##飛
+##飞
+##食
+##飢
+##飨
+##飩
+##飪
+##飯
+##飲
+##飼
+##飽
+##飾
+##餃
+##餅
+##餉
+##養
+##餌
+##餐
+##餒
+##餓
+##餘
+##餚
+##餛
+##餞
+##餡
+##館
+##餮
+##餵
+##餾
+##饅
+##饈
+##饋
+##饌
+##饍
+##饑
+##饒
+##饕
+##饗
+##饞
+##饥
+##饨
+##饪
+##饬
+##饭
+##饮
+##饯
+##饰
+##饱
+##饲
+##饴
+##饵
+##饶
+##饷
+##饺
+##饼
+##饽
+##饿
+##馀
+##馁
+##馄
+##馅
+##馆
+##馈
+##馋
+##馍
+##馏
+##馒
+##馔
+##首
+##馗
+##香
+##馥
+##馨
+##馬
+##馭
+##馮
+##馳
+##馴
+##駁
+##駄
+##駅
+##駆
+##駐
+##駒
+##駕
+##駛
+##駝
+##駭
+##駱
+##駿
+##騁
+##騎
+##騏
+##験
+##騙
+##騨
+##騰
+##騷
+##驀
+##驅
+##驊
+##驍
+##驒
+##驕
+##驗
+##驚
+##驛
+##驟
+##驢
+##驥
+##马
+##驭
+##驮
+##驯
+##驰
+##驱
+##驳
+##驴
+##驶
+##驷
+##驸
+##驹
+##驻
+##驼
+##驾
+##驿
+##骁
+##骂
+##骄
+##骅
+##骆
+##骇
+##骈
+##骊
+##骋
+##验
+##骏
+##骐
+##骑
+##骗
+##骚
+##骛
+##骜
+##骞
+##骠
+##骡
+##骤
+##骥
+##骧
+##骨
+##骯
+##骰
+##骶
+##骷
+##骸
+##骼
+##髂
+##髅
+##髋
+##髏
+##髒
+##髓
+##體
+##髖
+##高
+##髦
+##髪
+##髮
+##髯
+##髻
+##鬃
+##鬆
+##鬍
+##鬓
+##鬚
+##鬟
+##鬢
+##鬣
+##鬥
+##鬧
+##鬱
+##鬼
+##魁
+##魂
+##魄
+##魅
+##魇
+##魍
+##魏
+##魔
+##魘
+##魚
+##魯
+##魷
+##鮑
+##鮨
+##鮪
+##鮭
+##鮮
+##鯉
+##鯊
+##鯖
+##鯛
+##鯨
+##鯰
+##鯽
+##鰍
+##鰓
+##鰭
+##鰲
+##鰻
+##鰾
+##鱈
+##鱉
+##鱔
+##鱗
+##鱷
+##鱸
+##鱼
+##鱿
+##鲁
+##鲈
+##鲍
+##鲑
+##鲛
+##鲜
+##鲟
+##鲢
+##鲤
+##鲨
+##鲫
+##鲱
+##鲲
+##鲶
+##鲷
+##鲸
+##鳃
+##鳄
+##鳅
+##鳌
+##鳍
+##鳕
+##鳖
+##鳗
+##鳝
+##鳞
+##鳥
+##鳩
+##鳳
+##鳴
+##鳶
+##鴉
+##鴕
+##鴛
+##鴦
+##鴨
+##鴻
+##鴿
+##鵑
+##鵜
+##鵝
+##鵡
+##鵬
+##鵰
+##鵲
+##鶘
+##鶩
+##鶯
+##鶴
+##鷗
+##鷲
+##鷹
+##鷺
+##鸚
+##鸞
+##鸟
+##鸠
+##鸡
+##鸢
+##鸣
+##鸥
+##鸦
+##鸨
+##鸪
+##鸭
+##鸯
+##鸳
+##鸵
+##鸽
+##鸾
+##鸿
+##鹂
+##鹃
+##鹄
+##鹅
+##鹈
+##鹉
+##鹊
+##鹌
+##鹏
+##鹑
+##鹕
+##鹘
+##鹜
+##鹞
+##鹤
+##鹦
+##鹧
+##鹫
+##鹭
+##鹰
+##鹳
+##鹵
+##鹹
+##鹼
+##鹽
+##鹿
+##麂
+##麋
+##麒
+##麓
+##麗
+##麝
+##麟
+##麥
+##麦
+##麩
+##麴
+##麵
+##麸
+##麺
+##麻
+##麼
+##麽
+##麾
+##黃
+##黄
+##黍
+##黎
+##黏
+##黑
+##黒
+##黔
+##默
+##黛
+##黜
+##黝
+##點
+##黠
+##黨
+##黯
+##黴
+##鼋
+##鼎
+##鼐
+##鼓
+##鼠
+##鼬
+##鼹
+##鼻
+##鼾
+##齁
+##齊
+##齋
+##齐
+##齒
+##齡
+##齢
+##齣
+##齦
+##齿
+##龄
+##龅
+##龈
+##龊
+##龋
+##龌
+##龍
+##龐
+##龔
+##龕
+##龙
+##龚
+##龛
+##龜
+##龟
+##︰
+##︱
+##︶
+##︿
+##﹁
+##﹂
+##﹍
+##﹏
+##﹐
+##﹑
+##﹒
+##﹔
+##﹕
+##﹖
+##﹗
+##﹙
+##﹚
+##﹝
+##﹞
+##﹡
+##﹣
+##！
+##＂
+##＃
+##＄
+##％
+##＆
+##＇
+##（
+##）
+##＊
+##，
+##－
+##．
+##／
+##：
+##；
+##＜
+##？
+##＠
+##［
+##＼
+##］
+##＾
+##＿
+##｀
+##ｆ
+##ｈ
+##ｊ
+##ｕ
+##ｗ
+##ｚ
+##｛
+##｝
+##｡
+##｢
+##｣
+##､
+##･
+##ｯ
+##ｰ
+##ｲ
+##ｸ
+##ｼ
+##ｽ
+##ﾄ
+##ﾉ
+##ﾌ
+##ﾗ
+##ﾙ
+##ﾝ
+##ﾞ
+##ﾟ
+##￣
+##￥
+##👍
+##🔥
+##😂
+##😎
diff --git a/fengshen/workspace/readme.md b/fengshen/workspace/readme.md
new file mode 100644
index 0000000000000000000000000000000000000000..ddeb5b856454074f86c9c3d079377cc58859897f
--- /dev/null
+++ b/fengshen/workspace/readme.md
@@ -0,0 +1,3 @@
+# Readme
+
+这个目录主要用来存放训练中产生的日志文件、Checkpoint，以及一些examples初始化时需要的配置文件。
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e08c4834eaad45bb4accce382195722276d6deef
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,3 @@
+transformers
+torch
+gradio