Spaces:

HaloMaster
/

chinesesummary

Runtime error

App Files Files Community

HaloMaster commited on Sep 26, 2022

Commit

50f0fbb

1 Parent(s): 00cd7be

add fengshen

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

fengshen/README.md +105 -0
fengshen/__init__.py +19 -0
fengshen/cli/fengshen_pipeline.py +34 -0
fengshen/data/__init__.py +1 -0
fengshen/data/bert_dataloader/auto_split.sh +10 -0
fengshen/data/bert_dataloader/load.py +200 -0
fengshen/data/bert_dataloader/preprocessing.py +110 -0
fengshen/data/clip_dataloader/flickr.py +105 -0
fengshen/data/data_utils/common_utils.py +4 -0
fengshen/data/data_utils/mask_utils.py +285 -0
fengshen/data/data_utils/sentence_split.py +35 -0
fengshen/data/data_utils/sop_utils.py +32 -0
fengshen/data/data_utils/token_type_utils.py +25 -0
fengshen/data/data_utils/truncate_utils.py +19 -0
fengshen/data/hubert/hubert_dataset.py +361 -0
fengshen/data/megatron_dataloader/Makefile +9 -0
fengshen/data/megatron_dataloader/__init__.py +1 -0
fengshen/data/megatron_dataloader/bart_dataset.py +443 -0
fengshen/data/megatron_dataloader/bert_dataset.py +196 -0
fengshen/data/megatron_dataloader/blendable_dataset.py +64 -0
fengshen/data/megatron_dataloader/dataset_utils.py +788 -0
fengshen/data/megatron_dataloader/helpers.cpp +794 -0
fengshen/data/megatron_dataloader/indexed_dataset.py +585 -0
fengshen/data/megatron_dataloader/utils.py +24 -0
fengshen/data/mmap_dataloader/mmap_datamodule.py +68 -0
fengshen/data/mmap_dataloader/mmap_index_dataset.py +53 -0
fengshen/data/preprocess.py +1 -0
fengshen/data/t5_dataloader/t5_datasets.py +562 -0
fengshen/data/task_dataloader/__init__.py +3 -0
fengshen/data/task_dataloader/medicalQADataset.py +137 -0
fengshen/data/task_dataloader/task_datasets.py +206 -0
fengshen/data/universal_datamodule/__init__.py +4 -0
fengshen/data/universal_datamodule/universal_datamodule.py +161 -0
fengshen/data/universal_datamodule/universal_sampler.py +125 -0
fengshen/examples/FastDemo/README.md +105 -0
fengshen/examples/FastDemo/YuyuanQA.py +71 -0
fengshen/examples/FastDemo/image/demo.png +0 -0
fengshen/examples/classification/demo_classification_afqmc_erlangshen_offload.sh +103 -0
fengshen/examples/classification/demo_classification_afqmc_roberta.sh +62 -0
fengshen/examples/classification/demo_classification_afqmc_roberta_deepspeed.sh +90 -0
fengshen/examples/classification/finetune_classification.py +389 -0
fengshen/examples/classification/finetune_classification.sh +75 -0
fengshen/examples/classification/finetune_classification_bart-base_afqmc.sh +143 -0
fengshen/examples/classification/finetune_classification_bart-base_ocnli.sh +143 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_afqmc.sh +146 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_cmnli.sh +161 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_iflytek.sh +158 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_ocnli.sh +163 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_tnews.sh +161 -0
fengshen/examples/classification/finetune_classification_bert-3.9B_wsc.sh +158 -0

fengshen/README.md ADDED Viewed

	@@ -0,0 +1,105 @@

+## 最新发布
+* \[2022.09.13\] [更新ErLangShen系列DeBERTa预训练代码](https://huggingface.co/IDEA-CCNL/Erlangshen-DeBERTa-v2-97M-Chinese)
+* \[2022.09.13\] [更新RanDeng系列Bart预训练代码](https://huggingface.co/IDEA-CCNL/Randeng-BART-139M)
+* \[2022.09.13\] [更新ErLangShen系列Bert预训练代码](https://huggingface.co/IDEA-CCNL/Erlangshen-MegatronBert-1.3B)
+* \[2022.05.11\] [更新TaiYi系列VIT多模态模型及下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/太乙系列/Taiyi-vit-87M-D.html)
+* \[2022.05.11\] [更新BiGan系列Transformer-XL去噪模型及下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/比干系列/Bigan-Transformer-XL-denoise-1.1B.html)
+* \[2022.05.11\] [更新ErLangShen系列下游任务示例](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/二郎神系列/Erlangshen-Roberta-110M-NLI.html)
+# 导航
+- [导航](#导航)
+  - [框架简介](#框架简介)
+  - [依赖环境](#依赖环境)
+  - [项目结构](#项目结构)
+  - [设计思路](#设计思路)
+  - [分类下游任务](#分类下游任务)
+## 框架简介
+FengShen训练框架是封神榜大模型开源计划的重要一环，在大模型的生产和应用中起到至关重要的作用。FengShen可以应用在基于海量数据的预训练以及各种下游任务的finetune中。封神榜专注于NLP大模型开源，然而模型的增大带来不仅仅是训练的问题，在使用上也存在诸多不便。为了解决训练和使用的问题，FengShen参考了目前开源的优秀方案并且重新设计了Pipeline，用户可以根据自己的需求，从封神榜中选取丰富的预训练模型，同时利用FengShen快速微调下游任务。
+目前所有实例以及文档可以查看我们的[Wiki](https://fengshenbang-doc.readthedocs.io/zh/latest/index.html)
+所有的模型可以在[Huggingface主页](https://huggingface.co/IDEA-CCNL)找到
+通过我们的框架，你可以快速享受到：
+1. 比原生torch更强的性能，训练速度提升<font color=#0000FF >**300%**</font>
+2. 支持更大的模型，支持<font color=#0000FF >**百亿级别**</font>内模型训练及微调
+3. 支持<font color=#0000FF >**TB级以上**</font>的数据集，在家用主机上即可享受预训练模型带来的效果提升
+3. 丰富的预训练、下游任务示例，一键开始训练
+4. 适应各种设备环境，支持在CPU、GPU、TPU等不同设备上运行
+5. 集成主流的分布式训练逻辑，无需修改代码即可支持DDP、Zero Optimizer等分布式优化技术
+![avartar](../pics/fengshen_pic.png)
+## 依赖环境
+* Python >= 3.8
+* torch >= 1.8
+* transformers >= 3.2.0
+* pytorch-lightning >= 1.5.10
+在Fengshenbang-LM根目录下
+pip install --editable ./
+## 项目结构
+```
+├── data                        # 支持多种数据处理方式以及数据集
+│   ├── cbart_dataloader
+|   ├── fs_datasets             # 基于transformers datasets的封装，新增中文数据集(开源计划中)
+|   ├── universal_datamodule    # 打通fs_datasets与lightning datamodule，减少重复开发工作量
+│   ├── megatron_dataloader     # 支持基于Megatron实现的TB级别数据集处理、训练
+│   ├── mmap_dataloader         # 通用的Memmap形式的数据加载
+│   └── task_dataloader         # 支持多种下游任务
+├── examples                    # 丰富的示例，从预训练到下游任务，应有尽有。
+├── metric                      # 提供各种metric计算，支持用户自定义metric
+├── losses                      # 同样支持loss自定义，满足定制化需求
+├── tokenizer                   # 支持自定义tokenizer，比如我们使用的SentencePiece训练代码等
+├── models                      # 模型库
+│   ├── auto                    # 支持自动导入对应的模型
+│   ├── bart
+│   ├── longformer
+│   ├── megatron_t5
+│   └── roformer
+└── utils                       # 实用函数
+```
+## 设计思路
+FengShen框架目前整体基于Pytorch-Lightning & Transformer进行开发，在底层框架上不断开源基于中文的预训练模型，同时提供丰富的examples，每一个封神榜的模型都能找到对应的预训练、下游任务代码。
+在FengShen上开发，整体可以按照下面的三个步骤进行：
+1. 封装数据处理流程 -> pytorch_lightning.LightningDataModule
+2. 封装模型结构 -> pytorch_lightning.LightningModule
+3. 配置一些插件，比如log_monitor，checkpoint_callback等等。
+一个完整的DEMO可以看Randeng-BART系列实例 -> [文档](https://fengshenbang-doc.readthedocs.io/zh/latest/docs/燃灯系列/BART-139M.html) [代码](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/hf-ds/fengshen/examples/pretrain_bart)
+## 分类下游任务
+ 在examples/classification目录下，我们提供丰富的分类任务的示例���其中我们提供三个一键式运行的示例。
+* demo_classification_afqmc_roberta.sh              使用DDP微调roberta
+* demo_classification_afqmc_roberta_deepspeed.sh    结合deepspeed微调roberta，获得更快的运算速度
+* demo_classification_afqmc_erlangshen_offload.sh   仅需7G显存即可微调我们效果最好的二郎神系列模型
+ 上述示例均采用AFQMC的数据集，关于数据集的介绍可以在[这里](https://www.cluebenchmarks.com/introduce.html)找到。
+ 同时我们处理过的数据文件已经放在Huggingface上，点击[这里](https://huggingface.co/datasets/IDEA-CCNL/AFQMC)直达源文件。
+ 仅需要按我们的格式稍微处理一下数据集，即可适配下游不同的分类任务。
+ 在脚本示例中，仅需要修改如下参数即可适配本地文件
+ ```
+         --dataset_name IDEA-CCNL/AFQMC \
+ -------> 修改为
+         --data_dir $DATA_DIR \          # 数据目录
+         --train_data train.json \       # 数据文件
+         --valid_data dev.json \
+         --test_data test.json \
+ ```

fengshen/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .models.longformer import LongformerConfig, LongformerModel
+from .models.roformer import RoFormerConfig, RoFormerModel
+from .models.megatron_t5 import T5Config, T5EncoderModel
+from .models.ubert import UbertPiplines, UbertModel

fengshen/cli/fengshen_pipeline.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import sys
+from importlib import import_module
+from datasets import load_dataset
+import argparse
+def main():
+    if len(sys.argv) < 3:
+        raise Exception(
+            'args len < 3, example: fengshen_pipeline text_classification predict xxxxx')
+    pipeline_name = sys.argv[1]
+    method = sys.argv[2]
+    pipeline_class = getattr(import_module('fengshen.pipelines.' + pipeline_name), 'Pipeline')
+    total_parser = argparse.ArgumentParser("FengShen Pipeline")
+    total_parser.add_argument('--model', default='', type=str)
+    total_parser.add_argument('--datasets', default='', type=str)
+    total_parser.add_argument('--text', default='', type=str)
+    total_parser = pipeline_class.add_pipeline_specific_args(total_parser)
+    args = total_parser.parse_args(args=sys.argv[3:])
+    pipeline = pipeline_class(args=args, model=args.model)
+    if method == 'predict':
+        print(pipeline(args.text))
+    elif method == 'train':
+        datasets = load_dataset(args.datasets)
+        pipeline.train(datasets)
+    else:
+        raise Exception(
+            'cmd not support, now only support {predict, train}')
+if __name__ == '__main__':
+    main()

fengshen/data/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # coding=utf-8

fengshen/data/bert_dataloader/auto_split.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+files=`find $1 -type f -size +1024M`
+for p in $files
+do
+echo "processing $p"
+name=`basename $p .json`
+file=`dirname $p`
+split -a 2 -C 300M $p $file/$name- && ls|grep -E "(-[a-zA-Z]{2})" |xargs -n1 -i{} mv {} {}.json
+rm -f $p
+done

fengshen/data/bert_dataloader/load.py ADDED Viewed

	@@ -0,0 +1,200 @@

+import os
+import re
+from pathlib import Path
+import glob
+from tqdm import tqdm
+from contextlib import ExitStack
+import datasets
+import multiprocessing
+from typing import cast, TextIO
+from itertools import chain
+import json
+from concurrent.futures import ProcessPoolExecutor
+from random import shuffle
+from pytorch_lightning import LightningDataModule
+from typing import Optional
+from torch.utils.data import DataLoader
+# _SPLIT_DATA_PATH = '/data1/datas/wudao_180g_split/test'
+_SPLIT_DATA_PATH = '/data1/datas/wudao_180g_split'
+_CACHE_SPLIT_DATA_PATH = '/data1/datas/wudao_180g_FSData'
+# feats = datasets.Features({"text": datasets.Value('string')})
+class BertDataGenerate(object):
+    def __init__(self,
+                 data_files=_SPLIT_DATA_PATH,
+                 save_path=_CACHE_SPLIT_DATA_PATH,
+                 train_test_validation='950,49,1',
+                 num_proc=1,
+                 cache=True):
+        self.data_files = Path(data_files)
+        if save_path:
+            self.save_path = Path(save_path)
+        else:
+            self.save_path = self.file_check(
+                Path(self.data_files.parent, self.data_files.name+'_FSDataset'),
+                'save')
+        self.num_proc = num_proc
+        self.cache = cache
+        self.split_idx = self.split_train_test_validation_index(train_test_validation)
+        if cache:
+            self.cache_path = self.file_check(
+                Path(self.save_path.parent, 'FSDataCache', self.data_files.name), 'cache')
+        else:
+            self.cache_path = None
+    @staticmethod
+    def file_check(path, path_type):
+        print(path)
+        if not path.exists():
+            path.mkdir(parents=True)
+        print(f"Since no {path_type} directory is specified, the program will automatically create it in {path} directory.")
+        return str(path)
+    @staticmethod
+    def split_train_test_validation_index(train_test_validation):
+        split_idx_ = [int(i) for i in train_test_validation.split(',')]
+        idx_dict = {
+            'train_rate': split_idx_[0]/sum(split_idx_),
+            'test_rate': split_idx_[1]/sum(split_idx_[1:])
+        }
+        return idx_dict
+    def process(self, index, path):
+        print('saving dataset shard {}'.format(index))
+        ds = (datasets.load_dataset('json', data_files=str(path),
+                                    cache_dir=self.cache_path,
+                                    features=None))
+        # ds = ds.map(self.cut_sent,input_columns='text')
+        # print(d)
+        # print('!!!',ds)
+        ds = ds['train'].train_test_split(train_size=self.split_idx['train_rate'])
+        ds_ = ds['test'].train_test_split(train_size=self.split_idx['test_rate'])
+        ds = datasets.DatasetDict({
+            'train': ds['train'],
+            'test': ds_['train'],
+            'validation': ds_['test']
+        })
+        # print('!!!!',ds)
+        ds.save_to_disk(Path(self.save_path, path.name))
+        return 'saving dataset shard {} done'.format(index)
+    def generate_cache_arrow(self) -> None:
+        '''
+        生成HF支持的缓存文件，加速后续的加载
+        '''
+        data_dict_paths = self.data_files.rglob('*')
+        p = ProcessPoolExecutor(max_workers=self.num_proc)
+        res = list()
+        for index, path in enumerate(data_dict_paths):
+            res.append(p.submit(self.process, index, path))
+        p.shutdown(wait=True)
+        for future in res:
+            print(future.result(), flush=True)
+def load_dataset(num_proc=4, **kargs):
+    cache_dict_paths = Path(_CACHE_SPLIT_DATA_PATH).glob('*')
+    ds = []
+    res = []
+    p = ProcessPoolExecutor(max_workers=num_proc)
+    for path in cache_dict_paths:
+        res.append(p.submit(datasets.load_from_disk,
+                            str(path), **kargs))
+    p.shutdown(wait=True)
+    for future in res:
+        ds.append(future.result())
+        # print(future.result())
+    train = []
+    test = []
+    validation = []
+    for ds_ in ds:
+        train.append(ds_['train'])
+        test.append(ds_['test'])
+        validation.append(ds_['validation'])
+    # ds = datasets.concatenate_datasets(ds)
+    # print(ds)
+    return datasets.DatasetDict({
+        'train': datasets.concatenate_datasets(train),
+        'test': datasets.concatenate_datasets(test),
+        'validation': datasets.concatenate_datasets(validation)
+    })
+class BertDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('Universal DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--val_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--datasets_name', type=str)
+        # parser.add_argument('--datasets_name', type=str)
+        parser.add_argument('--train_datasets_field', type=str, default='train')
+        parser.add_argument('--val_datasets_field', type=str, default='validation')
+        parser.add_argument('--test_datasets_field', type=str, default='test')
+        return parent_args
+    def __init__(
+        self,
+        tokenizer,
+        collate_fn,
+        args,
+        **kwargs,
+    ):
+        super().__init__()
+        self.datasets = load_dataset(num_proc=args.num_workers)
+        self.tokenizer = tokenizer
+        self.collate_fn = collate_fn
+        self.save_hyperparameters(args)
+    def setup(self, stage: Optional[str] = None) -> None:
+        self.train = DataLoader(
+            self.datasets[self.hparams.train_datasets_field],
+            batch_size=self.hparams.train_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        self.val = DataLoader(
+            self.datasets[self.hparams.val_datasets_field],
+            batch_size=self.hparams.val_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        self.test = DataLoader(
+            self.datasets[self.hparams.test_datasets_field],
+            batch_size=self.hparams.test_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+        return
+    def train_dataloader(self):
+        return self.train
+    def val_dataloader(self):
+        return self.val
+    def test_dataloader(self):
+        return self.test
+if __name__ == '__main__':
+    # pre = PreProcessing(_SPLIT_DATA_PATH)
+    # pre.processing()
+    dataset = BertDataGenerate(_SPLIT_DATA_PATH, num_proc=16)
+    dataset.generate_cache_arrow()

fengshen/data/bert_dataloader/preprocessing.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import re
+import json
+import multiprocessing
+from tqdm import tqdm
+from pathlib import Path
+from itertools import chain
+_SPLIT_DATA_PATH = '/data1/datas/wudao_180g'
+def cut_sent(path):
+    """
+    中文分句，默认？、。、！、省略号分句，考虑双引号包裹的句子
+    采用分割替换的方式
+    """
+    path = Path(path)
+    # print(path)
+    save_path = str(Path('/data1/datas/wudao_180g_split', path.name))
+    print('处理文件：', save_path)
+    with open(save_path, 'wt', encoding='utf-8') as w:
+        with open(path, 'rt', encoding='utf-8') as f:
+            for para in tqdm(f):
+                para = json.loads(para)
+                para_ = para['text'] + ' '
+                # print('sentence piece......')
+                # pep8中 正则不能些 \? 要写成\\?
+                para_ = re.sub('([？。！\\?\\!…]+)([^”’]|[”’])',
+                               r'\1#####\2', para_)
+                para_ = re.sub('([\\.]{3,})([^”’])', r'\1#####\2', para_)
+                # 匹配 \1: 句子结束符紧挨’”  \2: 非句子结束符号，被引号包裹的句子
+                para_ = re.sub(
+                    '([。！？\\?\\!…][”’])([^，。！？\\?\\!]|\\s)', r'\1#####\2', para_)
+                para_ = re.sub(
+                    '([\\.]{3,}[”’])([^，。！？\\?\\!]|\\s)', r'\1#####\2', para_)
+                para_ = re.sub(
+                    '([#]{5})([”’])([^，。！？\\?\\!])', r'\2#####\3', para_)
+                para_ = para_.strip()
+                # 一个512里面多个样本
+                line_ = ''
+                for line in para_.split('#####'):
+                    line = line.strip()
+                    if len(line_) < 512 and len(line) > 0:
+                        line_ += line
+                    else:
+                        w.writelines(json.dumps(
+                            {'text': line_}, ensure_ascii=False)+'\n')
+                        line_ = line
+                w.writelines(json.dumps(
+                    {'text': line_}, ensure_ascii=False)+'\n')
+def chain_iter(*filenames):
+    """
+    将多个文件读成一个迭代器
+    """
+    reader = [open(file, 'r') for file in filenames]
+    return chain(*reader)
+class Config(object):
+    def __init__(self, data_path=_SPLIT_DATA_PATH, num_worker=16, split_numb=600000, cut_sentence=True, output_file=None) -> None:
+        self.data_path = Path(data_path)
+        self.num_worker = num_worker
+        self.split_numb = split_numb
+        self.cut_sentence = cut_sentence
+def processing1():
+    args = Config()
+    p_ = [str(i) for i in args.data_path.glob('*')]
+    fin = chain_iter(*p_)
+    pool = multiprocessing.Pool(args.num_worker)
+    docs = pool.imap(cut_sent, fin, chunksize=args.num_worker)
+    if not Path(args.data_path.parent, args.data_path.name+'_split').exists():
+        Path(args.data_path.parent, args.data_path.name+'_split').mkdir()
+    writer = open(str(Path(args.data_path.parent, args.data_path.name +
+                  '_split', 'sentence_level.json')), 'wt', encoding='utf-8')
+    for doc in tqdm(docs):
+        for sentence in doc:
+            writer.writelines(json.dumps(
+                {"text": sentence}, ensure_ascii=False)+'\n')
+    pool.close()
+    pool.join()
+    writer.close()
+if __name__ == '__main__':
+    from time import process_time, perf_counter
+    from random import shuffle
+    st = process_time()
+    args = Config(num_worker=16)
+    if not Path(args.data_path.parent, args.data_path.name+'_split').exists():
+        Path(args.data_path.parent, args.data_path.name +
+             '_split').mkdir(parents=True)
+    p_ = [str(i) for i in args.data_path.glob('*')]
+    # 简单shuffle
+    shuffle(p_)
+    pool = multiprocessing.Pool(args.num_worker)
+    for item in p_:
+        pool.apply_async(func=cut_sent, args=(item,))
+    pool.close()
+    pool.join()
+    cost_time = process_time() - st
+    print('DONE!! cost time : %.5f' % cost_time)

fengshen/data/clip_dataloader/flickr.py ADDED Viewed

	@@ -0,0 +1,105 @@

+from torch.utils.data import Dataset, DataLoader
+from torchvision.transforms import Normalize, Compose, RandomResizedCrop, InterpolationMode, ToTensor, Resize, \
+    CenterCrop
+from transformers import BertTokenizer
+import pytorch_lightning as pl
+from PIL import Image
+import os
+class flickr30k_CNA(Dataset):
+    def __init__(self, img_root_path,
+                 annot_path,
+                 transform=None):
+        self.images = []
+        self.captions = []
+        self.labels = []
+        self.root = img_root_path
+        with open(annot_path, 'r') as f:
+            for line in f:
+                line = line.strip().split('\t')
+                key, caption = line[0].split('#')[0], line[1]
+                img_path = key + '.jpg'
+                self.images.append(img_path)
+                self.captions.append(caption)
+                self.labels.append(key)
+        self.transforms = transform
+        self.tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext")
+        # NOTE large 模型
+        self.context_length = 77
+    def __len__(self):
+        return len(self.images)
+    def __getitem__(self, idx):
+        img_path = str(self.images[idx])
+        image = self.transforms(Image.open(os.path.join(self.root, img_path)))
+        text = self.tokenizer(str(self.captions[idx]), max_length=self.context_length,
+                              padding='max_length', truncation=True, return_tensors='pt')['input_ids'][0]
+        label = self.labels[idx]
+        return image, text, label
+def _convert_to_rgb(image):
+    return image.convert('RGB')
+def image_transform(
+        image_size: int,
+        is_train: bool,
+        mean=(0.48145466, 0.4578275, 0.40821073),
+        std=(0.26862954, 0.26130258, 0.27577711)
+):
+    normalize = Normalize(mean=mean, std=std)
+    if is_train:
+        return Compose([
+            RandomResizedCrop(image_size, scale=(0.9, 1.0), interpolation=InterpolationMode.BICUBIC),
+            _convert_to_rgb,
+            ToTensor(),
+            normalize,
+        ])
+    else:
+        return Compose([
+            Resize(image_size, interpolation=InterpolationMode.BICUBIC),
+            CenterCrop(image_size),
+            _convert_to_rgb,
+            ToTensor(),
+            normalize,
+        ])
+class FlickrDataModule(pl.LightningDataModule):
+    def __init__(self, args):
+        self.batch_size = args.batch_size
+        self.train_filename = args.train_filename  # NOTE 标注的文件夹
+        self.train_root = args.train_root  # NOTE 图片地址
+        self.val_filename = args.val_filename
+        self.val_root = args.val_root
+        self.test_filename = args.test_filename
+        self.test_root = args.test_root
+        self.pretrain_model = args.pretrain_model
+        self.image_size = 224
+        self.prepare_data_per_node = True
+        self._log_hyperparams = False
+        self.num_workers = args.num_workers
+    def setup(self, stage=None):
+        # dataset
+        train_transform = image_transform(224, True)
+        val_transform = image_transform(224, False)
+        test_transform = image_transform(224, False)
+        self.train_dataset = flickr30k_CNA(self.train_root, self.train_filename, transform=train_transform)
+        self.val_dataset = flickr30k_CNA(self.val_root, self.val_filename, transform=val_transform)
+        self.test_dataset = flickr30k_CNA(self.test_root, self.test_filename, transform=test_transform)
+    def train_dataloader(self):
+        return DataLoader(self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers)
+    def val_dataloader(self):
+        return DataLoader(self.val_dataset, batch_size=self.batch_size, num_workers=self.num_workers)
+    def test_dataloader(self):
+        return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=self.num_workers)

fengshen/data/data_utils/common_utils.py ADDED Viewed

	@@ -0,0 +1,4 @@

+def padding_to_maxlength(ids, max_length, pad_id):
+    cur_len = len(ids)
+    len_diff = max_length - len(ids)
+    return ids + [pad_id] * len_diff, [1] * cur_len + [0] * len_diff

fengshen/data/data_utils/mask_utils.py ADDED Viewed

	@@ -0,0 +1,285 @@

+import collections
+import numpy as np
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
+                                          ["index", "label"])
+def is_start_piece(piece):
+    """Check if the current word piece is the starting piece (BERT)."""
+    # When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    return not piece.startswith("##")
+def create_masked_lm_predictions(tokens,
+                                 vocab_id_list, vocab_id_to_token_dict,
+                                 masked_lm_prob,
+                                 cls_id, sep_id, mask_id,
+                                 max_predictions_per_seq,
+                                 np_rng,
+                                 max_ngrams=3,
+                                 do_whole_word_mask=True,
+                                 favor_longer_ngram=False,
+                                 do_permutation=False,
+                                 geometric_dist=False,
+                                 masking_style="bert",
+                                 zh_tokenizer=None):
+    """Creates the predictions for the masked LM objective.
+    Note: Tokens here are vocab ids and not text tokens."""
+    '''
+    modified from Megatron-LM
+    Args:
+        tokens: 输入
+        vocab_id_list: 词表token_id_list
+        vocab_id_to_token_dict： token_id到token字典
+        masked_lm_prob：mask概率
+        cls_id、sep_id、mask_id：特殊token
+        max_predictions_per_seq：最大mask个数
+        np_rng：mask随机数
+        max_ngrams：最大词长度
+        do_whole_word_mask：是否做全词掩码
+        favor_longer_ngram：优先用长的词
+        do_permutation：是否打乱
+        geometric_dist：用np_rng.geometric做随机
+        masking_style：mask类型
+        zh_tokenizer：WWM的分词器，比如用jieba.lcut做分词之类的
+    '''
+    cand_indexes = []
+    # Note(mingdachen): We create a list for recording if the piece is
+    # the starting piece of current token, where 1 means true, so that
+    # on-the-fly whole word masking is possible.
+    token_boundary = [0] * len(tokens)
+    # 如果没有指定中文分词器，那就直接按##算
+    if zh_tokenizer is None:
+        for (i, token) in enumerate(tokens):
+            if token == cls_id or token == sep_id:
+                token_boundary[i] = 1
+                continue
+        # Whole Word Masking means that if we mask all of the wordpieces
+        # corresponding to an original word.
+        #
+        # Note that Whole Word Masking does *not* change the training code
+        # at all -- we still predict each WordPiece independently, softmaxed
+        # over the entire vocabulary.
+            if (do_whole_word_mask and len(cand_indexes) >= 1 and
+                    not is_start_piece(vocab_id_to_token_dict[token])):
+                cand_indexes[-1].append(i)
+            else:
+                cand_indexes.append([i])
+                if is_start_piece(vocab_id_to_token_dict[token]):
+                    token_boundary[i] = 1
+    else:
+        # 如果指定了中文分词器，那就先用分词器分词，然后再进行判断
+        # 获取去掉CLS SEP的原始文本
+        raw_tokens = []
+        for t in tokens:
+            if t != cls_id and t != sep_id:
+                raw_tokens.append(t)
+        raw_tokens = [vocab_id_to_token_dict[i] for i in raw_tokens]
+        # 分词然后获取每次字开头的最长词的长度
+        word_list = set(zh_tokenizer(''.join(raw_tokens), HMM=True))
+        word_length_dict = {}
+        for w in word_list:
+            if len(w) < 1:
+                continue
+            if w[0] not in word_length_dict:
+                word_length_dict[w[0]] = len(w)
+            elif word_length_dict[w[0]] < len(w):
+                word_length_dict[w[0]] = len(w)
+        i = 0
+        # 从词表里面检索
+        while i < len(tokens):
+            token_id = tokens[i]
+            token = vocab_id_to_token_dict[token_id]
+            if len(token) == 0 or token_id == cls_id or token_id == sep_id:
+                token_boundary[i] = 1
+                i += 1
+                continue
+            word_max_length = 1
+            if token[0] in word_length_dict:
+                word_max_length = word_length_dict[token[0]]
+            j = 0
+            word = ''
+            word_end = i+1
+            # 兼容以前##的形式，如果后面的词是##开头的，那么直接把后面的拼到前面当作一个词
+            old_style = False
+            while word_end < len(tokens) and vocab_id_to_token_dict[tokens[word_end]].startswith('##'):
+                old_style = True
+                word_end += 1
+            if not old_style:
+                while j < word_max_length and i+j < len(tokens):
+                    cur_token = tokens[i+j]
+                    word += vocab_id_to_token_dict[cur_token]
+                    j += 1
+                    if word in word_list:
+                        word_end = i+j
+            cand_indexes.append([p for p in range(i, word_end)])
+            token_boundary[i] = 1
+            i = word_end
+    output_tokens = list(tokens)
+    masked_lm_positions = []
+    masked_lm_labels = []
+    if masked_lm_prob == 0:
+        return (output_tokens, masked_lm_positions,
+                masked_lm_labels, token_boundary)
+    num_to_predict = min(max_predictions_per_seq,
+                         max(1, int(round(len(tokens) * masked_lm_prob))))
+    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
+    if not geometric_dist:
+        # Note(mingdachen):
+        # By default, we set the probilities to favor shorter ngram sequences.
+        pvals = 1. / np.arange(1, max_ngrams + 1)
+        pvals /= pvals.sum(keepdims=True)
+        if favor_longer_ngram:
+            pvals = pvals[::-1]
+    # 获取一个ngram的idx，对于每个word，记录他的ngram的word
+    ngram_indexes = []
+    for idx in range(len(cand_indexes)):
+        ngram_index = []
+        for n in ngrams:
+            ngram_index.append(cand_indexes[idx:idx + n])
+        ngram_indexes.append(ngram_index)
+    np_rng.shuffle(ngram_indexes)
+    (masked_lms, masked_spans) = ([], [])
+    covered_indexes = set()
+    for cand_index_set in ngram_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if not cand_index_set:
+            continue
+        # Note(mingdachen):
+        # Skip current piece if they are covered in lm masking or previous ngrams.
+        for index_set in cand_index_set[0]:
+            for index in index_set:
+                if index in covered_indexes:
+                    continue
+        if not geometric_dist:
+            n = np_rng.choice(ngrams[:len(cand_index_set)],
+                              p=pvals[:len(cand_index_set)] /
+                              pvals[:len(cand_index_set)].sum(keepdims=True))
+        else:
+            # Sampling "n" from the geometric distribution and clipping it to
+            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
+            # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1)
+            n = min(np_rng.geometric(0.2), max_ngrams)
+        index_set = sum(cand_index_set[n - 1], [])
+        n -= 1
+        # Note(mingdachen):
+        # Repeatedly looking for a candidate that does not exceed the
+        # maximum number of predictions by trying shorter ngrams.
+        while len(masked_lms) + len(index_set) > num_to_predict:
+            if n == 0:
+                break
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+        # If adding a whole-word mask would exceed the maximum number of
+        # predictions, then just skip this candidate.
+        if len(masked_lms) + len(index_set) > num_to_predict:
+            continue
+        is_any_index_covered = False
+        for index in index_set:
+            if index in covered_indexes:
+                is_any_index_covered = True
+                break
+        if is_any_index_covered:
+            continue
+        for index in index_set:
+            covered_indexes.add(index)
+            masked_token = None
+            token_id = tokens[index]
+            if masking_style == "bert":
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        masked_token = tokens[index]
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
+            elif masking_style == "t5":
+                masked_token = mask_id
+            else:
+                raise ValueError("invalid value of masking style")
+            output_tokens[index] = masked_token
+            masked_lms.append(MaskedLmInstance(index=index, label=token_id))
+        masked_spans.append(MaskedLmInstance(
+            index=index_set,
+            label=[tokens[index] for index in index_set]))
+    assert len(masked_lms) <= num_to_predict
+    np_rng.shuffle(ngram_indexes)
+    select_indexes = set()
+    if do_permutation:
+        for cand_index_set in ngram_indexes:
+            if len(select_indexes) >= num_to_predict:
+                break
+            if not cand_index_set:
+                continue
+            # Note(mingdachen):
+            # Skip current piece if they are covered in lm masking or previous ngrams.
+            for index_set in cand_index_set[0]:
+                for index in index_set:
+                    if index in covered_indexes or index in select_indexes:
+                        continue
+            n = np.random.choice(ngrams[:len(cand_index_set)],
+                                 p=pvals[:len(cand_index_set)] /
+                                 pvals[:len(cand_index_set)].sum(keepdims=True))
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+            while len(select_indexes) + len(index_set) > num_to_predict:
+                if n == 0:
+                    break
+                index_set = sum(cand_index_set[n - 1], [])
+                n -= 1
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(select_indexes) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes or index in select_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                select_indexes.add(index)
+        assert len(select_indexes) <= num_to_predict
+        select_indexes = sorted(select_indexes)
+        permute_indexes = list(select_indexes)
+        np_rng.shuffle(permute_indexes)
+        orig_token = list(output_tokens)
+        for src_i, tgt_i in zip(select_indexes, permute_indexes):
+            output_tokens[src_i] = orig_token[tgt_i]
+            masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i]))
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+    # Sort the spans by the index of the first span
+    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)

fengshen/data/data_utils/sentence_split.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import re
+class ChineseSentenceSplitter(object):
+    def merge_symmetry(self, sentences, symmetry=('“', '”')):
+        # '''合并对称符号，如双引号'''
+        effective_ = []
+        merged = True
+        for index in range(len(sentences)):
+            if symmetry[0] in sentences[index] and symmetry[1] not in sentences[index]:
+                merged = False
+                effective_.append(sentences[index])
+            elif symmetry[1] in sentences[index] and not merged:
+                merged = True
+                effective_[-1] += sentences[index]
+            elif symmetry[0] not in sentences[index] and symmetry[1] not in sentences[index] and not merged:
+                effective_[-1] += sentences[index]
+            else:
+                effective_.append(sentences[index])
+        return [i.strip() for i in effective_ if len(i.strip()) > 0]
+    def to_sentences(self, paragraph):
+        #  """由段落切分成句子"""
+        sentences = re.split(r"(？|。|[！]+|!|\…\…)", paragraph)
+        sentences.append("")
+        sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
+        sentences = [i.strip() for i in sentences if len(i.strip()) > 0]
+        for j in range(1, len(sentences)):
+            if sentences[j][0] == '”':
+                sentences[j-1] = sentences[j-1] + '”'
+                sentences[j] = sentences[j][1:]
+        return self.merge_symmetry(sentences)
+    def tokenize(self, text):
+        return self.to_sentences(text)

fengshen/data/data_utils/sop_utils.py ADDED Viewed

	@@ -0,0 +1,32 @@

+# copy from megatron
+def get_a_and_b_segments(sample, np_rng):
+    """Divide sample into a and b segments."""
+    # Number of sentences in the sample.
+    n_sentences = len(sample)
+    # Make sure we always have two sentences.
+    assert n_sentences > 1, 'make sure each sample has at least two sentences.'
+    # First part:
+    # `a_end` is how many sentences go into the `A`.
+    a_end = 1
+    if n_sentences >= 3:
+        # Note that randin in numpy is exclusive.
+        a_end = np_rng.randint(1, n_sentences)
+    tokens_a = []
+    for j in range(a_end):
+        tokens_a.extend(sample[j])
+    # Second part:
+    tokens_b = []
+    for j in range(a_end, n_sentences):
+        tokens_b.extend(sample[j])
+    # Random next:
+    is_next_random = False
+    if np_rng.random() < 0.5:
+        is_next_random = True
+        tokens_a, tokens_b = tokens_b, tokens_a
+    return tokens_a, tokens_b, is_next_random

fengshen/data/data_utils/token_type_utils.py ADDED Viewed

	@@ -0,0 +1,25 @@

+def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
+    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
+    tokens = []
+    tokentypes = []
+    # [CLS].
+    tokens.append(cls_id)
+    tokentypes.append(0)
+    # Segment A.
+    for token in tokens_a:
+        tokens.append(token)
+        tokentypes.append(0)
+    # [SEP].
+    tokens.append(sep_id)
+    tokentypes.append(0)
+    # Segment B.
+    for token in tokens_b:
+        tokens.append(token)
+        tokentypes.append(1)
+    if tokens_b:
+        # [SEP].
+        tokens.append(sep_id)
+        tokentypes.append(1)
+    return tokens, tokentypes

fengshen/data/data_utils/truncate_utils.py ADDED Viewed

	@@ -0,0 +1,19 @@

+def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    # print(len_a, len_b, max_num_tokens)
+    assert len_a > 0
+    if len_a + len_b <= max_num_tokens:
+        return False
+    while len_a + len_b > max_num_tokens:
+        if len_a > len_b:
+            len_a -= 1
+            tokens = tokens_a
+        else:
+            len_b -= 1
+            tokens = tokens_b
+        if np_rng.random() < 0.5:
+            del tokens[0]
+        else:
+            tokens.pop()
+    return True

fengshen/data/hubert/hubert_dataset.py ADDED Viewed

	@@ -0,0 +1,361 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import itertools
+import logging
+import os
+import sys
+from typing import Any, List, Optional, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from fairseq.data import data_utils
+from fairseq.data.fairseq_dataset import FairseqDataset
+logger = logging.getLogger(__name__)
+def add_data_specific_args(parent_args):
+    parser = parent_args.add_argument_group('Hubert Dataset')
+    parser.add_argument('--data', type=str)
+    parser.add_argument('--sample_rate', type=float, default=16000)
+    parser.add_argument('--label_dir', type=str)
+    parser.add_argument('--labels', type=str, nargs='+')
+    parser.add_argument('--label_rate', type=float)
+    parser.add_argument('--max_keep_size', type=int, default=None)
+    parser.add_argument('--min_sample_size', type=int)
+    parser.add_argument('--max_sample_size', type=int)
+    parser.add_argument('--pad_audio', type=bool)
+    parser.add_argument('--normalize', type=bool)
+    parser.add_argument('--random_crop', type=bool)
+    parser.add_argument('--single_target', type=bool, default=False)
+    return parent_args
+def load_audio(manifest_path, max_keep, min_keep):
+    n_long, n_short = 0, 0
+    names, inds, sizes = [], [], []
+    with open(manifest_path) as f:
+        root = f.readline().strip()
+        for ind, line in enumerate(f):
+            items = line.strip().split("\t")
+            assert len(items) == 2, line
+            sz = int(items[1])
+            if min_keep is not None and sz < min_keep:
+                n_short += 1
+            elif max_keep is not None and sz > max_keep:
+                n_long += 1
+            else:
+                names.append(items[0])
+                inds.append(ind)
+                sizes.append(sz)
+    tot = ind + 1
+    logger.info(
+        (
+            f"max_keep={max_keep}, min_keep={min_keep}, "
+            f"loaded {len(names)}, skipped {n_short} short and {n_long} long, "
+            f"longest-loaded={max(sizes)}, shortest-loaded={min(sizes)}"
+        )
+    )
+    return root, names, inds, tot, sizes
+def load_label(label_path, inds, tot):
+    with open(label_path) as f:
+        labels = [line.rstrip() for line in f]
+        assert (
+            len(labels) == tot
+        ), f"number of labels does not match ({len(labels)} != {tot})"
+        labels = [labels[i] for i in inds]
+    return labels
+def load_label_offset(label_path, inds, tot):
+    with open(label_path) as f:
+        code_lengths = [len(line.encode("utf-8")) for line in f]
+        assert (
+            len(code_lengths) == tot
+        ), f"number of labels does not match ({len(code_lengths)} != {tot})"
+        offsets = list(itertools.accumulate([0] + code_lengths))
+        offsets = [(offsets[i], offsets[i + 1]) for i in inds]
+    return offsets
+def verify_label_lengths(
+    audio_sizes,
+    audio_rate,
+    label_path,
+    label_rate,
+    inds,
+    tot,
+    tol=0.1,  # tolerance in seconds
+):
+    if label_rate < 0:
+        logger.info(f"{label_path} is sequence label. skipped")
+        return
+    with open(label_path) as f:
+        lengths = [len(line.rstrip().split()) for line in f]
+        assert len(lengths) == tot
+        lengths = [lengths[i] for i in inds]
+    num_invalid = 0
+    for i, ind in enumerate(inds):
+        dur_from_audio = audio_sizes[i] / audio_rate
+        dur_from_label = lengths[i] / label_rate
+        if abs(dur_from_audio - dur_from_label) > tol:
+            logger.warning(
+                (
+                    f"audio and label duration differ too much "
+                    f"(|{dur_from_audio} - {dur_from_label}| > {tol}) "
+                    f"in line {ind+1} of {label_path}. Check if `label_rate` "
+                    f"is correctly set (currently {label_rate}). "
+                    f"num. of samples = {audio_sizes[i]}; "
+                    f"label length = {lengths[i]}"
+                )
+            )
+            num_invalid += 1
+    if num_invalid > 0:
+        logger.warning(
+            f"total {num_invalid} (audio, label) pairs with mismatched lengths"
+        )
+class HubertDataset(FairseqDataset):
+    def __init__(
+        self,
+        manifest_path: str,
+        sample_rate: float,
+        label_paths: List[str],
+        label_rates: Union[List[float], float],  # -1 for sequence labels
+        pad_list: List[str],
+        eos_list: List[str],
+        label_processors: Optional[List[Any]] = None,
+        max_keep_sample_size: Optional[int] = None,
+        min_keep_sample_size: Optional[int] = None,
+        max_sample_size: Optional[int] = None,
+        shuffle: bool = True,
+        pad_audio: bool = False,
+        normalize: bool = False,
+        store_labels: bool = True,
+        random_crop: bool = False,
+        single_target: bool = False,
+    ):
+        self.audio_root, self.audio_names, inds, tot, self.sizes = load_audio(
+            manifest_path, max_keep_sample_size, min_keep_sample_size
+        )
+        self.sample_rate = sample_rate
+        self.shuffle = shuffle
+        self.random_crop = random_crop
+        self.num_labels = len(label_paths)
+        self.pad_list = pad_list
+        self.eos_list = eos_list
+        self.label_processors = label_processors
+        self.single_target = single_target
+        self.label_rates = (
+            [label_rates for _ in range(len(label_paths))]
+            if isinstance(label_rates, float)
+            else label_rates
+        )
+        self.store_labels = store_labels
+        if store_labels:
+            self.label_list = [load_label(p, inds, tot) for p in label_paths]
+        else:
+            self.label_paths = label_paths
+            self.label_offsets_list = [
+                load_label_offset(p, inds, tot) for p in label_paths
+            ]
+        assert label_processors is None or len(label_processors) == self.num_labels
+        for label_path, label_rate in zip(label_paths, self.label_rates):
+            verify_label_lengths(
+                self.sizes, sample_rate, label_path, label_rate, inds, tot
+            )
+        self.max_sample_size = (
+            max_sample_size if max_sample_size is not None else sys.maxsize
+        )
+        self.pad_audio = pad_audio
+        self.normalize = normalize
+        logger.info(
+            f"pad_audio={pad_audio}, random_crop={random_crop}, "
+            f"normalize={normalize}, max_sample_size={self.max_sample_size}"
+        )
+    def get_audio(self, index):
+        import soundfile as sf
+        wav_path = os.path.join(self.audio_root, self.audio_names[index])
+        wav, cur_sample_rate = sf.read(wav_path)
+        wav = torch.from_numpy(wav).float()
+        wav = self.postprocess(wav, cur_sample_rate)
+        return wav
+    def get_label(self, index, label_idx):
+        if self.store_labels:
+            label = self.label_list[label_idx][index]
+        else:
+            with open(self.label_paths[label_idx]) as f:
+                offset_s, offset_e = self.label_offsets_list[label_idx][index]
+                f.seek(offset_s)
+                label = f.read(offset_e - offset_s)
+        if self.label_processors is not None:
+            label = self.label_processors[label_idx](label)
+        return label
+    def get_labels(self, index):
+        return [self.get_label(index, i) for i in range(self.num_labels)]
+    def __getitem__(self, index):
+        wav = self.get_audio(index)
+        labels = self.get_labels(index)
+        return {"id": index, "source": wav, "label_list": labels}
+    def __len__(self):
+        return len(self.sizes)
+    def crop_to_max_size(self, wav, target_size):
+        size = len(wav)
+        diff = size - target_size
+        if diff <= 0:
+            return wav, 0
+        start, end = 0, target_size
+        if self.random_crop:
+            start = np.random.randint(0, diff + 1)
+            end = size - diff + start
+        return wav[start:end], start
+    def collater(self, samples):
+        # target = max(sizes) -> random_crop not used
+        # target = max_sample_size -> random_crop used for long
+        samples = [s for s in samples if s["source"] is not None]
+        if len(samples) == 0:
+            return {}
+        audios = [s["source"] for s in samples]
+        audio_sizes = [len(s) for s in audios]
+        if self.pad_audio:
+            audio_size = min(max(audio_sizes), self.max_sample_size)
+        else:
+            audio_size = min(min(audio_sizes), self.max_sample_size)
+        collated_audios, padding_mask, audio_starts = self.collater_audio(
+            audios, audio_size
+        )
+        targets_by_label = [
+            [s["label_list"][i] for s in samples] for i in range(self.num_labels)
+        ]
+        targets_list, lengths_list, ntokens_list = self.collater_label(
+            targets_by_label, audio_size, audio_starts
+        )
+        net_input = {"source": collated_audios, "padding_mask": padding_mask}
+        batch = {
+            "id": torch.LongTensor([s["id"] for s in samples]),
+            "net_input": net_input,
+        }
+        if self.single_target:
+            batch["target_lengths"] = lengths_list[0]
+            batch["ntokens"] = ntokens_list[0]
+            batch["target"] = targets_list[0]
+        else:
+            batch["target_lengths_list"] = lengths_list
+            batch["ntokens_list"] = ntokens_list
+            batch["target_list"] = targets_list
+        return batch
+    def collater_audio(self, audios, audio_size):
+        collated_audios = audios[0].new_zeros(len(audios), audio_size)
+        padding_mask = (
+            torch.BoolTensor(collated_audios.shape).fill_(False)
+            # if self.pad_audio else None
+        )
+        audio_starts = [0 for _ in audios]
+        for i, audio in enumerate(audios):
+            diff = len(audio) - audio_size
+            if diff == 0:
+                collated_audios[i] = audio
+            elif diff < 0:
+                assert self.pad_audio
+                collated_audios[i] = torch.cat([audio, audio.new_full((-diff,), 0.0)])
+                padding_mask[i, diff:] = True
+            else:
+                collated_audios[i], audio_starts[i] = self.crop_to_max_size(
+                    audio, audio_size
+                )
+        return collated_audios, padding_mask, audio_starts
+    def collater_frm_label(self, targets, audio_size, audio_starts, label_rate, pad):
+        assert label_rate > 0
+        s2f = label_rate / self.sample_rate
+        frm_starts = [int(round(s * s2f)) for s in audio_starts]
+        frm_size = int(round(audio_size * s2f))
+        if not self.pad_audio:
+            rem_size = [len(t) - s for t, s in zip(targets, frm_starts)]
+            frm_size = min(frm_size, *rem_size)
+        targets = [t[s: s + frm_size] for t, s in zip(targets, frm_starts)]
+        logger.debug(f"audio_starts={audio_starts}")
+        logger.debug(f"frame_starts={frm_starts}")
+        logger.debug(f"frame_size={frm_size}")
+        lengths = torch.LongTensor([len(t) for t in targets])
+        ntokens = lengths.sum().item()
+        targets = data_utils.collate_tokens(targets, pad_idx=pad, left_pad=False)
+        return targets, lengths, ntokens
+    def collater_seq_label(self, targets, pad):
+        lengths = torch.LongTensor([len(t) for t in targets])
+        ntokens = lengths.sum().item()
+        targets = data_utils.collate_tokens(targets, pad_idx=pad, left_pad=False)
+        return targets, lengths, ntokens
+    def collater_label(self, targets_by_label, audio_size, audio_starts):
+        targets_list, lengths_list, ntokens_list = [], [], []
+        itr = zip(targets_by_label, self.label_rates, self.pad_list)
+        for targets, label_rate, pad in itr:
+            if label_rate == -1.0:
+                targets, lengths, ntokens = self.collater_seq_label(targets, pad)
+            else:
+                targets, lengths, ntokens = self.collater_frm_label(
+                    targets, audio_size, audio_starts, label_rate, pad
+                )
+            targets_list.append(targets)
+            lengths_list.append(lengths)
+            ntokens_list.append(ntokens)
+        return targets_list, lengths_list, ntokens_list
+    def num_tokens(self, index):
+        return self.size(index)
+    def size(self, index):
+        if self.pad_audio:
+            return self.sizes[index]
+        return min(self.sizes[index], self.max_sample_size)
+    def ordered_indices(self):
+        if self.shuffle:
+            order = [np.random.permutation(len(self))]
+        else:
+            order = [np.arange(len(self))]
+        order.append(self.sizes)
+        return np.lexsort(order)[::-1]
+    def postprocess(self, wav, cur_sample_rate):
+        if wav.dim() == 2:
+            wav = wav.mean(-1)
+        assert wav.dim() == 1, wav.dim()
+        if cur_sample_rate != self.sample_rate:
+            raise Exception(f"sr {cur_sample_rate} != {self.sample_rate}")
+        if self.normalize:
+            with torch.no_grad():
+                wav = F.layer_norm(wav, wav.shape)
+        return wav

fengshen/data/megatron_dataloader/Makefile ADDED Viewed

	@@ -0,0 +1,9 @@

+CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color
+CPPFLAGS += $(shell python3 -m pybind11 --includes)
+LIBNAME = helpers
+LIBEXT = $(shell python3-config --extension-suffix)
+default: $(LIBNAME)$(LIBEXT)
+%$(LIBEXT): %.cpp
+	$(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@

fengshen/data/megatron_dataloader/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from . import indexed_dataset

fengshen/data/megatron_dataloader/bart_dataset.py ADDED Viewed

	@@ -0,0 +1,443 @@

+"""BART Style dataset. Modified from fairseq."""
+import numpy as np
+import torch
+import math
+import re
+from fengshen.data.megatron_dataloader.dataset_utils import (
+    get_samples_mapping
+)
+class BartDataset(torch.utils.data.Dataset):
+    def __init__(self, name, indexed_dataset, data_prefix,
+                 num_epochs, max_num_samples, masked_lm_prob,
+                 max_seq_length, short_seq_prob, seed, tokenizer, zh_tokenizer):
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(self.indexed_dataset,
+                                                   data_prefix,
+                                                   num_epochs,
+                                                   max_num_samples,
+                                                   self.max_seq_length - 3,  # account for added tokens
+                                                   short_seq_prob,
+                                                   self.seed,
+                                                   self.name,
+                                                   False)
+        # Vocab stuff.
+        self.vocab_size = tokenizer.vocab_size
+        inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+        self.tokenizer = tokenizer
+        seg_tokens = ['。', ';', '；', '!', '！', '?', '？']
+        seg_token_ids = []
+        for t in seg_tokens:
+            if t in tokenizer.vocab:
+                seg_token_ids.append(tokenizer.vocab[t])
+            else:
+                print('seg_token "{}" not in vocab'.format(t))
+        self.seg_token_ids = set(seg_token_ids)
+        self.zh_tokenizer = zh_tokenizer
+        # Denoising ratios
+        self.permute_sentence_ratio = 1.0
+        self.mask_ratio = masked_lm_prob  # 0.15
+        self.random_ratio = 0.1
+        self.insert_ratio = 0.0
+        self.rotate_ratio = 0.0
+        self.mask_whole_word = 1
+        self.item_transform_func = None
+        self.mask_span_distribution = None
+        if False:
+            _lambda = 3  # Poisson lambda
+            lambda_to_the_k = 1
+            e_to_the_minus_lambda = math.exp(-_lambda)
+            k_factorial = 1
+            ps = []
+            for k in range(0, 128):
+                ps.append(e_to_the_minus_lambda * lambda_to_the_k / k_factorial)
+                lambda_to_the_k *= _lambda
+                k_factorial *= k + 1
+                if ps[-1] < 0.0000001:
+                    break
+            ps = torch.FloatTensor(ps)
+            self.mask_span_distribution = torch.distributions.Categorical(ps)
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+    def __getitem__(self, idx):
+        start_idx, end_idx, seq_length = self.samples_mapping[idx]
+        sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return self.build_training_sample(sample, self.max_seq_length, np_rng)
+    def build_training_sample(self, sample, max_seq_length, np_rng):
+        """Biuld training sample.
+        Arguments:
+            sample: A list of sentences in which each sentence is a list token ids.
+            max_seq_length: Desired sequence length.
+            np_rng: Random number genenrator. Note that this rng state should be
+                numpy and not python since python randint is inclusive for
+                the opper bound whereas the numpy one is exclusive.
+        """
+        # permute sentences
+        full_stops = []
+        tokens = [self.cls_id]
+        for sent in sample:
+            for t in sent:
+                token = self.vocab_id_to_token_dict[t]
+                if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                    # 兼容erlangshen ##的方式做whole word mask
+                    t = self.tokenizer.convert_tokens_to_ids(token[2:])
+                tokens.append(t)
+                if t in self.seg_token_ids:
+                    tokens.append(self.sep_id)
+            if tokens[-1] != self.sep_id:
+                tokens.append(self.sep_id)
+        if len(tokens) > max_seq_length:
+            tokens = tokens[:max_seq_length]
+            tokens[-1] = self.sep_id
+        tokens = torch.LongTensor(tokens)
+        full_stops = (tokens == self.sep_id).long()
+        assert (max_seq_length - tokens.shape[0]) >= 0, (tokens.size(), tokens[-1], max_seq_length)
+        source, target = tokens, tokens[1:].clone()
+        use_decoder = 1
+        # if torch.rand(1).item() < 0.5:
+        #     use_decoder = 0
+        if self.permute_sentence_ratio > 0.0 and use_decoder == 1:
+            source = self.permute_sentences(source, full_stops, self.permute_sentence_ratio)
+        if self.mask_ratio > 0.0:
+            replace_length = 1 if use_decoder else -1
+            mask_ratio = self.mask_ratio * 2 if use_decoder else self.mask_ratio
+            source = self.add_whole_word_mask(source, mask_ratio, replace_length)
+        if self.insert_ratio > 0.0:
+            raise NotImplementedError
+            source = self.add_insertion_noise(source, self.insert_ratio)
+        if self.rotate_ratio > 0.0 and np.random.random() < self.rotate_ratio:
+            raise NotImplementedError
+            source = self.add_rolling_noise(source)
+        # there can additional changes to make:
+        if self.item_transform_func is not None:
+            source, target = self.item_transform_func(source, target)
+        assert (source >= 0).all()
+        # assert (source[1:-1] >= 1).all()
+        assert (source <= self.vocab_size).all()
+        assert source[0] == self.cls_id
+        assert source[-1] == self.sep_id
+        # tokenizer = get_tokenizer()
+        # print(' '.join(tokenizer.tokenizer.convert_ids_to_tokens(source)))
+        # print(tokenizer.detokenize(target))
+        # print(tokenizer.detokenize(source))
+        # print()
+        prev_output_tokens = torch.zeros_like(target)
+        prev_output_tokens[0] = self.sep_id  # match the preprocessing in fairseq
+        prev_output_tokens[1:] = target[:-1]
+        # src_padding_length = max_seq_length - source.shape[0]
+        # tgt_padding_length = max_seq_length - target.shape[0]
+        # assert src_padding_length >= 0, (source.size(), source[-1], max_seq_length)
+        # assert tgt_padding_length >= 0, (target.size(), target[-1], max_seq_length)
+        source_ = torch.full((max_seq_length,), self.pad_id, dtype=torch.long)
+        source_[:source.shape[0]] = source
+        target_ = torch.full((max_seq_length,), -100, dtype=torch.long)
+        # decoder not need bos in the front
+        target_[:target.shape[0]] = target
+        prev_output_tokens_ = torch.full((max_seq_length,), self.pad_id, dtype=torch.long)
+        prev_output_tokens_[:prev_output_tokens.shape[0]] = prev_output_tokens
+        return {
+            "input_ids": source_,
+            "labels": target_,
+            # "decoder_input_ids": prev_output_tokens_,
+            "attention_mask": (source_ != self.pad_id).long()
+        }
+    def permute_sentences(self, source, full_stops, p=1.0):
+        # Tokens that are full stops, where the previous token is not
+        sentence_ends = (full_stops[1:] * ~full_stops[:-1]).nonzero(as_tuple=False) + 2
+        result = source.clone()
+        num_sentences = sentence_ends.size(0)
+        num_to_permute = math.ceil((num_sentences * 2 * p) / 2.0)
+        substitutions = torch.randperm(num_sentences)[:num_to_permute]
+        ordering = torch.arange(0, num_sentences)
+        ordering[substitutions] = substitutions[torch.randperm(num_to_permute)]
+        # Ignore <bos> at start
+        index = 1
+        for i in ordering:
+            sentence = source[(sentence_ends[i - 1] if i > 0 else 1): sentence_ends[i]]
+            result[index: index + sentence.size(0)] = sentence
+            index += sentence.size(0)
+        return result
+    def word_starts_en(self, source):
+        if self.mask_whole_word is not None:
+            is_word_start = self.mask_whole_word.gather(0, source)
+        else:
+            is_word_start = torch.ones(source.size())
+        is_word_start[0] = 0
+        is_word_start[-1] = 0
+        return is_word_start
+    def word_starts(self, source):
+        if self.mask_whole_word is None:
+            is_word_start = torch.ones(source.size())
+            is_word_start[0] = 0
+            is_word_start[-1] = 0
+            return is_word_start
+        raw_tokens = [self.vocab_id_to_token_dict[i] for i in source.tolist()]
+        words = [raw_tokens[0]] + \
+            self.zh_tokenizer(''.join(raw_tokens[1:-1]), HMM=True) + [raw_tokens[-1]]
+        def _is_chinese_char(c):
+            """Checks whether CP is the #codepoint of a CJK character."""
+            # This defines a "chinese character" as anything in the CJK Unicode block:
+            #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+            #
+            # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+            # despite its name. The modern Korean Hangul alphabet is a different block,
+            # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+            # space-separated words, so they are not treated specially and handled
+            # like the all of the other languages.
+            if len(c) > 1:
+                return all([_is_chinese_char(c_i) for c_i in c])
+            cp = ord(c)
+            if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                    (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+                return True
+            return False
+        def align_linear(atokens, btokens):
+            a2c = []
+            c2b = []
+            a2b = []
+            length = 0
+            for tok in atokens:
+                a2c.append([length + i for i in range(len(tok))])
+                length += len(tok)
+            for i, tok in enumerate(btokens):
+                c2b.extend([i for _ in range(len(tok))])
+            for i, amap in enumerate(a2c):
+                bmap = [c2b[ci] for ci in amap]
+                a2b.append(list(set(bmap)))
+            return a2b
+        raw_to_word_align = align_linear(raw_tokens, words)
+        is_word_start = torch.zeros(source.size())
+        word_starts = []
+        skip_cur_word = True
+        for i in range(1, len(raw_to_word_align)):
+            if raw_to_word_align[i-1] == raw_to_word_align[i]:
+                # not a word start, as they align to the same word
+                if not skip_cur_word and not _is_chinese_char(raw_tokens[i]):
+                    word_starts.pop(-1)
+                    skip_cur_word = True
+                continue
+            else:
+                is_word_start[i] = 1
+                if _is_chinese_char(raw_tokens[i]):
+                    word_starts.append(i)
+                    skip_cur_word = False
+        is_word_start[0] = 0
+        is_word_start[-1] = 0
+        word_starts = torch.tensor(word_starts).long().view(-1, 1)
+        return is_word_start, word_starts
+    def add_whole_word_mask(self, source, p, replace_length=1):
+        is_word_start, word_starts = self.word_starts(source)
+        num_to_mask_word = int(math.ceil(word_starts.size(0) * p))
+        num_to_mask_char = int(math.ceil(word_starts.size(0) * p * 0.1))
+        num_to_mask = num_to_mask_word + num_to_mask_char
+        if num_to_mask > word_starts.size(0):
+            word_starts = is_word_start.nonzero(as_tuple=False)
+        num_inserts = 0
+        if num_to_mask == 0:
+            return source
+        if self.mask_span_distribution is not None:
+            lengths = self.mask_span_distribution.sample(sample_shape=(num_to_mask,))
+            # Make sure we have enough to mask
+            cum_length = torch.cumsum(lengths, 0)
+            while cum_length[-1] < num_to_mask:
+                lengths = torch.cat(
+                    [
+                        lengths,
+                        self.mask_span_distribution.sample(sample_shape=(num_to_mask,)),
+                    ],
+                    dim=0,
+                )
+                cum_length = torch.cumsum(lengths, 0)
+            # Trim to masking budget
+            i = 0
+            while cum_length[i] < num_to_mask:
+                i += 1
+            lengths[i] = num_to_mask - (0 if i == 0 else cum_length[i - 1])
+            num_to_mask = i + 1
+            lengths = lengths[:num_to_mask]
+            # Handle 0-length mask (inserts) separately
+            lengths = lengths[lengths > 0]
+            num_inserts = num_to_mask - lengths.size(0)
+            num_to_mask -= num_inserts
+            if num_to_mask == 0:
+                return self.add_insertion_noise(source, num_inserts / source.size(0))
+            assert (lengths > 0).all()
+        else:
+            lengths = torch.ones((num_to_mask,)).long()
+        assert is_word_start[-1] == 0
+        indices = word_starts[
+            torch.randperm(word_starts.size(0))[:num_to_mask]
+        ].squeeze(1)
+        mask_random = torch.FloatTensor(num_to_mask).uniform_() < self.random_ratio
+        source_length = source.size(0)
+        assert source_length - 1 not in indices
+        to_keep = torch.ones(source_length, dtype=torch.bool)
+        is_word_start[
+            -1
+        ] = 255  # acts as a long length, so spans don't go over the end of doc
+        if replace_length == 0:
+            to_keep[indices] = 0
+        else:
+            # keep index, but replace it with [MASK]
+            # print(source.size(), word_starts.size(), indices.size(), mask_random.size())
+            source[indices] = self.mask_id
+            source[indices[mask_random]] = torch.randint(
+                1, self.vocab_size, size=(mask_random.sum(),)
+            )
+            # sorted_indices = torch.sort(indices)[0]
+            # continue_mask_pos = ((sorted_indices + 1)[:-1] == sorted_indices[1:])
+            # continue_mask_indices = sorted_indices[1:][continue_mask_pos]
+            # to_keep[continue_mask_indices] = 0
+        # for char indices, we already masked, the following loop handles word mask
+        indices = indices[:num_to_mask_word]
+        mask_random = mask_random[:num_to_mask_word]
+        if self.mask_span_distribution is not None:
+            assert len(lengths.size()) == 1
+            assert lengths.size() == indices.size()
+            lengths -= 1
+            while indices.size(0) > 0:
+                assert lengths.size() == indices.size()
+                lengths -= is_word_start[indices + 1].long()
+                uncompleted = lengths >= 0
+                indices = indices[uncompleted] + 1
+                mask_random = mask_random[uncompleted]
+                lengths = lengths[uncompleted]
+                if replace_length != -1:
+                    # delete token
+                    to_keep[indices] = 0
+                else:
+                    # keep index, but replace it with [MASK]
+                    source[indices] = self.mask_id
+                    source[indices[mask_random]] = torch.randint(
+                        1, self.vocab_size, size=(mask_random.sum(),)
+                    )
+        else:
+            # A bit faster when all lengths are 1
+            while indices.size(0) > 0:
+                uncompleted = is_word_start[indices + 1] == 0
+                indices = indices[uncompleted] + 1
+                mask_random = mask_random[uncompleted]
+                if replace_length != -1:
+                    # delete token
+                    to_keep[indices] = 0
+                else:
+                    # keep index, but replace it with [MASK]
+                    source[indices] = self.mask_id
+                    source[indices[mask_random]] = torch.randint(
+                        1, self.vocab_size, size=(mask_random.sum(),)
+                    )
+                assert source_length - 1 not in indices
+        source = source[to_keep]
+        if num_inserts > 0:
+            source = self.add_insertion_noise(source, num_inserts / source.size(0))
+        return source
+    def add_permuted_noise(self, tokens, p):
+        num_words = len(tokens)
+        num_to_permute = math.ceil(((num_words * 2) * p) / 2.0)
+        substitutions = torch.randperm(num_words - 2)[:num_to_permute] + 1
+        tokens[substitutions] = tokens[substitutions[torch.randperm(num_to_permute)]]
+        return tokens
+    def add_rolling_noise(self, tokens):
+        offset = np.random.randint(1, max(1, tokens.size(-1) - 1) + 1)
+        tokens = torch.cat(
+            (tokens[0:1], tokens[offset:-1], tokens[1:offset], tokens[-1:]),
+            dim=0,
+        )
+        return tokens
+    def add_insertion_noise(self, tokens, p):
+        if p == 0.0:
+            return tokens
+        num_tokens = len(tokens)
+        n = int(math.ceil(num_tokens * p))
+        noise_indices = torch.randperm(num_tokens + n - 2)[:n] + 1
+        noise_mask = torch.zeros(size=(num_tokens + n,), dtype=torch.bool)
+        noise_mask[noise_indices] = 1
+        result = torch.LongTensor(n + len(tokens)).fill_(-1)
+        num_random = int(math.ceil(n * self.random_ratio))
+        result[noise_indices[num_random:]] = self.mask_id
+        result[noise_indices[:num_random]] = torch.randint(
+            low=1, high=self.vocab_size, size=(num_random,)
+        )
+        result[~noise_mask] = tokens
+        assert (result >= 0).all()
+        return result

fengshen/data/megatron_dataloader/bert_dataset.py ADDED Viewed

	@@ -0,0 +1,196 @@

+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT Style dataset."""
+import numpy as np
+import torch
+from fengshen.data.megatron_dataloader.dataset_utils import (
+    get_samples_mapping,
+    get_a_and_b_segments,
+    create_masked_lm_predictions,
+    create_tokens_and_tokentypes,
+)
+class BertDataset(torch.utils.data.Dataset):
+    def __init__(self, name, indexed_dataset, data_prefix,
+                 num_epochs, max_num_samples, masked_lm_prob,
+                 max_seq_length, short_seq_prob, seed, binary_head, tokenizer, masking_style):
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+        self.short_seq_prob = short_seq_prob
+        self.binary_head = binary_head
+        self.masking_style = masking_style
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(self.indexed_dataset,
+                                                   data_prefix,
+                                                   num_epochs,
+                                                   max_num_samples,
+                                                   # account for added tokens
+                                                   self.max_seq_length - 3,
+                                                   short_seq_prob,
+                                                   self.seed,
+                                                   self.name,
+                                                   self.binary_head)
+        inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
+        self.vocab_id_list = list(inv_vocab.keys())
+        self.vocab_id_to_token_dict = inv_vocab
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+        self.tokenizer = tokenizer
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+    def __getitem__(self, idx):
+        start_idx, end_idx, seq_length = self.samples_mapping[idx]
+        sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return build_training_sample(sample, seq_length,
+                                     self.max_seq_length,  # needed for padding
+                                     self.vocab_id_list,
+                                     self.vocab_id_to_token_dict,
+                                     self.cls_id, self.sep_id,
+                                     self.mask_id, self.pad_id,
+                                     self.masked_lm_prob, np_rng,
+                                     self.binary_head,
+                                     tokenizer=self.tokenizer,
+                                     masking_style=self.masking_style)
+def build_training_sample(sample,
+                          target_seq_length, max_seq_length,
+                          vocab_id_list, vocab_id_to_token_dict,
+                          cls_id, sep_id, mask_id, pad_id,
+                          masked_lm_prob, np_rng, binary_head,
+                          tokenizer,
+                          masking_style='bert'):
+    """Biuld training sample.
+    Arguments:
+        sample: A list of sentences in which each sentence is a list token ids.
+        target_seq_length: Desired sequence length.
+        max_seq_length: Maximum length of the sequence. All values are padded to
+            this length.
+        vocab_id_list: List of vocabulary ids. Used to pick a random id.
+        vocab_id_to_token_dict: A dictionary from vocab ids to text tokens.
+        cls_id: Start of example id.
+        sep_id: Separator id.
+        mask_id: Mask token id.
+        pad_id: Padding token id.
+        masked_lm_prob: Probability to mask tokens.
+        np_rng: Random number genenrator. Note that this rng state should be
+              numpy and not python since python randint is inclusive for
+              the opper bound whereas the numpy one is exclusive.
+    """
+    if binary_head:
+        # We assume that we have at least two sentences in the sample
+        assert len(sample) > 1
+    assert target_seq_length <= max_seq_length
+    # Divide sample into two segments (A and B).
+    if binary_head:
+        tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample,
+                                                                  np_rng)
+    else:
+        tokens_a = []
+        for j in range(len(sample)):
+            tokens_a.extend(sample[j])
+        tokens_b = []
+        is_next_random = False
+    if len(tokens_a) >= max_seq_length-3:
+        tokens_a = tokens_a[:max_seq_length-3]
+    # Truncate to `target_sequence_length`.
+    max_num_tokens = target_seq_length
+    ''''
+    truncated = truncate_segments(tokens_a, tokens_b, len(tokens_a),
+                                  len(tokens_b), max_num_tokens, np_rng)
+    '''
+    # Build tokens and toketypes.
+    tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b,
+                                                      cls_id, sep_id)
+    # Masking.
+    max_predictions_per_seq = masked_lm_prob * max_num_tokens
+    (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions(
+        tokens, vocab_id_list, vocab_id_to_token_dict, masked_lm_prob,
+        cls_id, sep_id, mask_id, max_predictions_per_seq, np_rng,
+        tokenizer=tokenizer,
+        masking_style=masking_style)
+    # Padding.
+    tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np \
+        = pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                                   masked_labels, pad_id, max_seq_length)
+    train_sample = {
+        'input_ids': tokens_np,
+        'token_type_ids': tokentypes_np,
+        'labels': labels_np,
+        'next_sentence_label': int(is_next_random),
+        'attention_mask': padding_mask_np}
+    return train_sample
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                             masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+    # Some checks.
+    num_tokens = len(tokens)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                               dtype=np.int64)
+    # Lables and loss mask.
+    labels = [-100] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np

fengshen/data/megatron_dataloader/blendable_dataset.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Blendable dataset."""
+import time
+import numpy as np
+import torch
+from fengshen.data.megatron_dataloader.utils import print_rank_0
+class BlendableDataset(torch.utils.data.Dataset):
+    def __init__(self, datasets, weights):
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+        self.size = 0
+        for dataset in self.datasets:
+            self.size += len(dataset)
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+        # Build indecies.
+        start_time = time.time()
+        assert num_datasets < 255
+        self.dataset_index = np.zeros(self.size, dtype=np.uint8)
+        self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+        from fengshen.data.megatron_dataloader import helpers
+        helpers.build_blending_indices(self.dataset_index,
+                                       self.dataset_sample_index,
+                                       weights, num_datasets, self.size,
+                                       torch.distributed.get_rank() == 0)
+        print_rank_0('> elapsed time for building blendable dataset indices: '
+                     '{:.2f} (sec)'.format(time.time() - start_time))
+    def __len__(self):
+        return self.size
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return self.datasets[dataset_idx][sample_idx]

fengshen/data/megatron_dataloader/dataset_utils.py ADDED Viewed

	@@ -0,0 +1,788 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors, and NVIDIA.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Most of the code here has been copied from:
+#   https://github.com/google-research/albert/blob/master/create_pretraining_data.py
+# with some modifications.
+import math
+import time
+import collections
+import numpy as np
+import re
+from fengshen.data.megatron_dataloader.utils import (
+    print_rank_0
+)
+from fengshen.data.megatron_dataloader.blendable_dataset import BlendableDataset
+from fengshen.data.megatron_dataloader.indexed_dataset import make_dataset as make_indexed_dataset
+DSET_TYPE_BERT = 'standard_bert'
+DSET_TYPE_ICT = 'ict'
+DSET_TYPE_T5 = 't5'
+DSET_TYPE_BERT_CN_WWM = 'bert_cn_wwm'
+DSET_TYPE_BART = 'bart'
+DSET_TYPE_COCOLM = 'coco_lm'
+DSET_TYPES = [DSET_TYPE_BERT, DSET_TYPE_ICT,
+              DSET_TYPE_T5, DSET_TYPE_BERT_CN_WWM,
+              DSET_TYPE_BART, DSET_TYPE_COCOLM]
+def get_datasets_weights_and_num_samples(data_prefix,
+                                         train_valid_test_num_samples):
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005))
+             for val in train_valid_test_num_samples])
+    return prefixes, weights, datasets_train_valid_test_num_samples
+def compile_helper():
+    """Compile helper function ar runtime. Make sure this
+    is invoked on a single process."""
+    import os
+    import subprocess
+    path = os.path.abspath(os.path.dirname(__file__))
+    ret = subprocess.run(['make', '-C', path])
+    if ret.returncode != 0:
+        print("Making C++ dataset helpers module failed, exiting.")
+        import sys
+        sys.exit(1)
+def get_a_and_b_segments(sample, np_rng):
+    """Divide sample into a and b segments."""
+    # Number of sentences in the sample.
+    n_sentences = len(sample)
+    # Make sure we always have two sentences.
+    assert n_sentences > 1, 'make sure each sample has at least two sentences.'
+    # First part:
+    # `a_end` is how many sentences go into the `A`.
+    a_end = 1
+    if n_sentences >= 3:
+        # Note that randin in numpy is exclusive.
+        a_end = np_rng.randint(1, n_sentences)
+    tokens_a = []
+    for j in range(a_end):
+        tokens_a.extend(sample[j])
+    # Second part:
+    tokens_b = []
+    for j in range(a_end, n_sentences):
+        tokens_b.extend(sample[j])
+    # Random next:
+    is_next_random = False
+    if np_rng.random() < 0.5:
+        is_next_random = True
+        tokens_a, tokens_b = tokens_b, tokens_a
+    return tokens_a, tokens_b, is_next_random
+def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    # print(len_a, len_b, max_num_tokens)
+    assert len_a > 0
+    if len_a + len_b <= max_num_tokens:
+        return False
+    while len_a + len_b > max_num_tokens:
+        if len_a > len_b:
+            len_a -= 1
+            tokens = tokens_a
+        else:
+            len_b -= 1
+            tokens = tokens_b
+        if np_rng.random() < 0.5:
+            del tokens[0]
+        else:
+            tokens.pop()
+    return True
+def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
+    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
+    tokens = []
+    tokentypes = []
+    # [CLS].
+    tokens.append(cls_id)
+    tokentypes.append(0)
+    # Segment A.
+    for token in tokens_a:
+        tokens.append(token)
+        tokentypes.append(0)
+    # [SEP].
+    tokens.append(sep_id)
+    tokentypes.append(0)
+    # Segment B.
+    for token in tokens_b:
+        tokens.append(token)
+        tokentypes.append(1)
+    if tokens_b:
+        # [SEP].
+        tokens.append(sep_id)
+        tokentypes.append(1)
+    return tokens, tokentypes
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
+                                          ["index", "label"])
+def is_start_piece(piece):
+    """Check if the current word piece is the starting piece (BERT)."""
+    # When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    return not piece.startswith("##")
+def create_masked_lm_predictions(tokens,
+                                 vocab_id_list, vocab_id_to_token_dict,
+                                 masked_lm_prob,
+                                 cls_id, sep_id, mask_id,
+                                 max_predictions_per_seq,
+                                 np_rng,
+                                 tokenizer,
+                                 max_ngrams=3,
+                                 do_whole_word_mask=True,
+                                 favor_longer_ngram=False,
+                                 do_permutation=False,
+                                 geometric_dist=False,
+                                 masking_style="bert",
+                                 zh_tokenizer=None):
+    """Creates the predictions for the masked LM objective.
+    Note: Tokens here are vocab ids and not text tokens."""
+    cand_indexes = []
+    # Note(mingdachen): We create a list for recording if the piece is
+    # the starting piece of current token, where 1 means true, so that
+    # on-the-fly whole word masking is possible.
+    token_boundary = [0] * len(tokens)
+    # 如果没有指定中文分词器，那就直接按##算
+    if zh_tokenizer is None:
+        for (i, token) in enumerate(tokens):
+            if token == cls_id or token == sep_id:
+                token_boundary[i] = 1
+                continue
+        # Whole Word Masking means that if we mask all of the wordpieces
+        # corresponding to an original word.
+        #
+        # Note that Whole Word Masking does *not* change the training code
+        # at all -- we still predict each WordPiece independently, softmaxed
+        # over the entire vocabulary.
+            if (do_whole_word_mask and len(cand_indexes) >= 1 and
+                    not is_start_piece(vocab_id_to_token_dict[token])):
+                cand_indexes[-1].append(i)
+            else:
+                cand_indexes.append([i])
+                if is_start_piece(vocab_id_to_token_dict[token]):
+                    token_boundary[i] = 1
+    else:
+        # 如果指定了中文分词器，那就先用分词器分词，然后再进行判断
+        # 获取去掉CLS SEP的原始文本
+        raw_tokens = []
+        for t in tokens:
+            if t != cls_id and t != sep_id:
+                raw_tokens.append(t)
+        raw_tokens = [vocab_id_to_token_dict[i] for i in raw_tokens]
+        # 分词然后获取每次字开头的最长词的长度
+        word_list = set(zh_tokenizer(''.join(raw_tokens), HMM=True))
+        word_length_dict = {}
+        for w in word_list:
+            if len(w) < 1:
+                continue
+            if w[0] not in word_length_dict:
+                word_length_dict[w[0]] = len(w)
+            elif word_length_dict[w[0]] < len(w):
+                word_length_dict[w[0]] = len(w)
+        i = 0
+        # 从词表里面检索
+        while i < len(tokens):
+            token_id = tokens[i]
+            token = vocab_id_to_token_dict[token_id]
+            if len(token) == 0 or token_id == cls_id or token_id == sep_id:
+                token_boundary[i] = 1
+                i += 1
+                continue
+            word_max_length = 1
+            if token[0] in word_length_dict:
+                word_max_length = word_length_dict[token[0]]
+            j = 0
+            word = ''
+            word_end = i+1
+            # 兼容以前##的形式，如果后面的词是##开头的，那么直接把后面的拼到前面当作一个词
+            old_style = False
+            while word_end < len(tokens) and vocab_id_to_token_dict[tokens[word_end]].startswith('##'):
+                old_style = True
+                word_end += 1
+            if not old_style:
+                while j < word_max_length and i+j < len(tokens):
+                    cur_token = tokens[i+j]
+                    word += vocab_id_to_token_dict[cur_token]
+                    j += 1
+                    if word in word_list:
+                        word_end = i+j
+            cand_indexes.append([p for p in range(i, word_end)])
+            token_boundary[i] = 1
+            i = word_end
+    output_tokens = list(tokens)
+    # add by ganruyi
+    if masking_style == 'bert-cn-wwm':
+        # if non chinese is False, that means it is chinese
+        # then try to remove "##" which is added previously
+        new_token_ids = []
+        for token_id in output_tokens:
+            token = tokenizer.convert_ids_to_tokens([token_id])[0]
+            if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                token = token[2:]
+            new_token_id = tokenizer.convert_tokens_to_ids([token])[
+                0]
+            new_token_ids.append(new_token_id)
+        output_tokens = new_token_ids
+    masked_lm_positions = []
+    masked_lm_labels = []
+    if masked_lm_prob == 0:
+        return (output_tokens, masked_lm_positions,
+                masked_lm_labels, token_boundary)
+    num_to_predict = min(max_predictions_per_seq,
+                         max(1, int(round(len(tokens) * masked_lm_prob))))
+    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
+    if not geometric_dist:
+        # Note(mingdachen):
+        # By default, we set the probilities to favor shorter ngram sequences.
+        pvals = 1. / np.arange(1, max_ngrams + 1)
+        pvals /= pvals.sum(keepdims=True)
+        if favor_longer_ngram:
+            pvals = pvals[::-1]
+    # 获取一个ngram的idx，对于每个word，记录他的ngram的word
+    ngram_indexes = []
+    for idx in range(len(cand_indexes)):
+        ngram_index = []
+        for n in ngrams:
+            ngram_index.append(cand_indexes[idx:idx + n])
+        ngram_indexes.append(ngram_index)
+    np_rng.shuffle(ngram_indexes)
+    (masked_lms, masked_spans) = ([], [])
+    covered_indexes = set()
+    for cand_index_set in ngram_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if not cand_index_set:
+            continue
+        # Note(mingdachen):
+        # Skip current piece if they are covered in lm masking or previous ngrams.
+        for index_set in cand_index_set[0]:
+            for index in index_set:
+                if index in covered_indexes:
+                    continue
+        if not geometric_dist:
+            n = np_rng.choice(ngrams[:len(cand_index_set)],
+                              p=pvals[:len(cand_index_set)] /
+                              pvals[:len(cand_index_set)].sum(keepdims=True))
+        else:
+            # Sampling "n" from the geometric distribution and clipping it to
+            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
+            # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1)
+            n = min(np_rng.geometric(0.2), max_ngrams)
+        index_set = sum(cand_index_set[n - 1], [])
+        n -= 1
+        # Note(mingdachen):
+        # Repeatedly looking for a candidate that does not exceed the
+        # maximum number of predictions by trying shorter ngrams.
+        while len(masked_lms) + len(index_set) > num_to_predict:
+            if n == 0:
+                break
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+        # If adding a whole-word mask would exceed the maximum number of
+        # predictions, then just skip this candidate.
+        if len(masked_lms) + len(index_set) > num_to_predict:
+            continue
+        is_any_index_covered = False
+        for index in index_set:
+            if index in covered_indexes:
+                is_any_index_covered = True
+                break
+        if is_any_index_covered:
+            continue
+        for index in index_set:
+            covered_indexes.add(index)
+            masked_token = None
+            if masking_style == "bert":
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        masked_token = tokens[index]
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
+            elif masking_style == 'bert-cn-wwm':
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        # 如果是中文全词mask，去掉tokens里的##
+                        token_id = tokens[index]
+                        token = tokenizer.convert_ids_to_tokens([token_id])[
+                            0]
+                        if len(re.findall('##[\u4E00-\u9FA5]', token)) > 0:
+                            token = token[2:]
+                        new_token_id = tokenizer.convert_tokens_to_ids([token])[
+                            0]
+                        masked_token = new_token_id
+                    # 10% of the time, replace with random word
+                    else:
+                        masked_token = vocab_id_list[np_rng.randint(
+                            0, len(vocab_id_list))]
+            elif masking_style == "t5":
+                masked_token = mask_id
+            else:
+                raise ValueError("invalid value of masking style")
+            output_tokens[index] = masked_token
+            masked_lms.append(MaskedLmInstance(
+                index=index, label=tokens[index]))
+        masked_spans.append(MaskedLmInstance(
+            index=index_set,
+            label=[tokens[index] for index in index_set]))
+    assert len(masked_lms) <= num_to_predict
+    np_rng.shuffle(ngram_indexes)
+    select_indexes = set()
+    if do_permutation:
+        for cand_index_set in ngram_indexes:
+            if len(select_indexes) >= num_to_predict:
+                break
+            if not cand_index_set:
+                continue
+            # Note(mingdachen):
+            # Skip current piece if they are covered in lm masking or previous ngrams.
+            for index_set in cand_index_set[0]:
+                for index in index_set:
+                    if index in covered_indexes or index in select_indexes:
+                        continue
+            n = np.random.choice(ngrams[:len(cand_index_set)],
+                                 p=pvals[:len(cand_index_set)] /
+                                 pvals[:len(cand_index_set)].sum(keepdims=True))
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+            while len(select_indexes) + len(index_set) > num_to_predict:
+                if n == 0:
+                    break
+                index_set = sum(cand_index_set[n - 1], [])
+                n -= 1
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(select_indexes) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes or index in select_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                select_indexes.add(index)
+        assert len(select_indexes) <= num_to_predict
+        select_indexes = sorted(select_indexes)
+        permute_indexes = list(select_indexes)
+        np_rng.shuffle(permute_indexes)
+        orig_token = list(output_tokens)
+        for src_i, tgt_i in zip(select_indexes, permute_indexes):
+            output_tokens[src_i] = orig_token[tgt_i]
+            masked_lms.append(MaskedLmInstance(
+                index=src_i, label=orig_token[src_i]))
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+    # Sort the spans by the index of the first span
+    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
+                             masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+    # Some checks.
+    num_tokens = len(tokens)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
+                               dtype=np.int64)
+    # Lables and loss mask.
+    labels = [-1] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
+def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
+                                    train_valid_test_num_samples,
+                                    max_seq_length,
+                                    masked_lm_prob, short_seq_prob, seed,
+                                    tokenizer,
+                                    skip_warmup, binary_head=False,
+                                    max_seq_length_dec=None,
+                                    dataset_type='standard_bert',
+                                    zh_tokenizer=None,
+                                    span=None):
+    if len(data_prefix) == 1:
+        return _build_train_valid_test_datasets(data_prefix[0],
+                                                data_impl, splits_string,
+                                                train_valid_test_num_samples,
+                                                max_seq_length, masked_lm_prob,
+                                                short_seq_prob, seed,
+                                                skip_warmup,
+                                                binary_head,
+                                                max_seq_length_dec,
+                                                tokenizer,
+                                                dataset_type=dataset_type,
+                                                zh_tokenizer=zh_tokenizer,
+                                                span=span)
+    # Blending dataset.
+    # Parse the values.
+    output = get_datasets_weights_and_num_samples(data_prefix,
+                                                  train_valid_test_num_samples)
+    prefixes, weights, datasets_train_valid_test_num_samples = output
+    # Build individual datasets.
+    train_datasets = []
+    valid_datasets = []
+    test_datasets = []
+    for i in range(len(prefixes)):
+        train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+            prefixes[i], data_impl, splits_string,
+            datasets_train_valid_test_num_samples[i],
+            max_seq_length, masked_lm_prob, short_seq_prob,
+            seed, skip_warmup, binary_head, max_seq_length_dec,
+            tokenizer, dataset_type=dataset_type, zh_tokenizer=zh_tokenizer)
+        if train_ds:
+            train_datasets.append(train_ds)
+        if valid_ds:
+            valid_datasets.append(valid_ds)
+        if test_ds:
+            test_datasets.append(test_ds)
+        # Blend.
+    blending_train_dataset = None
+    if train_datasets:
+        blending_train_dataset = BlendableDataset(train_datasets, weights)
+    blending_valid_dataset = None
+    if valid_datasets:
+        blending_valid_dataset = BlendableDataset(valid_datasets, weights)
+    blending_test_dataset = None
+    if test_datasets:
+        blending_test_dataset = BlendableDataset(test_datasets, weights)
+    return (blending_train_dataset, blending_valid_dataset,
+            blending_test_dataset)
+def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
+                                     train_valid_test_num_samples,
+                                     max_seq_length,
+                                     masked_lm_prob, short_seq_prob, seed,
+                                     skip_warmup, binary_head,
+                                     max_seq_length_dec,
+                                     tokenizer,
+                                     dataset_type='standard_bert',
+                                     zh_tokenizer=None,
+                                     span=None):
+    if dataset_type not in DSET_TYPES:
+        raise ValueError("Invalid dataset_type: ", dataset_type)
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix,
+                                           data_impl,
+                                           skip_warmup)
+    # Get start and end indices of train/valid/train into doc-idx
+    # Note that doc-idx is desinged to be num-docs + 1 so we can
+    # easily iterate over it.
+    total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1
+    splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+    # Print stats about the splits.
+    print_rank_0(' > dataset split:')
+    def print_split_stats(name, index):
+        print_rank_0('    {}:'.format(name))
+        print_rank_0('     document indices in [{}, {}) total of {} '
+                     'documents'.format(splits[index], splits[index + 1],
+                                        splits[index + 1] - splits[index]))
+        start_index = indexed_dataset.doc_idx[splits[index]]
+        end_index = indexed_dataset.doc_idx[splits[index + 1]]
+        print_rank_0('     sentence indices in [{}, {}) total of {} '
+                     'sentences'.format(start_index, end_index,
+                                        end_index - start_index))
+    print_split_stats('train', 0)
+    print_split_stats('validation', 1)
+    print_split_stats('test', 2)
+    def build_dataset(index, name):
+        from fengshen.data.megatron_dataloader.bert_dataset import BertDataset
+        from fengshen.data.megatron_dataloader.bart_dataset import BartDataset
+        from fengshen.data.megatron_dataloader.cocolm_dataset import COCOLMDataset
+        dataset = None
+        if splits[index + 1] > splits[index]:
+            # Get the pointer to the original doc-idx so we can set it later.
+            doc_idx_ptr = indexed_dataset.get_doc_idx()
+            # Slice the doc-idx
+            start_index = splits[index]
+            # Add +1 so we can index into the dataset to get the upper bound.
+            end_index = splits[index + 1] + 1
+            # New doc_idx view.
+            indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index])
+            # Build the dataset accordingly.
+            kwargs = dict(
+                name=name,
+                data_prefix=data_prefix,
+                num_epochs=None,
+                max_num_samples=train_valid_test_num_samples[index],
+                max_seq_length=max_seq_length,
+                seed=seed,
+            )
+            if dataset_type == DSET_TYPE_BERT or dataset_type == DSET_TYPE_BERT_CN_WWM:
+                dataset = BertDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    binary_head=binary_head,
+                    # 增加参数区分bert和bert-cn-wwm
+                    tokenizer=tokenizer,
+                    masking_style='bert' if dataset_type == DSET_TYPE_BERT else 'bert-cn-wwm',
+                    **kwargs
+                )
+            elif dataset_type == DSET_TYPE_BART:
+                dataset = BartDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    tokenizer=tokenizer,
+                    zh_tokenizer=zh_tokenizer,
+                    **kwargs
+                )
+            elif dataset_type == DSET_TYPE_COCOLM:
+                dataset = COCOLMDataset(
+                    indexed_dataset=indexed_dataset,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    tokenizer=tokenizer,
+                    masking_style='bert',
+                    span=span,
+                    **kwargs
+                )
+            else:
+                raise NotImplementedError(
+                    "Dataset type not fully implemented.")
+            # Set the original pointer so dataset remains the main dataset.
+            indexed_dataset.set_doc_idx(doc_idx_ptr)
+            # Checks.
+            assert indexed_dataset.doc_idx[0] == 0
+            assert indexed_dataset.doc_idx.shape[0] == \
+                (total_num_of_documents + 1)
+        return dataset
+    train_dataset = build_dataset(0, 'train')
+    valid_dataset = build_dataset(1, 'valid')
+    test_dataset = build_dataset(2, 'test')
+    return (train_dataset, valid_dataset, test_dataset)
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+    print_rank_0(' > building dataset index ...')
+    start_time = time.time()
+    indexed_dataset = make_indexed_dataset(data_prefix,
+                                           data_impl,
+                                           skip_warmup)
+    assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1]
+    print_rank_0(' > finished creating indexed dataset in {:4f} '
+                 'seconds'.format(time.time() - start_time))
+    print_rank_0(' > indexed dataset stats:')
+    print_rank_0('    number of documents: {}'.format(
+        indexed_dataset.doc_idx.shape[0] - 1))
+    print_rank_0('    number of sentences: {}'.format(
+        indexed_dataset.sizes.shape[0]))
+    return indexed_dataset
+def get_train_valid_test_split_(splits_string, size):
+    """ Get dataset splits from comma or '/' separated string list."""
+    splits = []
+    if splits_string.find(',') != -1:
+        splits = [float(s) for s in splits_string.split(',')]
+    elif splits_string.find('/') != -1:
+        splits = [float(s) for s in splits_string.split('/')]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] +
+                            int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+def get_samples_mapping(indexed_dataset,
+                        data_prefix,
+                        num_epochs,
+                        max_num_samples,
+                        max_seq_length,
+                        short_seq_prob,
+                        seed,
+                        name,
+                        binary_head):
+    """Get a list that maps a sample index to a starting
+    sentence index, end sentence index, and length"""
+    if not num_epochs:
+        if not max_num_samples:
+            raise ValueError("Need to specify either max_num_samples "
+                             "or num_epochs")
+        num_epochs = np.iinfo(np.int32).max - 1
+    if not max_num_samples:
+        max_num_samples = np.iinfo(np.int64).max - 1
+    # Filename of the index mapping
+    indexmap_filename = data_prefix
+    indexmap_filename += '_{}_indexmap'.format(name)
+    if num_epochs != (np.iinfo(np.int32).max - 1):
+        indexmap_filename += '_{}ep'.format(num_epochs)
+    if max_num_samples != (np.iinfo(np.int64).max - 1):
+        indexmap_filename += '_{}mns'.format(max_num_samples)
+    indexmap_filename += '_{}msl'.format(max_seq_length)
+    indexmap_filename += '_{:0.2f}ssp'.format(short_seq_prob)
+    indexmap_filename += '_{}s'.format(seed)
+    indexmap_filename += '.npy'
+    # This should be a barrier but nccl barrier assumes
+    # device_index=rank which is not the case for model
+    # parallel case
+    # ganruyi comment
+    # counts = torch.cuda.LongTensor([1])
+    # torch.distributed.all_reduce(
+    # counts, group=mpu.get_data_parallel_group())
+    # torch.distributed.all_reduce(
+    # counts, group=mpu.get_pipeline_model_parallel_group())
+    # assert counts[0].item() == (
+    #    torch.distributed.get_world_size() //
+    #    torch.distributed.get_world_size(
+    # group=mpu.get_tensor_model_parallel_group()))
+    # Load indexed dataset.
+    print_rank_0(' > loading indexed mapping from {}'.format(
+        indexmap_filename))
+    start_time = time.time()
+    samples_mapping = np.load(
+        indexmap_filename, allow_pickle=True, mmap_mode='r')
+    print_rank_0('    loaded indexed file in {:3.3f} seconds'.format(
+        time.time() - start_time))
+    print_rank_0('    total number of samples: {}'.format(
+        samples_mapping.shape[0]))
+    return samples_mapping

fengshen/data/megatron_dataloader/helpers.cpp ADDED Viewed

	@@ -0,0 +1,794 @@

+/*
+ coding=utf-8
+ Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+     http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ */
+/* Helper methods for fast index mapping builds */
+#include <algorithm>
+#include <iostream>
+#include <limits>
+#include <math.h>
+#include <stdexcept>
+#include <pybind11/pybind11.h>
+#include <pybind11/numpy.h>
+#include <random>
+namespace py = pybind11;
+using namespace std;
+const int32_t LONG_SENTENCE_LEN = 512;
+void build_blending_indices(py::array_t<uint8_t> &dataset_index,
+                            py::array_t<int64_t> &dataset_sample_index,
+                            const py::array_t<double> &weights,
+                            const int32_t num_datasets,
+                            const int64_t size, const bool verbose)
+{
+  /* Given multiple datasets and a weighting array, build samples
+   such that it follows those wieghts.*/
+  if (verbose)
+  {
+    std::cout << "> building indices for blendable datasets ..." << std::endl;
+  }
+  // Get the pointer access without the checks.
+  auto dataset_index_ptr = dataset_index.mutable_unchecked<1>();
+  auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>();
+  auto weights_ptr = weights.unchecked<1>();
+  // Initialize buffer for number of samples used for each dataset.
+  int64_t current_samples[num_datasets];
+  for (int64_t i = 0; i < num_datasets; ++i)
+  {
+    current_samples[i] = 0;
+  }
+  // For each sample:
+  for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx)
+  {
+    // Determine where the max error in sampling is happening.
+    auto sample_idx_double = std::max(static_cast<double>(sample_idx), 1.0);
+    int64_t max_error_index = 0;
+    double max_error = weights_ptr[0] * sample_idx_double -
+                       static_cast<double>(current_samples[0]);
+    for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx)
+    {
+      double error = weights_ptr[dataset_idx] * sample_idx_double -
+                     static_cast<double>(current_samples[dataset_idx]);
+      if (error > max_error)
+      {
+        max_error = error;
+        max_error_index = dataset_idx;
+      }
+    }
+    // Populate the indices.
+    dataset_index_ptr[sample_idx] = static_cast<uint8_t>(max_error_index);
+    dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index];
+    // Update the total samples.
+    current_samples[max_error_index] += 1;
+  }
+  // print info
+  if (verbose)
+  {
+    std::cout << " > sample ratios:" << std::endl;
+    for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx)
+    {
+      auto ratio = static_cast<double>(current_samples[dataset_idx]) /
+                   static_cast<double>(size);
+      std::cout << "   dataset " << dataset_idx << ", input: " << weights_ptr[dataset_idx] << ", achieved: " << ratio << std::endl;
+    }
+  }
+}
+py::array build_sample_idx(const py::array_t<int32_t> &sizes_,
+                           const py::array_t<int32_t> &doc_idx_,
+                           const int32_t seq_length,
+                           const int32_t num_epochs,
+                           const int64_t tokens_per_epoch)
+{
+  /* Sample index (sample_idx) is used for gpt2 like dataset for which
+       the documents are flattened and the samples are built based on this
+       1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2]
+       where [..., 0] contains the index into `doc_idx` and [..., 1] is the
+       starting offset in that document.*/
+  // Consistency checks.
+  assert(seq_length > 1);
+  assert(num_epochs > 0);
+  assert(tokens_per_epoch > 1);
+  // Remove bound checks.
+  auto sizes = sizes_.unchecked<1>();
+  auto doc_idx = doc_idx_.unchecked<1>();
+  // Mapping and it's length (1D).
+  int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length;
+  int32_t *sample_idx = new int32_t[2 * (num_samples + 1)];
+  cout << "    using:" << endl
+       << std::flush;
+  cout << "     number of documents:       " << doc_idx_.shape(0) / num_epochs << endl
+       << std::flush;
+  cout << "     number of epochs:          " << num_epochs << endl
+       << std::flush;
+  cout << "     sequence length:           " << seq_length << endl
+       << std::flush;
+  cout << "     total number of samples:   " << num_samples << endl
+       << std::flush;
+  // Index into sample_idx.
+  int64_t sample_index = 0;
+  // Index into doc_idx.
+  int64_t doc_idx_index = 0;
+  // Begining offset for each document.
+  int32_t doc_offset = 0;
+  // Start with first document and no offset.
+  sample_idx[2 * sample_index] = doc_idx_index;
+  sample_idx[2 * sample_index + 1] = doc_offset;
+  ++sample_index;
+  while (sample_index <= num_samples)
+  {
+    // Start with a fresh sequence.
+    int32_t remaining_seq_length = seq_length + 1;
+    while (remaining_seq_length != 0)
+    {
+      // Get the document length.
+      auto doc_id = doc_idx[doc_idx_index];
+      auto doc_length = sizes[doc_id] - doc_offset;
+      // And add it to the current sequence.
+      remaining_seq_length -= doc_length;
+      // If we have more than a full sequence, adjust offset and set
+      // remaining length to zero so we return from the while loop.
+      // Note that -1 here is for the same reason we have -1 in
+      // `_num_epochs` calculations.
+      if (remaining_seq_length <= 0)
+      {
+        doc_offset += (remaining_seq_length + doc_length - 1);
+        remaining_seq_length = 0;
+      }
+      else
+      {
+        // Otherwise, start from the begining of the next document.
+        ++doc_idx_index;
+        doc_offset = 0;
+      }
+    }
+    // Record the sequence.
+    sample_idx[2 * sample_index] = doc_idx_index;
+    sample_idx[2 * sample_index + 1] = doc_offset;
+    ++sample_index;
+  }
+  // Method to deallocate memory.
+  py::capsule free_when_done(sample_idx, [](void *mem_)
+                             {
+                               int32_t *mem = reinterpret_cast<int32_t *>(mem_);
+                               delete[] mem;
+                             });
+  // Return the numpy array.
+  const auto byte_size = sizeof(int32_t);
+  return py::array(std::vector<int64_t>{num_samples + 1, 2}, // shape
+                   {2 * byte_size, byte_size},               // C-style contiguous strides
+                   sample_idx,                               // the data pointer
+                   free_when_done);                          // numpy array references
+}
+inline int32_t get_target_sample_len(const int32_t short_seq_ratio,
+                                     const int32_t max_length,
+                                     std::mt19937 &rand32_gen)
+{
+  /* Training sample length. */
+  if (short_seq_ratio == 0)
+  {
+    return max_length;
+  }
+  const auto random_number = rand32_gen();
+  if ((random_number % short_seq_ratio) == 0)
+  {
+    return 2 + random_number % (max_length - 1);
+  }
+  return max_length;
+}
+template <typename DocIdx>
+py::array build_mapping_impl(const py::array_t<int64_t> &docs_,
+                             const py::array_t<int32_t> &sizes_,
+                             const int32_t num_epochs,
+                             const uint64_t max_num_samples,
+                             const int32_t max_seq_length,
+                             const double short_seq_prob,
+                             const int32_t seed,
+                             const bool verbose,
+                             const int32_t min_num_sent)
+{
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+       start and end index are the indices of the sentences in the sample
+       and sequence-length is the target sequence length.
+    */
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(short_seq_prob >= 0.0);
+  assert(short_seq_prob <= 1.0);
+  assert(seed > 0);
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+  // For efficiency, convert probability to ratio. Note: rand() generates int.
+  int32_t short_seq_ratio = 0;
+  if (short_seq_prob > 0)
+  {
+    short_seq_ratio = static_cast<int32_t>(round(1.0 / short_seq_prob));
+  }
+  if (verbose)
+  {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl
+         << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1 << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", " << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     short sequence probability:     " << short_seq_prob << endl
+         << std::flush;
+    cout << "     short sequence ration (1/prob): " << short_seq_ratio << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+  // Mapping and it's length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration)
+  {
+    // Set the seed so both iterations produce the same results.
+    std::mt19937 rand32_gen(seed);
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+    // Counters:
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+    // Current map index.
+    uint64_t map_index = 0;
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch)
+    {
+      if (map_index >= max_num_samples)
+      {
+        if (verbose && (!second))
+        {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc)
+      {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        // At the begining of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+        // Some bookkeeping
+        if ((epoch == 0) && (!second))
+        {
+          if (num_remain_sent == 0)
+          {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1)
+          {
+            ++one_sent_docs;
+          }
+        }
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent > 1)
+        {
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN)
+            {
+              if ((epoch == 0) && (!second))
+              {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have more than two sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence))
+        {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+          auto target_seq_len = get_target_sample_len(short_seq_ratio,
+                                                      max_seq_length,
+                                                      rand32_gen);
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+            // If we have reached the target length.
+            // and if not only one sentence is left in the document.
+            // and if we have at least two sentneces.
+            // and if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent > 1) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0))
+            {
+              // Check for overflow.
+              if ((3 * map_index + 2) >
+                  std::numeric_limits<int64_t>::max())
+              {
+                cout << "number of samples exceeded maximum "
+                     << "allowed by type int64: "
+                     << std::numeric_limits<int64_t>::max()
+                     << endl;
+                throw std::overflow_error("Number of samples");
+              }
+              // Populate the map.
+              if (second)
+              {
+                const auto map_index_0 = 3 * map_index;
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(target_seq_len);
+              }
+              // Update indices / counters.
+              ++map_index;
+              prev_start_index = sent_index + 1;
+              target_seq_len = get_target_sample_len(short_seq_ratio,
+                                                     max_seq_length,
+                                                     rand32_gen);
+              seq_len = 0;
+              num_sent = 0;
+            }
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+    if (!second)
+    {
+      if (verbose)
+      {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[3 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i)
+  {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 3 * i;
+    const auto j0 = 3 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+  }
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_)
+                             {
+                               DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+                               delete[] mem;
+                             });
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 3}, // shape
+                   {3 * byte_size, byte_size},           // C-style contiguous strides
+                   maps,                                 // the data pointer
+                   free_when_done);                      // numpy array references
+}
+py::array build_mapping(const py::array_t<int64_t> &docs_,
+                        const py::array_t<int> &sizes_,
+                        const int num_epochs,
+                        const uint64_t max_num_samples,
+                        const int max_seq_length,
+                        const double short_seq_prob,
+                        const int seed,
+                        const bool verbose,
+                        const int32_t min_num_sent)
+{
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max())
+  {
+    if (verbose)
+    {
+      cout << "    using uint64 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_mapping_impl<uint64_t>(docs_, sizes_, num_epochs,
+                                        max_num_samples, max_seq_length,
+                                        short_seq_prob, seed, verbose,
+                                        min_num_sent);
+  }
+  else
+  {
+    if (verbose)
+    {
+      cout << "    using uint32 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_mapping_impl<uint32_t>(docs_, sizes_, num_epochs,
+                                        max_num_samples, max_seq_length,
+                                        short_seq_prob, seed, verbose,
+                                        min_num_sent);
+  }
+}
+template <typename DocIdx>
+py::array build_blocks_mapping_impl(const py::array_t<int64_t> &docs_,
+                                    const py::array_t<int32_t> &sizes_,
+                                    const py::array_t<int32_t> &titles_sizes_,
+                                    const int32_t num_epochs,
+                                    const uint64_t max_num_samples,
+                                    const int32_t max_seq_length,
+                                    const int32_t seed,
+                                    const bool verbose,
+                                    const bool use_one_sent_blocks)
+{
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+       start and end index are the indices of the sentences in the sample
+       and sequence-length is the target sequence length.
+    */
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(seed > 0);
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+  auto titles_sizes = titles_sizes_.unchecked<1>();
+  if (verbose)
+  {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl
+         << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1 << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", " << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+  // Mapping and its length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+  // Acceptable number of sentences per block.
+  int min_num_sent = 2;
+  if (use_one_sent_blocks)
+  {
+    min_num_sent = 1;
+  }
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration)
+  {
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+    // Current map index.
+    uint64_t map_index = 0;
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch)
+    {
+      // assign every block a unique id
+      int32_t block_id = 0;
+      if (map_index >= max_num_samples)
+      {
+        if (verbose && (!second))
+        {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc)
+      {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        const auto target_seq_len = max_seq_length - titles_sizes[doc];
+        // At the begining of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+        // Some bookkeeping
+        if ((epoch == 0) && (!second))
+        {
+          if (num_remain_sent == 0)
+          {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1)
+          {
+            ++one_sent_docs;
+          }
+        }
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent >= min_num_sent)
+        {
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN)
+            {
+              if ((epoch == 0) && (!second))
+              {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have enough sentences and no long sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence))
+        {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first;
+               sent_index < sent_index_last; ++sent_index)
+          {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+            // If we have reached the target length.
+            // and there are an acceptable number of sentences left
+            // and if we have at least the minimum number of sentences.
+            // or if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent >= min_num_sent) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0))
+            {
+              // Populate the map.
+              if (second)
+              {
+                const auto map_index_0 = 4 * map_index;
+                // Each sample has 4 items: the starting sentence index, ending sentence index,
+                // the index of the document from which the block comes (used for fetching titles)
+                // and the unique id of the block (used for creating block indexes)
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(doc);
+                maps[map_index_0 + 3] = static_cast<DocIdx>(block_id);
+              }
+              // Update indices / counters.
+              ++map_index;
+              ++block_id;
+              prev_start_index = sent_index + 1;
+              seq_len = 0;
+              num_sent = 0;
+            }
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+    if (!second)
+    {
+      if (verbose)
+      {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[4 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i)
+  {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 4 * i;
+    const auto j0 = 4 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+    swap(maps[i0 + 3], maps[j0 + 3]);
+  }
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_)
+                             {
+                               DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+                               delete[] mem;
+                             });
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 4}, // shape
+                   {4 * byte_size, byte_size},           // C-style contiguous strides
+                   maps,                                 // the data pointer
+                   free_when_done);                      // numpy array references
+}
+py::array build_blocks_mapping(const py::array_t<int64_t> &docs_,
+                               const py::array_t<int> &sizes_,
+                               const py::array_t<int> &titles_sizes_,
+                               const int num_epochs,
+                               const uint64_t max_num_samples,
+                               const int max_seq_length,
+                               const int seed,
+                               const bool verbose,
+                               const bool use_one_sent_blocks)
+{
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max())
+  {
+    if (verbose)
+    {
+      cout << "    using uint64 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_blocks_mapping_impl<uint64_t>(docs_, sizes_, titles_sizes_,
+                                               num_epochs, max_num_samples, max_seq_length, seed, verbose, use_one_sent_blocks);
+  }
+  else
+  {
+    if (verbose)
+    {
+      cout << "    using uint32 for data mapping..." << endl
+           << std::flush;
+    }
+    return build_blocks_mapping_impl<uint32_t>(docs_, sizes_, titles_sizes_,
+                                               num_epochs, max_num_samples, max_seq_length, seed, verbose, use_one_sent_blocks);
+  }
+}
+PYBIND11_MODULE(helpers, m)
+{
+  m.def("build_mapping", &build_mapping);
+  m.def("build_blocks_mapping", &build_blocks_mapping);
+  m.def("build_sample_idx", &build_sample_idx);
+  m.def("build_blending_indices", &build_blending_indices);
+}

fengshen/data/megatron_dataloader/indexed_dataset.py ADDED Viewed

	@@ -0,0 +1,585 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# copied from fairseq/fairseq/data/indexed_dataset.py
+# Removed IndexedRawTextDataset since it relied on Fairseq dictionary
+# other slight modifications to remove fairseq dependencies
+# Added document index to index file and made it accessible.
+#    An empty sentence no longer separates documents.
+from functools import lru_cache
+import os
+import shutil
+import struct
+from itertools import accumulate
+import numpy as np
+import torch
+from fengshen.data.megatron_dataloader.utils import print_rank_0
+def __best_fitting_dtype(vocab_size=None):
+    if vocab_size is not None and vocab_size < 65500:
+        return np.uint16
+    else:
+        return np.int32
+def get_available_dataset_impl():
+    return ['lazy', 'cached', 'mmap']
+def infer_dataset_impl(path):
+    if IndexedDataset.exists(path):
+        with open(index_file_path(path), 'rb') as f:
+            magic = f.read(8)
+            if magic == IndexedDataset._HDR_MAGIC:
+                return 'cached'
+            elif magic == MMapIndexedDataset.Index._HDR_MAGIC[:8]:
+                return 'mmap'
+            else:
+                return None
+    else:
+        print(f"Dataset does not exist: {path}")
+        print("Path should be a basename that both .idx and "
+              ".bin can be appended to get full filenames.")
+        return None
+def make_builder(out_file, impl, vocab_size=None):
+    if impl == 'mmap':
+        return MMapIndexedDatasetBuilder(out_file,
+                                         dtype=__best_fitting_dtype(vocab_size))
+    else:
+        return IndexedDatasetBuilder(out_file)
+def make_dataset(path, impl, skip_warmup=False):
+    if not IndexedDataset.exists(path):
+        print(f"Dataset does not exist: {path}")
+        print("Path should be a basename that both .idx "
+              "and .bin can be appended to get full filenames.")
+        return None
+    if impl == 'infer':
+        impl = infer_dataset_impl(path)
+    if impl == 'lazy' and IndexedDataset.exists(path):
+        return IndexedDataset(path)
+    elif impl == 'cached' and IndexedDataset.exists(path):
+        return IndexedCachedDataset(path)
+    elif impl == 'mmap' and MMapIndexedDataset.exists(path):
+        return MMapIndexedDataset(path, skip_warmup)
+    print(f"Unknown dataset implementation: {impl}")
+    return None
+def dataset_exists(path, impl):
+    if impl == 'mmap':
+        return MMapIndexedDataset.exists(path)
+    else:
+        return IndexedDataset.exists(path)
+def read_longs(f, n):
+    a = np.empty(n, dtype=np.int64)
+    f.readinto(a)
+    return a
+def write_longs(f, a):
+    f.write(np.array(a, dtype=np.int64))
+dtypes = {
+    1: np.uint8,
+    2: np.int8,
+    3: np.int16,
+    4: np.int32,
+    5: np.int64,
+    6: np.float,
+    7: np.double,
+    8: np.uint16
+}
+def code(dtype):
+    for k in dtypes.keys():
+        if dtypes[k] == dtype:
+            return k
+    raise ValueError(dtype)
+def index_file_path(prefix_path):
+    return prefix_path + '.idx'
+def data_file_path(prefix_path):
+    return prefix_path + '.bin'
+def create_doc_idx(sizes):
+    doc_idx = [0]
+    for i, s in enumerate(sizes):
+        if s == 0:
+            doc_idx.append(i + 1)
+    return doc_idx
+class IndexedDataset(torch.utils.data.Dataset):
+    """Loader for IndexedDataset"""
+    _HDR_MAGIC = b'TNTIDX\x00\x00'
+    def __init__(self, path):
+        super().__init__()
+        self.path = path
+        self.data_file = None
+        self.read_index(path)
+    def read_index(self, path):
+        with open(index_file_path(path), 'rb') as f:
+            magic = f.read(8)
+            assert magic == self._HDR_MAGIC, (
+                'Index file doesn\'t match expected format. '
+                'Make sure that --dataset-impl is configured properly.'
+            )
+            version = f.read(8)
+            assert struct.unpack('<Q', version) == (1,)
+            code, self.element_size = struct.unpack('<QQ', f.read(16))
+            self.dtype = dtypes[code]
+            self._len, self.s = struct.unpack('<QQ', f.read(16))
+            self.doc_count = struct.unpack('<Q', f.read(8))
+            self.dim_offsets = read_longs(f, self._len + 1)
+            self.data_offsets = read_longs(f, self._len + 1)
+            self.sizes = read_longs(f, self.s)
+            self.doc_idx = read_longs(f, self.doc_count)
+    def read_data(self, path):
+        self.data_file = open(data_file_path(path), 'rb', buffering=0)
+    def check_index(self, i):
+        if i < 0 or i >= self._len:
+            raise IndexError('index out of range')
+    def __del__(self):
+        if self.data_file:
+            self.data_file.close()
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if not self.data_file:
+            self.read_data(self.path)
+        if isinstance(idx, int):
+            i = idx
+            self.check_index(i)
+            tensor_size = self.sizes[
+                self.dim_offsets[i]:self.dim_offsets[i + 1]]
+            a = np.empty(tensor_size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[i] * self.element_size)
+            self.data_file.readinto(a)
+            return a
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError(
+                    "Slices into indexed_dataset must be contiguous")
+            sizes = self.sizes[self.dim_offsets[start]:self.dim_offsets[stop]]
+            size = sum(sizes)
+            a = np.empty(size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[start] * self.element_size)
+            self.data_file.readinto(a)
+            offsets = list(accumulate(sizes))
+            sents = np.split(a, offsets[:-1])
+            return sents
+    def __len__(self):
+        return self._len
+    def num_tokens(self, index):
+        return self.sizes[index]
+    def size(self, index):
+        return self.sizes[index]
+    @staticmethod
+    def exists(path):
+        return (
+            os.path.exists(index_file_path(path)) and os.path.exists(
+                data_file_path(path))
+        )
+    @property
+    def supports_prefetch(self):
+        return False  # avoid prefetching to save memory
+class IndexedCachedDataset(IndexedDataset):
+    def __init__(self, path):
+        super().__init__(path)
+        self.cache = None
+        self.cache_index = {}
+    @property
+    def supports_prefetch(self):
+        return True
+    def prefetch(self, indices):
+        if all(i in self.cache_index for i in indices):
+            return
+        if not self.data_file:
+            self.read_data(self.path)
+        indices = sorted(set(indices))
+        total_size = 0
+        for i in indices:
+            total_size += self.data_offsets[i + 1] - self.data_offsets[i]
+        self.cache = np.empty(total_size, dtype=self.dtype)
+        ptx = 0
+        self.cache_index.clear()
+        for i in indices:
+            self.cache_index[i] = ptx
+            size = self.data_offsets[i + 1] - self.data_offsets[i]
+            a = self.cache[ptx: ptx + size]
+            self.data_file.seek(self.data_offsets[i] * self.element_size)
+            self.data_file.readinto(a)
+            ptx += size
+        if self.data_file:
+            # close and delete data file after prefetch so we can pickle
+            self.data_file.close()
+            self.data_file = None
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            i = idx
+            self.check_index(i)
+            tensor_size = self.sizes[
+                self.dim_offsets[i]:self.dim_offsets[i + 1]]
+            a = np.empty(tensor_size, dtype=self.dtype)
+            ptx = self.cache_index[i]
+            np.copyto(a, self.cache[ptx: ptx + a.size])
+            return a
+        elif isinstance(idx, slice):
+            # Hack just to make this work, can optimizer later if necessary
+            sents = []
+            for i in range(*idx.indices(len(self))):
+                sents.append(self[i])
+            return sents
+class IndexedDatasetBuilder(object):
+    element_sizes = {
+        np.uint8: 1,
+        np.int8: 1,
+        np.int16: 2,
+        np.int32: 4,
+        np.int64: 8,
+        np.float: 4,
+        np.double: 8
+    }
+    def __init__(self, out_file, dtype=np.int32):
+        self.out_file = open(out_file, 'wb')
+        self.dtype = dtype
+        self.data_offsets = [0]
+        self.dim_offsets = [0]
+        self.sizes = []
+        self.element_size = self.element_sizes[self.dtype]
+        self.doc_idx = [0]
+    def add_item(self, tensor):
+        bytes = self.out_file.write(np.array(tensor.numpy(), dtype=self.dtype))
+        self.data_offsets.append(
+            self.data_offsets[-1] + bytes / self.element_size)
+        for s in tensor.size():
+            self.sizes.append(s)
+        self.dim_offsets.append(self.dim_offsets[-1] + len(tensor.size()))
+    def end_document(self):
+        self.doc_idx.append(len(self.sizes))
+    def merge_file_(self, another_file):
+        index = IndexedDataset(another_file)
+        assert index.dtype == self.dtype
+        begin = self.data_offsets[-1]
+        for offset in index.data_offsets[1:]:
+            self.data_offsets.append(begin + offset)
+        self.sizes.extend(index.sizes)
+        begin = self.dim_offsets[-1]
+        for dim_offset in index.dim_offsets[1:]:
+            self.dim_offsets.append(begin + dim_offset)
+        with open(data_file_path(another_file), 'rb') as f:
+            while True:
+                data = f.read(1024)
+                if data:
+                    self.out_file.write(data)
+                else:
+                    break
+    def finalize(self, index_file):
+        self.out_file.close()
+        index = open(index_file, 'wb')
+        index.write(b'TNTIDX\x00\x00')
+        index.write(struct.pack('<Q', 1))
+        index.write(struct.pack('<QQ', code(self.dtype), self.element_size))
+        index.write(struct.pack('<QQ', len(
+            self.data_offsets) - 1, len(self.sizes)))
+        index.write(struct.pack('<Q', len(self.doc_idx)))
+        write_longs(index, self.dim_offsets)
+        write_longs(index, self.data_offsets)
+        write_longs(index, self.sizes)
+        write_longs(index, self.doc_idx)
+        index.close()
+def _warmup_mmap_file(path):
+    with open(path, 'rb') as stream:
+        while stream.read(100 * 1024 * 1024):
+            pass
+class MMapIndexedDataset(torch.utils.data.Dataset):
+    class Index(object):
+        _HDR_MAGIC = b'MMIDIDX\x00\x00'
+        @classmethod
+        def writer(cls, path, dtype):
+            class _Writer(object):
+                def __enter__(self):
+                    self._file = open(path, 'wb')
+                    self._file.write(cls._HDR_MAGIC)
+                    self._file.write(struct.pack('<Q', 1))
+                    self._file.write(struct.pack('<B', code(dtype)))
+                    return self
+                @staticmethod
+                def _get_pointers(sizes):
+                    dtype_size = dtype().itemsize
+                    address = 0
+                    pointers = []
+                    for size in sizes:
+                        pointers.append(address)
+                        address += size * dtype_size
+                    return pointers
+                def write(self, sizes, doc_idx):
+                    pointers = self._get_pointers(sizes)
+                    self._file.write(struct.pack('<Q', len(sizes)))
+                    self._file.write(struct.pack('<Q', len(doc_idx)))
+                    sizes = np.array(sizes, dtype=np.int32)
+                    self._file.write(sizes.tobytes(order='C'))
+                    del sizes
+                    pointers = np.array(pointers, dtype=np.int64)
+                    self._file.write(pointers.tobytes(order='C'))
+                    del pointers
+                    doc_idx = np.array(doc_idx, dtype=np.int64)
+                    self._file.write(doc_idx.tobytes(order='C'))
+                def __exit__(self, exc_type, exc_val, exc_tb):
+                    self._file.close()
+            return _Writer()
+        def __init__(self, path, skip_warmup=False):
+            with open(path, 'rb') as stream:
+                magic_test = stream.read(9)
+                assert self._HDR_MAGIC == magic_test, (
+                    'Index file doesn\'t match expected format. '
+                    'Make sure that --dataset-impl is configured properly.'
+                )
+                version = struct.unpack('<Q', stream.read(8))
+                assert (1,) == version
+                dtype_code, = struct.unpack('<B', stream.read(1))
+                self._dtype = dtypes[dtype_code]
+                self._dtype_size = self._dtype().itemsize
+                self._len = struct.unpack('<Q', stream.read(8))[0]
+                self._doc_count = struct.unpack('<Q', stream.read(8))[0]
+                offset = stream.tell()
+            if not skip_warmup:
+                print_rank_0("    warming up index mmap file...")
+                _warmup_mmap_file(path)
+            self._bin_buffer_mmap = np.memmap(path, mode='r', order='C')
+            self._bin_buffer = memoryview(self._bin_buffer_mmap)
+            print_rank_0("    reading sizes...")
+            self._sizes = np.frombuffer(
+                self._bin_buffer,
+                dtype=np.int32,
+                count=self._len,
+                offset=offset)
+            print_rank_0("    reading pointers...")
+            self._pointers = np.frombuffer(self._bin_buffer,
+                                           dtype=np.int64, count=self._len,
+                                           offset=offset + self._sizes.nbytes)
+            print_rank_0("    reading document index...")
+            self._doc_idx = np.frombuffer(
+                self._bin_buffer,
+                dtype=np.int64, count=self._doc_count,
+                offset=offset + self._sizes.nbytes + self._pointers.nbytes)
+        def __del__(self):
+            self._bin_buffer_mmap._mmap.close()
+            del self._bin_buffer_mmap
+        @property
+        def dtype(self):
+            return self._dtype
+        @property
+        def sizes(self):
+            return self._sizes
+        @property
+        def doc_idx(self):
+            return self._doc_idx
+        @lru_cache(maxsize=8)
+        def __getitem__(self, i):
+            return self._pointers[i], self._sizes[i]
+        def __len__(self):
+            return self._len
+    def __init__(self, path, skip_warmup=False):
+        super().__init__()
+        self._path = None
+        self._index = None
+        self._bin_buffer = None
+        self._do_init(path, skip_warmup)
+    def __getstate__(self):
+        return self._path
+    def __setstate__(self, state):
+        self._do_init(state)
+    def _do_init(self, path, skip_warmup):
+        self._path = path
+        self._index = self.Index(index_file_path(self._path), skip_warmup)
+        if not skip_warmup:
+            print_rank_0("    warming up data mmap file...")
+            _warmup_mmap_file(data_file_path(self._path))
+        print_rank_0("    creating numpy buffer of mmap...")
+        self._bin_buffer_mmap = np.memmap(
+            data_file_path(self._path), mode='r', order='C')
+        print_rank_0("    creating memory view of numpy buffer...")
+        self._bin_buffer = memoryview(self._bin_buffer_mmap)
+    def __del__(self):
+        self._bin_buffer_mmap._mmap.close()
+        del self._bin_buffer_mmap
+        del self._index
+    def __len__(self):
+        return len(self._index)
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            ptr, size = self._index[idx]
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                     count=size, offset=ptr)
+            return np_array
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError(
+                    "Slices into indexed_dataset must be contiguous")
+            ptr = self._index._pointers[start]
+            sizes = self._index._sizes[idx]
+            offsets = list(accumulate(sizes))
+            total_size = sum(sizes)
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                     count=total_size, offset=ptr)
+            sents = np.split(np_array, offsets[:-1])
+            return sents
+    def get(self, idx, offset=0, length=None):
+        """ Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        ptr, size = self._index[idx]
+        if length is None:
+            length = size - offset
+        ptr += offset * np.dtype(self._index.dtype).itemsize
+        np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype,
+                                 count=length, offset=ptr)
+        return np_array
+    @property
+    def sizes(self):
+        return self._index.sizes
+    @property
+    def doc_idx(self):
+        return self._index.doc_idx
+    def get_doc_idx(self):
+        return self._index._doc_idx
+    def set_doc_idx(self, doc_idx_):
+        self._index._doc_idx = doc_idx_
+    @property
+    def supports_prefetch(self):
+        return False
+    @staticmethod
+    def exists(path):
+        return (
+            os.path.exists(index_file_path(path)) and os.path.exists(
+                data_file_path(path))
+        )
+class MMapIndexedDatasetBuilder(object):
+    def __init__(self, out_file, dtype=np.int64):
+        self._data_file = open(out_file, 'wb', buffering=5000000)
+        self._dtype = dtype
+        self._sizes = []
+        self._doc_idx = [0]
+    def add_item(self, tensor):
+        np_array = np.array(tensor.numpy(), dtype=self._dtype)
+        self._data_file.write(np_array.tobytes(order='C'))
+        self._sizes.append(np_array.size)
+    def end_document(self):
+        self._doc_idx.append(len(self._sizes))
+    def merge_file_(self, another_file):
+        # Concatenate index
+        index = MMapIndexedDataset.Index(index_file_path(another_file))
+        assert index.dtype == self._dtype
+        for size in index.sizes:
+            self._sizes.append(size)
+        # Concatenate data
+        with open(data_file_path(another_file), 'rb') as f:
+            shutil.copyfileobj(f, self._data_file)
+    def finalize(self, index_file):
+        self._data_file.close()
+        with MMapIndexedDataset.Index.writer(index_file, self._dtype) as index:
+            index.write(self._sizes, self._doc_idx)

fengshen/data/megatron_dataloader/utils.py ADDED Viewed

	@@ -0,0 +1,24 @@

+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+def print_rank_0(message):
+    """If distributed is initialized, print only on rank 0."""
+    if torch.distributed.is_initialized():
+        if torch.distributed.get_rank() == 0:
+            print(message, flush=True)
+    else:
+        print(message, flush=True)

fengshen/data/mmap_dataloader/mmap_datamodule.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from typing import Optional
+from pytorch_lightning import LightningDataModule
+from torch.utils.data import DataLoader
+from fengshen.data.mmap_index_dataset import MMapIndexDataset
+class MMapDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('MMAP DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--eval_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--train_datas', default=[
+            './train_datas'
+        ], type=str, nargs='+')
+        parser.add_argument('--valid_datas', default=[
+            './valid_datas'
+        ], type=str, nargs='+')
+        parser.add_argument('--test_datas', default=[
+            './test_datas'],
+            type=str, nargs='+')
+        parser.add_argument('--input_tensor_name', default=['input_ids'], type=str, nargs='+')
+        return parent_args
+    def __init__(
+        self,
+        collate_fn,
+        args,
+        **kwargs,
+    ):
+        super().__init__()
+        self.collate_fn = collate_fn
+        self.train_dataset = MMapIndexDataset(args.train_datas, args.input_tensor_name)
+        self.valid_dataset = MMapIndexDataset(args.valid_datas, args.input_tensor_name)
+        self.test_dataset = MMapIndexDataset(args.test_datas, args.input_tensor_name)
+        self.save_hyperparameters(args)
+    def setup(self, stage: Optional[str] = None) -> None:
+        return super().setup(stage)
+    def train_dataloader(self):
+        return DataLoader(
+            self.train_dataset,
+            batch_size=self.hparams.train_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+    def val_dataloader(self):
+        return DataLoader(
+            self.valid_dataset,
+            batch_size=self.hparams.eval_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )
+    def test_dataloader(self):
+        return DataLoader(
+            self.test_dataset,
+            batch_size=self.hparams.test_batchsize,
+            shuffle=True,
+            num_workers=self.hparams.num_workers,
+            collate_fn=self.collate_fn,
+        )

fengshen/data/mmap_dataloader/mmap_index_dataset.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import numpy as np
+import torch
+from typing import List
+from torch.utils.data import Dataset
+class MMapIndexDataset(Dataset):
+    # datapaths 是所有的内存映射文件的路径
+    # input_tensor_name 是输入的tensor的名字 例如 ['input_ids'] 会存储在对应的文件里面
+    def __init__(self, datapaths: List[str], input_tensor_name: List[str]):
+        dict_idx_fp = {}
+        dict_bin_fp = {}
+        idx_len = []
+        for tensor_name in input_tensor_name:
+            idx_fp = []
+            bin_fp = []
+            len = 0
+            for data_path in datapaths:
+                idx_fp += [np.load(
+                    data_path + '_' + tensor_name + '.npy', mmap_mode='r')]
+                bin_fp += [np.memmap(
+                    data_path + '_' + tensor_name + '.bin',
+                    dtype='long',
+                    mode='r')]
+                len += idx_fp[-1].shape[0]
+                idx_len += [idx_fp[-1].shape[0]]
+            dict_idx_fp[tensor_name] = idx_fp
+            dict_bin_fp[tensor_name] = bin_fp
+            #  通常情况下不同的tensor的长度是一样的
+            self._len = len
+        self._input_tensor_name = input_tensor_name
+        self._dict_idx_fp = dict_idx_fp
+        self._dict_bin_fp = dict_bin_fp
+        self._idx_len = idx_len
+    def __len__(self):
+        return self._len
+    def __getitem__(self, idx):
+        sample = {}
+        for i in range(len(self._idx_len)):
+            if idx >= self._idx_len[i]:
+                idx -= self._idx_len[i]
+            else:
+                break
+        for tensor_name in self._input_tensor_name:
+            sample[tensor_name] = torch.tensor(self._dict_bin_fp[tensor_name][i][
+                self._dict_idx_fp[tensor_name][i][idx, 0]:
+                    self._dict_idx_fp[tensor_name][i][idx, 1]
+            ], dtype=torch.long)
+        # print(sample)
+        return sample

fengshen/data/preprocess.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # coding=utf-8

fengshen/data/t5_dataloader/t5_datasets.py ADDED Viewed

	@@ -0,0 +1,562 @@

+# coding=utf8
+import json
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+from transformers import BertTokenizer, MT5Config, MT5Tokenizer, BatchEncoding
+import torch
+import pytorch_lightning as pl
+import numpy as np
+from itertools import chain
+import sys
+sys.path.append('../../')
+def compute_input_and_target_lengths(inputs_length, noise_density, mean_noise_span_length):
+    """This function is copy of `random_spans_helper <https://github.com/google-research/
+    text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2466>`__ .
+    Training parameters to avoid padding with random_spans_noise_mask.
+    When training a model with random_spans_noise_mask, we would like to set the other
+    training hyperparmeters in a way that avoids padding.
+    This function helps us compute these hyperparameters.
+    We assume that each noise span in the input is replaced by extra_tokens_per_span_inputs sentinel tokens,
+    and each non-noise span in the targets is replaced by extra_tokens_per_span_targets sentinel tokens.
+    This function tells us the required number of tokens in the raw example (for split_tokens())
+    as well as the length of the encoded targets. Note that this function assumes
+    the inputs and targets will have EOS appended and includes that in the reported length.
+    Args:
+        inputs_length: an integer - desired length of the tokenized inputs sequence
+        noise_density: a float
+        mean_noise_span_length: a float
+    Returns:
+        tokens_length: length of original text in tokens
+        targets_length: an integer - length in tokens of encoded targets sequence
+    """
+    def _tokens_length_to_inputs_length_targets_length(tokens_length):
+        num_noise_tokens = int(round(tokens_length * noise_density))
+        num_nonnoise_tokens = tokens_length - num_noise_tokens
+        num_noise_spans = int(round(num_noise_tokens / mean_noise_span_length))
+        # inputs contain all nonnoise tokens, sentinels for all noise spans
+        # and one EOS token.
+        _input_length = num_nonnoise_tokens + num_noise_spans + 1
+        _output_length = num_noise_tokens + num_noise_spans + 1
+        return _input_length, _output_length
+    tokens_length = inputs_length
+    while _tokens_length_to_inputs_length_targets_length(tokens_length + 1)[0] <= inputs_length:
+        tokens_length += 1
+    inputs_length, targets_length = _tokens_length_to_inputs_length_targets_length(
+        tokens_length)
+    # minor hack to get the targets length to be equal to inputs length
+    # which is more likely to have been set to a nice round number.
+    if noise_density == 0.5 and targets_length > inputs_length:
+        tokens_length -= 1
+        targets_length -= 1
+    return tokens_length, targets_length
+class UnsuperviseT5Dataset(Dataset):
+    '''
+    Dataset Used for T5 unsuprvise pretrain.
+    load_data_type = 0: load raw data from data path and save tokenized data, call function load_data
+    load_data_type = 1: load tokenized data from path, call function load_tokenized_data
+    load_data_type = 2: load tokenized data from memery data, call function load_tokenized_memory_data
+    '''
+    def __init__(self, data_path, args, load_data_type=0, data=None):
+        super().__init__()
+        if args.tokenizer_type == 't5_tokenizer':
+            if args.new_vocab_path is not None:
+                self.tokenizer = MT5Tokenizer.from_pretrained(args.new_vocab_path)
+            else:
+                self.tokenizer = MT5Tokenizer.from_pretrained(args.pretrained_model_path)
+        else:
+            self.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path)
+        self.noise_density = 0.15
+        self.mean_noise_span_length = 3
+        self.text_column_name = args.text_column_name
+        self.dataset_num_workers = args.dataset_num_workers
+        self.max_seq_length = args.max_seq_length
+        self.remove_columns = args.remove_columns
+        # whether load tokenieze data
+        self.load_data_type = load_data_type
+        if self.load_data_type == 0:
+            # T5-like span masked language modeling will fuse consecutively masked tokens to a single sentinel token.
+            # To ensure that the input length is `max_seq_length`, we need to increase the maximum length
+            # according to `mlm_probability` and `mean_noise_span_length`.
+            # We can also define the label length accordingly.
+            self.expanded_inputs_length, self.targets_length = compute_input_and_target_lengths(
+                inputs_length=self.max_seq_length,
+                noise_density=self.noise_density,
+                mean_noise_span_length=self.mean_noise_span_length,
+            )
+            print('self.expanded_inputs_length, self.targets_length:{},{}'.format(
+                self.expanded_inputs_length, self.targets_length))
+            self.data = self.load_data(data_path)
+        elif self.load_data_type == 1:
+            self.data = self.load_tokenized_data(data_path)
+        else:
+            assert data is not None
+            self.data = self.load_tokenized_memory_data(data)
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, index):
+        return self.data[index]
+    def load_data(self, data_path):
+        # TODO: large data process
+        from data.fs_datasets import load_dataset
+        samples = load_dataset(
+            # samples = datasets.load_from_disk(data_path)['train']
+            data_path, num_proc=self.dataset_num_workers)['train']
+        # print(samples)
+        tokenized_datasets = samples.map(
+            self.tokenize_function,
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            # load_from_cache_file=not data_args.overwrite_cache,
+        ).map(
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            remove_columns=self.remove_columns)
+        # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
+        # remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
+        # might be slower to preprocess.
+        #
+        # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
+        # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
+        tokenized_datasets = tokenized_datasets.map(
+            self.group_texts,
+            batched=True,
+            num_proc=self.dataset_num_workers,
+            # load_from_cache_file=not data_args.overwrite_cache,
+        )
+        return tokenized_datasets
+    '''
+        The function load tokenized data saved from load_data function.
+    '''
+    def load_tokenized_data(self, data_path):
+        from data.fs_datasets import load_dataset
+        samples = load_dataset(data_path)['train']
+        return samples
+    def load_tokenized_memory_data(self, data):
+        return data
+    # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
+    # Since we make sure that all sequences are of the same length, no attention_mask is needed.
+    def tokenize_function(self, examples):
+        # 这里add_special_tokens=False，避免句子中间出现eos
+        return self.tokenizer(examples[self.text_column_name],
+                              add_special_tokens=False,
+                              return_attention_mask=False)
+    # Main data processing function that will concatenate all texts from our dataset
+    # and generate chunks of expanded_inputs_length.
+    def group_texts(self, examples):
+        # Concatenate all texts.
+        concatenated_examples = {
+            k: list(chain(*examples[k])) for k in examples.keys()}
+        total_length = len(concatenated_examples[list(examples.keys())[0]])
+        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
+        # customize this part to your needs.
+        if total_length >= self.expanded_inputs_length:
+            total_length = (
+                total_length // self.expanded_inputs_length) * self.expanded_inputs_length
+        # Split by chunks of max_len.
+        result = {
+            k: [t[i: i + self.expanded_inputs_length]
+                for i in range(0, total_length, self.expanded_inputs_length)]
+            for k, t in concatenated_examples.items()
+        }
+        return result
+class UnsuperviseT5DataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('UnsuperviseT5DataModel')
+        parser.add_argument('--dataset_num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_num_workers', default=4, type=int)
+        parser.add_argument(
+            '--train_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument('--train_batchsize', default=2, type=int)
+        parser.add_argument('--valid_batchsize', default=2, type=int)
+        parser.add_argument('--train_split_size', default=None, type=float)
+        parser.add_argument('--tokenizer_type', default='t5_tokenizer', choices=['t5_tokenizer', 'bert_tokenizer'])
+        parser.add_argument('--text_column_name', default='text')
+        parser.add_argument('--remove_columns', nargs='+', default=[])
+        return parent_args
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        if args.train_split_size is not None:
+            from data.fs_datasets import load_dataset
+            data_splits = load_dataset(args.train_data_path, num_proc=args.dataset_num_workers)
+            train_split = data_splits['train']
+            test_split = data_splits['test']
+            print('train:', train_split, '\ntest_data:', test_split)
+            self.train_dataset = UnsuperviseT5Dataset('', args, load_data_type=2, data=train_split)
+            self.test_dataset = UnsuperviseT5Dataset('', args, load_data_type=2, data=test_split)
+        else:
+            self.train_data = UnsuperviseT5Dataset(args.train_data_path, args, load_data_type=1)
+        self.config = MT5Config.from_pretrained(args.pretrained_model_path)
+        self.noise_density = 0.15
+        self.mean_noise_span_length = 3
+        self.pad_token_id = self.config.pad_token_id
+        self.decoder_start_token_id = self.config.decoder_start_token_id
+        self.eos_token_id = self.config.eos_token_id
+        self.vocab_size = self.config.vocab_size
+        self.max_seq_length = args.max_seq_length
+        # 因为加载旧的spm里面已经包括了exrta_ids，但是T5Tokenizer会在spm的基础上再增加100个extra_ids,所以需要指定extra_ids=0
+        if args.tokenizer_type == 't5_tokenizer' and args.new_vocab_path is not None:
+            self.tokenizer = MT5Tokenizer.from_pretrained(args.new_vocab_path, extra_ids=0)
+            # 如果是刚开始加载mt5,需要更新vocab_size为提取中英词之后的new_vocab_size
+            self.vocab_size = len(self.tokenizer)
+        # T5-like span masked language modeling will fuse consecutively masked tokens to a single sentinel token.
+        # To ensure that the input length is `max_seq_length`, we need to increase the maximum length
+        # according to `mlm_probability` and `mean_noise_span_length`. We can also define the label length accordingly.
+        self.expanded_inputs_length, self.targets_length = compute_input_and_target_lengths(
+            inputs_length=self.max_seq_length,
+            noise_density=self.noise_density,
+            mean_noise_span_length=self.mean_noise_span_length,
+        )
+    def train_dataloader(self):
+        from fengshen.data.universal_datamodule.universal_sampler import PretrainingSampler
+        from fengshen.data.universal_datamodule.universal_datamodule import get_consume_samples
+        # 采用自定义的sampler，确保继续训练能正确取到数据
+        consumed_samples = get_consume_samples(self)
+        batch_sampler = PretrainingSampler(
+            total_samples=len(self.train_dataset),
+            consumed_samples=consumed_samples,
+            micro_batch_size=self.hparams.train_batchsize,
+            data_parallel_rank=self.trainer.global_rank,
+            data_parallel_size=self.trainer.world_size,
+        )
+        return DataLoader(
+            self.train_dataset,
+            batch_sampler=batch_sampler,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+    def val_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.test_dataset, shuffle=False)
+        return DataLoader(
+            self.test_dataset,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+    def predict_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.test_dataset, shuffle=False)
+        return DataLoader(
+            self.test_data,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers,
+            collate_fn=self.collate_fn,
+        )
+    def collate_fn(self, examples):
+        # convert list to dict and tensorize input
+        batch = BatchEncoding(
+            {k: np.array([examples[i][k] for i in range(len(examples))])
+             for k, v in examples[0].items()}
+        )
+        input_ids = np.array(batch['input_ids'])
+        batch_size, expanded_input_length = input_ids.shape
+        mask_indices = np.asarray([self.random_spans_noise_mask(
+            expanded_input_length) for i in range(batch_size)])
+        labels_mask = ~mask_indices
+        input_ids_sentinel = self.create_sentinel_ids(
+            mask_indices.astype(np.int8))
+        labels_sentinel = self.create_sentinel_ids(labels_mask.astype(np.int8))
+        batch["input_ids"] = self.filter_input_ids(
+            input_ids, input_ids_sentinel)
+        batch["labels"] = self.filter_input_ids(input_ids, labels_sentinel)
+        if batch["input_ids"].shape[-1] != self.max_seq_length:
+            raise ValueError(
+                f"`input_ids` are incorrectly preprocessed. `input_ids` length is \
+                    {batch['input_ids'].shape[-1]}, but should be {self.targets_length}."
+            )
+        if batch["labels"].shape[-1] != self.targets_length:
+            raise ValueError(
+                f"`labels` are incorrectly preprocessed. `labels` length is \
+                    {batch['labels'].shape[-1]}, but should be {self.targets_length}."
+            )
+        batch["decoder_input_ids"] = self.shift_tokens_right(
+            batch["labels"], self.pad_token_id, self.decoder_start_token_id
+        )
+        for k, v in batch.items():
+            batch[k] = torch.tensor(v)
+            # print(k, batch[k], self.tokenizer.batch_decode(batch[k]), '\n', flush=True)
+        return batch
+    def create_sentinel_ids(self, mask_indices):
+        """
+        Sentinel ids creation given the indices that should be masked.
+        The start indices of each mask are replaced by the sentinel ids in increasing
+        order. Consecutive mask indices to be deleted are replaced with `-1`.
+        """
+        start_indices = mask_indices - \
+            np.roll(mask_indices, 1, axis=-1) * mask_indices
+        start_indices[:, 0] = mask_indices[:, 0]
+        sentinel_ids = np.where(start_indices != 0, np.cumsum(
+            start_indices, axis=-1), start_indices)
+        sentinel_ids = np.where(
+            sentinel_ids != 0, (self.vocab_size - sentinel_ids), 0)
+        sentinel_ids -= mask_indices - start_indices
+        return sentinel_ids
+    def filter_input_ids(self, input_ids, sentinel_ids):
+        """
+        Puts sentinel mask on `input_ids` and fuse consecutive mask tokens into a single mask token by deleting.
+        This will reduce the sequence length from `expanded_inputs_length` to `input_length`.
+        """
+        batch_size = input_ids.shape[0]
+        input_ids_full = np.where(sentinel_ids != 0, sentinel_ids, input_ids)
+        # input_ids tokens and sentinel tokens are >= 0, tokens < 0 are
+        # masked tokens coming after sentinel tokens and should be removed
+        input_ids = input_ids_full[input_ids_full >=
+                                   0].reshape((batch_size, -1))
+        input_ids = np.concatenate(
+            [input_ids, np.full((batch_size, 1), self.eos_token_id, dtype=np.int32)], axis=-1
+        )
+        return input_ids
+    # Copied from transformers.models.bart.modeling_flax_bart.shift_tokens_right
+    def shift_tokens_right(self, input_ids: np.array, pad_token_id: int, decoder_start_token_id: int) -> np.ndarray:
+        """
+        Shift input ids one token to the right.
+        """
+        shifted_input_ids = np.zeros_like(input_ids)
+        shifted_input_ids[:, 1:] = input_ids[:, :-1]
+        shifted_input_ids[:, 0] = decoder_start_token_id
+        shifted_input_ids = np.where(
+            shifted_input_ids == -100, pad_token_id, shifted_input_ids)
+        return shifted_input_ids
+    def random_spans_noise_mask(self, length):
+        """This function is copy of `random_spans_helper <https://github.com/google-research/text-to-text-transfer-transformer/
+        blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2682>`__ .
+        Noise mask consisting of random spans of noise tokens.
+        The number of noise tokens and the number of noise spans and non-noise spans
+        are determined deterministically as follows:
+        num_noise_tokens = round(length * noise_density)
+        num_nonnoise_spans = num_noise_spans = round(num_noise_tokens / mean_noise_span_length)
+        Spans alternate between non-noise and noise, beginning with non-noise.
+        Subject to the above restrictions, all masks are equally likely.
+        Args:
+            length: an int32 scalar (length of the incoming token sequence)
+            noise_density: a float - approximate density of output mask
+            mean_noise_span_length: a number
+        Returns:
+            a boolean tensor with shape [length]
+        """
+        orig_length = length
+        num_noise_tokens = int(np.round(length * self.noise_density))
+        # avoid degeneracy by ensuring positive numbers of noise and nonnoise tokens.
+        num_noise_tokens = min(max(num_noise_tokens, 1), length - 1)
+        num_noise_spans = int(
+            np.round(num_noise_tokens / self.mean_noise_span_length))
+        # avoid degeneracy by ensuring positive number of noise spans
+        num_noise_spans = max(num_noise_spans, 1)
+        num_nonnoise_tokens = length - num_noise_tokens
+        # pick the lengths of the noise spans and the non-noise spans
+        def _random_segmentation(num_items, num_segments):
+            """Partition a sequence of items randomly into non-empty segments.
+            Args:
+                num_items: an integer scalar > 0
+                num_segments: an integer scalar in [1, num_items]
+            Returns:
+                a Tensor with shape [num_segments] containing positive integers that add
+                up to num_items
+            """
+            mask_indices = np.arange(num_items - 1) < (num_segments - 1)
+            np.random.shuffle(mask_indices)
+            first_in_segment = np.pad(mask_indices, [[1, 0]])
+            segment_id = np.cumsum(first_in_segment)
+            # count length of sub segments assuming that list is sorted
+            _, segment_length = np.unique(segment_id, return_counts=True)
+            return segment_length
+        noise_span_lengths = _random_segmentation(
+            num_noise_tokens, num_noise_spans)
+        nonnoise_span_lengths = _random_segmentation(
+            num_nonnoise_tokens, num_noise_spans)
+        interleaved_span_lengths = np.reshape(
+            np.stack([nonnoise_span_lengths, noise_span_lengths],
+                     axis=1), [num_noise_spans * 2]
+        )
+        span_starts = np.cumsum(interleaved_span_lengths)[:-1]
+        span_start_indicator = np.zeros((length,), dtype=np.int8)
+        span_start_indicator[span_starts] = True
+        span_num = np.cumsum(span_start_indicator)
+        is_noise = np.equal(span_num % 2, 1)
+        return is_noise[:orig_length]
+class TaskT5Dataset(Dataset):
+    def __init__(self, data_path, args):
+        super().__init__()
+        self.max_length = args.max_seq_length
+        if args.tokenizer_type == 't5_tokenizer':
+            self.tokenizer = MT5Tokenizer.from_pretrained(args.pretrained_model_path)
+        else:
+            self.tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_path)
+        self.data = self.load_data(data_path)
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+    def load_data(self, data_path):
+        samples = []
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            for line in tqdm(lines):
+                samples.append(json.loads(line))
+        return samples
+    def encode(self, item):
+        if item["textb"] != "":
+            text = item['question'] + '，'.join(item['choice'])+'。' + f"""{item["texta"]}""" + f"""{item["textb"]}"""
+        else:
+            text = f"""{item["question"]}""" + "，".join(item["choice"]) + "。" + f"""{item["texta"]}"""
+        label = item['answer']
+        encode_dict = self.tokenizer.encode_plus(text, max_length=self.max_length, padding='max_length',
+                                                 truncation=True, return_tensors='pt')
+        decode_dict = self.tokenizer.encode_plus(label, max_length=16, padding='max_length',
+                                                 truncation=True)
+        answer_token = []
+        max_label_len = 0
+        choice_encode = []  # 用来确定模型生成的最大长度
+        for a in item['choice']:
+            answer_encode = self.tokenizer.encode(a)
+            choice_encode.append(answer_encode)
+            if len(answer_encode) > max_label_len:
+                max_label_len = len(answer_encode)
+            for an in answer_encode:
+                if an not in answer_token:
+                    answer_token.append(an)
+        # bad_words_ids = [[i] for i in range(self.tokenizer.vocab_size) if i not in answer_token] #不生成这些token
+        # while len(bad_words_ids)<self.tokenizer.vocab_size:
+        #     bad_words_ids.append(bad_words_ids[0])
+        # bad_words_ids = [[423],[67],[878]]
+        encode_sent = encode_dict['input_ids'].squeeze()
+        attention_mask = encode_dict['attention_mask'].squeeze()
+        target = decode_dict['input_ids']
+        labels = torch.tensor(target)
+        labels[target == self.tokenizer.pad_token_id] = -100
+        return {
+            "input_ids": torch.tensor(encode_sent).long(),
+            "attention_mask": torch.tensor(attention_mask).float(),
+            "labels": torch.tensor(target).long(),
+            "force_words_ids": answer_token,
+        }
+class TaskT5DataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TaskT5DataModel')
+        parser.add_argument('--dataset_num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_num_workers', default=4, type=int)
+        parser.add_argument(
+            '--train_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument(
+            '--valid_data_path', default='wudao_180g_mt5_tokenized', type=str)
+        parser.add_argument('--train_batchsize', default=2, type=int)
+        parser.add_argument('--valid_batchsize', default=2, type=int)
+        parser.add_argument('--train_split_size', default=None, type=float)
+        parser.add_argument('--tokenizer_type', default='t5_tokenizer', choices=['t5_tokenizer', 'bert_tokenizer'])
+        parser.add_argument('--text_column_name', default='text')
+        parser.add_argument('--remove_columns', nargs='+', default=[])
+        return parent_args
+    def __init__(self, args):
+        super().__init__()
+        self.save_hyperparameters(args)
+        self.train_dataset = TaskT5Dataset(args.train_data_path, args)
+        self.valid_dataset = TaskT5Dataset(args.valid_data_path, args)
+    def train_dataloader(self):
+        from fengshen.data.universal_datamodule.universal_sampler import PretrainingSampler
+        from fengshen.data.universal_datamodule.universal_datamodule import get_consume_samples
+        # 采用自定��的sampler，确保继续训练能正确取到数据
+        consumed_samples = get_consume_samples(self)
+        # batch_sampler = PretrainingRandomSampler(
+        batch_sampler = PretrainingSampler(
+            total_samples=len(self.train_dataset),
+            consumed_samples=consumed_samples,
+            micro_batch_size=self.hparams.train_batchsize,
+            data_parallel_rank=self.trainer.global_rank,
+            data_parallel_size=self.trainer.world_size,
+        )
+        # epoch=self.trainer.current_epoch
+        # )
+        return DataLoader(
+            self.train_dataset,
+            batch_sampler=batch_sampler,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers
+        )
+    def val_dataloader(self):
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            self.valid_dataset, shuffle=False)
+        return DataLoader(
+            self.valid_dataset,
+            sampler=sampler,
+            shuffle=False,
+            batch_size=self.hparams.valid_batchsize,
+            pin_memory=True,
+            num_workers=self.hparams.dataloader_num_workers
+        )

fengshen/data/task_dataloader/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+# coding=utf-8
+from .task_datasets import LCSTSDataModel, LCSTSDataset
+__all__ = ['LCSTSDataModel', 'LCSTSDataset']

fengshen/data/task_dataloader/medicalQADataset.py ADDED Viewed

	@@ -0,0 +1,137 @@

+# coding=utf8
+import os
+import pytorch_lightning as pl
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+from transformers import AutoTokenizer
+class GPT2QADataset(Dataset):
+    '''
+    Dataset Used for yuyuan medical qa task.
+    Just surpport small datasets, when deal with large datasets it may be slowly.
+    for large datasets please use mmapdatasets(doing)
+    '''
+    def __init__(self, data_path, name, args):
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path)
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
+        self.data_size = os.path.getsize(data_path)/1024/1024/1024
+        self.data_type_name = name
+        self.data = self.load_data(data_path)
+        self.max_seq_length = args.max_seq_length
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+    def load_data(self, data_path):
+        # 有进度条展示
+        if self.data_size <= 5:
+            with open(data_path, "rt", encoding='utf8') as f:
+                lines = f.readlines()
+            total_num = len(lines)
+            data_gen = lines
+        else:
+            data_gen = open(data_path, "rt", encoding='utf8')
+            total_num = None
+        data = []
+        with tqdm(total=total_num, desc=f'{self.data_type_name}处理进度', mininterval=0.3) as bar:
+            for idx, line in enumerate(data_gen):
+                data.append(self.data_parse(line))
+                bar.update()
+        if self.data_size > 5:
+            data_gen.close()
+        return data
+    def data_parse(self, line):
+        """
+        解析不同格式的数据
+        """
+        dic = eval(line.strip())
+        return dic
+    def encode(self, item):
+        """
+        将数据转换成模型训练的输入
+        """
+        inputs_dict = self.tokenizer.encode_plus(item['Question']+item['answer'],
+                                                 max_length=self.max_seq_length, padding='max_length',
+                                                 truncation=True, return_tensors='pt')
+        target = inputs_dict['input_ids']
+        labels = target.clone().detach()
+        labels[target == self.tokenizer.pad_token_id] = -100
+        return {
+            "input_ids": inputs_dict['input_ids'].squeeze(),
+            "attention_mask": inputs_dict['attention_mask'].squeeze(),
+            "labels": labels.squeeze(),
+            "question": item['Question'],
+            "answer": item['answer']
+        }
+class GPT2QADataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('GPT2QADataModel')
+        parser.add_argument('--data_dir', type=str, required=True)
+        parser.add_argument('--num_workers', default=2, type=int)
+        parser.add_argument('--train_data', default='train.txt', type=str)
+        parser.add_argument('--valid_data', default='valid.txt', type=str)
+        parser.add_argument('--test_data', default='test.txt', type=str)
+        parser.add_argument('--train_batchsize', type=int, required=True)
+        parser.add_argument('--valid_batchsize', type=int, required=True)
+        parser.add_argument('--max_seq_length', default=1024, type=int)
+        return parent_args
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        if not args.do_eval_only:
+            self.train_data = GPT2QADataset(os.path.join(
+                args.data_dir, args.train_data), '训练集', args)
+            self.valid_data = GPT2QADataset(os.path.join(
+                args.data_dir, args.valid_data), '验证集', args)
+        self.test_data = GPT2QADataset(os.path.join(
+            args.data_dir, args.test_data), '测试集', args)
+    def train_dataloader(self):
+        return DataLoader(
+            self.train_data, shuffle=True,
+            batch_size=self.train_batchsize,
+            pin_memory=False, num_workers=self.args.num_workers)
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False, num_workers=self.args.num_workers)
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False,
+                          batch_size=self.valid_batchsize, pin_memory=False,
+                          num_workers=self.args.num_workers)
+if __name__ == '__main__':
+    import argparse
+    modelfile = '/cognitive_comp/wuziwei/pretrained_model_hf/medical_v2'
+    datafile = '/cognitive_comp/wuziwei/task-data/medical_qa/medical_qa_train.txt'
+    parser = argparse.ArgumentParser(description='hf test', allow_abbrev=False)
+    group = parser.add_argument_group(title='test args')
+    group.add_argument('--pretrained-model-path', type=str, default=modelfile,
+                       help='Number of transformer layers.')
+    group.add_argument('--max-seq-length', type=int, default=1024)
+    args = parser.parse_args()
+    testml = GPT2QADataset(datafile, 'medical_qa', args=args)
+    print(testml[10])

fengshen/data/task_dataloader/task_datasets.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# coding=utf8
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+from transformers import AutoTokenizer
+import json
+import torch
+import pytorch_lightning as pl
+import os
+class AbstractCollator:
+    """
+    collector for summary task
+    """
+    def __init__(self, tokenizer, max_enc_length, max_dec_length, prompt):
+        self.tokenizer = tokenizer
+        self.max_enc_length = max_enc_length
+        self.max_dec_length = max_dec_length
+        self.prompt = prompt
+    def __call__(self, samples):
+        labels = []
+        attn_mask = []
+        # decoder_attn_mask = []
+        source_inputs = []
+        for sample in samples:
+            encode_dict = self.tokenizer.encode_plus(
+                self.prompt + sample['text'],
+                max_length=self.max_enc_length,
+                padding='max_length',
+                truncation=True,
+                return_tensors='pt')
+            decode_dict = self.tokenizer.encode_plus(
+                sample['summary'],
+                max_length=self.max_dec_length,
+                padding='max_length',
+                truncation=True,
+                return_tensors='pt')
+            source_inputs.append(encode_dict['input_ids'].squeeze())
+            labels.append(decode_dict['input_ids'].squeeze())
+            attn_mask.append(encode_dict['attention_mask'].squeeze())
+            # decoder_attn_mask.append(decode_dict['attention_mask'].squeeze())
+        # labels = torch.tensor(decode_dict['input'])
+        source_inputs = torch.stack(source_inputs)
+        labels = torch.stack(labels)
+        attn_mask = torch.stack(attn_mask)
+        # decoder_attn_mask = torch.stack(decoder_attn_mask)
+        # decode_input_idxs = shift_tokens_right(labels, self.tokenizer.pad_token_id, self.tokenizer.pad_token_id)
+        end_token_index = torch.where(labels == self.tokenizer.eos_token_id)[1]
+        for idx, end_idx in enumerate(end_token_index):
+            labels[idx][end_idx + 1:] = -100
+        return {
+            "input_ids": source_inputs,
+            "attention_mask": attn_mask,
+            "labels": labels,
+            "text": [sample['text'] for sample in samples],
+            "summary": [sample['summary'] for sample in samples]
+        }
+class LCSTSDataset(Dataset):
+    '''
+    Dataset Used for LCSTS summary task.
+    '''
+    def __init__(self, data_path, args):
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path, use_fast=False)
+        self.data = self.load_data(data_path)
+        self.prompt = args.prompt
+        self.max_enc_length = args.max_enc_length
+        self.max_dec_length = args.max_dec_length
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, index):
+        return self.encode(self.data[index])
+    def load_data(self, data_path):
+        with open(data_path, "r", encoding='utf8') as f:
+            lines = f.readlines()
+        samples = []
+        for line in tqdm(lines):
+            obj = json.loads(line)
+            source = obj['text']
+            target = obj['summary']
+            samples.append({
+                "text": source,
+                "summary": target
+            })
+        return samples
+    def cal_data(self, data_path):
+        with open(data_path, "r", encoding='utf8') as f:
+            lines = f.readlines()
+        samples = []
+        enc_sizes = []
+        dec_sizes = []
+        for line in tqdm(lines):
+            obj = json.loads(line.strip())
+            source = obj['text']
+            target = obj['summary']
+            enc_input_ids = self.tokenizer.encode(source)
+            target = self.tokenizer.encode(target)
+            enc_sizes.append(len(enc_input_ids))
+            dec_sizes.append(len(target)-1)
+            samples.append({
+                "enc_input_ids": enc_input_ids,
+                "dec_input_ids": target[:-1],
+                "label_ids": target[1:]
+            })
+        max_enc_len = max(enc_sizes)
+        max_dec_len = max(dec_sizes)
+        import numpy as np
+        # mean of len(enc_input_ids): 74.68041911345998
+        # mean of len(dec_input_ids): 14.02265483791283
+        # max of len(enc_input_ids): 132
+        # max of len(dec_input_ids): 31
+        print('mean of len(enc_input_ids):', np.mean(enc_sizes),
+              'mean of len(dec_input_ids):', np.mean(dec_sizes),
+              'max of len(enc_input_ids):', max_enc_len,
+              'max of len(dec_input_ids):', max_dec_len)
+        return samples
+    def encode(self, item):
+        encode_dict = self.tokenizer.encode_plus(
+            self.prompt + item['text'],
+            max_length=self.max_enc_length,
+            padding='max_length',
+            truncation=True,
+            return_tensors='pt')
+        decode_dict = self.tokenizer.encode_plus(
+            item['summary'],
+            max_length=self.max_dec_length,
+            padding='max_length',
+            truncation=True)
+        target = decode_dict['input_ids']
+        # print('encode_dict shape:', encode_dict['input_ids'].shape)
+        labels = torch.tensor(target)
+        labels[target == self.tokenizer.pad_token_id] = -100
+        return {
+            "input_ids": encode_dict['input_ids'].squeeze(),
+            "attention_mask": encode_dict['attention_mask'].squeeze(),
+            "labels": labels.squeeze(),
+            "text": item['text'],
+            "summary": item['summary']
+        }
+class LCSTSDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('LCSTSDataModel')
+        parser.add_argument(
+            '--data_dir', default='/cognitive_comp/ganruyi/data_datasets_LCSTS_LCSTS/', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.jsonl', type=str)
+        parser.add_argument('--valid_data', default='valid.jsonl', type=str)
+        parser.add_argument('--test_data', default='test_public.jsonl', type=str)
+        parser.add_argument('--train_batchsize', default=128, type=int)
+        parser.add_argument('--valid_batchsize', default=128, type=int)
+        parser.add_argument('--max_enc_length', default=128, type=int)
+        parser.add_argument('--max_dec_length', default=30, type=int)
+        parser.add_argument('--prompt', default='summarize:', type=str)
+        return parent_args
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        if not args.do_eval_only:
+            self.train_data = LCSTSDataset(os.path.join(
+                args.data_dir, args.train_data), args)
+        self.valid_data = LCSTSDataset(os.path.join(
+            args.data_dir, args.valid_data), args)
+        self.test_data = LCSTSDataset(os.path.join(
+            args.data_dir, args.test_data), args)
+    def train_dataloader(self):
+        return DataLoader(self.train_data,
+                          shuffle=True,
+                          batch_size=self.train_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)
+    def val_dataloader(self):
+        return DataLoader(self.valid_data,
+                          shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)
+    def predict_dataloader(self):
+        return DataLoader(self.test_data,
+                          shuffle=False,
+                          batch_size=self.valid_batchsize,
+                          pin_memory=False,
+                          num_workers=self.args.num_workers)

fengshen/data/universal_datamodule/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .universal_datamodule import UniversalDataModule
+from .universal_sampler import PretrainingSampler, PretrainingRandomSampler
+__all__ = ['UniversalDataModule', 'PretrainingSampler', 'PretrainingRandomSampler']

fengshen/data/universal_datamodule/universal_datamodule.py ADDED Viewed

	@@ -0,0 +1,161 @@

+from pytorch_lightning import LightningDataModule
+from typing import Optional
+from torch.utils.data import DataLoader, DistributedSampler
+def get_consume_samples(data_model: LightningDataModule) -> int:
+    if hasattr(data_model.trainer.lightning_module, 'consumed_samples'):
+        consumed_samples = data_model.trainer.lightning_module.consumed_samples
+        print('get consumed samples from model: {}'.format(consumed_samples))
+    else:
+        world_size = data_model.trainer.world_size
+        consumed_samples = max(0, data_model.trainer.global_step - 1) * \
+            data_model.hparams.train_batchsize * world_size * data_model.trainer.accumulate_grad_batches
+        print('calculate consumed samples: {}'.format(consumed_samples))
+    return consumed_samples
+class UniversalDataModule(LightningDataModule):
+    @ staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('Universal DataModule')
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--dataloader_workers', default=2, type=int)
+        parser.add_argument('--train_batchsize', default=32, type=int)
+        parser.add_argument('--val_batchsize', default=32, type=int)
+        parser.add_argument('--test_batchsize', default=32, type=int)
+        parser.add_argument('--datasets_name', type=str, default=None)
+        parser.add_argument('--train_datasets_field', type=str, default='train')
+        parser.add_argument('--val_datasets_field', type=str, default='validation')
+        parser.add_argument('--test_datasets_field', type=str, default='test')
+        parser.add_argument('--train_file', type=str, default=None)
+        parser.add_argument('--val_file', type=str, default=None)
+        parser.add_argument('--test_file', type=str, default=None)
+        parser.add_argument('--raw_file_type', type=str, default='json')
+        parser.add_argument('--sampler_type', type=str,
+                            choices=['single',
+                                     'random'],
+                            default='random')
+        return parent_args
+    def __init__(
+        self,
+        tokenizer,
+        collate_fn,
+        args,
+        datasets=None,
+        **kwargs,
+    ):
+        super().__init__()
+        # 如果不传入datasets的名字，则可以在对象外部替换内部的datasets为模型需要的
+        if datasets is not None:
+            self.datasets = datasets
+        elif args.datasets_name is not None:
+            from fengshen.data.fs_datasets import load_dataset
+            print('---------begin to load datasets {}'.format(args.datasets_name))
+            self.datasets = load_dataset(
+                args.datasets_name, num_proc=args.num_workers)
+            print('---------ending load datasets {}'.format(args.datasets_name))
+        else:
+            print('---------begin to load datasets from local file')
+            from datasets import load_dataset
+            self.datasets = load_dataset(args.raw_file_type,
+                                         data_files={
+                                             args.train_datasets_field: args.train_file,
+                                             args.val_datasets_field: args.val_file,
+                                             args.test_datasets_field: args.test_file})
+            print('---------end to load datasets from local file')
+        self.tokenizer = tokenizer
+        self.collate_fn = collate_fn
+        self.save_hyperparameters(args)
+    def get_custom_sampler(self, ds):
+        from .universal_sampler import PretrainingRandomSampler
+        from .universal_sampler import PretrainingSampler
+        world_size = self.trainer.world_size
+        consumed_samples = get_consume_samples(self)
+        # use the user default sampler
+        if self.hparams.sampler_type == 'random':
+            return PretrainingRandomSampler(
+                total_samples=len(ds),
+                # consumed_samples cal by global steps
+                consumed_samples=consumed_samples,
+                micro_batch_size=self.hparams.train_batchsize,
+                data_parallel_rank=self.trainer.global_rank,
+                data_parallel_size=world_size,
+                epoch=self.trainer.current_epoch,
+            )
+        elif self.hparams.sampler_type == 'single':
+            return PretrainingSampler(
+                total_samples=len(ds),
+                # consumed_samples cal by global steps
+                consumed_samples=consumed_samples,
+                micro_batch_size=self.hparams.train_batchsize,
+                data_parallel_rank=self.trainer.global_rank,
+                data_parallel_size=world_size,
+            )
+        else:
+            raise Exception('Unknown sampler type: {}'.format(self.hparams.sampler_type))
+    def setup(self, stage: Optional[str] = None) -> None:
+        return
+    def train_dataloader(self):
+        ds = self.datasets[self.hparams.train_datasets_field]
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+        if self.hparams.replace_sampler_ddp is False:
+            return DataLoader(
+                ds,
+                batch_sampler=self.get_custom_sampler(ds),
+                num_workers=self.hparams.dataloader_workers,
+                collate_fn=collate_fn,
+                pin_memory=True,
+            )
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.train_batchsize,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            pin_memory=True,
+        )
+    def val_dataloader(self):
+        ds = self.datasets[self.hparams.val_datasets_field]
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.val_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            sampler=DistributedSampler(
+                ds, shuffle=False),
+            pin_memory=True,
+        )
+    def test_dataloader(self):
+        ds = self.datasets[self.hparams.test_datasets_field]
+        collate_fn = self.collate_fn
+        if collate_fn is None and hasattr(ds, 'collater'):
+            collate_fn = ds.collater
+        return DataLoader(
+            ds,
+            batch_size=self.hparams.test_batchsize,
+            shuffle=False,
+            num_workers=self.hparams.dataloader_workers,
+            collate_fn=collate_fn,
+            sampler=DistributedSampler(
+                ds, shuffle=False),
+            pin_memory=True,
+        )

fengshen/data/universal_datamodule/universal_sampler.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Dataloaders."""
+import torch
+class PretrainingSampler:
+    def __init__(self, total_samples, consumed_samples, micro_batch_size,
+                 data_parallel_rank, data_parallel_size, drop_last=True):
+        # Keep a copy of input params for later use.
+        self.total_samples = total_samples
+        self.consumed_samples = consumed_samples
+        self.micro_batch_size = micro_batch_size
+        self.data_parallel_rank = data_parallel_rank
+        self.micro_batch_times_data_parallel_size = \
+            self.micro_batch_size * data_parallel_size
+        self.drop_last = drop_last
+        # Sanity checks.
+        assert self.total_samples > 0, \
+            'no sample to consume: {}'.format(self.total_samples)
+        assert self.consumed_samples < self.total_samples, \
+            'no samples left to consume: {}, {}'.format(self.consumed_samples,
+                                                        self.total_samples)
+        assert self.micro_batch_size > 0
+        assert data_parallel_size > 0
+        assert self.data_parallel_rank < data_parallel_size, \
+            'data_parallel_rank should be smaller than data size: {}, ' \
+            '{}'.format(self.data_parallel_rank, data_parallel_size)
+    def __len__(self):
+        return self.total_samples // self.micro_batch_times_data_parallel_size
+    def get_start_end_idx(self):
+        start_idx = self.data_parallel_rank * self.micro_batch_size
+        end_idx = start_idx + self.micro_batch_size
+        return start_idx, end_idx
+    def __iter__(self):
+        batch = []
+        # Last batch will be dropped if drop_last is not set False
+        for idx in range(self.consumed_samples, self.total_samples):
+            batch.append(idx)
+            if len(batch) == self.micro_batch_times_data_parallel_size:
+                start_idx, end_idx = self.get_start_end_idx()
+                yield batch[start_idx:end_idx]
+                batch = []
+        # Check the last partial batch and see drop_last is set
+        if len(batch) > 0 and not self.drop_last:
+            start_idx, end_idx = self.get_start_end_idx()
+            yield batch[start_idx:end_idx]
+class PretrainingRandomSampler:
+    def __init__(self, total_samples, consumed_samples, micro_batch_size,
+                 data_parallel_rank, data_parallel_size, epoch):
+        # Keep a copy of input params for later use.
+        self.total_samples = total_samples
+        self.consumed_samples = consumed_samples
+        self.micro_batch_size = micro_batch_size
+        self.data_parallel_rank = data_parallel_rank
+        self.data_parallel_size = data_parallel_size
+        self.micro_batch_times_data_parallel_size = \
+            self.micro_batch_size * data_parallel_size
+        self.last_batch_size = \
+            self.total_samples % self.micro_batch_times_data_parallel_size
+        self.epoch = epoch
+        # Sanity checks.
+        assert self.total_samples > 0, \
+            'no sample to consume: {}'.format(self.total_samples)
+        assert self.micro_batch_size > 0
+        assert data_parallel_size > 0
+        assert self.data_parallel_rank < data_parallel_size, \
+            'data_parallel_rank should be smaller than data size: {}, ' \
+            '{}'.format(self.data_parallel_rank, data_parallel_size)
+    def __len__(self):
+        return self.total_samples // self.micro_batch_times_data_parallel_size
+    def __iter__(self):
+        active_total_samples = self.total_samples - self.last_batch_size
+        current_epoch_samples = self.consumed_samples % active_total_samples
+        assert current_epoch_samples % self.micro_batch_times_data_parallel_size == 0
+        # data sharding and random sampling
+        bucket_size = (self.total_samples // self.micro_batch_times_data_parallel_size) \
+            * self.micro_batch_size
+        bucket_offset = current_epoch_samples // self.data_parallel_size
+        start_idx = self.data_parallel_rank * bucket_size
+        g = torch.Generator()
+        g.manual_seed(self.epoch)
+        random_idx = torch.randperm(bucket_size, generator=g).tolist()
+        idx_range = [start_idx + x for x in random_idx[bucket_offset:]]
+        batch = []
+        # Last batch if not complete will be dropped.
+        for idx in idx_range:
+            batch.append(idx)
+            if len(batch) == self.micro_batch_size:
+                self.consumed_samples += self.micro_batch_times_data_parallel_size
+                yield batch
+                batch = []
+    def set_epoch(self, epoch):
+        self.epoch = epoch

fengshen/examples/FastDemo/README.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# 「streamlit」快速搭建你的算法demo
+在搭建demo之前，首先得做好这些准备工作：
+- 模型训练完毕
+- 模型的入参确定
+- 安装streamlit库，`pip install streamlit` 就可以安装。
+streamlit脚本的启动方式是 `streamlit run demo.py`，很简单就启动了一个demo页面，页面会随着脚本代码的改变实时刷新的。所以在没有经验的时候，可以创建一个demo.py的文件，照着下面的教程一步一步添加代码，看页面的展示情况。下面开始上干货，具体细节在代码注释中有说明！
+### 第一步 导包
+```python
+import streamlit as st
+# 其他包更具你的需要导入
+```
+[streamlit](https://streamlit.io)是一个用于构建机器学习、深度学习、数据可视化demo的python框架。它不需要你有web开发的经验，会写python就可以高效的开发你的demo。
+### 第二步 页面导航信息以及布局配置
+```python
+st.set_page_config(
+     page_title="余元医疗问答", # 页面标签标题
+     page_icon=":shark:", # 页面标签图标
+     layout="wide", # 页面的布局
+     initial_sidebar_state="expanded", # 左侧的sidebar的布局方式
+     # 配置菜单按钮的信息
+     menu_items={
+         'Get Help': 'https://www.extremelycoolapp.com/help',
+         'Report a bug': "https://www.extremelycoolapp.com/bug",
+         'About': "# This is a header. This is an *extremely* cool app!"
+     }
+ )
+```
+这一步可以省略，如果想让app更加个性化，可以添加这些设置。
+### 第三步 设置demo标题
+```python
+st.title('Demo for MedicalQA')
+```
+streamlit的每一个小组件对应于页面都有一个默认的样式展示。
+### 第四步 配置demo的参数
+```python
+# 此处是用的sidebar，侧边栏作为参数配置模块
+st.sidebar.header("参数配置")
+# 这里是在sidebar里面创建了表单，每个表单一定有一个标题和提交按钮
+sbform = st.sidebar.form("固定参数设置")
+# slider是滑动条组建，可以配置数值型参数
+n_sample = sbform.slider("设置返回条数",min_value=1,max_value=10,value=3)
+text_length = sbform.slider('生成长度:',min_value=32,max_value=512,value=64,step=32)
+text_level = sbform.slider('文本多样性:',min_value=0.1,max_value=1.0,value=0.9,step=0.1)
+# number_input也可以配置数值型参数
+model_id = sbform.number_input('选择模型号:',min_value=0,max_value=13,value=13,step=1)
+# selectbox选择组建，只能选择配置的选项
+trans = sbform.selectbox('选择翻译内核',['百度通用','医疗生物'])
+# 提交表单的配置，这些参数的赋值才生效
+sbform.form_submit_button("提交配置")
+# 这里是页面中的参数配置，也是demo的主体之一
+form = st.form("参数设置")
+# 本demo是qa demo，所以要录入用户的文本输入，text_input组建可以实现
+input_text = form.text_input('请输入你的问题:',value='',placeholder='例如：糖尿病的症状有哪些？')
+form.form_submit_button("提交")
+```
+以上就把demo的参数基本配置完成了。
+### 第五步 模型预测
+```python
+# 定义一个前向预测的方法
+# @st.cache(suppress_st_warning=True)
+def generate_qa(input_text,n_sample,model_id='7',length=64,translator='baidu',level=0.7):
+    # 这里我们是把模型用fastapi搭建了一个api服务
+    URL = 'http://192.168.190.63:6605/qa'
+    data = {
+            "text":input_text,"n_sample":n_sample,
+            "model_id":model_id,"length":length,
+            'translator':translator,'level':level
+            }
+    r = requests.get(URL,params=data)
+    return r.text
+# 模型预测结果
+results = generate_qa(input_text,n_sample,model_id=str(model_id),
+                    translator=translator,length=text_length,level=text_level)
+```
+这里说明一下，由于demo展示机器没有GPU，所以模型部署采用的是Fastapi部署在后台的。如果demo展示的机器可以直接部署模型，这里可以直接把模型预测的方法写在这里，不需要另外部署模型，再用api的方式调用。这样做有一个值得注意的地方，因为streamlit的代码每一次运行，都是从头到尾执行一遍，就导致模型可能会重复加载，所以这里需要用到st.cache组建，当内容没有更新的时候，会把这一步的结果缓存，而不会重新执行。保证了效率不会因此而下降。
+### 第六步 结果展示
+```python
+with st.spinner('老夫正在思考中🤔...'):
+    if input_text:
+        results = generate_qa(input_text,n_sample,model_id=str(model_id),
+                        translator=translator,length=text_length,level=text_level)
+        for idx,item in enumerate(eval(results),start=1):
+            st.markdown(f"""
+            **候选回答「{idx}」:**\n
+            """)
+            st.info('中文：%s'%item['fy_next_sentence'])
+            st.info('英文：%s'%item['next_sentence'])
+```
+streamlit对不同格式的内容展示，有丰富的组建，对于文本可以用`st.markdown`组建以及`st.text`和`st.write`展示。更多组建和功能可以参考官方文档：https://docs.streamlit.io
+至此，一个完整的demo展示就完成了。效果图如下：
+![](./image/demo.png)
+完整的代码可以参考：`Fengshenbang-LM/fengshen/examples/FastDemo/YuyuanQA.py`

fengshen/examples/FastDemo/YuyuanQA.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import requests
+import langid
+import streamlit as st
+from translate import baiduTranslatorMedical
+from translate import baiduTranslator
+langid.set_languages(['en', 'zh'])
+lang_dic = {'zh': 'en', 'en': 'zh'}
+st.set_page_config(
+    page_title="余元医疗问答",
+    page_icon=":shark:",
+    #  layout="wide",
+    initial_sidebar_state="expanded",
+    menu_items={
+        'Get Help': 'https://www.extremelycoolapp.com/help',
+        'Report a bug': "https://www.extremelycoolapp.com/bug",
+        'About': "# This is a header. This is an *extremely* cool app!"
+    }
+)
+st.title('Demo for MedicalQA')
+st.sidebar.header("参数配置")
+sbform = st.sidebar.form("固定参数设置")
+n_sample = sbform.slider("设置返回条数", min_value=1, max_value=10, value=3)
+text_length = sbform.slider('生成长度:', min_value=32, max_value=512, value=64, step=32)
+text_level = sbform.slider('文本多样性:', min_value=0.1, max_value=1.0, value=0.9, step=0.1)
+model_id = sbform.number_input('选择模型号:', min_value=0, max_value=13, value=13, step=1)
+trans = sbform.selectbox('选择翻译内核', ['百度通用', '医疗生物'])
+sbform.form_submit_button("配置")
+form = st.form("参数设置")
+input_text = form.text_input('请输入你的问题:', value='', placeholder='例如：糖尿病的症状有哪些？')
+if trans == '百度通用':
+    translator = 'baidu_common'
+else:
+    translator = 'baidu'
+if input_text:
+    lang = langid.classify(input_text)[0]
+    if translator == 'baidu':
+        st.write('**你的问题是:**', baiduTranslatorMedical(input_text, src=lang, dest=lang_dic[lang]).text)
+    else:
+        st.write('**你的问题是:**', baiduTranslator(input_text, src=lang, dest=lang_dic[lang]).text)
+form.form_submit_button("提交")
+# @st.cache(suppress_st_warning=True)
+def generate_qa(input_text, n_sample, model_id='7', length=64, translator='baidu', level=0.7):
+    # st.write('调用了generate函数')
+    URL = 'http://192.168.190.63:6605/qa'
+    data = {"text": input_text, "n_sample": n_sample, "model_id": model_id,
+            "length": length, 'translator': translator, 'level': level}
+    r = requests.get(URL, params=data)
+    return r.text
+# my_bar = st.progress(80)
+with st.spinner('老夫正在思考中🤔...'):
+    if input_text:
+        results = generate_qa(input_text, n_sample, model_id=str(model_id),
+                              translator=translator, length=text_length, level=text_level)
+        for idx, item in enumerate(eval(results), start=1):
+            st.markdown(f"""
+            **候选回答「{idx}」:**\n
+            """)
+            st.info('中文：%s' % item['fy_next_sentence'])
+            st.info('英文：%s' % item['next_sentence'])

fengshen/examples/FastDemo/image/demo.png ADDED Viewed

fengshen/examples/classification/demo_classification_afqmc_erlangshen_offload.sh ADDED Viewed

	@@ -0,0 +1,103 @@

+MODEL_NAME="IDEA-CCNL/Erlangshen-MegatronBert-1.3B"
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=1
+VAL_BATCH_SIZE=1
+ZERO_STAGE=3
+config_json="./ds_config.json"
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE},
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9
+    },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-1 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy deepspeed_stage_${ZERO_STAGE}_offload \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+python3 finetune_classification.py $options

fengshen/examples/classification/demo_classification_afqmc_roberta.sh ADDED Viewed

	@@ -0,0 +1,62 @@

+MODEL_NAME="IDEA-CCNL/Erlangshen-Roberta-110M-NLI"
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=1
+VAL_BATCH_SIZE=1
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy ddp \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+python3 finetune_classification.py $options

fengshen/examples/classification/demo_classification_afqmc_roberta_deepspeed.sh ADDED Viewed

	@@ -0,0 +1,90 @@

+MODEL_NAME="IDEA-CCNL/Erlangshen-Roberta-110M-NLI"
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=32
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+config_json="./ds_config.json"
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --dataset_name IDEA-CCNL/AFQMC \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup_ratio 0.01 \
+        --num_labels 2 \
+        --model_type huggingface-auto \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath . \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 1 \
+        --num_nodes 1 \
+        --strategy deepspeed_stage_${ZERO_STAGE} \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --precision 16 \
+        --default_root_dir . \
+        "
+options=" \
+        --pretrained_model_path $MODEL_NAME \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+python3 finetune_classification.py $options

fengshen/examples/classification/finetune_classification.py ADDED Viewed

	@@ -0,0 +1,389 @@

+# coding=utf-8
+# Copyright 2021 The IDEA Authors. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# from fengshen.models.zen1 import ZenModel
+from dataclasses import dataclass
+from fengshen.models.megatron_t5 import T5EncoderModel
+from fengshen.models.roformer import RoFormerModel
+from fengshen.models.longformer import LongformerModel
+# from fengshen.models.cocolm.modeling_cocolm import COCOLMForSequenceClassification
+import numpy as np
+import os
+from tqdm import tqdm
+import json
+import torch
+import pytorch_lightning as pl
+import argparse
+from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.data._utils.collate import default_collate
+from transformers import (
+    BertModel,
+    BertConfig,
+    MegatronBertModel,
+    MegatronBertConfig,
+    AutoModel,
+    AutoConfig,
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+)
+# os.environ["CUDA_VISIBLE_DEVICES"] = '6'
+model_dict = {'huggingface-bert': BertModel,
+              'fengshen-roformer': RoFormerModel,
+              'huggingface-megatron_bert': MegatronBertModel,
+              'fengshen-megatron_t5': T5EncoderModel,
+              'fengshen-longformer': LongformerModel,
+              # 'fengshen-zen1': ZenModel,
+              'huggingface-auto': AutoModelForSequenceClassification,
+              }
+class TaskDataset(Dataset):
+    def __init__(self, data_path, args, label2id):
+        super().__init__()
+        self.args = args
+        self.label2id = label2id
+        self.max_length = args.max_length
+        self.data = self.load_data(data_path, args)
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, index):
+        return self.data[index]
+    def load_data(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            samples = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                text_id = int(data[args.id_name]
+                              ) if args.id_name in data.keys() else 0
+                texta = data[args.texta_name] if args.texta_name in data.keys(
+                ) else ''
+                textb = data[args.textb_name] if args.textb_name in data.keys(
+                ) else ''
+                labels = self.label2id[data[args.label_name]
+                                       ] if args.label_name in data.keys() else 0
+                samples.append({args.texta_name: texta, args.textb_name: textb,
+                                args.label_name: labels, 'id': text_id})
+        return samples
+@dataclass
+class TaskCollator:
+    args = None
+    tokenizer = None
+    def __call__(self, samples):
+        sample_list = []
+        for item in samples:
+            if item[self.args.texta_name] != '' and item[self.args.textb_name] != '':
+                if self.args.model_type != 'fengshen-roformer':
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.args.texta_name], item[self.args.textb_name]],
+                        max_length=self.args.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+                else:
+                    encode_dict = self.tokenizer.encode_plus(
+                        [item[self.args.texta_name] +
+                            self.tokenizer.eos_token+item[self.args.textb_name]],
+                        max_length=self.args.max_length,
+                        padding='max_length',
+                        truncation='longest_first')
+            else:
+                encode_dict = self.tokenizer.encode_plus(
+                    item[self.args.texta_name],
+                    max_length=self.args.max_length,
+                    padding='max_length',
+                    truncation='longest_first')
+            sample = {}
+            for k, v in encode_dict.items():
+                sample[k] = torch.tensor(v)
+            sample['labels'] = torch.tensor(item[self.args.label_name]).long()
+            sample['id'] = item['id']
+            sample_list.append(sample)
+        return default_collate(sample_list)
+class TaskDataModel(pl.LightningDataModule):
+    @staticmethod
+    def add_data_specific_args(parent_args):
+        parser = parent_args.add_argument_group('TASK NAME DataModel')
+        parser.add_argument('--data_dir', default='./data', type=str)
+        parser.add_argument('--num_workers', default=8, type=int)
+        parser.add_argument('--train_data', default='train.json', type=str)
+        parser.add_argument('--valid_data', default='dev.json', type=str)
+        parser.add_argument('--test_data', default='test.json', type=str)
+        parser.add_argument('--train_batchsize', default=16, type=int)
+        parser.add_argument('--valid_batchsize', default=32, type=int)
+        parser.add_argument('--max_length', default=128, type=int)
+        parser.add_argument('--texta_name', default='text', type=str)
+        parser.add_argument('--textb_name', default='sentence2', type=str)
+        parser.add_argument('--label_name', default='label', type=str)
+        parser.add_argument('--id_name', default='id', type=str)
+        parser.add_argument('--dataset_name', default=None, type=str)
+        return parent_args
+    def __init__(self, args):
+        super().__init__()
+        self.train_batchsize = args.train_batchsize
+        self.valid_batchsize = args.valid_batchsize
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            args.pretrained_model_path)
+        self.collator = TaskCollator()
+        self.collator.args = args
+        self.collator.tokenizer = self.tokenizer
+        if args.dataset_name is None:
+            self.label2id, self.id2label = self.load_schema(os.path.join(
+                args.data_dir, args.train_data), args)
+            self.train_data = TaskDataset(os.path.join(
+                args.data_dir, args.train_data), args, self.label2id)
+            self.valid_data = TaskDataset(os.path.join(
+                args.data_dir, args.valid_data), args, self.label2id)
+            self.test_data = TaskDataset(os.path.join(
+                args.data_dir, args.test_data), args, self.label2id)
+        else:
+            import datasets
+            ds = datasets.load_dataset(args.dataset_name)
+            self.train_data = ds['train']
+            self.valid_data = ds['validation']
+            self.test_data = ds['test']
+        self.save_hyperparameters(args)
+    def train_dataloader(self):
+        return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+    def val_dataloader(self):
+        return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, pin_memory=False,
+                          collate_fn=self.collator)
+    def load_schema(self, data_path, args):
+        with open(data_path, 'r', encoding='utf8') as f:
+            lines = f.readlines()
+            label_list = []
+            for line in tqdm(lines):
+                data = json.loads(line)
+                labels = data[args.label_name] if args.label_name in data.keys(
+                ) else 0
+                if labels not in label_list:
+                    label_list.append(labels)
+        label2id, id2label = {}, {}
+        for i, k in enumerate(label_list):
+            label2id[k] = i
+            id2label[i] = k
+        return label2id, id2label
+class taskModel(torch.nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        print('args mode type:', args.model_type)
+        self.bert_encoder = model_dict[args.model_type].from_pretrained(
+            args.pretrained_model_path)
+        self.config = self.bert_encoder.config
+        self.cls_layer = torch.nn.Linear(
+            in_features=self.config.hidden_size, out_features=self.args.num_labels)
+        self.loss_func = torch.nn.CrossEntropyLoss()
+    def forward(self, input_ids, attention_mask, token_type_ids, labels=None):
+        if self.args.model_type == 'fengshen-megatron_t5':
+            bert_output = self.bert_encoder(
+                input_ids=input_ids, attention_mask=attention_mask)  # (bsz, seq, dim)
+            encode = bert_output.last_hidden_state[:, 0, :]
+        else:
+            bert_output = self.bert_encoder(
+                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)  # (bsz, seq, dim)
+            encode = bert_output[1]
+        logits = self.cls_layer(encode)
+        if labels is not None:
+            loss = self.loss_func(logits, labels.view(-1,))
+            return loss, logits
+        else:
+            return 0, logits
+class LitModel(pl.LightningModule):
+    @staticmethod
+    def add_model_specific_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--num_labels', default=2, type=int)
+        return parent_args
+    def __init__(self, args, num_data):
+        super().__init__()
+        self.args = args
+        self.num_data = num_data
+        self.model = model_dict[args.model_type].from_pretrained(
+            args.pretrained_model_path)
+        self.save_hyperparameters(args)
+    def setup(self, stage) -> None:
+        train_loader = self.trainer._data_connector._train_dataloader_source.dataloader()
+        # Calculate total steps
+        if self.trainer.max_epochs > 0:
+            world_size = self.trainer.world_size
+            tb_size = self.hparams.train_batchsize * max(1, world_size)
+            ab_size = self.trainer.accumulate_grad_batches
+            self.total_steps = (len(train_loader.dataset) *
+                                self.trainer.max_epochs // tb_size) // ab_size
+        else:
+            self.total_steps = self.trainer.max_steps // self.trainer.accumulate_grad_batches
+        print('Total steps: {}' .format(self.total_steps))
+    def training_step(self, batch, batch_idx):
+        del batch['id']
+        output = self.model(**batch)
+        loss, logits = output[0], output[1]
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('train_loss', loss)
+        self.log('train_acc', acc)
+        return loss
+    def comput_metrix(self, logits, labels):
+        y_pred = torch.argmax(logits, dim=-1)
+        y_pred = y_pred.view(size=(-1,))
+        y_true = labels.view(size=(-1,)).float()
+        corr = torch.eq(y_pred, y_true)
+        acc = torch.sum(corr.float())/labels.size()[0]
+        return acc
+    def validation_step(self, batch, batch_idx):
+        del batch['id']
+        output = self.model(**batch)
+        loss, logits = output[0], output[1]
+        acc = self.comput_metrix(logits, batch['labels'])
+        self.log('val_loss', loss)
+        self.log('val_acc', acc, sync_dist=True)
+    def predict_step(self, batch, batch_idx):
+        ids = batch['id']
+        del batch['id']
+        output = self.model(**batch)
+        return {ids, output.logits}
+    def configure_optimizers(self):
+        from fengshen.models.model_utils import configure_optimizers
+        return configure_optimizers(self)
+class TaskModelCheckpoint:
+    @staticmethod
+    def add_argparse_args(parent_args):
+        parser = parent_args.add_argument_group('BaseModel')
+        parser.add_argument('--monitor', default='train_loss', type=str)
+        parser.add_argument('--mode', default='min', type=str)
+        parser.add_argument('--dirpath', default='./log/', type=str)
+        parser.add_argument(
+            '--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
+        parser.add_argument('--save_top_k', default=3, type=float)
+        parser.add_argument('--every_n_train_steps', default=100, type=float)
+        parser.add_argument('--save_weights_only', default=True, type=bool)
+        return parent_args
+    def __init__(self, args):
+        self.callbacks = ModelCheckpoint(monitor=args.monitor,
+                                         save_top_k=args.save_top_k,
+                                         mode=args.mode,
+                                         every_n_train_steps=args.every_n_train_steps,
+                                         save_weights_only=args.save_weights_only,
+                                         dirpath=args.dirpath,
+                                         every_n_epochs=1,
+                                         filename=args.filename)
+def save_test(data, args, data_model, rank):
+    file_name = args.output_save_path + f'.{rank}'
+    with open(file_name, 'w', encoding='utf-8') as f:
+        idx = 0
+        for i in range(len(data)):
+            ids, batch = data[i]
+            for id, sample in zip(ids, batch):
+                tmp_result = dict()
+                label_id = np.argmax(sample.cpu().numpy())
+                tmp_result['id'] = id.item()
+                tmp_result['label'] = data_model.id2label[label_id]
+                json_data = json.dumps(tmp_result, ensure_ascii=False)
+                f.write(json_data+'\n')
+                idx += 1
+    print('save the result to '+file_name)
+def main():
+    pl.seed_everything(42)
+    total_parser = argparse.ArgumentParser("TASK NAME")
+    total_parser.add_argument('--pretrained_model_path', default='', type=str)
+    total_parser.add_argument('--output_save_path',
+                              default='./predict.json', type=str)
+    total_parser.add_argument('--model_type',
+                              default='huggingface-bert', type=str)
+    # * Args for data preprocessing
+    total_parser = TaskDataModel.add_data_specific_args(total_parser)
+    # * Args for training
+    total_parser = pl.Trainer.add_argparse_args(total_parser)
+    total_parser = TaskModelCheckpoint.add_argparse_args(total_parser)
+    # * Args for base model
+    from fengshen.models.model_utils import add_module_args
+    total_parser = add_module_args(total_parser)
+    total_parser = LitModel.add_model_specific_args(total_parser)
+    args = total_parser.parse_args()
+    print(args.pretrained_model_path)
+    checkpoint_callback = TaskModelCheckpoint(args).callbacks
+    early_stop_callback = EarlyStopping(
+        monitor="val_acc", min_delta=0.00, patience=5, verbose=False, mode="max")
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    trainer = pl.Trainer.from_argparse_args(args,
+                                            callbacks=[
+                                                checkpoint_callback,
+                                                lr_monitor,
+                                                early_stop_callback]
+                                            )
+    data_model = TaskDataModel(args)
+    model = LitModel(args, len(data_model.train_dataloader()))
+    trainer.fit(model, data_model)
+    result = trainer.predict(
+        model, data_model, ckpt_path=trainer.checkpoint_callback.best_model_path)
+    save_test(result, args, data_model, trainer.global_rank)
+if __name__ == "__main__":
+    main()

fengshen/examples/classification/finetune_classification.sh ADDED Viewed

	@@ -0,0 +1,75 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=1 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=2 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=16G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:1 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+MODEL_TYPE=fengshen-roformer
+PRETRAINED_MODEL_PATH=IDEA-CCNL/Zhouwenwang-Unified-110M
+ROOT_PATH=cognitive_comp
+TASK=tnews
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/modelevaluation/tnews/
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/predict.json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test1.1.json \
+        --train_batchsize 32 \
+        --valid_batchsize 128 \
+        --max_length 128 \
+        --texta_name sentence \
+        --label_name label \
+        --id_name id \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.00002 \
+        --weight_decay 0.1 \
+        --num_labels 15 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir ./log/ \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        --model_type $MODEL_TYPE \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+python3 $SCRIPT_PATH $options
+# singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bart-base_afqmc.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+#!/bin/bash
+#SBATCH --job-name=afqmc-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+MODEL_NAME=bart-base
+TASK=afqmc
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$MODEL_NAME/
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/ckpt/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}/predict.json
+config_json="./ds_config.${MODEL_NAME}.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-5,
+      "warmup_max_lr": 1e-4,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 64 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-6 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 2 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bart-base_ocnli.sh ADDED Viewed

	@@ -0,0 +1,143 @@

+#!/bin/bash
+#SBATCH --job-name=ocnli-bart-base # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=30 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+MODEL_NAME=bart-base
+TASK=ocnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=1
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$MODEL_NAME/
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/ckpt/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${MODEL_NAME}-${TASK}/predict.json
+config_json="./ds_config.${MODEL_NAME}.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": ${ZERO_STAGE}
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-6 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 3 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 2 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 1.0 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_afqmc.sh ADDED Viewed

	@@ -0,0 +1,146 @@

+#!/bin/bash
+#SBATCH --job-name=afqmc # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=4 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=20 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --gres=gpu:4 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+#SBATCH -o %x-%j.log # output and error file name (%x=job name, %j=job id)
+set -x -e
+echo "START TIME: $(date)"
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/gaoxinyu/cache/torch_extendsions
+BERT_NAME=bert-3.9B
+TASK=afqmc
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=8
+VAL_BATCH_SIZE=32
+ZERO_STAGE=2
+STRATEGY=deepspeed_stage_${ZERO_STAGE}
+DATA_DIR=/cognitive_comp/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/cognitive_comp/gaoxinyu/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/cognitive_comp/gaoxinyu/ln_model/fintune/ckpt/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/gaoxinyu/ln_model/finetune/${BERT_NAME}-${TASK}
+OUTPUT_PATH=/cognitive_comp/gaoxinyu/ln_model/finetune/${BERT_NAME}-${TASK}/predict.json
+config_json="./ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 1000,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": 2
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-7,
+      "eps": 1e-12,
+      "weight_decay": 1e-1
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 1e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 1e-5 \
+        --weight_decay 1e-2 \
+        --warmup 0.01 \
+        --num_labels 2 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 0 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 67 \
+        --gpus 4 \
+        --num_nodes 1 \
+        --strategy $STRATEGY \
+        --gradient_clip_val 1.0 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --precision 16 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/cognitive_comp/gaoxinyu/docker/pytorch21_06_py3_docker_image_v2.sif
+SCRIPT_PATH=/cognitive_comp/gaoxinyu/github/Fengshenbang-LM/fengshen/examples/classification/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun -N 1 --job-name=afqmc --jobid=151522 --ntasks=4 --cpus-per-task=15 --gres=gpu:4 -o %x-%j.log singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_cmnli.sh ADDED Viewed

	@@ -0,0 +1,161 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+BERT_NAME=bert-3.9B
+TASK=cmnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME/
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+config_json="./ds_config.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-6,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-3
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-6
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.000001 \
+        --weight_decay 0.001 \
+        --warmup 0.001 \
+        --num_labels 3 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_iflytek.sh ADDED Viewed

	@@ -0,0 +1,158 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+BERT_NAME=bert-3.9B
+TASK=iflytek
+TEXTA_NAME=sentence
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/Fengshenbang-LM/fengshen/scripts/log/$TASK
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 119 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_ocnli.sh ADDED Viewed

	@@ -0,0 +1,163 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+BERT_NAME=bert-1.3B
+TASK=ocnli
+TEXTA_NAME=sentence1
+TEXTB_NAME=sentence2
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 0.1,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-6,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-6
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-6,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --textb_name $TEXTB_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.000001 \
+        --weight_decay 0.001 \
+        --warmup 0.001 \
+        --num_labels 3 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 100 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_tnews.sh ADDED Viewed

	@@ -0,0 +1,161 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=4 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:4 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+BERT_NAME=bert-3.9B
+TASK=tnews
+TEXTA_NAME=sentence
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+ROOT_PATH=cognitive_comp
+DATA_DIR=/$ROOT_PATH/yangping/data/ChineseCLUE_DATA/${TASK}_public/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/fengshen/fengshen/scripts/log/$TASK/$BERT_NAME/nograd
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-8,
+      "warmup_max_lr": 1e-5,
+      "warmup_num_steps": 400,
+      "warmup_type": "linear"
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 15 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 200 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 4 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --gradient_clip_val 0.1 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 100 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options

fengshen/examples/classification/finetune_classification_bert-3.9B_wsc.sh ADDED Viewed

	@@ -0,0 +1,158 @@

+#!/bin/bash
+#SBATCH --job-name=slurm-test # create a short name for your job
+#SBATCH --nodes=1 # node count
+#SBATCH --ntasks=2 # total number of tasks across all nodes
+#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
+#SBATCH --mem-per-cpu=8G # memory per cpu-core (4G is default)
+#SBATCH --gres=gpu:2 # number of gpus per node
+#SBATCH --mail-type=ALL # send email when job begins, ends or failed etc.
+export TORCH_EXTENSIONS_DIR=/cognitive_comp/yangping/cache/torch_extendsions
+BERT_NAME=bert-3.9B
+TASK=wsc
+TEXTA_NAME=texta
+LABEL_NAME=label
+ID_NAME=id
+BATCH_SIZE=16
+VAL_BATCH_SIZE=56
+ZERO_STAGE=2
+ROOT_PATH=cognitive_comp
+DATA_DIR=/cognitive_comp/yangping/data/unidata/multichoice/mrc_multichoice_data/other/cluewsc2020/
+PRETRAINED_MODEL_PATH=/$ROOT_PATH/yangping/pretrained_model/$BERT_NAME/
+CHECKPOINT_PATH=/$ROOT_PATH/yangping/checkpoints/fengshen-finetune/$TASK/
+DEFAULT_ROOT_DIR=/cognitive_comp/yangping/nlp/Fengshenbang-LM/fengshen/scripts/log/$TASK
+OUTPUT_PATH=/$ROOT_PATH/yangping/nlp/modelevaluation/output/${TASK}_predict.json
+config_json="./ds_config.$SLURM_JOBID.json"
+# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
+# reduce_bucket_size: hidden_size*hidden_size
+# stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size
+# stage3_param_persistence_threshold: 10 * hidden_size
+cat <<EOT > $config_json
+{
+  "train_micro_batch_size_per_gpu": $BATCH_SIZE,
+  "steps_per_print": 100,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 6553600,
+        "stage3_prefetch_bucket_size": 5898240,
+        "stage3_param_persistence_threshold": 25600,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-5,
+      "betas": [
+        0.9,
+        0.95
+      ],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "scheduler": {
+    "type": "WarmupLR",
+    "params":{
+      "warmup_min_lr": 5e-6,
+      "warmup_max_lr": 1e-5
+    }
+  },
+  "zero_allow_untested_optimizer": false,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "contiguous_memory_optimization": false
+  },
+  "wall_clock_breakdown": false
+}
+EOT
+export PL_DEEPSPEED_CONFIG_PATH=$config_json
+DATA_ARGS="\
+        --data_dir $DATA_DIR \
+        --train_data train.json \
+        --valid_data dev.json \
+        --test_data test.json \
+        --train_batchsize $BATCH_SIZE \
+        --valid_batchsize $VAL_BATCH_SIZE \
+        --max_length 128 \
+        --texta_name $TEXTA_NAME \
+        --label_name $LABEL_NAME \
+        --id_name $ID_NAME \
+        "
+MODEL_ARGS="\
+        --learning_rate 0.00001 \
+        --weight_decay 0.01 \
+        --warmup 0.001 \
+        --num_labels 2 \
+        "
+MODEL_CHECKPOINT_ARGS="\
+        --monitor val_acc \
+        --save_top_k 3 \
+        --mode max \
+        --every_n_train_steps 10 \
+        --save_weights_only True \
+        --dirpath $CHECKPOINT_PATH \
+        --filename model-{epoch:02d}-{val_acc:.4f} \
+        "
+TRAINER_ARGS="\
+        --max_epochs 7 \
+        --gpus 2 \
+        --strategy deepspeed_stage_3 \
+        --precision 16 \
+        --check_val_every_n_epoch 1 \
+        --val_check_interval 10 \
+        --default_root_dir $DEFAULT_ROOT_DIR \
+        "
+options=" \
+        --pretrained_model_path $PRETRAINED_MODEL_PATH \
+        --output_save_path $OUTPUT_PATH \
+        $DATA_ARGS \
+        $MODEL_ARGS \
+        $MODEL_CHECKPOINT_ARGS \
+        $TRAINER_ARGS \
+        "
+DOCKER_PATH=/$ROOT_PATH/yangping/containers/pytorch21_06_py3_docker_image.sif
+SCRIPT_PATH=/$ROOT_PATH/yangping/nlp/fengshen/fengshen/examples/finetune_classification.py
+# python3 $SCRIPT_PATH $options
+srun singularity exec --nv -B /cognitive_comp/:/cognitive_comp/ $DOCKER_PATH python3 $SCRIPT_PATH $options