Update

Browse files

Files changed (8) hide show

README.md +135 -55
config.json +29 -0
69.pt → pytorch_model.bin +2 -2
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
train_conformer_large_w2v.yaml +0 -119
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,77 +1,157 @@
 ---
-language: All languages
-datasets: ISML datasets (80 thousands hours unlabeled data) + babel datasets (2 thousands unlabeled data)
-# Chinese W2v-conformer
 ## Model description
-This is the set of Speech W2v-conformer model pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/):
 ## How to use
-You can use the model directly with a pipeline for speech recognition:
 ```python
->>> from wenet.dataset.dataset import CollateFunc, AudioDataset
->>> from wenet.transformer.asr_model import ASRModel
->>> from wenet.transformer.encoder import ConformerEncoder
->>> from wenet.transformer.decoder import TransformerDecoder
->>> from wenet.transformer.ctc import CTC
->>> from wenet.utils.executor import Executor
->>> from wenet.utils.checkpoint import save_checkpoint, load_checkpoint
->>> encoder = ConformerEncoder(input_dim, **configs['encoder_conf'])
->>> decoder = TransformerDecoder(vocab_size, encoder.output_size(), **configs['decoder_conf'])
->>> ctc = CTC(vocab_size, encoder.output_size())
->>> with open(args.config, 'r') as fin: configs = yaml.load(fin)
->>> model = ASRModel(
-        vocab_size=vocab_size,
-        encoder=encoder,
-        decoder=decoder,
-        ctc=ctc,
-        **configs['model_conf'],
-    )
->>> infos = load_checkpoint(model, args.checkpoint)
 ```
 ## Training data
-ISML datasets (80 thousands hours unlabeled data) and babel datasets (2 thousands unlabeled data) are used as training data.
 ## Training procedure
-The model is pre-trained by wav2vec2 (https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 70 epochs with a batch size of 128. We use the same hyper-parameters on different model sizes.
-The downstream models are finetuned:
-Stage 1:
 ```
-        python wenet/bin/train.py --gpu 0,1,2,3,4,5,6,7 \
-            --config $train_config \
-            --train_data train.data \
-            --cv_data dev.data \
-            ${checkpoint:+--checkpoint $checkpoint} \
-            --model_dir $dir \
-            --ddp.init_method $init_method \
-            --ddp.world_size 7 \
-            --ddp.dist_backend nccl \
-            --num_workers 2
 ```
-### BibTeX entry and citation info
 ```
-@article{baevski2020wav2vec,
-  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
-  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
-  journal={arXiv preprint arXiv:2006.11477},
-  year={2020}
-}
-@article{zhang2020pushing,
-  title={Pushing the limits of semi-supervised learning for automatic speech recognition},
-  author={Zhang, Yu and Qin, James and Park, Daniel S and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V and Wu, Yonghui},
-  journal={arXiv preprint arXiv:2010.10504},
-  year={2020}
 }
-@article{zhang2021wenet,
-  title={WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit},
-  author={Zhang, Binbin and Wu, Di and Yang, Chao and Chen, Xiaoyu and Peng, Zhendong and Wang, Xiangming and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Xie, Lei and others},
-  journal={arXiv preprint arXiv:2102.01547},
-  year={2021}
 }
 ```
 [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall

 ---
+language: Chinese
+datasets: CLUECorpusSmall
+widget:
+- text: "中国的首都是[MASK]京"
+---
+# Chinese ALBERT
 ## Model description
+This is the set of Chinese ALBERT models pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
+|          |           Link           |
+| -------- | :-----------------------: |
+| **ALBERT-Base**  | [**L=12/H=768 (Base)**][base] |
+| **ALBERT-Large**  | [**L=24/H=1024 (Large)**][large] |
 ## How to use
+You can use the model directly with a pipeline for text generation:
+```python
+>>> from transformers import BertTokenizer, AlbertForMaskedLM, FillMaskPipeline
+>>> tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+>>> model = AlbertForMaskedLM.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+>>> unmasker = FillMaskPipeline(model, tokenizer)
+>>> unmasker("中国的首都是[MASK]京。")
+    [
+        {'sequence': '中 国 的 首 都 是 北 京 。',
+         'score': 0.8528032898902893,
+         'token': 1266,
+         'token_str': '北'},
+        {'sequence': '中 国 的 首 都 是 南 京 。',
+         'score': 0.07667620480060577,
+         'token': 1298,
+         'token_str': '南'},
+        {'sequence': '中 国 的 首 都 是 东 京 。',
+         'score': 0.020440367981791496,
+         'token': 691,
+         'token_str': '东'},
+        {'sequence': '中 国 的 首 都 是 维 京 。',
+         'score': 0.010197942145168781,
+         'token': 5335,
+         'token_str': '维'},
+        {'sequence': '中 国 的 首 都 是 汴 京 。',
+         'score': 0.0075391442514956,
+         'token': 3745,
+         'token_str': '汴'}
+    ]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
 ```python
+from transformers import BertTokenizer, AlbertModel
+tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+model = AlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+text = "用你喜欢的任何文本替换我。"
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import BertTokenizer, TFAlbertModel
+tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+model = TFAlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
+text = "用你喜欢的任何文本替换我。"
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
 ```
 ## Training data
+[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
 ## Training procedure
+The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
+Taking the case of ALBERT-Base
+Stage1:
 ```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+                      --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
+                      --seq_length 128 --processes_num 32 --target albert
 ```
 ```
+python3 pretrain.py --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
+                    --vocab_path models/google_zh_vocab.txt \
+                    --config_path models/albert/base_config.json \
+                    --output_model_path models/cluecorpussmall_albert_base_seq128_model.bin \
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
+                    --learning_rate 1e-4 --batch_size 64 \
+                    --factorized_embedding_parameterization --parameter_sharing \
+                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
+```
+Stage2:
+```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+                      --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
+                      --seq_length 512 --processes_num 32 --target albert
+```
+```
+python3 pretrain.py --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
+                    --pretrained_model_path models/cluecorpussmall_albert_base_seq128_model.bin-1000000 \
+                    --vocab_path models/google_zh_vocab.txt \
+                    --config_path models/albert/base_config.json \
+                    --output_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
+                    --learning_rate 1e-4 --batch_size 64 \
+                    --factorized_embedding_parameterization --parameter_sharing \
+                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
+```
+Finally, we convert the pre-trained model into Huggingface's format:
+```
+python3 scripts/convert_albert_from_uer_to_huggingface.py --input_model_path cluecorpussmall_albert_base_seq512_model.bin-250000 \
+                                                          --output_model_path pytorch_model.bin
+```
+### BibTeX entry and citation info
+```
+@article{lan2019albert,
+  title={Albert: A lite bert for self-supervised learning of language representations},
+  author={Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu},
+  journal={arXiv preprint arXiv:1909.11942},
+  year={2019}
 }
+@article{zhao2019uer,
+  title={UER: An Open-Source Toolkit for Pre-training Models},
+  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
+  journal={EMNLP-IJCNLP 2019},
+  pages={241},
+  year={2019}
 }
 ```
 [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "_name_or_path": "albert",
+  "architectures": [
+    "AlbertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0,
+  "bos_token_id": 2,
+  "classifier_dropout_prob": 0.1,
+  "embedding_size": 128,
+  "eos_token_id": 3,
+  "hidden_act": "relu",
+  "hidden_dropout_prob": 0,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "inner_group_num": 1,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "albert",
+  "num_attention_heads": 12,
+  "num_hidden_groups": 1,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "tokenizer_class": "BertTokenizer",
+  "transformers_version": "4.6.0",
+  "type_vocab_size": 2,
+  "vocab_size": 21128
+}

69.pt → pytorch_model.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:eee2e05f3ca00624ab4a5bac31ca538f05d1c2ccd975f85074dc4c3ad13793b4
-size 562887224

 version https://git-lfs.github.com/spec/v1
+oid sha256:4e90c5f6b64fda667d9a10a8065878a4790515a0df171e361787354b25526141
+size 40325143

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:00b2f0b8fa2b513f5dde4fe14f25978c459e1381cb7ff0fd259fc98c4a6b4d61
+size 51528256

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}

train_conformer_large_w2v.yaml DELETED Viewed

@@ -1,119 +0,0 @@
-# network architecture
-# encoder related
-encoder: conformer
-encoder_conf:
-    output_size: 512    # dimension of attention
-    attention_heads: 8
-    linear_units: 2048 # the number of units of position-wise feed forward
-    num_blocks: 18      # the number of encoder blocks
-    dropout_rate: 0.1
-    positional_dropout_rate: 0.0
-    attention_dropout_rate: 0.0
-    input_layer: conv2d6 # encoder input type, you can chose conv2d, conv2d6 and conv2d8
-    normalize_before: true
-    cnn_module_kernel: 15
-    use_cnn_module: True
-    activation_type: 'swish'
-    macaron_style: True
-    pos_enc_layer_type: 'rel_pos'
-    selfattention_layer_type: 'abs_selfattn'
-    nonorm: False
-    cnn_prev: True
-    cnn_after: False
-# decoder related
-decoder: transformer
-decoder_conf:
-    attention_heads: 4
-    linear_units: 2048
-    num_blocks: 1
-    dropout_rate: 0.0
-    positional_dropout_rate: 0.0
-    self_attention_dropout_rate: 0.0
-    src_attention_dropout_rate: 0.0
-# hybrid CTC/attention
-model_conf:
-    ctc_weight: 1.0
-    lsm_weight: 0.1     # label smoothing option
-    length_normalized_loss: false
-raw_wav: False
-data_save: True
-use_gc: True
-w2v_encoder: True
-pretrain: True
-random_pretrain: False
-wav2vec: True
-w2v_coef: 1.0
-mpc_didi_ver: False
-wav2mpc: False
-wav2mpc_reduction: False
-mpc_mask_loss: False
-mpc_coef: 0.0
-mask: True
-quantize_targets: True
-project_targets: True
-latent_vars: 320
-w2v_reduct: True
-w2v_ext_loss: True
-w2v_loss_weights: [0.1,0]
-w2v_mask_prob: 0.65
-mpc_prob: 0.5
-remove_valbest: False
-model:
-  method: 'npc'                                         # Accepts npc/apc/vqapc
-  paras:
-    kernel_size: 15     # Receptive field size (R) = kernel_size + 2*(n_blocks)
-    mask_size: 5     # Desired input mask size (M_in) as described in NPC paper
-    n_blocks: 4                     # Number of ConvBlocks stacked in NPC model
-    hidden_size: 512                       # Dimension of feature of all layers
-    dropout: 0.1                                         # Dropout in ConvBlock
-    residual: True                           # Residual connection in ConvBlock
-    batch_norm: True                             # Apply BatchNorm in ConvBlock
-    activate: 'relu'                         # Activation function of ConvBlock
-    disable_cross_layer: False      # Apply Masked ConvBlock at last layer only
-    vq:
-      codebook_size: [64,64,64,64]    # Codebook size of each group in VQ-layer
-      code_dim: [128,128,128,128] # Dim of each group summing up to hidden_size
-      gumbel_temperature: 1.0       # Temperature of Gumbel Softmax in VQ-layer
-collate_conf:
-    spec_aug: false
-# specaugmentation related
-spec_aug_conf:
-    num_time_mask: 2
-    num_freq_mask: 2
-    max_time_mask: 50
-    max_freq_mask: 10
-    max_time_warp: 80
-    gauss_mask_for_time: False
-    warp_for_time: False
-# dataset related
-dataset_conf:
-    max_length: 4500
-    min_length: 80
-    max_frames_in_batch: 16000
-    batch_type: 'dynamic' # static or dynamic
-    batch_size: 20
-    sort: true
-grad_clip: 10
-accum_grad: 2
-max_epoch: 180
-log_interval: 100
-optim: adam
-optim_conf:
-    lr: 0.001
-scheduler: warmuplr     # pytorch v1.1.0+ required
-scheduler_conf:
-    warmup_steps: 10000

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff