Spaces:
Sleeping
Sleeping
# Incremental BPE builder | |
<strong>Modified, simplified version of text_encoder_build_subword.py and its dependencies included in [tensor2tensor library](https://github.com/tensorflow/tensor2tensor), making its output fits to [google research's open-sourced BERT project](https://github.com/google-research/bert).</strong> | |
## Requirement | |
The environment I made this project in consists of : | |
- python 3.6 | |
- tensorflow 1.11 | |
## Basic usage | |
#### Build domain-specific vocabulary automatically | |
If you want to build a proper size of vocabulary for specific domain using the incremental algorithm in our paper, you can do as following: | |
```bash | |
python vocab_extend.py \ | |
--corpus {file for the domain corpus} \ | |
--raw_vocab {bert_raw_vocab_file} \ | |
--output_file {he output file of the final vocabulary} \ | |
--interval {vocab size interval} \ | |
--threshold {threshold for P(D)} | |
# Example using sample data | |
python vocab_extend.py --corpus test_data/chem.txt \ | |
--raw_vocab test_data/vocab.txt \ | |
--output_file test_data/chem.vocab \ | |
--interval 1000 --threshold 1 | |
``` | |
If you simply want to get a specific size of vocab, you can run the following | |
``` | |
python subword_builder.py \ | |
--corpus_filepattern {corpus_for_vocab} \ | |
--raw_vocab {bert_raw_vocab_file} \ | |
--output_filename {name_of_vocab} \ | |
--vocab_size {final vocab size} \ | |
--do_lower_case | |
``` | |