Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

Upload 7664 files

6fc683c verified about 1 year ago

1.35 kB

	# Incremental BPE builder

	<strong>Modified, simplified version of text_encoder_build_subword.py and its dependencies included in [tensor2tensor library](https://github.com/tensorflow/tensor2tensor), making its output fits to [google research's open-sourced BERT project](https://github.com/google-research/bert).</strong>

	## Requirement
	The environment I made this project in consists of :
	- python 3.6
	- tensorflow 1.11

	## Basic usage

	#### Build domain-specific vocabulary automatically

	If you want to build a proper size of vocabulary for specific domain using the incremental algorithm in our paper, you can do as following:

	```bash
	python vocab_extend.py \
	--corpus {file for the domain corpus} \
	--raw_vocab {bert_raw_vocab_file} \
	--output_file {he output file of the final vocabulary} \
	--interval {vocab size interval} \
	--threshold {threshold for P(D)}

	# Example using sample data
	python vocab_extend.py --corpus test_data/chem.txt \
	--raw_vocab test_data/vocab.txt \
	--output_file test_data/chem.vocab \
	--interval 1000 --threshold 1

	```

	If you simply want to get a specific size of vocab, you can run the following

	```
	python subword_builder.py \
	--corpus_filepattern {corpus_for_vocab} \
	--raw_vocab {bert_raw_vocab_file} \
	--output_filename {name_of_vocab} \
	--vocab_size {final vocab size} \
	--do_lower_case
	```