Spaces:
Running
Running
## Training a pointer-generator model on the Extreme Summarization dataset | |
##### 1. Download the Extreme Summarization data and preprocess it | |
Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to obtain | |
the original Extreme Summarization dataset. You should have six files, | |
{train,validation,test}.{document,summary}. | |
##### 2. Create a vocabulary and extend it with source position markers | |
```bash | |
vocab_size=10000 | |
position_markers=1000 | |
export LC_ALL=C | |
cat train.document train.summary | | |
tr -s '[:space:]' '\n' | | |
sort | | |
uniq -c | | |
sort -k1,1bnr -k2 | | |
head -n "$((vocab_size - 4))" | | |
awk '{ print $2 " " $1 }' >dict.pg.txt | |
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt | |
``` | |
This creates the file dict.pg.txt that contains the 10k most frequent words, | |
followed by 1k source position markers: | |
``` | |
the 4954867 | |
. 4157552 | |
, 3439668 | |
to 2212159 | |
a 1916857 | |
of 1916820 | |
and 1823350 | |
... | |
<unk-0> 0 | |
<unk-1> 0 | |
<unk-2> 0 | |
<unk-3> 0 | |
<unk-4> 0 | |
... | |
``` | |
##### 2. Preprocess the text data | |
```bash | |
./preprocess.py --source train.document --target train.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out train.pg.src --target-out train.pg.tgt | |
./preprocess.py --source validation.document --target validation.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out valid.pg.src --target-out valid.pg.tgt | |
./preprocess.py --source test.document --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out test.pg.src | |
``` | |
The data should now contain `<unk-N>` tokens in place of out-of-vocabulary words. | |
##### 3. Binarize the dataset: | |
```bash | |
fairseq-preprocess \ | |
--source-lang src \ | |
--target-lang tgt \ | |
--trainpref train.pg \ | |
--validpref valid.pg \ | |
--destdir bin \ | |
--workers 60 \ | |
--srcdict dict.pg.txt \ | |
--joined-dictionary | |
``` | |
##### 3. Train a model | |
```bash | |
total_updates=20000 | |
warmup_updates=500 | |
lr=0.001 | |
max_tokens=4096 | |
update_freq=4 | |
pointer_layer=-2 | |
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train bin \ | |
--user-dir examples/pointer_generator/pointer_generator_src \ | |
--max-tokens "$max_tokens" \ | |
--task translation \ | |
--source-lang src --target-lang tgt \ | |
--truncate-source \ | |
--layernorm-embedding \ | |
--share-all-embeddings \ | |
--encoder-normalize-before \ | |
--decoder-normalize-before \ | |
--required-batch-size-multiple 1 \ | |
--arch transformer_pointer_generator \ | |
--alignment-layer "$pointer_layer" \ | |
--alignment-heads 1 \ | |
--source-position-markers 1000 \ | |
--criterion label_smoothed_cross_entropy \ | |
--label-smoothing 0.1 \ | |
--dropout 0.1 --attention-dropout 0.1 \ | |
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ | |
--clip-norm 0.1 \ | |
--lr-scheduler inverse_sqrt --lr "$lr" --max-update "$total_updates" --warmup-updates "$warmup_updates" \ | |
--update-freq "$update_freq" \ | |
--skip-invalid-size-inputs-valid-test | |
``` | |
Above we specify that our dictionary contains 1000 source position markers, and | |
that we want to use one attention head from the penultimate decoder layer for | |
pointing. It should run in 5.5 hours on one node with eight 32GB V100 GPUs. The | |
logged messages confirm that dictionary indices above 10000 will be mapped to | |
the `<unk>` embedding: | |
``` | |
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [src] dictionary: 11000 types | |
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [tgt] dictionary: 11000 types | |
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.src | |
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.tgt | |
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | bin valid src-tgt 11332 examples | |
2020-09-24 20:43:53 | INFO | fairseq.models.transformer_pg | dictionary indices from 10000 to 10999 will be mapped to 3 | |
``` | |
##### 4. Summarize the test sequences | |
```bash | |
batch_size=32 | |
beam_size=6 | |
max_length=60 | |
length_penalty=1.0 | |
fairseq-interactive bin \ | |
--user-dir examples/pointer_generator/pointer_generator_src \ | |
--batch-size "$batch_size" \ | |
--task translation \ | |
--source-lang src --target-lang tgt \ | |
--path checkpoints/checkpoint_last.pt \ | |
--input test.pg.src \ | |
--buffer-size 200 \ | |
--max-len-a 0 \ | |
--max-len-b "$max_length" \ | |
--lenpen "$length_penalty" \ | |
--beam "$beam_size" \ | |
--skip-invalid-size-inputs-valid-test | | |
tee generate.out | |
grep ^H generate.out | cut -f 3- >generate.hyp | |
``` | |
Now you should have the generated sequences in `generate.hyp`. They contain | |
`<unk-N>` tokens that the model has copied from the source sequence. In order to | |
retrieve the original words, we need the unprocessed source sequences from | |
`test.document`. | |
##### 5. Process the generated output | |
Since we skipped too long inputs when producing `generate.hyp`, we also have to | |
skip too long sequences now that we read `test.document`. | |
```bash | |
./postprocess.py \ | |
--source <(awk 'NF<1024' test.document) \ | |
--target generate.hyp \ | |
--target-out generate.hyp.processed | |
``` | |
Now you'll find the final sequences from `generate.hyp.processed`, with | |
`<unk-N>` replaced with the original word from the source sequence. | |
##### An example of a summarized sequence | |
The original source document in `test.document`: | |
> de roon moved to teesside in june 2016 for an initial # 8.8 m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . | |
The preprocessed source document in `test.src.pg`: | |
> de \<unk-1> moved to \<unk-4> in june 2016 for an initial # \<unk-12> m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . | |
The generated summary in `generate.hyp`: | |
> middlesbrough striker \<unk> de \<unk-1> has joined spanish side \<unk> on a season-long loan . | |
The generated summary after postprocessing in `generate.hyp.processed`: | |
> middlesbrough striker \<unk> de roon has joined spanish side \<unk> on a season-long loan . | |