## Training a pointer-generator model on the Extreme Summarization dataset ##### 1. Download the Extreme Summarization data and preprocess it Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to obtain the original Extreme Summarization dataset. You should have six files, {train,validation,test}.{document,summary}. ##### 2. Create a vocabulary and extend it with source position markers ```bash vocab_size=10000 position_markers=1000 export LC_ALL=C cat train.document train.summary | tr -s '[:space:]' '\n' | sort | uniq -c | sort -k1,1bnr -k2 | head -n "$((vocab_size - 4))" | awk '{ print $2 " " $1 }' >dict.pg.txt python3 -c "[print(' 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt ``` This creates the file dict.pg.txt that contains the 10k most frequent words, followed by 1k source position markers: ``` the 4954867 . 4157552 , 3439668 to 2212159 a 1916857 of 1916820 and 1823350 ... 0 0 0 0 0 ... ``` ##### 2. Preprocess the text data ```bash ./preprocess.py --source train.document --target train.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out train.pg.src --target-out train.pg.tgt ./preprocess.py --source validation.document --target validation.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out valid.pg.src --target-out valid.pg.tgt ./preprocess.py --source test.document --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out test.pg.src ``` The data should now contain `` tokens in place of out-of-vocabulary words. ##### 3. Binarize the dataset: ```bash fairseq-preprocess \ --source-lang src \ --target-lang tgt \ --trainpref train.pg \ --validpref valid.pg \ --destdir bin \ --workers 60 \ --srcdict dict.pg.txt \ --joined-dictionary ``` ##### 3. Train a model ```bash total_updates=20000 warmup_updates=500 lr=0.001 max_tokens=4096 update_freq=4 pointer_layer=-2 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train bin \ --user-dir examples/pointer_generator/pointer_generator_src \ --max-tokens "$max_tokens" \ --task translation \ --source-lang src --target-lang tgt \ --truncate-source \ --layernorm-embedding \ --share-all-embeddings \ --encoder-normalize-before \ --decoder-normalize-before \ --required-batch-size-multiple 1 \ --arch transformer_pointer_generator \ --alignment-layer "$pointer_layer" \ --alignment-heads 1 \ --source-position-markers 1000 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ --clip-norm 0.1 \ --lr-scheduler inverse_sqrt --lr "$lr" --max-update "$total_updates" --warmup-updates "$warmup_updates" \ --update-freq "$update_freq" \ --skip-invalid-size-inputs-valid-test ``` Above we specify that our dictionary contains 1000 source position markers, and that we want to use one attention head from the penultimate decoder layer for pointing. It should run in 5.5 hours on one node with eight 32GB V100 GPUs. The logged messages confirm that dictionary indices above 10000 will be mapped to the `` embedding: ``` 2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [src] dictionary: 11000 types 2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [tgt] dictionary: 11000 types 2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.src 2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.tgt 2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | bin valid src-tgt 11332 examples 2020-09-24 20:43:53 | INFO | fairseq.models.transformer_pg | dictionary indices from 10000 to 10999 will be mapped to 3 ``` ##### 4. Summarize the test sequences ```bash batch_size=32 beam_size=6 max_length=60 length_penalty=1.0 fairseq-interactive bin \ --user-dir examples/pointer_generator/pointer_generator_src \ --batch-size "$batch_size" \ --task translation \ --source-lang src --target-lang tgt \ --path checkpoints/checkpoint_last.pt \ --input test.pg.src \ --buffer-size 200 \ --max-len-a 0 \ --max-len-b "$max_length" \ --lenpen "$length_penalty" \ --beam "$beam_size" \ --skip-invalid-size-inputs-valid-test | tee generate.out grep ^H generate.out | cut -f 3- >generate.hyp ``` Now you should have the generated sequences in `generate.hyp`. They contain `` tokens that the model has copied from the source sequence. In order to retrieve the original words, we need the unprocessed source sequences from `test.document`. ##### 5. Process the generated output Since we skipped too long inputs when producing `generate.hyp`, we also have to skip too long sequences now that we read `test.document`. ```bash ./postprocess.py \ --source <(awk 'NF<1024' test.document) \ --target generate.hyp \ --target-out generate.hyp.processed ``` Now you'll find the final sequences from `generate.hyp.processed`, with `` replaced with the original word from the source sequence. ##### An example of a summarized sequence The original source document in `test.document`: > de roon moved to teesside in june 2016 for an initial # 8.8 m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . The preprocessed source document in `test.src.pg`: > de \ moved to \ in june 2016 for an initial # \ m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . The generated summary in `generate.hyp`: > middlesbrough striker \ de \ has joined spanish side \ on a season-long loan . The generated summary after postprocessing in `generate.hyp.processed`: > middlesbrough striker \ de roon has joined spanish side \ on a season-long loan .