Discourse Mutual Information (DMI)
This repository hosts the PyTorch-based implementation for the DMI model proposed in Representation Learning for Conversational Data using Discourse Mutual Information Maximization.
Requirements
- wandb
- transformers
- datasets
- torch 1.8.2 (lts)
Getting Access to the Source Code or Pretrained Models
To get access to the source-code or pretrained-model checkpoints, please send a request to AcadGrants@service.microsoft.com and cc to pawang.iitk [at] iitkgp.ac.in and bsantraigi [at] gmail.com.
Note: The requesting third party (1) can download and use these deliverables for research as well as commercial use, (2) modify it as they like but should include citation to our work and include this readme, and (3) cannot redistribute strictly to any other organization.
Cite As
@inproceedings{santra2022representation,
title={Representation Learning for Conversational Data using Discourse Mutual Information Maximization},
author={Santra, Bishal and Roychowdhury, Sumegh and Mandal, Aishik and Gurram, Vasu and Naik, Atharva and Gupta, Manish and Goyal, Pawan},
booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
year={2022}
}
How to run?
Loading and Finetuning the model for a task
For finetuning the model on the tasks mentioned in the paper, or on a new task, use the run_finetune.py
script or modify it according to your requirements. Example commands for launching finetuning based on some DMI checkpoints can be found in the auto_eval
directory.
For example, if you have downloaded the checkpoint DMI_small_model.pth
in checkpoints/
directory, you can launch one of these auto_eval
scripts as:
MODEL_NAME_PATH="checkpoints/DMI_small_model.pth" bash long_eval/probe_part1_rob.sh
Special Note
For running experiments with the checkpoints of different sizes, use the corresponding code branches as directed below -
- DMI_Base:
master
branch- Finetuning scripts in
auto_eval/
- Finetuning scripts in
- DMI_Medium:
berty
branch- Finetuning scripts in
auto_eval/8L/
- Finetuning scripts in
- DMI_Small:
berty
branch- Finetuning scripts in
auto_eval/
- Finetuning scripts in
The main difference among these models is the DMI_Base model uses the Roberta-base
architecture as the core for its encoder, whereas the other two uses bert-8L
and bert-6L
respectively.
Pretraining Dataset
There are two types of dataset structure that are available for model pretraining.
In case of smaller or "Normal" datasets, a single train_dialog file contains all the training data and is consumed fully during each epoch.
In case of "Large" datasets, the files are split into smaller shards and saved as .json files.
- Normal Datasets: For example of this, check the
data/dailydialog
ordata/reddit_1M
directories.
data/reddit_1M
βββ test_dialogues.txt
βββ train_dialogues.txt
βββ val_dialogues.txt
- Large Datasets: This mode can be activated by setting the
--dataset
argument torMax
, i.e.,--dataset rMax
or-dd rMax
. This also require you to provide the-rmp
argument for the directory path containing the json files. For validation during pretraining, this model uses the DailyDialog validation set by default.
data/rMax-subset
βββ test-00000-of-01000.json
βββ test-00001-of-01000.json
βββ test-00002-of-01000.json
βββ test-00003-of-01000.json
βββ ...
βββ train-00000-of-01000.json
βββ train-00001-of-01000.json
βββ train-00002-of-01000.json
βββ train-00003-of-01000.json
βββ ...
For training a model
To train a new model, it can be started using the pretrain.py script.
Example:
- For training from scratch:
python pretrain.py \
-dd rMax -voc roberta \
--roberta_init \
-sym \
-bs 64 -ep 1000 -vi 400 -li 50 -lr 5e-5 -scdl \
--data_path ./data \
-rmp /disk2/infonce-dialog/data/r727m/ \
-t 1 \
-ddp --world_size 6 \
-ntq
- To resume training from an existing checkpoint: This example shows resuming training from a checkpoint saved under
checkpoints/DMI-Small_BERT-26Jan/
. Also note how we specify a name an existing BERT/RoBERTa model which defines the architecture and the original initialization of the model weights.
python pretrain.py \
-dd rMax -voc bert \
--roberta_init \
-robname google/bert_uncased_L-8_H-768_A-12 \
-sym -bs 130 -lr 1e-5 -scdl -ep 1000 -vi 400 -li 50 \
--data_path ./data \
-rmp /disk2/infonce-dialog/data/r727m/ \
-ddp --world_size 4 \
-ntq -t 1 \
-re -rept checkpoints/DMI-Small_BERT-26Jan/model_current.pth
It accepts the following arguments.
-h, --help show this help message and exit
-dd {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}, --dataset {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}
which dataset to use for pretraining.
-rf, --reddit_filter_enabled
Enable reddit data filter for removing low quality dialogs.
-rmp RMAX_PATH, --rmax_path RMAX_PATH
path to dir for r727m (.json) data files.
-dp DATA_PATH, --data_path DATA_PATH
path to the root data folder.
-op OUTPUT_PATH, --output_path OUTPUT_PATH
Path to store the output ``model.pth'' files
-voc {bert,blender,roberta,dgpt-m}, --vocab {bert,blender,roberta,dgpt-m}
mention which tokenizer was used for pretraining? bert or blender
-rob, --roberta_init Initialize transformer-encoder with roberta weights?
-robname ROBERTA_NAME, --roberta_name ROBERTA_NAME
name of checkpoint from huggingface
-d D_MODEL, --d_model D_MODEL
size of transformer encoders' hidden representation
-d_ff DIM_FEEDFORWARD, --dim_feedforward DIM_FEEDFORWARD
dim_feedforward for transformer encoder.
-p PROJECTION, --projection PROJECTION
size of projection layer output
-el ENCODER_LAYERS, --encoder_layers ENCODER_LAYERS
number of layers in transformer encoder
-eh ENCODER_HEADS, --encoder_heads ENCODER_HEADS
number of heads in tformer enc
-sym, --symmetric_loss
whether to train using symmetric infonce
-udrl, --unsupervised_discourse_losses
Additional unsupervised discourse-relation loss components
-sdrl, --supervised_discourse_losses
Additional supervised discourse-relation loss components
-es {infonce,jsd,nwj,tuba,dv,smile,infonce/td}, --estimator {infonce,jsd,nwj,tuba,dv,smile,infonce/td}
which MI estimator is used as the loss function.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
batch size during pretraining
-ep EPOCHS, --epochs EPOCHS
epochs for pretraining
-vi VAL_INTERVAL, --val_interval VAL_INTERVAL
validation interval during training
-li LOG_INTERVAL, --log_interval LOG_INTERVAL
logging interval during training
-lr LEARNING_RATE, --learning_rate LEARNING_RATE
set learning rate
-lrc, --learning_rate_control
LRC: outer layer and projection layer will have faster LR and rest will be LR/10
-t {0,1}, --tracking {0,1}
whether to track training+validation loss wandb
-scdl, --use_scheduler
whether to use a warmup+decay schedule for LR
-ntq, --no_tqdm disable tqdm to create concise log files!
-ddp, --distdp Should it use pytorch Distributed dataparallel?
-ws WORLD_SIZE, --world_size WORLD_SIZE
world size when using DDP with pytorch.
-re, --resume 2-stage pretrain: Resume training from a previous checkpoint?
-rept RESUME_MODEL_PATH, --resume_model_path RESUME_MODEL_PATH
If ``Resuming'', path to ckpt file.