sakharamg
/

NMTKD

Model card Files Files and versions Community

NMTKD / translation /OpenNMT-py /docs /source /legacy /vid2text.rst

sakharamg

Uploading all files

158b61b over 1 year ago

raw

history blame contribute delete

20.8 kB

	Video to Text
	=============

	---------

	WARNING: This example is based on the
	`legacy version of OpenNMT-py <https://github.com/OpenNMT/OpenNMT-py/tree/legacy>`_
	!

	---------


	Recurrent
	---------

	This tutorial shows how to replicate the results from
	`"Describing Videos by Exploiting Temporal Structure" <https://arxiv.org/pdf/1502.08029.pdf>`_
	[`code <https://github.com/yaoli/arctic-capgen-vid>`_]
	using OpenNMT-py.

	Get `YouTubeClips.tar` from `here <http://www.cs.utexas.edu/users/ml/clamp/videoDescription/>`_.
	Use ``tar -xvf YouTubeClips.tar`` to decompress the archive.

	Now, visit `this repo <https://github.com/yaoli/arctic-capgen-vid>`_.
	Follow the "preprocessed YouTube2Text download link."
	We'll be throwing away the Googlenet features. We just need the captions.
	Use ``unzip youtube2text_iccv15.zip`` to decompress the files.

	Get to the following directory structure: ::

	yt2t
	\|-YouTubeClips
	\|-youtube2text_iccv15

	Change directories to `yt2t`. We'll rename the videos to follow the "vid#.avi" format:

	.. code-block:: python

	import pickle
	import os


	YT = "youtube2text_iccv15"
	YTC = "YouTubeClips"

	# load the YouTube hash -> vid### map.
	with open(os.path.join(YT, "dict_youtube_mapping.pkl"), "rb") as f:
	yt2vid = pickle.load(f, encoding="latin-1")

	for f in os.listdir(YTC):
	hashy, ext = os.path.splitext(f)
	vid = yt2vid[hashy]
	fpath_old = os.path.join(YTC, f)
	f_new = vid + ext
	fpath_new = os.path.join(YTC, f_new)
	os.rename(fpath_old, fpath_new)

	Make sure all the videos have the same (low) framerate by changing to the YouTubeClips directory and using

	.. code-block:: bash

	for fi in $( ls ); do ffmpeg -y -i $fi -r 2 $fi; done

	Now we want to convert the frames into sequences of CNN feature vectors.
	(We'll use the environment variable ``Y2T2`` to refer to the `yt2t` directory, so change directories back and use)

	.. code-block:: bash

	export YT2T=`pwd`

	Then change directories back to the `OpenNMT-py` directory.
	Use `tools/img_feature_extractor.py`.
	Set the ``--world_size`` argument to the number of GPUs you have available
	(You can use the environment variable ``CUDA_VISIBLE_DEVICES`` to restrict the GPUs used).

	.. code-block:: bash

	PYTHONPATH=$PWD:$PYTHONPATH python tools/vid_feature_extractor.py --root_dir $YT2T/YouTubeClips --out_dir $YT2T/r152

	Ensure the count is equal to 1970.
	You can use ``ls -1 $YT2T/r152 \| wc -l``.
	If not, rerun the script. It will only process on the missing feature vectors.
	(Note this is unexpected behavior and consider opening an issue.)

	Now we turn our attention to the annotations. Each video has multiple associated captions. We want to
	train the model on each video + single caption pair. We'll collect all the captions per video, then we'll
	flatten them into files listing the feature vector sequence filenames (repeating for each caption) and the
	annotations. We skip the test videos since they are handled separately at translation time.

	Change directories back to ``YT2T``:

	.. code-block:: bash

	cd $YT2T

	.. code-block:: python

	import pickle
	import os
	from random import shuffle


	YT = "youtube2text_iccv15"
	SHUFFLE = True

	with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
	ann = pickle.load(f, encoding="latin-1")

	vid2anns = {}
	for vid_name, data in ann.items():
	for d in data:
	try:
	vid2anns[vid_name].append(d["tokenized"])
	except KeyError:
	vid2anns[vid_name] = [d["tokenized"]]

	with open(os.path.join(YT, "train.pkl"), "rb") as f:
	train = pickle.load(f, encoding="latin-1")

	with open(os.path.join(YT, "valid.pkl"), "rb") as f:
	val = pickle.load(f, encoding="latin-1")

	with open(os.path.join(YT, "test.pkl"), "rb") as f:
	test = pickle.load(f, encoding="latin-1")

	train_files = open("yt2t_train_files.txt", "w")
	val_files = open("yt2t_val_files.txt", "w")
	val_folded = open("yt2t_val_folded_files.txt", "w")
	test_files = open("yt2t_test_files.txt", "w")

	train_cap = open("yt2t_train_cap.txt", "w")
	val_cap = open("yt2t_val_cap.txt", "w")

	vid_names = vid2anns.keys()
	if SHUFFLE:
	vid_names = list(vid_names)
	shuffle(vid_names)


	for vid_name in vid_names:
	anns = vid2anns[vid_name]
	vid_path = vid_name + ".npy"
	for i, an in enumerate(anns):
	an = an.replace("\n", " ") # some caps have newlines
	split_name = vid_name + "_" + str(i)
	if split_name in train:
	train_files.write(vid_path + "\n")
	train_cap.write(an + "\n")
	elif split_name in val:
	if i == 0:
	val_folded.write(vid_path + "\n")
	val_files.write(vid_path + "\n")
	val_cap.write(an + "\n")
	else:
	# Don't need to save out the test captions,
	# just the files. And, don't need to repeat
	# it for each caption
	assert split_name in test
	if i == 0:
	test_files.write(vid_path + "\n")

	Return to the `OpenNMT-py` directory. Now we preprocess the data for training.
	We preprocess with a small shard size of 1000. This keeps the amount of data in memory (RAM) to a
	manageable 10 G. If you have more RAM, you can increase the shard size.

	Preprocess the data with

	.. code-block:: bash

	onmt_preprocess -data_type vec -train_src $YT2T/yt2t_train_files.txt -src_dir $YT2T/r152/ -train_tgt $YT2T/yt2t_train_cap.txt -valid_src $YT2T/yt2t_val_files.txt -valid_tgt $YT2T/yt2t_val_cap.txt -save_data data/yt2t --shard_size 1000

	Train with

	.. code-block:: bash

	onmt_train -data data/yt2t -save_model yt2t-model -world_size 2 -gpu_ranks 0 1 -model_type vec -batch_size 64 -train_steps 10000 -valid_steps 500 -save_checkpoint_steps 500 -encoder_type brnn -optim adam -learning_rate .0001 -feat_vec_size 2048

	Translate with

	.. code-block::

	onmt_translate -model yt2t-model_step_7200.pt -src $YT2T/yt2t_test_files.txt -output pred.txt -verbose -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 10

	.. note::

	Generally, you want to keep the model that has the lowest validation perplexity. That turned out to be
	at step 7200, but choosing a different validation frequency or random seed could result in different results.


	Then you can use `coco-caption <https://github.com/tylin/coco-caption/tree/master/pycocoevalcap>`_ to evaluate the predictions.
	(Note that the fork `flauted <https://github.com/flauted/coco-caption>`_ can be used for Python 3 compatibility).
	Install the git repository with pip using


	.. code-block:: bash

	pip install git+<clone URL>

	Then use the following Python code to evaluate:

	.. code-block:: python

	import os
	from pprint import pprint
	from pycocoevalcap.bleu.bleu import Bleu
	from pycocoevalcap.meteor.meteor import Meteor
	from pycocoevalcap.rouge.rouge import Rouge
	from pycocoevalcap.cider.cider import Cider
	from pycocoevalcap.spice.spice import Spice


	if __name__ == "__main__":
	pred = open("pred.txt")

	import pickle
	import os

	YT = os.path.join(os.environ["YT2T"], "youtube2text_iccv15")

	with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
	ann = pickle.load(f, encoding="latin-1")

	vid2anns = {}
	for vid_name, data in ann.items():
	for d in data:
	try:
	vid2anns[vid_name].append(d["tokenized"])
	except KeyError:
	vid2anns[vid_name] = [d["tokenized"]]

	test_files = open(os.path.join(os.environ["YT2T"], "yt2t_test_files.txt"))

	scorers = {
	"Bleu": Bleu(4),
	"Meteor": Meteor(),
	"Rouge": Rouge(),
	"Cider": Cider(),
	"Spice": Spice()
	}

	gts = {}
	res = {}
	for outp, filename in zip(pred, test_files):
	filename = filename.strip("\n")
	outp = outp.strip("\n")
	vid_id = os.path.splitext(filename)[0]
	anns = vid2anns[vid_id]
	gts[vid_id] = anns
	res[vid_id] = [outp]

	scores = {}
	for name, scorer in scorers.items():
	score, all_scores = scorer.compute_score(gts, res)
	if isinstance(score, list):
	for i, sc in enumerate(score, 1):
	scores[name + str(i)] = sc
	else:
	scores[name] = score
	pprint(scores)

	Here are our results ::

	{'Bleu1': 0.7888553878084233,
	'Bleu2': 0.6729376621109295,
	'Bleu3': 0.5778428507344473,
	'Bleu4': 0.47633625833397897,
	'Cider': 0.7122415518428051,
	'Meteor': 0.31829562714082704,
	'Rouge': 0.6811305229481235,
	'Spice': 0.044147089472463576}


	So how does this stack up against the paper? These results should be compared to the "Global (Temporal Attention)"
	row in Table 1. The authors report BLEU4 0.4028, METEOR 0.2900, and CIDEr 0.4801. So, our results are a significant
	improvement. Our architecture follows the general encoder + attentional decoder described in the paper, but the
	actual attention implementation is slightly different. The paper downsamples by choosing 26 equally spaced frames from
	the first 240, while we downsample the video to 2 fps. Also, we use ResNet features instead of GoogLeNet, and we
	lowercase while the paper does not, so some improvement is expected.

	Transformer
	-----------

	Now we will try to replicate the baseline transformer results from
	`"TVT: Two-View Transformer Network for Video Captioning" <http://proceedings.mlr.press/v95/chen18b.html>`_
	on the MSVD (YouTube2Text) dataset. See Table 3, Base model(R).

	In Section 4.3, the authors report most of their preprocessing and hyperparameters.

	Create a folder called yt2t_2. Copy youtube2text_iccv15 directory and YouTubeClips.tar into
	the new directory and untar YouTubeClips. Rerun the renaming code. Subssample at 5 FPS using

	.. code-block:: bash

	for fi in $( ls ); do ffmpeg -y -i $fi -r 5 $fi; done

	Set the environment variable ``$YT2T`` to this new directory and change to the repo directory.
	Run the feature extraction command again to extract ResNet features on the frames.
	Then use this reprocessing code. Note that it shuffles the data differently, and it performs
	tokenization similar to what the authors report.

	.. code-block:: python

	import pickle
	import os
	import random
	import string

	seed = 2345
	random.seed(seed)


	YT = "youtube2text_iccv15"
	SHUFFLE = True

	with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
	ann = pickle.load(f, encoding="latin-1")

	def clean(caption):
	caption = caption.lower()
	caption = caption.replace("\n", " ").replace("\t", " ").replace("\r", " ")
	# remove punctuation
	caption = caption.translate(str.maketrans("", "", string.punctuation))
	# multiple whitespace
	caption = " ".join(caption.split())
	return caption


	with open(os.path.join(YT, "train.pkl"), "rb") as f:
	train = pickle.load(f, encoding="latin-1")

	with open(os.path.join(YT, "valid.pkl"), "rb") as f:
	val = pickle.load(f, encoding="latin-1")

	with open(os.path.join(YT, "test.pkl"), "rb") as f:
	test = pickle.load(f, encoding="latin-1")

	train_data = []
	val_data = []
	test_data = []
	for vid_name, data in ann.items():
	vid_path = vid_name + ".npy"
	for i, d in enumerate(data):
	split_name = vid_name + "_" + str(i)
	datum = (vid_path, i, clean(d["caption"]))
	if split_name in train:
	train_data.append(datum)
	elif split_name in val:
	val_data.append(datum)
	elif split_name in test:
	test_data.append(datum)
	else:
	assert False

	if SHUFFLE:
	random.shuffle(train_data)

	train_files = open("yt2t_train_files.txt", "w")
	train_cap = open("yt2t_train_cap.txt", "w")

	for vid_path, _, an in train_data:
	train_files.write(vid_path + "\n")
	train_cap.write(an + "\n")

	train_files.close()
	train_cap.close()

	val_files = open("yt2t_val_files.txt", "w")
	val_folded = open("yt2t_val_folded_files.txt", "w")
	val_cap = open("yt2t_val_cap.txt", "w")

	for vid_path, i, an in val_data:
	if i == 0:
	val_folded.write(vid_path + "\n")
	val_files.write(vid_path + "\n")
	val_cap.write(an + "\n")

	val_files.close()
	val_folded.close()
	val_cap.close()

	test_files = open("yt2t_test_files.txt", "w")

	for vid_path, i, an in test_data:
	# Don't need to save out the test captions,
	# just the files. And, don't need to repeat
	# it for each caption
	if i == 0:
	test_files.write(vid_path + "\n")

	test_files.close()

	Then preprocess the data with max-length filtering. (Note you will be prompted to remove the
	old data. Do this, i.e. ``rm data/yt2t.*.pt.``)

	.. code-block:: bash

	onmt_preprocess -data_type vec -train_src $YT2T/yt2t_train_files.txt -src_dir $YT2T/r152/ -train_tgt $YT2T/yt2t_train_cap.txt -valid_src $YT2T/yt2t_val_files.txt -valid_tgt $YT2T/yt2t_val_cap.txt -save_data data/yt2t --shard_size 1000 --src_seq_length 50 --tgt_seq_length 20

	Delete the old checkpoints and train a transformer model on this data.

	.. code-block:: bash

	rm -r yt2t-model_step_*.pt; onmt_train -data data/yt2t -save_model yt2t-model -world_size 2 -gpu_ranks 0 1 -model_type vec -batch_size 64 -train_steps 8000 -valid_steps 400 -save_checkpoint_steps 400 -optim adam -learning_rate .0001 -feat_vec_size 2048 -layers 4 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.3 -param_init 0 -param_init_glorot -report_every 400 --share_decoder_embedding --seed 7000

	Note we use the hyperparameters described in the paper.
	We estimate the length of 20 epochs with ``-train_steps``. Note that this depends on
	using a world size of 2. If you use a different world size, scale the ``-train_steps`` (and
	``-save_checkpoint_steps``, along with other parameters) accordingly.

	The batch size is not specified in the paper, so we assume one checkpoint
	per our estimated epoch. And, sharing
	the decoder embeddings is not mentioned, although we find this helps performance. Like the paper, we perform
	"early-stopping" with the COCO scores. We use beam search on the early stopping,
	although this too is not mentioned. You can reproduce our early-stops with these scripts
	(namely, running `find_val_stops.sh` and then `test_early_stops.sh` -
	`process_results.py` is a dependency of `find_val_stops.sh`):

	.. code-block:: python
	:caption: `process_results.py`

	import argparse

	from collections import defaultdict
	import pandas as pd


	def load_results(fname="results.txt"):
	index = []
	data = []
	with open(fname, "r") as f:
	while True:
	try:
	filename = next(f).strip()
	except:
	break
	step = int(filename.split("_")[-1].split(".")[0])
	next(f) # blank
	next(f) # spice junk
	next(f) # length stats
	next(f) # ratios
	scores = {}
	while True:
	score_line = next(f).strip().strip("{").strip(",")
	metric, score = score_line.split(": ")
	metric = metric.strip("'")
	score_num = float(score.strip("}").strip(","))
	scores[metric] = float(score_num)
	if score.endswith("}"):
	break
	next(f) # blank
	next(f) # blank
	next(f) # blank
	index.append(step)
	data.append(scores)
	df = pd.DataFrame(data, index=index)
	return df


	def find_absolute_stops(df):
	return df.idxmax()


	def find_early_stops(df, stop_count):
	maxes = defaultdict(lambda: 0)
	argmaxes = {}
	count_since_max = {}
	ended_metrics = set()
	for index, row in df.iterrows():
	for metric, score in row.items():
	if metric in ended_metrics:
	continue
	if score >= maxes[metric]:
	maxes[metric] = score
	argmaxes[metric] = index
	count_since_max[metric] = 0
	else:
	count_since_max[metric] += 1
	if count_since_max[metric] == stop_count:
	ended_metrics.add(metric)
	if len(ended_metrics) == len(row):
	break
	return pd.Series(argmaxes)


	def find_stops(df, stop_count):
	if stop_count > 0:
	return find_early_stops(df, stop_count)
	else:
	return find_absolute_stops(df)


	if __name__ == "__main__":
	parser = argparse.ArgumentParser("Find locations of best scores")
	parser.add_argument(
	"-s", "--stop_count", type=int, default=0,
	help="Stop after this many scores worse than running max (0 to disable).")
	args = parser.parse_args()
	df = load_results()
	maxes = find_stops(df, args.stop_count)
	for metric, idx in maxes.iteritems():
	print(f"{metric} maxed @ {idx}")
	print(df.loc[idx])
	print()


	.. code-block:: bash
	:caption: `find_val_stops.sh`

	rm results.txt
	touch results.txt
	for file in $( ls -1v yt2t-model_step*.pt )
	do
	echo $file
	onmt_translate -model $file -src $YT2T/yt2t_val_folded_files.txt -output pred.txt -verbose -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 16 -max_length 20 >/dev/null 2>/dev/null
	echo -e "$file\n" >> results.txt
	python coco.py -s val >> results.txt
	echo -e "\n\n" >> results.txt
	done
	python process_results.py -s 10 > val_stops.txt

	.. code-block:: bash
	:caption: `test_early_stops.sh`

	rm test_results.txt
	touch test_results.txt
	while IFS='' read -r line \|\| [[ -n "$line" ]]; do
	if [[ $line == "maxed" ]]; then
	metric=$(echo $line \| awk '{print $1}')
	step=$(echo $line \| awk '{print $NF}')
	echo $metric early stopped @ $step \| tee -a test_results.txt
	onmt_translate -model "yt2t-model_step_${step}.pt" -src $YT2T/yt2t_test_files.txt -output pred.txt -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 16 -max_length 20 >/dev/null 2>/dev/null
	python coco.py -s 'test' >> test_results.txt
	echo -e "\n\n" >> test_results.txt
	fi
	done < val_stops.txt
	cat test_results.txt

	Thus we test the checkpoint at step 2000 and find the following scores::

	Meteor early stopped @ 2000
	SPICE evaluation took: 2.522 s
	{'testlen': 3410, 'reflen': 3417, 'guess': [3410, 2740, 2070, 1400], 'correct': [2664, 1562, 887, 386]}
	ratio: 0.9979514193734276
	{'Bleu1': 0.7796296150773093,
	'Bleu2': 0.6659837622637965,
	'Bleu3': 0.5745524496015597,
	'Bleu4': 0.4779574102543823,
	'Cider': 0.7541600090591118,
	'Meteor': 0.3259497476899707,
	'Rouge': 0.6800279518634998,
	'Spice': 0.046435637924854}


	Note our scores are an improvement over the recurrent approach.

	The paper reports
	BLEU4 50.25, CIDEr 72.11, METEOR 33.41, ROUGE 70.16.

	The CIDEr score is higher than the paper (but, considering the sensitivity of this
	metric, not by much), while the other metrics are slightly lower.
	This could be indicative of an implementation difference. Note that Table 5 reports
	24M parameters for a 2-layer transformer with ResNet inputs, while we find a few M less. This
	could be due to generator or embedding differences, or perhaps linear layers on the
	residual connections. Alternatively, the difference could be the initial tokenization.
	The paper reports 9861 tokens, while we find fewer.

	Part of this could be due to using
	the annotations from the other repository, where perhaps some annotations have been
	stripped. We also do not know the batch size or checkpoint frequency from the original
	work.

	Different random initializations could account for some of the difference, although
	our random seed gives good results.

	Overall, however, the scores are nearly reproduced
	and the scores are favorable.