pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-29uFS4FiDZ.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 2 months ago

raw

history blame

38.8 kB

	# WORD SENSE INDUCTION WITH KNOWLEDGE DISTIL## LATION FROM BERT

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Pre-trained contextual language models are ubiquitously employed for language
	understanding tasks, but are unsuitable for resource-constrained systems. Noncontextual word embeddings are an efficient alternative in these settings. Such
	methods typically use one vector to encode multiple different meanings of a word,
	and incur errors due to polysemy. This paper proposes a two-stage method to distill
	multiple word senses from a pre-trained language model (BERT) by using attention
	over the senses of a word in a context and transferring this sense information to fit
	multi-sense embeddings in a skip-gram-like framework. We demonstrate an effective approach to training the sense disambiguation mechanism in our model with a
	distribution over word senses extracted from the output layer embeddings of BERT.
	Experiments on the contextual word similarity and sense induction tasks show that
	this method is superior to or competitive with state-of-the-art multi-sense embeddings on multiple benchmark data sets, and experiments with an embedding-based
	topic model (ETM) demonstrates the benefits of using this multi-sense embedding
	in a downstream application.

	1 INTRODUCTION

	While modern deep contextual word embeddings have dramatically improved the state-of-the-art in
	natural language understanding (NLU) tasks, shallow noncontextual representation of words are more
	practical solution in settings constrained by compute power or latency. In single-sense embeddings
	such as word2vec or GloVe, the different meanings of a word are represented by the same vector,
	which leads to the meaning conflation problem in the presence of polysemy. Arora et al. (2018)
	showed that in the presence of polysemy, word embeddings are linear combination of the different
	sense embeddings. This leads to another limitation in the word embeddings due to the triangle
	inequality of vectors (Neelakantan et al., 2014). For apple, the distance between the words similar to
	its two meanings, i.e. peach and software, will be less than the sum of their distances to apple. To
	account for these and other issues that arise due to polysemy, several multi-sense embeddings model
	have been proposed to learn multiple vector representations for each word.

	There are two major components for a multi-sense embedding model to be used in downstream tasks:
	(1) sense selection to find the correct sense of a word in context (2) sense representation for the
	different meanings of a word. Two types of tasks are used to evaluate these two distinct abilities of
	a word sense embedding model: word sense induction and contextual word similarity respectively.
	In word sense induction, a model has to select the correct meaning of a word given its context. For
	contextual word similarity, the model must be able to embed word senses with similar meaning close
	to each other in the embedding space. Prior works on multi-sense embeddings have not been shown
	to be effective on these two tasks simultaneously (e.g. AdaGram (Bartunov et al., 2016) works well
	on word sense induction but not on contextual word similarity). We propose a method to distill
	knowledge from a contextual embedding model, BERT (Devlin et al., 2018), for learning a word
	sense model that can perform both tasks effectively. Furthermore, we show the advantage of sense
	embeddings over word embeddings for training topic models.

	Pre-trained language models like BERT, GPT (Radford et al., 2019), RoBERTa (Liu et al., 2019)
	etc. have been immensely successful in multiple natural language understanding tasks, but the
	computational requirements and large latencies of these models inhibits their use in low-resource
	systems. Static word embeddings or sense embeddings have significantly lower computational


	-----

	requirements and lower latencies compared to these large transformer models. In this work, we
	use knowledge distillation (Hinton et al., 2015) to transfer the contextual information in BERT for
	improving the learning of multi-sense word embedding models. Although there are several prior
	works on distilling knowledge to smaller transformer models, to the best of our knowledge there has
	been no work on using knowledge distillation to improve the training of word or sense embeddings
	using pre-trained language models.

	The contributions of this paper are:

	(i) A two-stage method for knowledge distillation from BERT to static word sense embeddings

	(ii) A sense embedding model with state-of-the-art simultaneous performance on sense selection
	and sense induction.

	(iii) A demonstration of the increased efficacy of using these multi-sense embedding in lieu of single
	sense in embeddings in a downstream task, namely embedded topic modeling (Dieng et al.,
	2020).

	2 RELATED WORK

	Here we describe a number of previous models trained to represent different meanings of words using
	the word2vec framework. Prior works on model distillation using BERT are also discussed.

	Reisinger & Mooney (2010) was the first to learn vector representations for multiple senses of a word.
	They extracted feature vectors of a word in different contexts from the context words and clustered
	them to generate sense clusters for a word. The cluster centroids are then used as sense embeddings
	or prototypes. Huang et al. (2012) similarly clustered the context representations of a word using
	spherical k-means, and learned sense embeddings by training a neural language model.

	Neelakantan et al. (2014) extended the skip-gram model to learn the cluster centers the sense
	embeddings simultaneously. At each iteration, the context embedding is assigned to the nearest
	cluster center and the sense embedding for that cluster is used to predict the context word. A
	non-parametric method is employed to learn an appropriate number of clusters during training.

	Tian et al. (2014) defined the probability of a context word given a center word as a mixture model
	with word senses representing the mixtures. They developed an EM algorithm for learning the
	embeddings by extending the skip-gram likelihood function. Li & Jurafsky (2015) used the Chinese
	Restaurent Process to learn the number of senses while training the sense embeddings. They evaluated
	their embeddings on a number of downstream tasks but did not find a consistent advantage in using
	multi-sense embeddings over single-sense embeddings.

	Athiwaratkun & Wilson (2017) represented words as multi-modal Gaussian distributions where the
	different modes correspond to different senses of a word. To denote the similarity between two word
	distributions, they used the expected likelihood kernel replacing cosine similarity for vectors.

	Bartunov et al. (2016) proposed a non-parameteric Bayesian extension to the skip-gram model to learn
	the number of senses of words automatically. They used the Dirichlet Process for infinite mixture
	modeling by assuming prior probabilities over the meanings of a word. Due to the intractibility of the
	likelihood function, they derived a stochastic variational inference algorithm for the model.

	Grzegorczyk & Kurdziel (2018) learned sense disambiguation embeddings to model the probability
	of different word senses in a context. They used the gumbel-softmax method to approximate the
	categorical distribution of sense probabilities. To learn the number of senses of a word, they added an
	entropy loss and combined it with a pruning strategy.

	Lee & Chen (2017) proposed the first sense embedding model using a reinforcement learning
	framework. They learned only sense embeddings and an efficient sense selection module for assigning
	the correct sense to a word in context.

	Hinton et al. (2015) demonstrated the effectiveness of employing a large pre-trained teacher network
	to guide the training of a smaller student network and named this method model distillation. Recently,
	this technique has been used to train many efficient transformer models from BERT. Sun et al. (2019);
	Sanh et al. (2019); Jiao et al. (2019); Sun et al. (2020) have trained smaller BERT models by distilling
	knowledge from pre-trained BERT models down to fewer layers. However, there seems to be no prior
	work on improving word or sense embeddings by using pre-trained language models like BERT.


	-----

	Lsense

	p(wi = wik\|c)

	gi+l Sigmoid (gi+l . vik)

	Softmax (c . dik)

	c

	Mean

	v k senses d gi-Δ …

	center word wi


	vi

	MODEL


	di gi-Δ

	context word

	wi -Δ

	Figure 1: Sense embedding model


	context word

	wi +Δ


	We first describe our word sense embedding model. Then the details of extracting knowledge from
	the contextual embeddings in BERT are discussed.

	3.1 SENSE EMBEDDING MODEL

	We have a dataset D = _w1, w2, . . ., wN_, a sequence of N words from a vocabulary V . For a word
	_{_ _}_
	_wi in our vocabulary, we assume K senses and for the k-th sense, we learn a sense embedding, vi[k]_
	and a disambiguation embedding, d[k]i [. We also have a global embedding][ g][i][ for each word][ w][i][.]

	Following the word2vec framework to train the sense embeddings, we slide a window of size
	2∆+ 1 (center word and ∆ words to its left and right) over the dataset. In a context window
	_C = {wi−∆, . . ., wi, . . ., wi+∆}, we define the conditional probability of a context word given the_
	center word, wi as


	_p(wi+l_ _wi = wi[k][)][p][(][w][i]_ [=][ w]i[k] (1)
	_k=1_ _\|_ _[\|][w][i][, C][)]_

	X


	_p(wi+l_ _wi) =_
	_\|_


	where, l ∈ [−∆, ∆], l ̸= 0.

	Here the probability of a context word, wi+l given a sense embedding, wi[k] [is]

	_p(wi+l\|wi = wi[k][) =][ σ][(][⟨][g][i][+][l][,][ v]i[k][⟩][)]_ (2)

	where σ(x) = (1 + exp(−x))[−][1] is the logistic sigmoid function.

	The sense disambiguation vector selects the more relevant sense based on the context. The probability
	of sense, k of word, wi in context C is defined similar to the dot-product attention mechanism.

	exp( d[k]i _[,][ c][⟩][)]_
	_p(wi = wi[k]_ _K_ _⟨_ (3)

	_[\|][w][i][, C][) =]_ _k=1_ [exp(][⟨][d]i[k][,][ c][⟩][)]

	where, c is the context embedding for context C defined by the average of the global embeddings ofP
	the context words


	gi+l (4)


	c =


	2∆


	_l=l=0̸_ _−∆_


	Iterative sense disambiguation Instead of using the global embeddings of the context words, we
	can use their sense embeddings and sense probabilities to obtain context embedding with more
	contextual information. The steps are following:

	Step 1 Using the context embedding from Eq. 4, compute the sense probability p(wi+l =
	_wi[k]+l[\|][w][i][+][l][, C][)][ for each word in the context (][l][ ∈]_ [[][−][∆][,][ ∆]][, l][ ̸][= 0][)]


	-----

	Step 2 Replace the global embeddings in Eq. 4 with the weighted average of the sense embeddings
	for each context word


	vi[k]+l _[p][(][w][i][+][l]_ [=][ w]i[k]+l[\|][w][i][+][l][, C][)] (5)
	_k=1_

	X


	c =


	2∆


	_l=l=0̸_ _−∆_


	Similar to the skip-gram model (Mikolov et al., 2013), we employ the negative sampling method
	to reduce the computational cost of a softmax over a huge vocabulary. The loss function for the
	multi-sense embedding model is as follows


	log[1 _p(wi+l,j_ _wi)]_ (6)
	_−_ _\|_ _}_
	_j=1_

	X


	_Lsense = −_ _{log[p(wi+l\|wi)] +_ log[1 − _p(wi+l,j\|wi)]}_ (6)

	_wXi∈D_ _l=Xl=0̸_ _−∆_ Xj=1

	where, n is the number of negative samples and wi+l,j is the j-th negative sample for the context
	word wi+l.

	3.2 TRANSFER KNOWLEDGE FROM BERT

	We use the contextual embeddings from BERT to improve our sense embedding model in 2 steps.
	First, we learn the sense embeddings as a linear decomposition of the BERT contextual embedding
	in a context window. Second, the sense embeddings from the first step are used for knowledge
	distillation to our sense embedding model.

	3.2.1 BERT SENSE MODEL

	Let, the sense embeddings of wordembedding for word, wi is bi - embedding from the output layer of BERT. Since BERT uses wi trained from BERT is {u[1]i _[, . . .,][ u]i[K][}][ and the contextual]_
	WordPiece tokenization, some words are divided into smaller tokens. So each word has multiple
	‘wordpiece’ tokens and output embeddings. To get the contextual embedding bi from these token
	embeddings, we take the mean of all token output embeddings for a word. Similar to Eq. 3 the
	probability of sense, k of word, wi in context C is

	exp( d[k]i _[,][ c][BERT][⟩][)]_
	_p(wi = wi[k]_ _K_ _⟨_ (7)

	_[\|][w][i][, C][) =]_ _k=1_ [exp(][⟨][d]i[k][,][ c][BERT][⟩][)]

	where, context embedding P


	_Lsense = −_


	_wi∈D_


	cBERT =


	bi+l
	_l=X−∆_


	2∆


	We want the weighted sum of the sense embeddings to be similar to the contextual embedding from
	BERT. So, the loss function is


	_p(wi = wi[k]_ _i_
	_k=1_ _[\|][w][i][, C][)][u][k]_ _[−]_ [b][i]

	X


	(8)


	_LBERT =_


	_wi∈D_


	3.2.2 KNOWLEDGE DISTILLATION FROM BERT

	Here we combine our multi-sense embedding model and the sense embeddings from BERT to transfer
	the contextual information. To be precise, we want to improve the sense disambiguation module (3)
	of the multi-sense embedding model using BERT. The key idea of knowledge distillation is that the
	output from a complex network contains more fine grained information than a class label. So we
	want the student network to produce similar probabilities as the teacher network. Hinton et al. (2015)
	suggested using a “softer" version of the output from the teacher network using a temperature T over
	the output logits zi:

	exp(zi/T )
	_p[T]_ (x) =

	_i_ [exp(][z][i][/T] [)]

	P


	-----

	where, x is the input.

	The loss for the knowledge transfer is the combination of the original loss (Eq. 6) and the crossentropy loss between the sense disambiguation probability from the sense embedding (Eq. 3) model
	and the BERT model (Eq. 7).
	_L = Lsense + αLtransfer_ (9)

	where


	_Ltransfer = −T_ [2]


	_p[T]sense[(][w][i]_ [=][ w]i[k] BERT[(][w][i] [=][ w]i[k] (10)
	_k=1_ _[\|][w][i][, C][) log(][p][T]_ _[\|][w][i][, C][))]_

	X


	is the cross-entropy loss between the distilled sense distribution and the BERT sense distribution.
	We multiply this loss by T [2] because the magnitudes of the gradients in the distillation loss scale by
	1/T [2].

	4 EXPERIMENTS

	We conduct both qualitative and quantitative experiments on our trained sense embedding model.

	4.1 DATASET AND TRANING

	We trained our models on the April 2010 snapshot of Wikipedia (Shaoul & Westbury, 2010) for fair
	comparison with the prior multi-sense embedding models. This corpus has over 2 million documents
	and about 990 million words. We also used the same vocabulary set as Neelakantan et al. (2014).

	We selected the hyperparameters of our models by training them on a small part of the corpus and
	using the loss metric on a validation set. For training, we set the context window size to ∆= 5. We
	use 10 negative samples for each positive (word, context) pair. We do not apply subsampling like
	word2vec since the BERT model was pre-trained on sentences without subsampling. We learn 3
	sense embeddings for each word with a dimension of 300 for comparison with prior work. We use
	the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 0.001.

	The base model of BERT is used as the pre-trained language model. The PyTorch transformers (Wolf
	et al., 2019) library is used for inference through the BERT-base model. With a batch size of 2048, it
	takes 10 hours to train the sense embedding model on the wikipedia corpus for one epoch in a Tesla
	V100 GPU. The sense embedding models with and without knowledge distillation are trained for 2
	epochs over the full Wikipedia dataset. The weight for the knowledge distillation loss (Eq. 9) α = 1
	is selected from {0.1, 0.5, 1.0}. Temperature parameter T is set to 4 by tuning over the range [2,6].

	4.2 NEAREST NEIGHBOURS

	The nearest neighbours of words in context are shown in Table 1. Given the context words, the
	context embedding is computed using Eq. 5 and sense probability distribution is computed with Eq.
	3. The most probable sense is selected and then the cosine distance is computed with the global
	embeddings of all words in the vocabulary. Table 1 also shows the values of the sense probability for
	the word in context. The sense probabilities are close to 1 in most of the cases. This indicates that the
	model learned a narrow distribution over the senses and it selects a single sense most of the time.

	4.3 WORD SENSE INDUCTION

	This task evaluates the model’s ability to select the appropriate sense given a context. To assign
	a sense to a word in context, we take the words in its context window and compute the context
	embedding using Eq. 5. Then the sense probability from Eq. 3 gives the most probable sense for that
	context.

	We follow Bartunov et al. (2016) for the evaluation dataset and metrics. Three WSI datasets are used
	in our experiments. The SemEval-2007 dataset is from SemEval-2007 Task 2 competition (Agirre &
	Soroa, 2007). It contains 27232 contexts for 100 words collected from Wall Street Journal (WSJ)
	corpus. The SemEval-2010 dataset is from the SemEval-2010 Task 14 competition (Manandhar &


	-----

	\|Context\|Sense: Prob.\|Nearest Neighbours\|
	\|---\|---\|---\|
	\|macintosh is a family of computers from apple apple software includes mac os x apple is good for health\|1 : 0.99 2 : 0.963 3 : 0.999\|apple, amstrad, macintosh, commodore, dos microsoft, ipod, announced, software, mac apple, candy, strawberry, cherry, blueberry\|
	\|the power plant is closing the science of plant life the seed of a plant\|3 : 1.0 2 : 1.0 1 : 0.99\|mill, plant, cogeneration, factory, mw species, plant, plants, lichens, insects thistle, plant, gorse, salvia, leaves\|
	\|the cell membranes of an organism almost everyone has a cell phone now prison cell is not a pleasant place\|3 : 1.0 2 : 0.99 1 : 0.99\|depolarization, contractile, hematopoietic, follicular, apoptosis phone, cell, phones, mobile, each prison, room, bomb, arrives, jail\|


	Context Sense: Nearest Neighbours
	Prob.

	macintosh is a family of computers from apple 1 : 0.99 apple, amstrad, macintosh, commodore, dos

	_apple software includes mac os x_ 2 : 0.963 microsoft, ipod, announced, software, mac

	_apple is good for health_ 3 : 0.999 apple, candy, strawberry, cherry, blueberry

	the power plant is closing 3 : 1.0 mill, plant, cogeneration, factory, mw

	the science of plant life 2 : 1.0 species, plant, plants, lichens, insects

	the seed of a plant 1 : 0.99 thistle, plant, gorse, salvia, leaves

	the cell membranes of an organism 3 : 1.0 depolarization, contractile, hematopoietic,
	follicular, apoptosis

	almost everyone has a cell phone now 2 : 0.99 phone, cell, phones, mobile, each

	prison cell is not a pleasant place 1 : 0.99 prison, room, bomb, arrives, jail


	Table 1: Top 5 nearest neighbours of words in context based on cosine similarity. We also report the
	index of the selected sense and its probability in that context.

	Klapaftis, 2009) which contains 8915 contexts for 100 words. Finally, the SemEval-2013 dataset
	comes from the SemEval-2013 Task 13 (Jurgens & Klapaftis, 2013) with 4664 contexts for 50 words.

	For evaluation, we cluster the different contexts of a word according to the model’s predicted sense.
	To evaluate the clustering of a set of points, we take pairs of points in the dataset and check if they
	are in the same cluster according to the ground truth. For the SemEval competitions, two traditional
	metrics were used: F-score and V-measure. But there are intrinsic biases in these methods. F-score
	will be higher for a cluster with fewer centers while V-measure will be higher for clusters with a
	large number of centers. So Bartunov et al. (2016) used ARI (Adjusted Rand Index) as the clustering
	evaluation method since it does not suffer from these problems.

	We report the ARI scores for MSSG (Neelakantan et al., 2014), MPSG (Tian et al., 2014) and
	AdaGram (Bartunov et al., 2016) from Bartunov et al. (2016). Scores for disambiguated skip-gram
	are collected from Grzegorczyk & Kurdziel (2018). We used the trained embeddings published by
	MUSE Lee & Chen (2017) with Boltzmann method for this task. Single-sense models like skip-gram
	cannot be used in this task since the task is to assign different senses to different contexts of a word.
	Contextual models like BERT give an embedding for a word given its context, but this embedding is
	not associated with a specific discrete meaning of the word. Therefore BERT models are not suitable
	for comparison on this task.

	MUSE shows strong performance in word similarity tasks (section 4.4) but fails to induce meaningful
	senses given a context. The Disambiguated Skip-gram model was the best model on SemEval-2007
	and SemEval-2010 datasets. Our sense embedding model performs better than MSSG and MPSG
	models but there is a large gap with the state-of-the-art. We can see a significant improvement in ARI
	score (1.25x − 2.25x) on the 3 datasets with knowledge distillation from BERT.

	\|Model\|SE-2007\|SE-2010\|SE-2013\|
	\|---\|---\|---\|---\|
	\|MSSG MPSG AdaGram MUSE Disamb. SG\|0.048 0.044 0.069 0.0009 0.077\|0.085 0.077 0.097 0.01 0.117\|0.033 0.033 0.061 0.006 0.045\|
	\|SenseEmbed BERTSense BERTKDEmbed\|0.053 0.179 0.145\|0.076 0.179 0.144\|0.054 0.155 0.133\|



	Table 2: ARI (Adjusted Rand Index) scores for word sense induction tasks. All models have
	dimension of 300. The Adagram and Disambiguated skip-gram models learn variable number of
	senses while all other models learn fixed number of senses per word. All baseline results except
	MUSE were collected from Grzegorczyk & Kurdziel (2018). SenseEmbed - Sense embedding model
	without distillation, BERTSense - Sense embedding model (Sec. 3.2.1), BERTKDEmbed - Sense
	embedding with distillation

	4.4 CONTEXTUAL WORD SIMILARITY

	Here, the task is to predict the similarity of the meaning of 2 words given the context they were used.
	We use the Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012). This dataset


	-----

	contains 2003 word pairs with their contexts, selected from Wikipedia. Spearman’s rank correlation
	_ρ is used to measure the degree of agreement between the human similarity scores and the model’s_
	similarity scores.

	We use the following methods for predicting similarity of two words in context for a sense embedding
	model following Reisinger & Mooney (2010): AvgSimC and MaxSimC. The AvgSimC measure
	weighs the cosine similarity of all combinations of sense embeddings of 2 words with the probability
	of those 2 senses given their respective contexts.


	_p(w1 = w1[i]_ _[\|][w][1][, C][)][p][(][w][2]_ [=][ w]2[j][\|][w][2][, C][) cos(][v]1[i] _[, v]2[j][)]_ (11)
	_j=1_

	X


	AvgSimC(w1, w2) =


	_K_ [2]


	_i=1_


	Here, the cosine similarity between sense i word 1 and sense j of word 2 is multiplied by the
	probability of these 2 senses given their contexts.

	The MaxSimC method selects the best sense for each context based on their probabilities and measures
	their cosine similarity.
	MaxSimC(w1, w2) = cos(v1[i] _[, v]2[j][)]_ (12)
	where

	_K_ _K_
	_i = arg_ max 1 _[\|][w][1][, C][)]_ and _j = arg_ max 2 _[\|][w][2][, C][)]_ (13)
	_k=1_ _[p][(][w][1][ =][ w][k]_ _k=1_ _[p][(][w][2][ =][ w][k]_

	For comparison with baselines on this task, we use the single-sense skip-gram model (Mikolov et al.,
	2013) with 300 and 900 dimensions since our word sense model has 3 embeddings per word with
	300 dimensions each. We use the contextual models (BERT, DistilBERT) by selecting the token
	embedding from the output layer for the word in a context window of 10 words to the left and right.

	We report the score for the best performing model from each method in Table 3. For MUSE, the best
	model (Boltzmann method) is used. [1]. Our sense embedding model (SenseEmbed) is competitive in
	AvgSimC measure whereas the knowledge distillation method significantly improves the MaxSimC
	score. We can infer from this result that the improvement in sense induction mechanism helped
	learn better embeddings for each word sense. The best performing multi-sense embedding model’s
	performance is very close to the 900 dimensional skip-gram embeddings (69.3 vs 68.4). The best
	performing multi-sense model considering the average over the 2 measures is MUSE with an average
	of 68.2. But it does not do well on the word sense induction task. Both contextual embedding models
	(BERT and DistilBERT (Sanh et al., 2019)) obtained lower scores than our sense embedding model.

	\|Model\|AvgSimC\|MaxSimC\|
	\|---\|---\|---\|
	\|Skip-Gram (300D) Skip-Gram (900D)\|65.2 68.4\|65.2 68.4\|
	\|MSSG MPSG AdaGram MUSE\|69.3 65.4 61.2 68.8\|57.26 63.6 53.8 67.6\|
	\|BERT-base DistilBERT-base\|64.26 66.51\|64.26 66.51\|
	\|SenseEmbed BERTSense BERTKDEmbed\|67.6 62.6 68.6\|54.7 64 67.2\|



	Table 3: Spearman rank correlation score (ρ × 100) on the SCWS dataset. Skip-gram is the only
	single-sense embedding model. All other embedding models are 300 dimensional. Skip-gram results
	are reported in Bartunov et al. (2016), MUSE score is obtained using their published model weights
	and for other models the results are from the respective papers.

	4.5 EFFECTIVENESS OF KNOWLEDGE DISTILLATION

	We can use the intermediate sense embeddings3.2.1) on word sense induction and contextual word similarity tasks. Since the BERT sense embed- {u[1]i _[, . . .,][ u]i[K][}][ from the BERT Sense model (Sec.]_
	dings are trained using the output embeddings from BERT, we can use them as context embedding
	for sense induction.

	[1https://github.com/MiuLab/MUSE, we used trained embeddings published here](https://github.com/MiuLab/MUSE)


	-----

	In the knowledge distillation method, we train our sense embedding model to mimick the sense
	distribution from the BERT Sense model. Since the WSI task evaluates the predicted sense of the
	model, this is the ideal task to quantify the quality of knowledge transfer from BERT to the sense
	embedding model. The BERT Sense model is compared with the sense embedding model trained with
	knowledge distillation in Table 4. As the BERT Sense model is trained with MSE (Mean Squared
	Error) loss, it’s sense embeddings have the same dimensionality as BERT (768). From Table 4, we
	see that the 300 dimensional sense embedding model can obtain 75% to 88% ARI scores without
	using the 12 layer transformer model with 768 dimensional embeddings.

	\|Model\|SE-2007\|SE-2010\|SE-2013\|
	\|---\|---\|---\|---\|
	\|BERTSense\|0.195\|0.163\|0.155\|
	\|BERTKDEmbed\|0.145\|0.144\|0.133\|



	Table 4: Word sense induction using the BERT Sense model (768D) and knowledge distilled sense
	embedding model (300D). ARI (Adjusted Rand Index) scores are reported.

	\|Model\|AvgSimC\|MaxSimC\|
	\|---\|---\|---\|
	\|BERTSense\|62.6\|64\|
	\|BERTKDEmbed\|68.6\|67.2\|



	Table 5: Comparison of BERT sense model with sense embedding trained with knowledge distillation.
	Spearman rank correlation (ρ × 100) scores are reported on the SCWS dataset.
	We also compare the BERT sense model with the sense embedding model on the contextual word
	similarity task in Table 5. The lower correlation score from BERT Sense model indicates that these
	embeddings are not better at sense representation although they learn to disambiguate senses. The
	loss function in Eq. 8 helps the BERT Sense model to extract contextual sense information in BERT
	output embeddings, but it is not optimal for training sense embeddings. The sense embedding model
	combined with the knowledge gained from the BERT sense model learns to do both i) sense selection
	and ii) sense representation effectively.

	4.6 EMBEDDED TOPIC MODEL

	As sense embeddings help discern the meaning of a word in context, prior work has shown the benefits
	of using them to perform topic modeling in the embedding space (Dieng et al., 2020). We explore the
	further benefits of using embeddings which model the fact that one token can have multiple meanings.
	Given a word embedding matrix ρ where the column ρv is the embedding of word v, Embedded
	Topic Model (ETM) follows the generative process:

	1. Draw topic proportions θd (0, I)
	_∼LN_
	2. For each word n in the document:

	(a) Draw topic assignment zdn Cat(θd)
	_∼_

	(b) Draw the word wdn softmax(ρ[T] _αzdn_ )
	_∼_

	Here, αk is the topic embedding for topic k. These topic embeddings are used to obtain per-topic
	distribution over the vocabulary of words.

	We experimentally compare the performance of ETM models trained[2] using skip-gram word embeddings, our sense embeddings and MSSG (Neelakantan et al., 2014) embeddings. Three senses
	were trained for each word in both of the multi-sense models. A 15K document training set, 4.9K
	test set, and 100 document validation set are sampled from the Wikipedia data set and used for the
	experiments. During training using our sense embeddings and the MSSG multi-sense embeddings,
	the sentences are represented using a "bag-of-sense" representation where each word is replaced
	with its most probable sense, selected using a context window of 10 words; with the skip-gram
	embeddings, bag-of-words representations are used.

	For evaluation, we measured the document completion perplexity, topic coherence, and topic diversity
	for different vocabulary sizes. Document perplexity measures how well the topic model can be used

	[2Using code published by the authors of the original ETM model; see https://github.com/](https://github.com/adjidieng/ETM)
	[adjidieng/ETM.](https://github.com/adjidieng/ETM)


	-----

	to predict the second half of a document, given its first half. Topic coherence measures the probability
	of the top words in each topic to occur in the same documents, while topic diversity is a measure of
	the uniqueness of the most common words in each topic. Higher topic coherence and topic diversity
	indicate that a model performs better at extracting distinct topics; the product of the two measures is
	used as a measure of topic quality:

	topic quality = topic coherence × topic diversity

	We follow (Dieng et al., 2020) in reporting the ratio of perplexity and topic quality as a measure of
	the performance of the topic model: a lower ratio is better.

	We set the minimum document frequency to [5, 10, 30, 100] to change the vocabulary size. ETM is
	initialized with pre-trained sense embeddings and then fine tuned along with the topic embeddings.
	We followed the hyperparameters reported in Dieng et al. (2020) to train all models. Since we use a
	bag-of-sense model, ETM provides a probability distribution over the senses. We sum the probability
	for the different senses of a word to obtain the probability for a word. In Figure 2, we show the ratio
	of the document completion perplexity (normalized by vocabulary size) and the topic quality.

	ETM
	ETM (sense)

	5 ETM (mssg)

	LDA

	4

	3

	2

	Quality-Normalized Perplexity

	1

	10000 20000 30000 40000 50000 60000 70000 80000

	Vocabulary size


	Figure 2: Comparison of word embeddings and sense embeddings for topic modeling with different
	vocabulary sizes. Here, ETM (sense) - ETM with BERTKDEmbed, ETM (mssg) - ETM with MSSG.

	We see that both MSSG and BERTKDEmbed ETM models obtain lower perplexity-quality ratio
	for higher vocabulary sizes compared to the single-sense (skip-gram) ETM model. This confirms
	that multi-sense embedding models produce Embedded Topic Models with higher performance.
	BERTKDEmbed ETM also outperforms the other multi-sense ETM model.

	5 CONCLUSION

	We present a multi-sense word embedding model for representing the different meanings of a word
	using knowledge distillation from a pre-trained language model (BERT). Unlike prior work on
	model distillation from contextual language models, we focus on extracting non-contextual word
	senses; this is accomplished in an unsupervised manner using an intermediate attention-based sense
	disambiguation model. We demonstrate start-of-the-art results on three word sense induction data sets
	using fast and efficient sense embedding models. This is the first multi-sense word embedding model
	with strong performance on both the word sense induction and contextual word similarity tasks. As a
	downstream application, we show that the sense embedding model increases the performance of the
	recent embedding-based topic model (Dieng et al., 2020). In future, we will explore the benefits of
	using pre-trained multi-sense embeddings as features for other downstream tasks, and tackle the task
	of tuning the number of word meanings for each individual word.


	-----

	REFERENCES

	Eneko Agirre and Aitor Soroa. SemEval-2007 task 02: Evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations
	_(SemEval-2007), pp. 7–12, Prague, Czech Republic, June 2007. Association for Computational_
	[Linguistics. URL https://www.aclweb.org/anthology/S07-1002.](https://www.aclweb.org/anthology/S07-1002)

	Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic
	structure of word senses, with applications to polysemy. Transactions of the Association for
	_[Computational Linguistics, 6:483–495, 2018. doi: 10.1162/tacl_a_00034. URL https://www.](https://www.aclweb.org/anthology/Q18-1034)_
	[aclweb.org/anthology/Q18-1034.](https://www.aclweb.org/anthology/Q18-1034)

	Ben Athiwaratkun and Andrew Wilson. Multimodal word distributions. In Proceedings of the 55th
	_Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp._
	1645–1656, 2017.

	Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. Breaking sticks and
	ambiguities with adaptive skip-gram. In artificial intelligence and statistics, pp. 130–138, 2016.

	Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
	bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

	Adji B Dieng, Francisco JR Ruiz, and David M Blei. Topic modeling in embedding spaces. Transac_tions of the Association for Computational Linguistics, 8:439–453, 2020._

	Karol Grzegorczyk and Marcin Kurdziel. Disambiguated skip-gram model. In Proceedings of
	_the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1445–1454,_
	Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:
	[10.18653/v1/D18-1174. URL https://www.aclweb.org/anthology/D18-1174.](https://www.aclweb.org/anthology/D18-1174)

	Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
	[In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.](http://arxiv.org/abs/1503.02531)
	[org/abs/1503.02531.](http://arxiv.org/abs/1503.02531)

	Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving word
	representations via global context and multiple word prototypes. In Proceedings of the 50th Annual
	_Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882._
	Association for Computational Linguistics, 2012.

	Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
	Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351,
	2019.

	David Jurgens and Ioannis Klapaftis. SemEval-2013 task 13: Word sense induction for graded and
	non-graded senses. In Second Joint Conference on Lexical and Computational Semantics (*SEM),
	_Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval_
	_2013), pp. 290–299, Atlanta, Georgia, USA, June 2013. Association for Computational Linguistics._
	[URL https://www.aclweb.org/anthology/S13-2049.](https://www.aclweb.org/anthology/S13-2049)

	Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
	_arXiv:1412.6980, 2014._

	Guang-He Lee and Yun-Nung Chen. Muse: Modularizing unsupervised sense embeddings. In
	_Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp._
	327–337, 2017.

	Jiwei Li and Dan Jurafsky. Do multi-sense embeddings improve natural language understanding? In
	_Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp._
	1722–1732, 2015.

	Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
	Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
	approach. arXiv preprint arXiv:1907.11692, 2019.


	-----

	Suresh Manandhar and Ioannis Klapaftis. SemEval-2010 task 14: Evaluation setting for word
	sense induction & disambiguation systems. In Proceedings of the Workshop on Semantic Evalua_tions: Recent Achievements and Future Directions (SEW-2009), pp. 117–122, Boulder, Colorado,_
	[June 2009. Association for Computational Linguistics. URL https://www.aclweb.org/](https://www.aclweb.org/anthology/W09-2419)
	[anthology/W09-2419.](https://www.aclweb.org/anthology/W09-2419)

	Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations
	of words and phrases and their compositionality. In Advances in neural information processing
	_systems, pp. 3111–3119, 2013._

	Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. Efficient nonparametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014
	_Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1059–1069,_
	2014.

	Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
	models are unsupervised multitask learners. 2019.

	Joseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of word meaning.
	In Human Language Technologies: The 2010 Annual Conference of the North American Chapter
	_of the Association for Computational Linguistics, pp. 109–117. Association for Computational_
	Linguistics, 2010.

	Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
	bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

	Cyrus Shaoul and Chris Westbury. The westbury lab wikipedia corpus. 2010.

	Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model
	compression. arXiv preprint arXiv:1908.09355, 2019.

	Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a
	compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.

	Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. A probabilistic
	model for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th
	_International Conference on Computational Linguistics: Technical Papers, pp. 151–160, 2014._

	Thomas Wolf, L Debut, V Sanh, J Chaumond, C Delangue, A Moi, P Cistac, T Rault, R Louf,
	M Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.
	_ArXiv, abs/1910.03771, 2019._


	-----