|
## YOUR AUTOREGRESSIVE GENERATIVE MODEL CAN BE BETTER IF YOU TREAT IT AS AN ENERGY-BASED ONE |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Autoregressive generative models are commonly used, especially for those tasks |
|
involving sequential data. They have, however, been plagued by a slew of inherent |
|
flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., |
|
exposure bias or lack of long-range coherence), severely limiting their ability to |
|
model distributions properly. In this paper, we propose a unique method for training the autoregressive generative model that takes advantage of a well-designed |
|
energy-based learning objective. We show that our method is capable of alleviating the exposure bias problem and increase temporal coherence by imposing a |
|
constraint which fits joint distributions at each time step. Besides, unlike former |
|
energy-based models, we estimate energy scores based on the underlying autoregressive network itself, which does not require any extra network. Finally, thanks |
|
to importance sampling, we can train the entire model efficiently without requiring |
|
an MCMC process. Extensive empirical results, covering benchmarks like language modeling, neural machine translation, and image generation, demonstrate |
|
the effectiveness of the proposed approach. |
|
|
|
1 INTRODUCTION |
|
|
|
By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) (Vaswani et al., 2017; Dai et al., 2019; van den Oord |
|
et al., 2016a;b; Salimans et al., 2017; Chen et al., 2018) simplify the difficult challenge of modeling |
|
high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and |
|
generate samples of exceptional quality, making this technique popular for modeling distributions, |
|
especially for sequential data. Nonetheless, despite their potency and flexibility, ARGMs still have |
|
inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling. For |
|
example, ARGMs usually suffer from a discrepancy of the input context distributions between the |
|
training and inference stages, which causes consequent error propagation (i.e., Exposure Bias (Ranzato et al., 2016; Bengio et al., 2015)). Besides, due to the nature of greedy selection of beam |
|
search approximations, the decoded results from ARGMs may also lack in long-range coherence. |
|
We consider one approach by which ARGMs could be adapted to reduce these concerns. |
|
|
|
Earlier work, both heuristic and theoretical, has already been proposed with those goals. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled |
|
sampling (Bengio et al., 2015; Mihaylova & Martins, 2019), by mixing input contexts from both |
|
real data and autoregressive generation, during the training stage. However, this scheme suffers |
|
from an over-correcting problem (Zhang et al., 2019). In addition, at the inference stage, beam |
|
search makes it possible to choose more diverse candidates, improving the quality of generated sequences. Nevertheless, this results in only marginal improvements in temporal coherence, since |
|
ARGMs can only leverage previous decoded contexts without consideration of the whole sequence |
|
information. Moreover, setting aside the difficulty in training them, energy-based models (EBMs) |
|
have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications (Zhao et al., 2017; Arbel et al., 2021; Gao et al., 2021), without requiring |
|
the transformation of the target distribution into a product of conditional distributions. As a result, |
|
several studies (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) attempt to combine |
|
EBMs with ARGMs, expecting to benefit from the strengths of both approaches. However, though |
|
some positive results were obtained, the existing works preferred a two-stage optimization, which |
|
first obtains a well-trained ARGM and then trains an additional EBM based on it. Such an optimiza |
|
|
|
----- |
|
|
|
tion strategy does not enable ARGMs to benefit from the properties of EBM in modeling the joint |
|
distribution in a temporally more coherent way. |
|
|
|
In this paper, we present a novel design for seamlessly integrating Energy-based models into |
|
**AutoRegressive Models (E-ARM). Our training is based on an energy-based learning objective,** |
|
which forces ARGMs training to fit the joint distribution along with the conditional one at each time |
|
step. Thanks to our well-designed energy function, the two involved models can share a single base |
|
network without additional parameters, that is, the base network not only serves as a generator that |
|
provides fake data to facilitate the training of EBMs like previous works (Che et al., 2020; Xiao |
|
et al., 2021; Durkan & Nash, 2019; Deng et al., 2020), but also plays the role of modeling the energy surface. This property makes it easy to plug E-ARM into the training of any autoregressive |
|
generative models. |
|
|
|
Intuitively, the exposure bias in ARGMs is caused by the fact that the model is trained on real data |
|
rather than data generated by the model. On the other hand, in the EBM’s optimization process |
|
for modeling joint densities, the negative phase of wake-sleep algorithms (Hinton, 2002; Kim & |
|
Bengio, 2016) requires sampling data from the EBM itself. Along with the fact that our method |
|
combines the EBM and the ARGM seamlessly as a whole, E-ARM can reduce the discrepancy |
|
between input data of the training and inference stage, which mitigates the exposure bias problem |
|
of the ARGM. On top of it, unlike ARGMs, which factor the joint distribution into a product of |
|
conditional distributions, EBMs are able to model the joint distribution directly and score each input |
|
at the sequence level instead of at the token level, which makes them capable of modeling longrange coherence. Additionally, in order to optimize the proposed energy-based learning objective |
|
efficiently via gradient-based wake-sleep algorithms (Kim & Bengio, 2016), we present a way to |
|
estimate the negative phase gradient (which is a necessary component in the gradient-based wakesleep algorithms) through those samples generated with the autoregressive view instead of the EBM |
|
view, which would require an expensive Markov Chain Monte Carlo (MCMC) process. This allows |
|
us to sidestep extremely time-consuming MCMCs, thus accelerating training. |
|
|
|
In summary, the following contributions are made with this paper: i) We introduce a novel scheme, |
|
E-ARM, to integrate the EBM view into autoregressive generative models seamlessly; ii) we attempt |
|
to reduce the intrinsic problems of autoregressive models, such as exposure bias and weak temporal |
|
coherence, by optimizing an energy-based learning objective, which uses samples autoregressively |
|
generated; iii) We demonstrate how to efficiently optimize our model constructed from a single |
|
network, using wake-sleep algorithms without MCMC; iv) In a number of applications, such as |
|
language modeling, neural machine translation, and image generation, our model can achieve better |
|
results in comparison with relevant baselines. |
|
|
|
2 BACKGROUND |
|
|
|
2.1 ENERGY-BASED MODELS |
|
|
|
Energy-based models (LeCun et al., 2006) can express any probability p(x) for x ∈ R[K] as |
|
|
|
_pθ(x) = [exp(][−][E][θ][(][x][))]_ _,_ (1) |
|
|
|
**Zθ** |
|
|
|
|
|
where Eθ : R[D] _→_ R denotes an energy function which aims to map a D-dimensional datapoint |
|
to a scalar, and Z(θ) = **x** [exp(][−][E][θ][(][x][))][ denotes the normalizing constant, also known as the] |
|
|
|
partition function. Any function can be used as an energy function to represent an EBM as long as it |
|
can generate a single scalar given some input x and the normalizing constant is finite[1]. Wake-sleep |
|
|
|
[P] |
|
|
|
algorithms are commonly used to optimize EBMs (Hinton, 2002; Kim & Bengio, 2016; Grathwohl |
|
et al., 2020) via gradient-based approximate maximum likelihood. Specifically, the gradient of the |
|
log-likelihood, which needs to be maximized, with respect to θ can be expressed as |
|
|
|
_∂_ _∂_ _∂_ |
|
Epd(x) = Epθ(x) Epd(x) _._ (2) |
|
|
|
_∂θ_ [log][ p][θ][(][x][)] _∂θ_ **[E][θ][(][x][)]** _−_ _∂θ_ **[E][θ][(][x][)]** |
|
|
|
h i h i h i |
|
|
|
1Without constraining the parametrization of Eθ, this can be achieved by bounding the region of space in |
|
which x takes its allowed values. |
|
|
|
|
|
----- |
|
|
|
The first term in the right hand side of Eq. 2 is the negative phase term while the second term is |
|
called the positive phase term. MCMC methods have been used (Hinton, 2002; Welling & Teh, |
|
2011a) to approximately sample from pθ(x), for estimating the negative phase term. |
|
|
|
2.2 MODELING DISTRIBUTIONS AUTOREGRESSIVELY |
|
|
|
Autoregressive generative models (ARGM)[2] can decompose any joint distribution p(x) into a product of conditional distributions using the product rule of probability by ordering those random variables within the joint distribution and characterizing each random variable given all variables preceding it in that order. Formally, we use x<k to denote the vector variable covering all random |
|
variables before the time step k and use xk denote the random variable at time step k. Then we have |
|
|
|
|
|
_p(xk_ **x<k).** (3) |
|
_|_ |
|
_k=1_ |
|
|
|
Y |
|
|
|
|
|
_p(x) =_ |
|
|
|
|
|
In general, modeling distributions autoregressively has achieved remarkable accomplishments in |
|
numerous areas (Vaswani et al., 2017; Radford et al., 2019; van den Oord et al., 2016c;b; Salimans |
|
et al., 2017) thanks to its ability to avoid the challenging target of modeling joint high-dimensional |
|
distributions directly. We primarily focus on autoregressive language models in this paper, but we |
|
also conduct experiments on image generation to validate the generality of our method. |
|
|
|
2.3 EXPOSURE BIAS AND INCOHERENCE PROBLEMS IN AUTOREGRESSIVE MODELS |
|
|
|
In the discussion about the defects of sequential autoregressive generative models, the exposure bias |
|
problem (Bengio et al., 2015; Ranzato et al., 2016) is an important issue, which greatly affects the |
|
model’s deployment performance. During the training stage, the autoregressive model is always |
|
conditioned on ground truth token sequences. In generation stage, however, the model has to rely |
|
on its own previously generated tokens to predict the next token, when the model is deployed. If an |
|
incorrect token is selected, this error can be amplified in following steps because the next prediction |
|
will be made using an unusual input (one unlike those in the training set). Besides, out of the |
|
consideration of efficiency, autoregressive decoding usually selects the most probable token at each |
|
time step, given the ones previously selected. Such a scheme assumes the largest joint probability of |
|
the whole sequence can be achieved by separately choosing the most probable next token (given its |
|
previous context) over all time steps, which is only the local optimum. Correspondingly, the chosen |
|
sequence can not always be the model’s optimum result. |
|
|
|
3 INTEGRATE EBMS INTO AUTOREGRESSIVE MODELS SEAMLESSLY |
|
|
|
For a long time, as a result of compromises for improving training stability and efficiency (e.g., |
|
modeling a joint distribution by decomposing it and using a teacher-forcing training strategy), conventional autoregressive generative models have suffered from flaws such as the exposure bias and |
|
the lack of long-range coherence. To tackle these issues, we attempt to seamlessly integrate Energybased models into AutoRegressive Models (E-ARM), which can be regarded as a variant of autoregressive generative models blending with an energy-based learning objective. Given a joint |
|
sequential distribution, E-ARM also addresses it autoregressively, that is, tackling tokens step by |
|
step under a specific order. However, what differs from conventional ARGMs is that we attempt |
|
to model both the conditional and the joint distributions simultaneously at each time step. In this |
|
way, E-ARM can model distributions conveniently in an autoregressive manner while avoiding those |
|
potential problems brought by ARGMs. |
|
|
|
Formally, given a sequence of random variables (x1, x2, . . ., xK) with length K, we introduce a |
|
parametric autoregressive model qθ(x<k) = _l=1_ _[q][θ][(][x][l][|][x][<l][)][ (][k][ denotes the time step) with pa-]_ |
|
rameters θ. Particularly, we define ˜qθ(x<k) = _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q]n[m]=1[−][1]_ _[q][(][x][n][|][x][<n][)][, which means]_ |
|
only those conditional distributions qθ(xl **x<l[Q]) of the most recent[k][−][1]** _k_ _m time steps are involved_ |
|
_|_ _−_ |
|
in the current update of parameters θ while those distant conditional distributions q(xn **x<n) are** |
|
|
|
[Q][k][−][1] _|_ |
|
|
|
2In this paper, the term “autoregressive model” is sometimes used to denote the autoregressive generative |
|
model for convenience. |
|
|
|
|
|
----- |
|
|
|
treated as fixed (The rationale behind such a design will be elaborated in Sec.4). Then, we define |
|
_pθ(xk, x<k) as a product of the autoregressive model and an EBM as follows,_ |
|
|
|
_pθ(xk, x<k) = ˜qθ(x<k)_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _,_ (4) |
|
_·_ **Zθ** |
|
|
|
|
|
where the energy function φθ(xk, x<k) is defined as the xk’s negative corresponding component |
|
of the base network’s output logit with the input prefix context x<k = (x1, x2, . . ., xk 1) (e.g., |
|
_−_ |
|
given a sequence “This is Friday.” and assuming the corresponding index of the token “Friday” in |
|
the vocabulary is i, then the value of _φθ(“Friday”, “This is”) is the i-th component of the output_ |
|
_−_ |
|
logit, which is the straight input tensor of the final softmax layer), and the normalization term Zθ = |
|
Ex[′]<k[∼]q[˜]θ(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]]. |
|
|
|
Our primary goal is to make the distribution qθ(xk **x<k) to approach the real conditional pd(xk** **x<k)** |
|
_|_ _|_ |
|
whilst maintaining pθ(xk, x<k) as close to the real joint pd(xk, x<k) as possible at each time step, |
|
which can be achieved by minimizing the Kullback-Leibler (KL) divergence between the distributions, |
|
|
|
_K_ |
|
|
|
_θ[∗]_ = arg minθ **DKL** _pd(xk|x<k)||qθ(xk|x<k)_ + λDKL _pd(xk, x<k)||pθ(xk, x<k)_ _, (5)_ |
|
|
|
_kX=1_ [] |
|
|
|
where λ adjusts the ratio between the two objectives. In Eq. 5, the first objective at each time step |
|
_k can be easily optimized by cross entropy while the second objective is optimized in the sense of_ |
|
EBMs by wake-sleep algorithms (Hinton et al., 1995; Kim & Bengio, 2016), which minimizes the |
|
objective by descending the following gradient of θ according to Eq. 2[3] |
|
|
|
_∂_ _∂_ |
|
|
|
Exk,x<k _pd(xk,x<k)_ Exk,x<k _pθ(xk,x<k)_ _,_ |
|
_∼_ _∂θ_ **[E][θ][(][x][k][,][ x][<k][)]** _−_ _∼_ _∂θ_ **[E][θ][(][x][k][,][ x][<k][)]** (6) |
|
|
|
|
|
|
|
**Positive Phase** **Negative Phase** |
|
|
|
where we have| **Eθ(xk, x<k{z) = φθ(xk, x<k)** } log ˜| _qθ(x<k). Optimization via Eq. 2 or 6 involves{z_ } |
|
_−_ |
|
sampling data from the model and can thus lead to the discovery of non-data-like samples, whose |
|
likelihood is then explicitly reduced by the energy function. E-ARM is therefore not plagued by |
|
the exposure bias problem. Besides, because we model the joint distribution throughout the training |
|
process, E-ARM can assess the entire sequence as a whole and generate more coherent data using |
|
energy sampling (Deng et al., 2020). |
|
|
|
4 OPTIMIZATION |
|
|
|
In this section, we present how to efficiently optimize E-ARM. To begin with, we optimize the first |
|
objective in Eq. 5 as with conventional autoregressive models by reducing the per time-step crossentropy loss. As for the second objective, we resort to descend the estimated gradient as shown in |
|
Eq. 6. Thanks to the importance sampling technique and our well-defined energy function, we now |
|
show that an improved version of Eq. 6 has a simple and symmetric form that can be easily estimated |
|
whilst not requiring an expensive MCMC. |
|
|
|
Specifically, by replacing Eθ(xk, x<k) with the specific form φθ(xk, x<k) log ˜qθ(x<k), the gra_−_ |
|
dient w.r.t. θ in the positive phase of Eq. 6 can be written into |
|
|
|
|
|
Ex<k _pd_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pd_ [ _[∂]_ (7) |
|
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_ |
|
|
|
|
|
Similarly, we can get the negative phase gradient as |
|
|
|
Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pθ_ [ _[∂]_ (8) |
|
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_ |
|
|
|
|
|
The first term −Ex<k∼pd [ _∂θ[∂]_ [log ˜]qθ(x<k)] in Eq. 7 is equivalent to the log-likelihood gradient of |
|
|
|
_q˜θ(x<k), which means improvements in this direction will be automatically taken care of as a re-_ |
|
sult of steps arising from the gradient of the first KL-divergence in Eq. 5, albeit at the expense of |
|
|
|
3here, we take a minimization version of the Eq. 2. Thus the sign before each phase is converse. |
|
|
|
|
|
----- |
|
|
|
changing the weight given to the second vs. the first KL, λ. Besides, because the estimation of the |
|
expectation operator over the data distribution pd is easy, and the score φθ(xk, x<k) can be acquired |
|
simply accessing the output logit of ARGM (see the definition of φθ in Sec. 3), the second term |
|
can likewise be readily estimated and optimized. As a result, the positive phase optimization is both |
|
feasible and efficient. |
|
|
|
The negative phase gradient estimation, on the other hand, is more involved. In Eq. 8, sampling data |
|
from pθ is required for estimating the expectation Epθ, whereas pθ is a parametric joint probability |
|
involving an energy-based unnormalized probability estimator that may require time-consuming |
|
MCMC methods to generate data. However, thanks to importance sampling, we can substitute |
|
the troublesome computation of the expectation over the distribution pθ with the expectation over |
|
the distribution qθ, which can generate samples autoregressively without MCMC. Formally, the |
|
negative phase gradient Exk,x<k∼pθ [ _∂θ[∂]_ **[E][θ][(][x][k][,][ x][<k][)]][ is equivalent to the following formulation (See]** |
|
|
|
the detailed derivation in Appendix A), |
|
|
|
Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)] + Exk,x<k_ _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ (9) |
|
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][,]_ |
|
|
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
**where** **w(x<k) =** _._ (10) |
|
|
|
Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]] |
|
|
|
According to Eq. 9, all the estimated expectations only need sampling from the autoregressive model |
|
rather than the joint distribution, and the reweighing weight w in Eq. 10 also does not involve |
|
expectation computation over distribution pθ. Generally, producing data from an autoregressive |
|
model is a very simple ancerstral sampling process, as compared with sampling straight from an |
|
EBM, which needs MCMC approaches (Durkan & Nash, 2019). On account of that, the optimization |
|
process can be much more efficient. |
|
|
|
Besides, the term Ex<k∼q˜θ(x<k)[[][w][(][x]<k[)][ ∂]∂θ [log ˜]qθ(x<k)] in Eq. 9 is equivalent to a re-weighted |
|
|
|
version of the gradient of qθ’s information entropy with respect to θ. This term can be optimized |
|
similarly to the teacher-forcing training of autoregressive model with the “teacher” sequence generated autoregressively by the model itself. Actually, the scheduled sampling methods (Bengio et al., |
|
2015; Ranzato et al., 2016; Mihaylova & Martins, 2019) are similar to this term but without the reweighting factor. Furthermore, it is worth noting that for a sequence with total length K, since we |
|
add a constraint to fit the joint distribution pθ at each time step k, Eq. 9 actually has K counterparts |
|
with different time steps. If we use the qθ(x<k) directly instead of ˜qθ(x<k) in the Eq. 4 to define |
|
_pθ(xk, x<k), due to the fact that the distribution qθ(x<k) modeled by an autoregressive model can_ |
|
be naturally broken up into pieces, simply summing up these K gradients results in the term |
|
|
|
|
|
_K+1−l_ |
|
|
|
Eqθ(x<k)[w(x<k) _[∂]_ (11) |
|
|
|
_∂θ_ [log][ q][θ][(][x][l][|][x][<l][)]][,] |
|
|
|
_k=1_ |
|
|
|
X |
|
|
|
|
|
_K_ |
|
|
|
Eqθ(x<k)[w(x<k) _[∂]_ |
|
|
|
_∂θ_ [log][ q][θ][(][x][<k][)] =] |
|
|
|
_k=1_ |
|
|
|
X |
|
|
|
|
|
_l=1_ |
|
|
|
|
|
where l indicates the specific index of the current token in the entire sequence. As a result, earlier |
|
time steps (smaller l) will get stronger training signals (larger K + 1 − _l, indicating more gradi-_ |
|
ent terms), giving rise to imbalanced training for different time steps. To solve this, we introduce |
|
_q˜θ(x<k) as_ _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q]n[m]=1[−][1]_ _[q][(][x][n][|][x][<n][)][ to define][ p][θ][(][x][k][,][ x][<k][)][ shown in Sec. 3, allowing]_ |
|
gradients only back propagate through conditional distributions w.r.t. a few recent tokens[4]. This |
|
explains our proposal of using[Q][k][−][1] ˜qθ(x<k) to define pθ(xk, x<k). |
|
|
|
Ultimately, combining Eq. 7 and Eq. 9, at each time step k, we can optimize pθ(xk, x<k) via |
|
descending the estimated gradient of θ as follows, |
|
|
|
|
|
Ex<k _pd_ [ _[∂]_ _qθ(x<k)]_ |
|
_−_ _∼_ _∂θ_ [log ˜] |
|
|
|
+ Exk,x<k _pd_ [ _[∂]_ |
|
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
**Positive Phase** |
|
| {z } |
|
|
|
|
|
Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)]_ |
|
_−_ _∼_ _∂θ_ [log ˜] |
|
|
|
+ Exk,x<k _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ |
|
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
**Negative Phase** |
|
| {z } |
|
|
|
|
|
(12) |
|
|
|
|
|
4In practice, we find that using recent 2 tokens worked best. |
|
|
|
|
|
----- |
|
|
|
From Eq. 12, we can see that the only difference between two phases is that in the negative phase, |
|
the expectation over ˜qθ have a reweighing weight w for each sample. The reweighing weight w in |
|
Eq. 10 and Eq. 12 can be further refined (see the derivation in Appendix B) and we can observe that |
|
|
|
_µ(x<k)_ |
|
**w(x<k) =** (13) |
|
|
|
Ex[′]<k _[µ][(][x][<k][)]_ _[,]_ |
|
|
|
|
|
where µ(x<k) = _[p]q˜θ[θ]([(]x[x]<k[<k])[)]_ [indicates the possibility of which distribution the prefix context][ x][<k][ is] |
|
|
|
most likely to come from, the distribution pθ or the distribution ˜qθ. Correspondingly, w(x<k) reflects |
|
the context x<k’s relative magnitude of µ(x<k) compared with the average among all potential |
|
contexts—the larger the value of w(x<k), the more likely the context x<k in the data space coming |
|
from pθ, which is modeled by the product of autoregressive models and EBMs. During training, |
|
those input sequences with contexts more likely under pθ than qθ will be assigned larger weights w |
|
while others will be assigned smaller weights w. |
|
|
|
In general, E-ARM ought to be viewed as a new learning pattern for autoregressive models that ensures our base autoregressive network stays close to the real distribution pd. We found that training |
|
from scratch with the energy-based learning objective of in Eq.12 alone did not work well. The |
|
reason is that at the initial stage of the training process, what we have is just a randomly initialized autoregressive network which outputs sequences with random values given any context. This |
|
indicates disjoint supports between the real sequence’s distribution pd and distribution pθ modeled |
|
by ARGMs. If we only use the energy-based learning objective of Eq. 12, the whole gradient |
|
Epd(x)[ _∂θ[∂]_ [log][ p][θ][(][x][)]][ in Eq.2 would be 0 due to disjoint supports between][ p][d][ and][ p][θ][. As a result, in] |
|
|
|
order to make the optimization more feasible, we must maintain the cross-entropy loss low throughout training and pre-train as a pure ARGM for a few epochs before introducing the E-ARM objective. |
|
Actually, the starting epoch of E-ARM is a hyper-parameter, and we discuss it in the Sec. 5.2. |
|
|
|
Following the excellent work of Deng et al. (2020); Bakhtin et al. (2021), we also adopt Top-K |
|
energy re-sampling in the inference stage, which means that in the generative process, we first gather |
|
multiple candidate sequences generated autoregressively, and then re-sample from them based on |
|
their energy scores estimated by the network’s logit at the last time step where the entire sequence |
|
has been processed. Since we employ the EBM to model the joint distribution at each time step, |
|
such a re-sampling strategy can mitigate the undesirable impact of the greedy selection of one token |
|
at a time, and we found this variation to increase the coherence of generated samples. |
|
|
|
5 EXPERIMENTS |
|
|
|
To empirically corroborate the effectiveness of E-ARM and show its broad applicability, we conduct extensive experiments covering three machine learning applications, which are neural machine |
|
translation (NMT), language modeling, and image generation. In this section, we will introduce the |
|
three corresponding experimental setups, followed by an analysis of the obtained results. We will |
|
release the source code once upon acceptance. |
|
|
|
5.1 APPLICATION TO NEURAL MACHINE TRANSLATION |
|
|
|
E-ARM is first evaluated in the context of neural machine translation (NMT), which is a conditional |
|
generation task and is important in the natural language processing (NLP) field. We first analyze |
|
E-ARM on the IWSLT14 dataset, which includes six different language pairs ({German, Spanish, |
|
Italian} → English and English →{German, Spanish, Italian}). In addition, we test E-ARM on the |
|
WMT16 (English → German) benchmark to make sure we evaluating E-ARM on a larger dataset. |
|
Hereafter we abbreviate English, German, Spanish, Italian as “En”, “De”, “Es”, “It”. The weight |
|
_λ in Eq. 5 is set as 0.05 for all translation tasks. We use one size of transformer (“Base-IWSLT”)_ |
|
for the IWSLT14 benchmark and two sizes of transformer (“Base-WMT”, “Large-WMT”) for the |
|
WMT16 benchmark [5]. Scheduled Sampling is carried out following Mihaylova & Martins (2019). |
|
|
|
The results of IWSLT14 tasks are shown in Table 1. We test not only the pure performance of |
|
E-ARM but also the compatibility with other techniques. Specifically, we can observe that (1) |
|
without any particular engineering, E-ARM outperforms the base autoregressive translation model |
|
|
|
5The implementation is developed on Fairseq (Ott et al., 2019). |
|
|
|
|
|
----- |
|
|
|
|Model|Label Scheduled Smoothing Sampling|Beam Searching|BLEU ↑ DE→EN EN→DE EN→IT IT→EN ES→EN EN→ES|Avg.| |
|
|---|---|---|---|---| |
|
|Base|- - - |- 5 B - 5 B - 5 B|32.44±0.06 26.64±0.10 27.92±0.03 30.48±0.08 38.61±0.11 35.42±0.09 33.62±0.07 27.41±0.08 28.72±0.04 31.39±0.05 39.55±0.12 36.38±0.07 33.68±0.03 27.62±0.04 28.81±0.07 31.42±0.07 39.85±0.13 36.71±0.09 34.61±0.08 28.46±0.06 29.72±0.10 32.29±0.03 40.64±0.07 37.48±0.05 34.23±0.06 27.96±0.03 29.26±0.11 31.93±0.08 40.16±0.03 37.21±0.04 35.10±0.04 28.73±0.04 29.97±0.07 32.64±0.12 40.91±0.06 37.93±0.10|31.92 32.85 33.02 33.87 33.46 34.21| |
|
|E-ARM|- - - |- 5 B - 5 B - 5 B|32.99±0.10 27.15±0.03 28.33±0.12 31.13±0.04 39.56±0.01 36.07±0.02 34.06±0.06 27.97±0.08 29.26±0.09 31.90 ±0.13 40.30 ±0.03 36.92 ±0.09 33.97 ±0.08 28.03 ±0.04 29.13 ±0.02 31.84 ±0.11 40.32 ±0.03 36.96 ±0.07 34.93 ±0.05 28.91 ±0.12 30.04 ±0.11 32.56 ±0.04 41.01 ±0.06 37.73 ±0.12 34.58 ±0.09 28.38 ±0.12 29.56 ±0.10 32.11 ±0.03 40.93 ±0.03 37.56 ±0.07 35.36 ±0.05 29.11 ±0.04 30.25 ±0.09 32.82 ±0.11 41.58 ±0.07 38.19 ±0.03|32.54 33.40 33.38 34.20 33.85 34.55| |
|
|
|
|
|
Table 1: Comparison of BLEU scores between our approach E-ARM and the base ARGM trained just with |
|
cross-entropy loss on six translation pairs of IWSLT14 datasets. We use “-” to denote that the training trick is |
|
not used while “” indicates we use it. “5 B” represents we use beam searching with 5 beams. |
|
|
|
trained with cross-entropy singly by 0.62 (31.92 → 32.54) BLEU points in average, especially |
|
on three translation pairs—38.61 → 39.56 on Spanish-to-English, 30.48 → 31.13 on Italian-toEnglish, 35.42 → 36.07 on English-to-Spanish. (2) E-ARM is compatible with other techniques |
|
like scheduled sampling, which can help alleviate the exposure bias problem to some extent. They |
|
are not mutually exclusive and can work together to further improve the performance of the base |
|
ARGM. (3) However, since scheduled sampling can reduce exposure bias and beam search can |
|
somewhat alleviate the flaws caused by greedy selection at each time step, the performance gain of |
|
E-ARM when all these tactics are combined is only 0.34 (34.21 → 34.55), which is lower than the |
|
0.62 (31.92 → 32.54) obtained when the model is purely trained without these other techniques. |
|
|
|
**Model** **L.S.** **S.S.** **w/E-ARM** **BLEU ↑** Additionally, Table 2 shows the per |
|
- - - 27.56 formance of E-ARM on the WMT16 |
|
|
|
- - 28.04 English German task. For two |
|
|
|
**Base-WMT** - 28.36 different model sizes, enabling la- → |
|
|
|
**28.62** bel smoothing (L.S.) improves model |
|
|
|
- - - 28.70 performance by 0.52 and 0.35, re |
|
- - 29.05 spectively. The performance of the |
|
|
|
**Large-WMT** - 29.23 base transformer model further in |
|
**29.44** creases to 28.36 BLEU points when |
|
|
|
scheduled sampling (S.S.) is used, |
|
|
|
|Model|L.S. S.S. w/E-ARM|BLEU ↑| |
|
|---|---|---| |
|
|Base-WMT|- - - - - - |27.56 28.04 28.36 28.62| |
|
|Large-WMT|- - - - - - |28.70 29.05 29.23 29.44| |
|
|
|
|
|
Table 2: Translation performance of proposed E-ARM on while the larger model improves to |
|
WMT16 English→German, evaluated with BLEU. We uniformly 29.23 points. E-ARM paired with |
|
use 5 beams when applying beam search. “L.S.” denotes Label label smoothing and scheduled samSmoothing and “S.S.” denotes Scheduled Sampling. pling yields the highest scores of |
|
|
|
28.62 and 29.44, respectively. Overall, our training strategy outperforms |
|
ARGM’s vanilla teacher-forcing training and can have uniformly favorable impacts across different |
|
models and dataset sizes. |
|
|
|
5.2 APPLICATION TO LANGUAGE MODELING |
|
|
|
|
|
To further demonstrate E-ARM’s |
|
consistency in reducing flaws of |
|
autoregressive generative models, |
|
we also conduct language modeling |
|
experiments. The WikiText-103 |
|
dataset (Merity et al., 2017), which |
|
is the largest word-level language |
|
modeling benchmark with long-term |
|
dependency, was chosen as the |
|
testbed. It comprises 103 million |
|
|
|
|Model|#Params PPL ↓| |
|
|---|---| |
|
|Tr-Base Tr-Base (w/E-ARM) Standard Tr-XL Standard Tr-XL (w/E-ARM)|156M 30.56 156M 29.89 151M 24.20 151M 23.81| |
|
|
|
|
|
Table 3: Language modeling performance of different models on |
|
WikiText103. Evaluation is conducted using perplexity (PPL). |
|
|
|
|
|
----- |
|
|
|
training tokens from 28 thousand |
|
articles, with an average length of 3.6 thousand tokens per article, which allows model to evaluate |
|
the ability of modeling long-term dependency. Two network structures are mainly tested, which |
|
are Transformer-Base (Vaswani et al., 2017) and Transformer-XL (Dai et al., 2019) (Tr-Base and |
|
Tr-XL for short respectively hereafter). |
|
|
|
|
|
The final results are reported in Table 3. We can see |
|
from the results that E-ARM outperforms baselines |
|
with clear margins for different types of models. |
|
Specifically, the Transformer-Base improves performance by 0.67 PPL points (from 30.56 to 29.89), **0.00** 30.56 30.56 30.56 |
|
while the Transformer-XL improves model by 0.20 **0.01** 30.48 30.12 30.22 |
|
PPL points (from 24.20 to 23.81). Our strategy does **_λ_** **0.05** 30.43 **29.89** 30.16 |
|
not change the structure of the base network nor **0.1** 30.60 30.03 30.14 |
|
introduces any additional module or learnable pa- **0.5** 30.71 30.36 30.47 |
|
rameters, therefore we can conclude that the perfor |
|
|Col1|Col2|Start Epoch| |
|
|---|---|---| |
|
|||5 15 25| |
|
|λ|0.00 0.01 0.05 0.1 0.5|30.56 30.56 30.56 30.48 30.12 30.22 30.43 29.89 30.16 30.60 30.03 30.14 30.71 30.36 30.47| |
|
|
|
mance boost is solely from the introduced energy- Table 4: How different λ and the E-ARM start |
|
based learning objective. epoch (when we introduce the E-ARM into the |
|
|
|
training on WikiText103) affect performance eval |
|
In addition, we study the effect of hyper-parameter |
|
|
|
uated by perplexity (PPL). The Tr-Base model |
|
|
|
settings on the performance of language modeling, |
|
|
|
structure is used and is train 40 epochs in total. |
|
|
|
which can be seen in Table 4. From this, we may deduce that starting E-ARM training at the 15-th epoch |
|
yields the best results, whereas starting earlier or later yields a performance decline. It is reasonable |
|
because, if E-ARM was introduced too early, the autoregressive model may not have been optimized well at that moment. As a result, generative quality would be terrible, and make energy-based |
|
training unstable. On the other hand, the underlying autoregressive model can be modified only |
|
marginally if E-ARM is introduced when the ARGM training is virtually complete. Besides, from |
|
the vertical perspective which presents the impact of different λ, we can observe that the best λ in |
|
Eq. 5 is 0.05. The first line of the table indicates the baseline of training the autoregressive model |
|
with pure cross-entropy loss. |
|
|
|
|
|
5.3 APPLICATION TO IMAGE GENERATION |
|
|
|
In order to illustrate the effectiveness and generality of our method in processing different modality |
|
tasks, we further show the results of applying E-ARM to image generation in this section. We apply |
|
E-ARM to Pixel-CNN (Van Oord et al., 2016) and its variant Gated Pixel-CNN (Oord et al., 2016). |
|
Experiments are carried out on the MNIST and CIFAR-10 datasets. |
|
|
|
|Model|Test (Train) NLL ↓| |
|
|---|---| |
|
||MNIST CIFAR-10| |
|
|Pixel-CNN Pixel-CNN (w/E-ARM) Gated Pixel-CNN Gated Pixel-CNN (w/E-ARM)|0.17 (0.13) 3.14 (3.08) 0.15 (0.12) 3.07 (2.98) 0.14 (0.11) 3.03 (2.90) 0.12 (0.10) 2.97 (2.91)| |
|
|
|
|
|
Table 5: Performance of E-ARM with different base |
|
networks on MNIST and CIFAR-10 in bits/dim (lower |
|
is better), training performance in brackets. |
|
|
|
|
|
Figure 1: Samples of CIFAR-10 from |
|
Gated Pixel-CNN (w/E-ARM). |
|
|
|
|
|
Table 5 summarizes the quantitative results measured by per-pixel negative log-likelihood (NLL), |
|
while Figure 1 depicts some of the generated samples. We can see that with the help of our E-ARM, |
|
both the Pixel-CNN and the Gated Pixel-CNN can obtain improvements in all datasets (0.17 → 0.15 |
|
and 3.14 → 3.07 for Pixel-CNN on MNIST and CIFAR10 respectively and 0.14 → 0.12 and 3.03 |
|
_→_ 2.97 for Gated Pixel-CNN on MNIST and CIFAR10 respectively). This is further evidence in |
|
favour of the energy-based learning objective for improving autoregressive models. |
|
|
|
|
|
----- |
|
|
|
6 RELATED WORKS |
|
|
|
6.1 AUTOREGRESSIVE GENERATIVE MODELS |
|
|
|
Modeling high-dimensional data distributions directly is usually a rather challenging task due to |
|
“the curse of dimensionality” (Bellman, 1954). One alternative method is to sequentialize the random variables and then factorize the joint probability distribution into the product of conditionals |
|
based on the sequence structure, which is exactly the core idea of autoregressive generative models |
|
(ARGMs). |
|
|
|
ARGMs have been very successful, in particular for sequential data. For example, ARGMs have |
|
been widely used in language modeling (Vaswani et al., 2017; Dai et al., 2019; Radford et al., |
|
2019), audio synthesis (van den Oord et al., 2016a), and even image generation (van den Oord et al., |
|
2016c;b; Salimans et al., 2017). The advantages of ARGMs are however balanced by issues of (1) |
|
exposure bias (Ranzato et al., 2016; Bengio et al., 2015; Song et al., 2020), due to the discrepancy |
|
in input context distributions between the training and inference stages, and (2) weak long-range |
|
coherence, due to the inherent greedy selection of one token at a time without look-ahead. |
|
|
|
6.2 ENERGY-BASED MODELS |
|
|
|
In the field of generative modeling, energy-based models (EBMs) have been widely used (Zhao |
|
et al., 2017; Arbel et al., 2021; Gao et al., 2021). The primary idea behind EBMs is to decompose |
|
the dependencies between variables (e.g. images and labels) through different terms of an energy |
|
function, assigning low energies to proper configurations found in the dataset, while assigning high |
|
energies to incorrect or unseen ones (LeCun et al., 2006). |
|
|
|
Due to the challenge of sampling from EBMs, training EBMs by wake-sleep algorithms (Hinton, |
|
2002; Kim & Bengio, 2016; Grathwohl et al., 2021), which require expensive MCMC approaches, |
|
has been notoriously difficult, especially on high-dimensional data like images or texts. Stochastic |
|
Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011a) is a frequently used gradient-based |
|
MCMC approach that injects noise into parameter updates and anneals the step size during the |
|
course of training, and which has been adopted in numerous prior works (Nijkamp et al., 2019; Du |
|
& Mordatch, 2019; Grathwohl et al., 2020). However, these gradient-based MCMC methods require |
|
enormous extra computing overheads and are not applicable when the input is discrete like for text |
|
sequences (Deng et al., 2020). |
|
|
|
As a result, a variety of recent works attempt to explore the strategy of training an EBM without |
|
MCMC. In particular, Bakhtin et al. (2021); Xu et al. (2021a); Gao et al. (2020) optimize the EBMs |
|
by using noise contrastive estimation (NCE) (Gutmann & Hyv¨arinen, 2010; Ma & Collins, 2018). |
|
Durkan & Nash (2019) estimate the intractable normalization component by utilizing ARGMs and |
|
importance sampling. Che et al. (2020); Wang et al. (2021) skirt the challenge of collecting data in |
|
the high-dimensional data space by producing data in the lower-dimensional feature space, which |
|
improves sampling efficiency. |
|
|
|
7 CONCLUSIONS AND FUTURE WORK |
|
|
|
In this paper, we propose a novel method dubbed E-ARM to integrate energy-based models into autoregressive generative models seamlessly, with an energy-based training objective that exploits an |
|
underlying autoregressive model. This is achieved by defining the energy function from the output |
|
logits of the base autoregressive network, to model the unnormalized joint distribution of the subsequence up to each time step. We also found ways to improve training of E-ARM using importance |
|
sampling, avoiding the requirement of MCMC for the energy-based training. Experimental results |
|
on two language tasks and one vision task demonstrate the effectiveness of E-ARM to alleviate exposure bias and incoherence problems of ARGMs. In the future, we expect to extend E-ARM on |
|
other sequential generation tasks (e.g. text summarization, audio generation), and incorporate the |
|
proposed methodology into other advanced autoregressive architectures. |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized energy based models. In 9th Inter_national Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,_ |
|
_2021, 2021._ |
|
|
|
Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc’Aurelio Ranzato, and Arthur Szlam. |
|
Residual energy-based models for text. J. Mach. Learn. Res., 22:40:1–40:41, 2021. |
|
|
|
Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and |
|
_Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann_ |
|
Arbor, Michigan, jun 2005. Association for Computational Linguistics. |
|
|
|
Richard Ernest Bellman. The Theory of Dynamic Programming. RAND Corporation, Santa Monica, |
|
CA, 1954. |
|
|
|
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence |
|
prediction with recurrent neural networks. In Advances in Neural Information Processing Systems |
|
_28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015,_ |
|
_Montreal, Quebec, Canada, pp. 1171–1179, 2015._ |
|
|
|
Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and |
|
Yoshua Bengio. Your GAN is secretly an energy-based model and you should use discriminator |
|
driven latent sampling. In Advances in Neural Information Processing Systems 33: Annual Con_ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,_ |
|
_virtual, 2020._ |
|
|
|
Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Proceedings of the 35th International Conference on Machine |
|
_Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of_ |
|
_Proceedings of Machine Learning Research, pp. 863–871, 2018._ |
|
|
|
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of |
|
_the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy,_ |
|
_July 28- August 2, 2019, Volume 1: Long Papers, pp. 2978–2988, 2019._ |
|
|
|
Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc’Aurelio Ranzato. Residual |
|
energy-based models for text generation. In 8th International Conference on Learning Repre_sentations, ICLR 2020. OpenReview.net, 2020._ |
|
|
|
Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In Ad_vances in Neural Information Processing Systems 32: Annual Conference on Neural Information_ |
|
_Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp._ |
|
3603–3613, 2019. |
|
|
|
Conor Durkan and Charlie Nash. Autoregressive energy machines. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, |
|
_ICML 2019, volume 97, pp. 1735–1744. PMLR, 2019._ |
|
|
|
Ruiqi Gao, Erik Nijkamp, Diederik P. Kingma, Zhen Xu, Andrew M. Dai, and Ying Nian Wu. Flow |
|
contrastive estimation of energy-based models. In 2020 IEEE/CVF Conference on Computer |
|
_Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 7515–_ |
|
7525, 2020. |
|
|
|
Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energybased models by diffusion recovery likelihood. In 9th International Conference on Learning |
|
_Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021._ |
|
|
|
Will Grathwohl, Kuan-Chieh Wang, J¨orn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, |
|
and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like |
|
one. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, |
|
_Ethiopia, April 26-30, 2020, 2020._ |
|
|
|
|
|
----- |
|
|
|
Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, |
|
and David Duvenaud. No MCMC for me: Amortized sampling for fast and stable training of |
|
energy-based models. In 9th International Conference on Learning Representations, ICLR 2021, |
|
_Virtual Event, Austria, May 3-7, 2021, 2021._ |
|
|
|
Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principle |
|
for unnormalized statistical models. In Proceedings of the Thirteenth International Conference |
|
_on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May_ |
|
_13-15, 2010, volume 9 of JMLR Proceedings, pp. 297–304, 2010._ |
|
|
|
G. Hinton, P. Dayan, B. Frey, and R. Neal. The “wake-sleep” algorithm for unsupervised neural |
|
networks. Science, 268 5214:1158–61, 1995. |
|
|
|
Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural |
|
_Comput., 14(8):1771–1800, 2002._ |
|
|
|
Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled |
|
text generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual |
|
_Event, Austria, May 3-7, 2021, 2021._ |
|
|
|
Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability |
|
estimation. CoRR, abs/1606.03439, 2016. |
|
|
|
Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based |
|
learning. Predicting structured data, 1(0), 2006. |
|
|
|
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization |
|
_Branches Out, pp. 74–81, Barcelona, Spain, jul 2004. Association for Computational Linguistics._ |
|
|
|
Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In Proceedings of the 2018 Conference on |
|
_Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November_ |
|
_4, 2018, pp. 3698–3707, 2018._ |
|
|
|
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture |
|
models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, |
|
_France, April 24-26, 2017, Conference Track Proceedings, 2017._ |
|
|
|
Tsvetomila Mihaylova and Andr´e F. T. Martins. Scheduled sampling for transformers. In Fernando Emilio Alva-Manchego, Eunsol Choi, and Daniel Khashabi (eds.), Proceedings of the 57th |
|
_Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28_ |
|
|
|
_- August 2, 2019, Volume 2, pp. 351–356, 2019._ |
|
|
|
Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent nonpersistent short-run MCMC toward energy-based model. In Advances in Neural Information |
|
_Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,_ |
|
_NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 5233–5243, 2019._ |
|
|
|
Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. _arXiv preprint_ |
|
_arXiv:1606.05328, 2016._ |
|
|
|
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, |
|
and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of |
|
_NAACL-HLT 2019: Demonstrations, 2019._ |
|
|
|
Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and Ying Nian Wu. Learning latent space |
|
energy-based prior model. In Advances in Neural Information Processing Systems 33: Annual |
|
_Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_ |
|
_2020, virtual, 2020._ |
|
|
|
|
|
----- |
|
|
|
Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman. Global autoregressive models for |
|
data-efficient sequence learning. In Proceedings of the 23rd Conference on Computational Nat_ural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, pp. 900–909,_ |
|
2019a. |
|
|
|
Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman. Distributional reinforcement learning for energy-based sequential models. CoRR, abs/1912.08517, 2019b. |
|
|
|
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language |
|
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. |
|
|
|
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, |
|
_ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016._ |
|
|
|
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the |
|
pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th Interna_tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,_ |
|
_Conference Track Proceedings, 2017._ |
|
|
|
Kaitao Song, Xu Tan, and Jianfeng Lu. Neural machine translation with error correction. In Proceed_ings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp._ |
|
3891–3897, 2020. |
|
|
|
A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, |
|
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for |
|
raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September |
|
_2016, pp. 125, 2016a._ |
|
|
|
A¨aron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and |
|
Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural In_formation Processing Systems 29: Annual Conference on Neural Information Processing Systems_ |
|
_2016, December 5-10, 2016, Barcelona, Spain, pp. 4790–4798, 2016b._ |
|
|
|
A¨aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. |
|
In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York |
|
_City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,_ |
|
pp. 1747–1756. JMLR.org, 2016c. |
|
|
|
Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In |
|
_International Conference on Machine Learning, pp. 1747–1756. PMLR, 2016._ |
|
|
|
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, |
|
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor_mation Processing Systems 30: Annual Conference on Neural Information Processing Systems_ |
|
_2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017._ |
|
|
|
Yezhen Wang, Bo Li, Tong Che, Kaiyang Zhou, Ziwei Liu, and Dongsheng Li. Energy-based openworld uncertainty modeling for confidence calibration. CoRR, abs/2107.12628, 2021. |
|
|
|
Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In |
|
_Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue,_ |
|
_Washington, USA, June 28 - July 2, 2011, pp. 681–688, 2011a._ |
|
|
|
Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In |
|
_Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue,_ |
|
_Washington, USA, June 28 - July 2, 2011, pp. 681–688, 2011b._ |
|
|
|
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energy-based models. In 9th International Conference on Learning |
|
_Representations, ICLR 2021. OpenReview.net, 2021._ |
|
|
|
|
|
----- |
|
|
|
J. Xie, Z. Zheng, X. Fang, S. Zhu, and Y. Wu. Cooperative training of fast thinking initializer and |
|
slow thinking solver for conditional learning. IEEE Transactions on Pattern Analysis & Machine |
|
_Intelligence, (01):1–1, mar 2019. ISSN 1939-3539._ |
|
|
|
Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of |
|
descriptor and generator networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(1):27–45, 2020. |
|
|
|
Minkai Xu, Shitong Luo, Yoshua Bengio, Jian Peng, and Jian Tang. Learning neural generative |
|
dynamics for molecular conformation generation. In 9th International Conference on Learning |
|
_Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021a._ |
|
|
|
Yilun Xu, Yang Song, Sahaj Garg, Linyuan Gong, Rui Shu, Aditya Grover, and Stefano Ermon. |
|
Anytime sampling for autoregressive models via ordered autoencoding. In 9th International Con_ference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021b._ |
|
|
|
Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridging the gap between training |
|
and inference for neural machine translation. In Anna Korhonen, David R. Traum, and Llu´ıs |
|
M`arquez (eds.), Proceedings of the 57th Conference of the Association for Computational Lin_guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1, pp. 4334–4343, 2019._ |
|
|
|
Junbo Jake Zhao, Micha¨el Mathieu, and Yann LeCun. Energy-based generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, |
|
_April 24-26, 2017, Conference Track Proceedings, 2017._ |
|
|
|
|
|
----- |
|
|
|
A THE DERIVATION OF THE NEGATIVE PHASE GRADIENT |
|
|
|
In this section, we show the detailed derivation of Eq. 9. Formally, as shown in Sec. 3, given an |
|
autoregressive model qθ(x<k) = _l=1_ _[q][θ][(][x][l][|][x][<l][)][ (][k][ denotes the time step) with parameters][ θ][, we]_ |
|
define a product of the autoregressive model and an EBM as follows |
|
|
|
[Q][k][−][1] |
|
|
|
_pθ(xk, x<k) = ˜qθ(x<k)_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _,_ (14) |
|
_·_ **Zθ** |
|
|
|
|
|
where ˜qθ(x<k) = _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q][m]n=1[−][1]_ _[q][(][x][n][|][x][<n][)][. Under such definition, only those con-]_ |
|
ditional distributions qθ(xl **x<l) of the most recent k** _m time steps are involved in the current_ |
|
_|_ _−_ |
|
update of parameters θ while those distant conditional distributions q(xn **x<n) are treated as fixed.** |
|
|
|
[Q][k][−][1] _|_ |
|
We have explained the rationale and intuition in Sec.4. Zθ is the normalization term and equal to |
|
Ex[′]<k[∼]q[˜]θ(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]]. The optimization of pθ(xk, x<k) includes two phases, and the |
|
|
|
gradient w.r.t θ of negative phase is |
|
|
|
Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pθ_ [ _[∂]_ (15) |
|
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_ |
|
|
|
|
|
Next, we will show the specific derivation of these two terms in Eq. 15 so that the entire Eq. 15 can |
|
be transformed into Eq. 9. |
|
|
|
A.1 THE DERIVATION OF THE FIRST TERM |
|
|
|
The first term Ex<k∼pθ [ _∂θ[∂]_ [log ˜]qθ(x<k)] can be processed as follows |
|
|
|
|
|
Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] =_ |
|
_∼_ _∂θ_ [log ˜] |
|
|
|
|
|
_pθ(x<k)_ _[∂]_ _qθ(x<k)_ |
|
|
|
_∂θ_ [log ˜] |
|
|
|
**x<k** |
|
|
|
X |
|
|
|
|
|
_pθ(xk, x<k)_ _[∂]_ _qθ(x<k)_ |
|
|
|
_∂θ_ [log ˜] |
|
|
|
_xk_ |
|
|
|
X |
|
|
|
|
|
**x<k** |
|
|
|
|
|
(16) |
|
|
|
(17) |
|
|
|
|
|
_xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ |
|
|
|
**Zθ** |
|
|
|
|
|
_q˜θ(x<k)_ |
|
**x<k** |
|
|
|
X |
|
|
|
|
|
_qθ(x<k)_ |
|
_∂θ_ [log ˜] |
|
|
|
|
|
=Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)],_ |
|
_∼_ _∂θ_ [log ˜] |
|
|
|
_xk_ _[e][−][φ][(][xk,][x][<k]_ [)] |
|
|
|
Ex′<k _[∼]qθ[˜]_ P(x<k )[[][P]xk _[e][−][φθ]_ [(][xk,][x]<k[′] [)]] [because] |
|
|
|
|
|
where we have w(x<k) = |
|
|
|
**w(x<k) =** |
|
|
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
**Zθ** |
|
|
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
Pxk _q[˜]θ(x<k)e[−][φ][θ][(][x][k][,][x][<k][)]_ |
|
|
|
|
|
**x<k** |
|
|
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
**x<k** _q[˜]θP(x<k)_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ |
|
|
|
P _xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
[P] |
|
|
|
Ex<k∼q˜θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x][<k][)][]]_ _[.]_ |
|
|
|
|
|
----- |
|
|
|
A.2 THE DERIVATION OF THE SECOND TERM |
|
|
|
Then, we tackle the second term Exk,x<k∼pθ [ _∂θ[∂]_ _[φ][θ][(][x][k][,][ x][<k][)]][ as follows]_ |
|
|
|
|
|
_∂_ |
|
Epθ |
|
|
|
_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
|
|
|
|
|
|
_pθ(xk, x<k)_ _[∂]_ |
|
|
|
_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
_xk,x<k_ |
|
|
|
X |
|
|
|
|
|
_qθ(xk, x<k)_ |
|
_pθ(xk, x<k) [˜]_ |
|
|
|
_q˜θ(xk, x<k)_ |
|
|
|
_xk,x<k_ |
|
|
|
X |
|
|
|
|
|
_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
|
|
_qθ(x<k)_ _e[−][φ][θ][(][x][k][,][x][<k][)]_ _∂_ |
|
|
|
= _q˜θ(xk, x<k) [˜]_ _·_ |
|
|
|
**Zθ** ˜qθ(xk, x<k) _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
_xk,x<k_ |
|
|
|
X _·_ |
|
|
|
_∂_ |
|
|
|
= Exk,x<k _q˜θ(xk,x<k)[[]_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ |
|
_∼_ _q˜θ(xk_ **x<k)** **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
_|_ _[·][ 1]_ |
|
|
|
_∂_ |
|
|
|
= _q˜θ(x<k)_ _q˜θ(xk_ **x<k)** _[e][−][φ][θ][(][x][k][,][x][<k][)]_ |
|
|
|
_|_ _q˜θ(xk_ **x<k)** **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
**xX<k** Xxk _|_ _[·][ 1]_ |
|
|
|
_∂_ |
|
|
|
= _q˜θ(x<k)_ _e[−][φ][θ][(][x][k][,][x][<k][)]_ [1] |
|
|
|
_·_ **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_ |
|
|
|
**x<k** _xk_ |
|
|
|
X X |
|
|
|
_e[−][φ][θ][(][x][k][,][x][<k][)]_ _∂_ |
|
|
|
= Eq˜θ(x<k)[[] |
|
|
|
**Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
_xk_ |
|
|
|
X |
|
|
|
_e[−][φ][θ][(][x][k][,][x][<k][)]_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _∂_ |
|
|
|
= Eq˜θ(x<k)[[] |
|
|
|
_xk_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)][ ·]_ P **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
X |
|
|
|
P |
|
|
|
|
|
_q˜θ(xk_ **x<k)w(x<k)** _[∂]_ |
|
_|_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
_xk_ |
|
|
|
X |
|
|
|
|
|
= Eq˜θ(x<k)[[] |
|
|
|
|
|
= Eq˜θ(x<k)[[][E]a _q˜θ(xk_ **x<k)[[][w][(][x]<k[)][ ∂]** |
|
_∼_ _|_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]]_ |
|
|
|
= Exk,x<k _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ |
|
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_ |
|
|
|
(18) |
|
|
|
_xk_ _[e][−][φ][(][xk,][x][<k]_ [)] |
|
|
|
where w(x<k) is also equal to P **Zθ** . Combining Eq. 16 and Eq. 18, we can obtain an |
|
|
|
equivalent form of the gradient of the negative phase without any expectation over pθ as |
|
|
|
|
|
Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)] + Exk,x<k_ _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ (19) |
|
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][,]_ |
|
|
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ |
|
|
|
_._ (20) |
|
|
|
Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]] |
|
|
|
|
|
**where** **w(x<k) =** |
|
|
|
|
|
B THE FURTHER REFINEMENT OF w |
|
|
|
The reweighing weight w can be further deduced as |
|
|
|
|
|
_pθ(xk,x<k)_ |
|
|
|
_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ _xk_ _q˜θ(x<k)_ |
|
|
|
= |
|
|
|
Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]] Ex[′]<k[∼]q[˜]Pθ(x<k)[[][P]xk _pθq˜(θx(kx,<kx<k)_ ) []] |
|
|
|
_pq˜θθ((xx<k<k))_ = _µ(x<k)_ |
|
|
|
Ex[′]<k[∼]q[˜]θ(x<k)[[][ p]q˜θ[θ]([(]x[x]<k[<k])[)] []] Ex[′]<k _[µ][(][x][<k][)]_ _[,]_ |
|
|
|
|
|
**w(x<k) =** |
|
|
|
|
|
(21) |
|
|
|
|
|
where µ(x<k) is defined as _[p]q˜θ[θ]([(]x[x]<k[<k])[)]_ [.] |
|
|
|
|
|
----- |
|
|
|
C EXPERIMENTAL SETTINGS |
|
|
|
In this section, we introduce the specific setup of different benchmarks in Table 6. We uniformly use |
|
Adam optimizer. The training will be stopped once the model has not obtained better performance |
|
for 20 epochs on the validation set. For translation tasks, the length of generated fake sentences, |
|
which is used for the computing of negative phase in Eq. 12, is dependent on the source sequence |
|
whilst for language modeling tasks, we fix the length of generated fake sentences as 50 during |
|
training. As for the model structures of the image generation task, we use the official structure |
|
reported by PixelCNN (van den Oord et al., 2016c) and Gated PixelCNN (van den Oord et al., |
|
2016b) without modification. The source code will be released once upon acceptance. We use the |
|
same batch of samples generated autoregressively to approximate both the expectations in Eq.12 |
|
and weight w (i.e., shared), which does not need to sample twice. The number of samples in a |
|
batch is dynamic while the maximum number of the total tokens in a batch are fixed (4096 in |
|
our experiments). If the length of sequences in a batch is 32, then it includes 4096 / 32 = 128 |
|
samples in total. It is a common strategy in language generation tasks, and has been used in many |
|
frameworks(e.g. Fairseq (Ott et al., 2019)). We generate samples autoregressively as many as the |
|
number of sequences in the current batch at each update iteration. |
|
|
|
|Hyper-Parameters|IWSLT14|WMT16|WiKiText103| |
|
|---|---|---|---| |
|
||Tr-Base|Tr-Base Tr-Large|Tr-Base Tr-XL| |
|
|Number of Layers Hidden Embed Size FC-Layer Embed Size Attention Heads Dropout Learning Rate lr scheduler Warm up Updates Weigth Decay Coefficient λ E-ARM Start Epoch|12 512 1024 4 0.3 5e-4 inverse sqrt 4000 1e-4 0.05 15|12 12 512 1024 2048 4096 8 16 0.3 0.3 1e-3 1e-3 inverse sqrt inverse sqrt 4000 4000 0.0 0.0 0.05 0.05 15 10|6 16 512 410 2048 2100 8 10 0.1 0.1 5e-4 2.5e-4 inverse sqrt cosine 4000 10000 1e-2 0.0 0.05 0.02 15 10| |
|
|
|
|
|
|
|
Table 6: Hyper-Parameters of different model structures and datasets. “Tr-Base”, “Tr-Large”, and “Tr-XL” |
|
indicate Transformer-Base, Transformer-Large, and Transformer-XL respectively |
|
|
|
D MORE EXPERIMENTAL ANALYSIS |
|
|
|
D.1 EFFECT ON INCOHERENCE |
|
|
|
In order to validate the effectiveness of our E-ARM for ameliorating the long-range coherence of |
|
generations, we undertake an experiment to assess the model’s performance under different test sets |
|
with varying sentence lengths. We divided the test set of IWSLT14 (German → English, Italian → |
|
English, Spanish → English) translation dataset into three subsets ([0, 25], [25, 50], and [50, ∞)) |
|
based on the target sentence lengths. Then, we incrementally applied scheduled sampling technique |
|
and our E-ARM above the base transformer network, and tested their performances on these three |
|
subsets. Generally, the subset of samples with longer target sentences ([50, ∞)) should have been |
|
more affected by the long-range incoherence problem (lower BLEU score). In practice, we uniformly applied label smoothing and beam searching (with 5 beams) strategy for all experiments in |
|
Table 7. |
|
|
|
Specifically, Table 7 shows that the base translation model improved performance for all three test |
|
sets with varying target sentence lengths after using the scheduled sampling technique, especially |
|
for the two sets [0, 25) and [25, 50) which had relatively short target sentence lengths (e.g. On |
|
German to English task, 38.20 - 37.72 = +0.48 points and 33.76 - 33.24 = + 0.52 points for [0, 25) |
|
and [25, 50) test sets respectively). We consider that this performance boost was achieved through |
|
alleviating the exposure bias problem, since scheduled sampling approaches (Ranzato et al., 2016; |
|
Zhang et al., 2019; Mihaylova & Martins, 2019) have been verified in mitigating the exposure bias |
|
problem. Besides, after applying our E-ARM together with the scheduled sampling technique, the |
|
|
|
|
|
----- |
|
|
|
|Translation Task|Scheduled E-ARM Sampling Training|Target Sentence Length|All Test| |
|
|---|---|---|---| |
|
|||[0, 25) [25, 49) [50, ∞)|| |
|
|De→En|- - - |37.72 ±0.04 33.24 ±0.06 30.86 ±0.07 38.20 ±0.07 33.76 ±0.03 31.08 ±0.06 38.37 ±0.06 33.92 ±0.09 31.43 ±0.04|34.61 ±0.08 35.10 ±0.04 35.36 ±0.05| |
|
|It→En|- - - |35.20 ±0.03 32.73 ±0.02 26.86 ±0.05 35.52 ±0.09 33.25 ±0.08 26.95 ±0.14 35.56 ±0.10 33.33 ±0.13 27.21 ±0.07|32.29 ±0.03 32.64 ±0.12 32.82 ±0.11| |
|
|Es→En|- - - |43.37 ±0.05 39.67 ±0.08 37.14 ±0.06 43.61 ±0.09 40.00 ±0.04 37.38 ±0.06 43.84 ±0.10 40.35 ±0.05 38.07 ±0.04|40.64 ±0.07 40.91 ±0.06 41.58 ±0.07| |
|
|
|
|
|
Table 7: Performance comparison on the IWSLT14 test set with respect to the different lengths of sentences |
|
on three translation tasks (German to English, Italian to English, and Spanish to English). Performance is |
|
evaluated by BLEU score. |
|
|
|
base model can further obtain additional performance gain. Specifically, the improvement on the |
|
longer sentence is more evident, since model can obtain large improvements on the [50, ∞) (e.g. |
|
On German to English task, 31.43 - 31.08 = +0.35 points for [50, ∞) test sets) than short sets [0, |
|
25] and [25, 50] (e.g. On German to English task, 38.37 - 38.20 = +0.17 points and 33.92 - 33.76 |
|
= + 0.16 points for [0, 25) and [25, 50) test sets respectively). This phenomenon indicates that our |
|
E-ARM can resolve the incoherence problem to some extent. |
|
|
|
D.2 EFFECT ON EXPOSURE BIAS |
|
|
|
|Trans. Pairs|DE→EN EN→DE EN→IT IT→EN ES→EN EN→ES| |
|
|---|---| |
|
|N Total Ratio|14203 14554 14976 13952 16021 15359 22148 23057 23654 23744 23860 22775 64.12% 63.12% 63.31% 59.76% 68.33% 67.43%| |
|
|
|
|
|
|
|
Table 8: The effect of E-ARM on the exposure bias problem. Each test set of translation tasks contains 1K |
|
sentences selected randomly. N denote the ground truth words whose probabilities in the predicted distributions |
|
produced by E-ARM are greater than those produced by the baseline. |
|
|
|
We follow the analytic experiments in the work (Zhang et al., 2019) to show that our E-ARM is |
|
capable of alleviating the exposure bias problem. Specifically, we randomly select 1K pairs from |
|
the training data for each translation pair and use the trained autoregressive model which applied |
|
E-ARM (Label Smoothing with smoothing factor 0.1 is applied during training while scheduled |
|
sampling is not used) to decode the source sentences, and then count the ground truth words whose |
|
probabilities in the predicted distributions produced by our E-ARM are greater than those produced |
|
by the baseline and denote the number as N . The ratio of N to the total number of words tested |
|
is calculated. The detailed results are shown in Table 8. We find that the results on all different |
|
tasks are greater than 50%, which demonstrate the ability of our E-ARM in solving exposure bias |
|
problem. |
|
|
|
D.3 ANALYSIS TO MODEL’S CONVERGENCE |
|
|
|
In this section, We will investigate the convergence of our E-ARM. To begin, we first train a base |
|
Transformer model (“Tr-Base” architecture shown in Table 6) on the IWSLT14 Spanish to English |
|
training set for baseline and E-ARM model respectively, and then record the training loss and test |
|
loss (in cross entropy) at the end of each epoch. The loss curves are plotted in the Figure 2. From |
|
Figure 2, we can see that (1) at the start of the training, our E-ARM converges slightly faster than |
|
the baseline. (2) As the training process progresses, the cross entropy of the baseline on the training |
|
set will gradually decrease, with a faster rate than E-ARM. On the other hand, the test loss curve of |
|
the baseline will fall at initially and then slowly rose after 50 epochs while E-ARM always remains |
|
stable convergence. This phenomenon also shows that our E-ARM model can effectively prevents |
|
over-fitting and produce better generalization. |
|
|
|
|
|
----- |
|
|
|
(a) (b) |
|
|
|
Figure 2: (a) Cross entropy loss curves on IWSLT14 Spanish to English translation task on training set. |
|
The blue and orange colors represent base model and E-ARM respectively; (b) Cross entropy loss curves on |
|
IWSLT14 Spanish → English translation task on test set. |
|
|
|
D.4 ANALYSIS TO TOP-K RE-SAMPLING |
|
|
|
|Trans. Pairs|Col2| |
|
|---|---| |
|
|k|0 5 10| |
|
|
|
|
|
**Trans. Pairs** **DE→** **EN** **EN→** **DE** **EN→** **IT** **IT→** **EN** **ES→** **EN** **EN→** **ES** |
|
|
|
**0** 34.86 28.73 29.91 32.44 40.88 37.59 |
|
|
|
**_k_** **5** **34.93** 28.85 **30.04** **32.56** **41.01** 37.66 |
|
|
|
**10** 34.88 **28.91** 29.96 32.41 40.90 **37.73** |
|
|
|
|
|
Table 9: The effect of Top-K correction in the inference stage. We tested BLEU scores of using different k on |
|
different translation pairs of IWSLT14 dataset. |
|
|
|
Top-K energy re-sampling in the inference stage is introduced by Bakhtin et al. (2021), which |
|
collects many candidate sequences generated autoregressively in the inference stage and then resamples from them depending on their energy scores estimated by the network. To measure the |
|
contribution of the Top-K energy re-sampling in our method, we conduct ablation study to verify |
|
it by selecting different K = {0, 5, 10}. The results are shown in Table 9 by using BLEU score. |
|
From Table 9, we observe that the benefits brought by Top-K sampling is minor (K={5, 10}), when |
|
compared with model without Top-K sampling (K=0). These results also indicate that the performance improvements of our E-ARM are mainly from our joint-training, rather than Top-K energy |
|
re-sampling. |
|
|
|
D.5 EVALUATION WITH OTHER METRICS |
|
|
|
|Trans. Pairs|Scheduled E-ARM Sampling Training|Metrics| |
|
|---|---|---| |
|
|||ROUGE-1 ↑ ROUGE-2↑ ROUGE-L↑ METEOR↑ BLEU↑| |
|
|De →En|- - - |66.51 43.69 63.69 64.35 34.61 66.83 44.08 64.02 64.61 35.10 67.46 44.77 64.78 65.13 35.36| |
|
|It →En|- - - |64.50 40.65 61.69 62.18 32.29 64.73 40.97 61.94 62.51 32.64 65.27 41.51 62.49 62.80 32.82| |
|
|Es →En|- - - |71.10 49.47 68.78 68.94 40.64 71.36 49.53 68.96 69.28 40.91 71.91 50.17 69.65 69.63 41.58| |
|
|
|
|
|
|
|
Table 10: Comparison of ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BLEU scores between our approach E-ARM and the base ARGM trained just with cross-entropy loss on three translation pairs of IWSLT14 |
|
datasets. The value is expressed in percentage. We use “Tr-Base” as the network architecture. |
|
|
|
To further evaluate the effectiveness of the our proposed E-ARM, we also evaluate our method by |
|
using other metrics, such as ROUGE Lin (2004) and METEOR Banerjee & Lavie (2005) for neural |
|
|
|
|
|
----- |
|
|
|
machine translation. The results are shown in Table 10. In Table 10, the improvements of E-ARM in |
|
different metrics is consistent with the conclusion of Table 1, which further prove the effectiveness |
|
of our E-ARM model. |
|
|
|
D.6 EFFICIENCY STUDY |
|
|
|
Our E-ARM has the advantage of being able to optimize an energy-based learning target using |
|
maximum log-likelihood, without the usage of MCMC procedures. The requirement to sample data |
|
from the autoregressive model at each update step, on the other hand, remains a possible element |
|
that could slow down the training process. Nonetheless, the extra overheads are still acceptable |
|
when compared to sampling data using MCMC algorithms. The reasons are provided in below: |
|
|
|
Assuming that the forward processes of the Transformer, with a length n sentence as the input, have |
|
the time cost τ . We tested the time cost of gradients back-propagation for Transformer on Tesla |
|
V100 GPU. We found that the time cost of the backward process is approximately twice as the |
|
forward process, which is marked as 2τ . Therefore, the time cost of one step update is approximate |
|
3τ . Autoregressively generating a sequence of length n by Transformer necessitates n feedforward |
|
processes, as each predicted token must use all previously created tokens as input. One fact is that |
|
we simply need to use the previously produced k tokens as the input at each time step k, and gradient |
|
back-propagation is not required during the generation. |
|
|
|
As a result, the time cost of |
|
generating a fake sentence |
|
|
|
2 |
|
|
|
Considering the IWSLT14 |
|
German to English transla |
|
cost of generating fake |
|
|
|
|Model|S.S. w/E-ARM|Sec./100 iter.| |
|
|---|---|---| |
|
|- - 27.3 - 30.1 Tr-Base - Autoreg. 145.8 Autoreg. 149.2 - 20 steps SGLD 630.6 - 50 steps SGLD 1452.3||| |
|
|
|
data in each iteration is |
|
roundly 9.5τ . Furthermore, Table 11: Efficiency performance on IWSLT14 German→ English, evalu |
|
ated with BLEU. We uniformly use 12 layer “Tr-Base” in Table 6. “S.S.” |
|
|
|
the generated fake sentence denotes Scheduled Sampling.”Autoreg.” indicates optimizing E-ARM with |
|
will be fed into the trans- Eq.12 by sampling fake data from autoregressive models. ”* steps SGLD” |
|
former and included in the represents optimizing our E-ARM with Eq.6, the fake data is sampled at the |
|
overall loss computation, first transformer layer’s output by SGLD with * steps. |
|
resulting in an extra forward and backward procedure apart from the update of original input. Thus, the total time cost of our E-ARM’s one update is |
|
15.5τ, which is about 5.2 times as great as vanilla training. Table 11 shows the time cost of training a 12 layer transformer with 100 iterations. The time cost of our E-ARM roughly coincides the |
|
extra time cost as we analyzed above. For long sequence tasks like image generation and language |
|
modeling, which usually have sequences consisting of hundreds of tokens, we randomly truncate a |
|
continuous sequence with length 50 for energy-based training in Eq.12. |
|
|
|
When it comes to the MCMC sampling, one problematic issue is that for sequential data like text, |
|
the intrinsic discrete property prevents it from applying MCMC in the data space, which forces us |
|
to apply it in the latent feature space. Here, we take the SGLD (Welling & Teh, 2011b) algorithm |
|
for example. Assuming that we apply the SGLD at the first layer of the network, then the time cost |
|
of one SGLD iteration is about 3τ either. Since the SGLD process requires k iterations to reach |
|
convergence, the total time cost of one update of our E-ARM with MCMC process is (3k + 6)τ . In |
|
practice, the k is usually set as 100 for stable training Grathwohl et al. (2020), which results in the |
|
time cost being [(3][×][100+6)]3τ _[τ]_ = 102 times as large as the vanilla training. For short-run SGLD, which |
|
|
|
takes k as 20 with a sacrifice of performance, it still leads to the time cost being 22 times as large as |
|
the vanilla training. |
|
|
|
|
|
----- |
|
|
|
D.7 CASES STUDIES |
|
|
|
To better understand the advantages of our method in correcting error tokens, we also prepare some |
|
translation cases in IWSLT14 German → English, as shown in Table 12. |
|
|
|
|Source Sentence(German)|Predicted Target Sentence(English)| |
|
|---|---| |
|
|wenn ich ihnen 600 zeitschriften zeige und sie in 10 kategorien aufteile oder GroundTruth: if i show you 600 magazines and i divide them up into 10 ich ihnen 400 zeitschriften zeige, und diese in 20 kategorien aufteile, dann categories, versus i show you 400 magazines and divide them up into 20 cat- glauben sie, dass ich ihnen mehr auswahl und eine bessere auswahlerfahrung egories, you believe that i have given you more choice and a better choosing gegeben habe, als ich ihnen die 400 gegeben ha¨tte gegenu¨ber dem, wenn ich experience if i gave you the 400 than if i gave you the 600. ihnen die 600 gegeben ha¨tte. Baseline: if i show you 600 magazines and i split them in 10 categories, or i’m showing them 400 magazines, and i’m going to split them up into 20 categories, you think i’ve given them more choices and better choice than i would have given them the 400 over the time that i gave them the 600. Baseline + S.S.: if i show you 600 magazines and i give you 400 magazines in 10 categories, and i give you 400 magazines, and i can split them up in 20 categories, then you think i’ve given you more choice and a better selection than i would have given you the 400 of which if i gave you the 600. Ours: if i show you 600 magazines and i divide them into 10 categories, or i show you 400 magazines, and i divide them into 20 categories, you think i’ve given you more choices and better selection experience than i gave you the 400 of whom if i gave you the 600.|| |
|
|und ich weiß definitiv, dass es fu¨r mich – in meiner situation – sehr gefa¨hrlich wa¨re, anzufangen, diesen dunklen pfad der vermutung sozusagen herunterzu- sickern – besonders in dem umstand, in dem ich mich in meiner karriere gerade befinde.|GroundTruth: and i definitely know that, in my case – in my situation – it would be very dangerous for me to start sort of leaking down that dark path of assumption, particularly given the circumstance that i’m in right now in my career. Baseline: and i know definitely, for me, it would be very dangerous to begin to do this dark path of suspect – especially in the circumstance that i’m in my career right now. Baseline + S.S.: and i know definitely it would be – in my situation – very dangerous to start, to kind of settle down this dark path of presumption – es- pecially in the circumstance in which i’m in my career right now. Ours: and i definitely know that it’s for me – in my situation – very danger- ous to start to sickle down this dark path of suspection, in particular, in the circumstance of where i’m in my career right now.| |
|
|wir haben das licht ausgeschaltet, legten es in ein vakuum und saugten die ganze luft aus und ku¨hlten es bis fast zum jetzt, ganz alleine im aufzug, war das stu¨ck metall frei, sich zu verhalten wie immer es wollte.|GroundTruth: we turned off the lights, and then we put it in a vacuum and sucked out all the air, and then we cooled it down now, all alone in the elevator, the little chunk of metal is free to act however it wanted. Baseline: we turned the light off, put it in a vacuum and sucked it out all the air and cooled it up until almost now, all the way alone, the piece of metal was open to behave as it was. Baseline + S.S.: we turned the lights off, we put it into a vacuum, and we sucked all the air, and we cooled it all the way up to now, all over the place, the piece of metal was free to behave whatever it wanted. Ours: we turned off the lights, we put it into a vacuum and we sucked all the air out, and we cooled it up until almost now, all alone in the elevator, the piece of metal was free to behave whatever it wanted.| |
|
|und im grunde ko¨nnen sie das betrachten, wissen sie, als eine tyrannei des erin- nernden selbst, und sie ko¨nnen sich das erinnernde selbst denken als eins, das sozusagen das erlebende selbst schleppt durch erfahrungen, die das erlebende selbst nicht braucht.|GroundTruth: and basically you can look at this, you know, as a tyranny of the remembering self, and you can think of the remembering self sort of dragging the experiencing self through experiences that the experiencing self doesn’t need. Baseline: and basically, you can think of this, you know, as a tyranny of self, and you can think of the memorable self as one that kind of weaves the living self through experiences that don’t need the life itself. Baseline + S.S.: and basically, you can look at this, you know, as a tyrannei of memorial self, and you can think of the memorial self as one that kind of sucks the living self through experiences that don’t need the living self. Ours: and basically, you can look at that, you know, as a tyranny of the re- membering self, and you can think of the memory itself as one, which is sort of dragging the living self through experiences that the living self doesn’t need.| |
|
|wir sind an der schwelle zu erstaunlichen, erstaunlichen ereignissen auf vielen gebieten. und doch denke ich wirklich, dass wir hunderte, 300 jahre vor die aufkla¨rung zuru¨ck gehen mu¨ssten, um eine zeit zu finden, in der wir fortschritt beka¨mpft haben, in der wir u¨ber diese dinge heftiger getritten haben, an mehr fronten als jetzt.|GroundTruth: we’re on the verge of amazing, amazing events in many fields, and yet i actually think we’d have to go back hundreds, 300 years, before the enlightenment, to find a time when we battled progress, when we fought about these things more vigorously, on more fronts, than we do now. Baseline: we are at the threshold of amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the enlightenment to find a time when we have fought progress in which we have driven more of these things than now. Baseline + S.S.: we’re at the threshold of amazing, amazing events in many areas. and yet, i really think that we have to go back hundreds and hundreds of years before the enlightenment to find a time when we have struggled with progress in which we have driven on these things more powerful, more fronts than now. Ours: we’re at the threshold to amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the en- lightenment to find a time when we fought progress, where we’ve been fighting about these things to more fronts than we have now.| |
|
|
|
|
|
|
|
Table 12: Translation cases on IWSLT14 De→En test set, generated by the baseline method, baseline with |
|
scheduled sampling and our E-ARM. The italic font means the mismatch translation |
|
|
|
|
|
----- |
|
|
|
E MORE DISCUSSION OF RELATED WORKS |
|
|
|
The seminal idea of combing a generative model and an energy-based model has been explored by a |
|
plethora of great works (Pang et al., 2020; Durkan & Nash, 2019; Xie et al., 2019; 2020; Xiao et al., |
|
2021; Bakhtin et al., 2021). Our E-ARM can be considered as a member of this family of models in general, but it has a different mechanism and goal than the others. In particular, Pang et al. |
|
(2020) aimed to learn an energy-based model (EBM) in the latent space of a generator model, so that |
|
the EBM can act as a prior model on the generator model’s top-down network. They believe that |
|
the energy-based correction of the prior noise distribution will benefit the subsequent generator’s |
|
generating process. Furthermore, Xie et al. (2019) attempted to learn the conditional distribution |
|
of a high-dimensional output given an input by combining the efforts of a fast thinking initializer, |
|
which generates the output and a latent vector, and a slow thinking solver, which learns an objective |
|
function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based |
|
model. A similar work is GAMs (Parshakova et al., 2019a;b; Khalifa et al., 2021), which combine |
|
an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. Moreover, VAEBM, a symbiotic composition of a variational |
|
auto-encoder and an EBM, was proposed by (Xiao et al., 2021). It can use a state-of-the-art VAE |
|
to capture the general mode structure of the data distribution while relying on its EBM component |
|
to explicitly eliminate non-data-like regions from the model and refine the generation samples. In |
|
addition, Bakhtin et al. (2021) designed a novel mechanism to train an unnormalized energy-based |
|
models for modeling joint sequence by working in the residual of a pretrained locally normalized |
|
language model and training using noise contrastive estimation. All of the above models require an |
|
additional network to learn the energy scores, which prevents the base autoregressive model from |
|
benefiting from EBM’s properties in modeling the joint distribution in a more temporally coherent |
|
manner. In contrast, by carefully constructing an energy-based learning objective and its corresponding optimization procedure, we are able to smoothly integrate energy surface learning into |
|
autoregressive networks that do not require additional learnable parameters. Rather than proposing |
|
a new generative model, our method is more likely to a novel training pattern for training a better |
|
autoregressive model. Recently, instead of constructing an autoregressive model in the data space, |
|
Xu et al. (2021b) have proposed a unique way which uses autoregressive models in the latent space |
|
followed by a decoder which decodes the autoregressively generated latent feature into the original |
|
data space. They attempt to learn a structured representation space where dimensions are ordered |
|
based on importance and trade off the sample quality for computational efficiency by truncating the |
|
dimensions of latent generations. Their work is orthogonal to ours. We think the combination between our E-ARM with anytime sampling is also a valuable work which is worth exploration in the |
|
future. |
|
|
|
|
|
----- |
|
|
|
|