pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
raw
history blame
75.4 kB
## HOW TO IMPROVE SAMPLE COMPLEXITY OF SGD
### OVER HIGHLY DEPENDENT DATA?
**Anonymous authors**
Paper under double-blind review
ABSTRACT
Conventional machine learning applications typically assume that data samples
are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data
samples, which are known to heavily bias the stochastic optimization process and
slow down the convergence of learning. In this paper, we conduct a fundamental
study on how different structures of stochastic update schemes affect the sample complexity of stochastic gradient descent (SGD) over highly dependent data.
Specifically, with a φ-mixing model of data dependence, we show that SGD with
proper periodic data-subsampling achieves an improved sample complexity over
the standard SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence
of SGD over highly dependent data. Moreover, we show that mini-batch SGD
can further substantially improve the sample complexity over SGD with periodic
data-subsampling over highly dependent data. We also conduct some numerical
experiments to validate our theoretical results.
1 INTRODUCTION
Stochastic optimization algorithms have attracted great attention in the past decade due to its successful applications to a broad research areas, including deep learning (Goodfellow et al., 2016), reinforcement learning (Sutton & Barto, 2018), online learning (Bottou, 2010; Hazan, 2017), control
(Marti, 2017), etc. In the conventional analysis of stochastic optimization algorithms, it is usually
assumed that all data samples are independently and identically distributed (i.i.d.) and queried. For
example, data samples in the traditional empirical risk minimization framework are assumed to be
queried independently from the underlying data distribution, while data samples in reinforcement
learning are assumed to be queried from the stationary distribution of the underlying Markov chain.
Although the i.i.d. data assumption leads to a comprehensive understanding of the statistical limit
and computation complexity of SGD, it violates the nature of many practical data-generating
stochastic processes, which generate highly correlated samples that depend on the history. In fact,
dependent data can be found almost everywhere, e.g., daily stock price, weather/climate data, state
transitions in Markov chains, etc. To understand the impact of data dependence on the convergence
and complexity of stochastic algorithms, there is a growing number of recent works that introduce
various definitions to quantify data dependence. Specifically, to analyze the finite-time convergence
of various stochastic reinforcement learning algorithms, recent studies assume that the dependent
samples queried from the Markov decision process satisfy a geometric mixing property (Dalal et al.,
2018; Zou et al., 2019; Xu & Gu, 2020; Qu & Wierman, 2020), which requires the underlying
Markov chain to be uniformly ergodic or has a finite mixing time (Even-Dar et al., 2003). On the
other hand, to analyze the convergence of stochastic optimization algorithms over dependent data,
Karimi et al. (2019) assumed the existence of a solution to the Poisson equation associated with the
underlying Markov chain, which is a weaker condition than the uniform ergodic condition (Glynn
& Meyn, 1996). Moreover, Agarwal & Duchi (2012) introduced a φ-mixing property of the datagenerating process that quantifies how fast the distribution of future data samples (conditioned on a
fixed filtration) converges to the underlying stationary data distribution. In particular, the φ-mixing
property is more general than the previous two notions of data dependence (Douc et al., 2018).
-----
While the aforementioned works leveraged the above notions of data dependence to characterize the
sample complexity of various standard stochastic algorithms over dependent data, there still lacks
theoretical understanding of how algorithm structure affects the sample complexity of stochastic
algorithms under different levels of data dependence. In particular, a key algorithm structure is the
stochastic update scheme, which critically affects the bias and variance of the stochastic optimization process. In fact, under i.i.d. data and convex geometry, it is well known that SGD achieves
the sample complexity lower bound under various stochastic update schemes (Bottou, 2010), e.g.,
single-sample update and mini-batch update. However, these stochastic update schemes may lead
to substantially different convergence behaviors over highly dependent data, as they are no longer
unbiased. Therefore, it is of vital importance to understand the interplay among data dependence,
structure of stochastic update and convergence rate of stochastic algorithms, and we want to ask the
following fundamental question.
- Q: How does the structure of stochastic updates affect the convergence rate and sample
complexity of stochastic algorithms over dependent data?
In this paper, we provide comprehensive answers to the above fundamental question. Specifically,
we conduct a comprehensive study of the convergence rate and sample complexity of the SGD
algorithm over a wide spectrum of data dependence levels under various types of stochastic updates,
including periodic subsampling and mini-batch sampling. Our results show that SGD with both
stochastic updates achieves a substantially improved sample complexity over the standard SGD
under highly dependent data. We summarize our contributions as follows.
1.1 OUR CONTRIBUTIONS
We consider the following standard stochastic optimization problem.
min _F_ (w; ξ) _,_ (P)
_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ]
_∈W_
 
where the objective function f is convex and Lipschitz continuous, and the expectation is taken over
the stationary distribution µ of the underlying data-generating process P. To perform stochastic
optimization, we query a stream of dependent data samples from the underlying data-generating
process. Specifically, we adopt the φ-mixing model to quantify the data dependence via a decaying
mixing coefficient function φξ(k) (see Definition 2.2) (Agarwal & Duchi, 2012). We study the convergence of the stochastic gradient descent (SGD) algorithm over φ-mixing dependent data samples
under various stochastic update schemes, including data subsampling and mini-batch sampling.
We first study the convergence of SGD over φ-mixing dependent data samples under the data subsampling update scheme. In particular, the data subsampling update scheme utilizes only one data
sample per r consecutive data samples by periodically skipping r − 1 samples. With this data subsampling scheme, the subsampled data samples are less dependent for a larger subsampling period
_r. Consequently, we show that SGD with a proper data subsampling period achieves an improved_
sample complexity over that of the standard SGD in the full spectrum of the convergence rate of the
_φ-mixing coefficient. In particular, the improvement is substantial when the data is highly dependent_
with an algebraic decaying φ-mixing coefficient.
Moreover, we study the sample complexity of SGD over φ-mixing dependent data samples under
the mini-batch update scheme. Compare to the data subsampling update, mini-batch update can
substantially reduce the mini-batch data dependence without skipping data samples. Consequently,
mini-batch update leverages the sample average over a mini batch of data samples to reduce both the
bias (caused by the data dependence) and the optimization variance. Specifically, we show that SGD
with mini-batch update achieves an orderwise lower sample complexity than both the standard SGD
and the SGD with data subsampling in the full spectrum of the convergence rate of the φ-mixing
coefficient. We summarize and compare the sample complexities of these stochastic algorithms
under different φ-mixing data dependence models in Table 1.
1.2 RELATED WORK
**Stochastic Algorithms over Dependent Data** Steinwart & Christmann (2009) and Modha &
Masry (1996) established the convergence analysis of online stochastic algorithms for streaming
-----
Table 1: Comparison of sample complexities of SGD, SGD with data subsampling and Mini-batch
SGD under different levels of data dependence for achieving f (w) − _f_ (w[∗]) ≤ _ϵ. Note that θ is a_
parameter of the convergence rate of the φ-mixing coefficient.
Data dependence level _φξ(k)_ SGD SGD w/ subsampling Mini-batch SGD
Geometric φ-mixing exp( _k[θ]),_ 2 1
_−_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2])
(Weakly dependent) _θ > 0_ _O_ _O_ _O_
Fast algebraic φ-mixing _k[−][θ],_
(ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (ϵ[−][2])
(Medium dependent) _θ_ 1 _O_ _O_ _O_
_≥_
Slow algebraic φ-mixing _k[−][θ],_ (ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (eϵ[−][1][−] _θ[1] )_
(Highly dependent) 0 < θ < 1 _O_ _O_ _O_
data with geometric ergodicity. Duchi et al. (2011) proved that the stochastic subgradient method
has strong convergence guarantee if the mixing time is uniformly bounded. Agarwal & Duchi (2012)
studied the convex/strongly convex stochastic optimization problem and proved high-probability
convergence bounds for general stochastic algorithms under general stationary mixing processes.
Godichon-Baggioni et al. (2021) provided the non-asymptotic analysis of stochastic algorithms with
strongly convex objective function over streaming mini-batch data. In a more general setting, the
stochastic approximation (SA) problem was studied in (Karimi et al., 2019) by assuming the existence of solution to a Poisson equation. Recently, Debavelaere et al. (2021) developed the asymptotic
convergence analysis of SA problem for sub-geometric Markov dynamic noises.
**Finite-time convergence of reinforcement learning** Recently, a series of work studied the finitetime convergence of many stochastic reinforcement learning algorithms over Markovian dependent
samples, including TD learning (Dalal et al., 2018; Xu et al., 2019; Kaledin et al., 2020), Q-learning
(Qu & Wierman, 2020; Li et al., 2021; Melo et al., 2008; Chen et al., 2019; Xu & Gu, 2020), fitted Qiteration (Mnih et al., 2013; 2015; Agarwal et al., 2021), actor-critic algorithms (Wang et al., 2019;
Yang et al., 2019; Kumar et al., 2019; Qiu et al., 2019; Wu et al., 2020; Xu et al., 2020), etc. In these
studies, the dependent Markovian samples are assumed to satisfy the geometric φ-mixing property,
which is satisfied when the underlying Markov chain is uniformly ergodic or time-homogeneous
with finite-states.
**Regret of Stochastic Convex Optimization** There have been many known regret bounds for online convex optimization problem. Hazan (2017) has built the standard O(√T ) regret bound for on
line SGD algorithm with assuming the bounded gradient. Xiao (2009) introduces the regret bound
of online dual averaging method. To our best knowledge, there is no high-probability guaranteed
regret bound for mini-batch SGD algorithm with considering the data dependence.
2 PROBLEM FORMULATION AND ASSUMPTIONS
In this section, we introduce the problem formulation and some basic assumptions. Consider a
model with parameters w. For any data sample ξ, denote F (w; ξ) ∈ R as the sample loss of this data
sample under the model w. In this paper, we consider the following standard stochastic optimization
problem that has broad applications in machine learning.
min _F_ (w; ξ) _._ (P)
_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ]
_∈W_
Here, the expectation is taken over the randomness of the data sample  _ξ, which is drawn from an_
underlying distribution µ. In particular, we make the following standard assumptions regarding the
problem (P) (Agarwal & Duchi, 2012).
**Assumption 2.1. The stochastic optimization problem (P) satisfies**
_1. For every ξ, function F_ (·; ξ) is G-Lipschitz continuous over W, i.e., for all w, v ∈W,
_|F_ (w; ξ) − _F_ (v; ξ)| ≤ _G∥w −_ _v∥._
_2. Function f_ ( ) is convex and bounded below, i.e., f (w[∗]) := inf _w_ _f_ (w) > _._
_·_ _∈W_ _−∞_
-----
_3. W is a convex and compact set with bounded diameter R, i.e., supw,v∈W ∥w −_ _v∥≤_ _R._
To solve this stochastic optimization problem, one often needs to query a set of data samples from
the distribution µ to perform optimization. Unlike traditional stochastic optimization that usually
assumes that the data samples are i.i.d. we consider a more general and practical dependent datagenerating process as we elaborate below.
**Dependent data-generating process: We consider a stochastic process P that generates a stream**
of data samples _ξ1, ξ2, ...,_, which are not necessarily independent. In particular, the stochastic
_{_ _}_
process P has an underlying stationary distribution µ. To quantify the dependence of the data
generation process, we introduce the following standard φ-mixing model (Agarwal & Duchi, 2012),
where we denote {Ft}t as the canonical filtration generated by {ξt}t.
**Definition 2.2 (φ-mixing process). Consider a stochastic process {ξt}t with a stationary distribution**
_µas the total variation distance. Then, the process. Let P(ξt+k ∈·|Ft) be the distribution of the (t_ + {ξkt)}-th sample conditioned ont is called φ-mixing if the following mixing Ft, and denote dTV
coefficient φξ( ) converges to 0 as k tends to infinity.
_·_
_φξ(k) :=_ sup 2dTV P(ξt+k _A), µ_ _._
_t∈N,A∈Ft_ _∈·|_

Intuitively, the φ-mixing coefficient describes how fast the distribution of sample ξt+k converges to
the stationary distribution µ when conditioned on the filtration _t, as the time gap k_ . The φ_F_ _→∞_
mixing process can be found in many applications, which involve mixing coefficients that converge
to zero at different convergence rates. Below we mention some popular examples.
- Geometric φ-mixing process. Such a type of process has a geometrically diminishing
mixing coefficient, i.e., φξ(k) ≤ _φ0 exp(−ck[θ]) for some φ0, c, θ > 0. Examples include_
finite-state ergodic Markov chains and some aperiodic Harris-recurrent Markov processes
(Modha & Masry, 1996; Agarwal & Duchi, 2012; Meyn & Tweedie, 2012);
- Algebraic φ-mixing process. Such a type of process has a polynomially diminishing mixing coefficient, i.e., φξ(k) _φ0k[−][θ]_ for some φ0, θ > 0. Examples include a large class of
_≤_
Metropolis-Hastings samplers (Jarner & Roberts, 2002) and some queuing systems (Agarwal & Duchi, 2012).
3 CONVERGENCE OF SGD WITH SUBSAMPLING OVER DEPENDENT DATA
In this section, we study the convergence rate and sample complexity of SGD with data subsampling
update over φ-mixing dependent data. In Section 3.1, we recap the convergence results of the standard SGD over dependent data established in (Agarwal & Duchi, 2012). In Section 3.2, we establish
convergence results of SGD with the data subsampling update.
Throughout, we define the sample complexity as the total number of samples required for the algorithm to output a model w that achieves an ϵ convergence error, i.e., f (w) − _f_ (w[∗]) ≤ _ϵ. Also, the_
standard regret of a stochastic algorithm is defined as
_F_ (w(t); ξt) _F_ (w[∗]; ξt),
_−_
_t=1_
X
(Regret): Rn :=
where the models _w1, w2, ..., wn_ are generated using the data samples _ξ1, ξ2, ..., ξn_, respec_{_ _}_ _{_ _}_
tively, and w[∗] is the minimizer of f (w). For this sequence of models _w1, w2, ..., wn_, we make the
_{_ _}_
following mild assumption, which is satisfied by many SGD-type algorithms.
**Assumption 3.1. There is a non-increasing sequence {κ(t)}t such that ∥w(t + 1) −** _w(t)∥≤_ _κ(t)._
3.1 STOCHASTIC GRADIENT DESCENT
Stochastic gradient descent (SGD) is a popular and classical algorithm for stochastic optimization.
In every iteration t, SGD queries a sample ξt from the data-generating process and performs the
following update.
(SGD): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξt), (1)
_−_ _∇_
-----
where ηt is the learning rate. In Theorem 2 of (Agarwal & Duchi, 2012), the authors established a
high probability convergence error bound for a generic class of stochastic algorithms. Specifically,
under the Assumptions 2.1 and 3.1, they showed that for any τ N with probability at least 1 _δ,_
_n_ _∈_ _−_
the averaged predictor _wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_
_n_
P 2τ
_f_ (wn) _f_ (w[∗]) b _κ(t) + [2(][τ][ −][1)][GR]_ + 2GR
_−_ _≤_ [R]n[n] [+ (][τ][ −]n[1)][G] _n_ _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)][GR.][ (2)]
_t=1_ r
X
Here, b Rn is the regret of the algorithm of interest, and τ ∈ N is an auxiliary parameter that is introduced to decouple the dependence of the data samples. From the above bound, one can see that
the optimal choice of τ depends on the convergence rate of the mixing coefficient φξ(τ ). Specifically, consider the SGD algorithm in (1). It can be shown that it achieves the regret Rn = O([√]n)
and satisfies κ(t) = O(1/√t) with a proper diminishing learning rate. Consequently, the above
high-probability convergence bound for SGD reduces to
1 _τ_ 1 _τ_
_f_ (wn) _f_ (w[∗]) _−_ _._ (3)
_−_ _≤O_ _√n + infτ_ _∈N_ _√n +_ r _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)]
 n o
Such a bound further implies the following sample complexity results of SGD under different con
b
vergence rates of the mixing coefficient φξ.
**Corollary 3.2. The sample complexity of SGD in (1) for achieving an ϵ convergence error over**
_φ-mixing dependent data is given as follows._
1
- If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = (log [1]ϵ [)] _θ_ _._
2 _O_
_The resulting sample complexity is in the order of n = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._
[]
- If the data is algebraic φ-mixing with parameter θ > 0, then we choose _τ =_ (ϵ[−] _θ[1] ). The_
[]
_O_
_resulting sample complexity is in the order of n = O(ϵ[−][2][−]_ _θ[2] )._
It can be seen that if the data-generating process has a fast geometrically diminishing mixing coefficient, i.e., the data samples are close to being independent from each other, then the resulting
sample complexity is almost the same as that of SGD with i.i.d. samples. On the other hand, if the
data-generating process mixes slowly with an algebraically diminishing mixing coefficient, i.e., the
data samples are highly dependent, then the data dependence increases the sample complexity by
a non-negligible factor of ϵ[−] _θ[2] . In particular, such a factor is substantially large if the mixing rate_
parameter θ is close to zero.
3.2 SGD WITH SUBSAMPLING
When apply SGD to solve stochastic optimization problems over dependent data, the key challenge is that the data dependence introduces non-negligible bias that slows down the convergence
of the algorithm. Hence, a straightforward solution is to reduce data dependence before performing
stochastic optimization. In the existing literature, a simple and useful approach is data subsampling (Nagaraj et al., 2020; Kotsalis et al., 2020). Next, we show that such an approach leads to an
improved convergence bound and sample complexity of SGD over highly dependent data.
Specifically, consider a stream of φ-mixing data samples _ξ1, ξ2, ξ3, . . ._ . Instead of utilizing the
_{_ _}_
entire stream of data, we subsample a subset of this data stream with period r ∈ N and obtain the
following subsampled data stream
_ξ1, ξr+1, ξ2r+1, . . ._ _._
_{_ _}_
In particular, let {Ft}t be the canonical filtration generated by {ξtr+1}t. Since the consecutive
subsampled samples are r time steps away from each other, it is easy to verify that the subsampled
data stream {ξtr+1}t is also a φ-mixing process with mixing coefficient given by φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][,]
where φ[r]ξ [denotes the mixing coefficient of the subsampled data stream][ {][ξ][tr][+1][}][t][. Therefore, by]
periodically subsampling the data stream, the resulting subsampled process has a faster-converging
mixing coefficient. Then, we apply SGD over such subsampled data, i.e.,
(SGD with subsampling): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξtr+1). (4)
_−_ _∇_
-----
In particular, the convergence error bound in eq. (2) still holds by replacing φξ(τ ) with φξ(rτ ), and
we obtain the following bound for SGD with subsampling.
1 (τ 1) _τ_
_f_ (wn) _f_ (w[∗]) _−_ + _._ (5)
_−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]
 n o
Such a bound further implies the following sample complexity results of SGD with subsampling
b
under different convergence rates of the mixing coefficient φξ.
**Corollary 3.3. The sample complexity of SGD with subsampling in (4) for achieving an ϵ conver-**
_gence error over φ-mixing dependent data is given as follows._
1
- If the data is geometric φ-mixing with parameter θ > 0, then we choose r = (log [1]ϵ [)] _θ_
_O_ 1
_and τ = O(1). The resulting sample complexity is in the order of rn = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._
[]
- If the data is algebraic φ-mixing with parameter θ > 0, then we choose r = _ϵ[−]_ _θ[1]_ _and_
_O_ []
_τ =_ (1). The resulting sample complexity is in the order of rn = _ϵ[−][2][−]_ _θ[1]_ _._
_O_ _O_
[]
Compare the above sample complexity results with those of the standard SGD in Corollary 3.2,
[] 1
we conclude that data-subsampling can improve the sample complexity by a factor of (log [1]ϵ [)] _θ and_
_ϵ[−]_ _θ[1] for geometric φ-mixing and algebraic φ-mixing data, respectively. Intuitively, this is because_
with data subsampling, we can choose a sufficiently large subsampling period r to decouple the data
dependence in the term φξ(rτ ), as opposed to choosing a large τ in Corollary 3.2. In this way, the
order of the dominant term _n_ [log][ τ]δ [is reduced. Therefore, when the data is highly dependent, it is]
beneficial to subsample the dependent data before performing SGD. We also note another advantage
of using data-subsampling, i.e., it only requires computing the stochastic gradients of the subsampled
[p][ τ]
data, and therefore can substantially reduce the computation load.
4 CONVERGENCE OF MINI-BATCH SGD OVER DEPENDENT DATA
Although the data-subsampling update scheme studied in the previous section helps improve the
sample complexity of SGD, it does not leverage the full information of all the queried data. In
particular, when the data is highly dependent, we need to choose a large period r to reduce data
dependence, and this will throw away a huge amount of valuable samples. In this section, we study
SGD with another popular update scheme that leverages the full information of all the sampled
data, i.e., the mini-batch update scheme. We show that this simple and popular scheme can effectively reduce data dependence without skipping data samples, and can achieve an improved sample
complexity over SGD with subsampling.
Specifically, we consider a data stream {ξt}t with φ-mixing dependent samples. We rearrange the
data samples into a stream of mini-batches {xt}t, where each mini-batch xt contains B samples,
i.e., xt = _ξ(t_ 1)B+1, ξ(t 1)B+2, . . ., ξtB . Then, we perform mini-batch SGD update as follows.
_{_ _−_ _−_ _}_
(Mini-batch SGD): _w(t + 1) = w(t)_ _F_ (w(t); ξ). (6)
_−_ _[η]B[t]_ _∇_
_ξX∈xt_
Performing SGD updates with mini-batch data has several advantages. First, it substantially reduce
the optimization variance and allows to use a large learning rate to facilitate the convergence of the
algorithm. As a comparison, SGD with subsampling still suffers from a large optimization variance.
Second, unlike SGD with subsampling, mini-batch SGD utilizes the information of all the data
samples to improve the performance of the model. Moreover, as we show in the following lemma,
mini-batch update substantially reduces the stochastic bias caused by the data dependence. In the
sequel, we denote F (w; x) := _B1_ _ξ_ _x_ _[F]_ [(][w][;][ ξ][)][ as the average loss on a mini-batch of samples.]
_∈_
With a bit abuse of notation, we also define _t_ _t as the canonical filtration generated by the mini-_
P _{F_ _}_
batch samples _xt_ _t._
_{_ _}_
**Lemma 4.1. Let Assumption 2.1 hold and consider the mini-batch data stream {xt}t. Then, for any**
_w, v ∈W measureable with regard to Ft and any τ ∈_ N, it holds that
_F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_ _f_ (w) _f_ (v)
_−_ _|F_  _−_ _−_  _≤_ _[GR]B_
_φξ(τB + i)._ (7)
_i=1_
X
-----
With dependent data, the above lemma shows that we can approximate the population risk f (w)
by the conditional expectation E[F (w; xt+τ )|Ft], which involves the sample xt+τ that is τ steps
ahead of the filtration _t. Intuitively, by the φ-mixing principle, as τ gets larger, the distribution of_
_F_
_xGRt+τ conditional onB_ _Ft gets closer to the stationary distribution µ. In general, the estimation bias_
_B_ _i=1_ _[φ][ξ][(][τB][ +][ i][)][ depends on both the batch size and the accumulated mixing coefficient over]_
the corresponding batch of samples. To provide a concrete understanding, below we calculate the
estimation bias for several different mixing models.P
- Geometric φ-mixing: In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ O][(1)][. Hence, the]_
estimation bias is in the order of O( _[GR]B_ [)][.]
[P][B]
- Fast algebraic φ-mixing (θ ≥ 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ e]O(1)._
Hence, the estimation bias is in the order of _O(_ _[GR]B_ [)][, where][ e]O hides all logarithm factors.
[P][B]
- Slow algebraic φ-mixing (0 < θ < 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤O][((][τB][)][1][−][θ][)][.]_
[e]
Hence, the estimation bias is in the order of ( _[GRτ]B[ 1][θ]_ _[−][θ]_ ).
_O_
[P][B]
It can be seen that if the mixing coefficient converges fast, i.e., either geometrically or fast algebraically, then the data dependence has a negligible impact on the estimation error. Consequently,
choosing a large batch size can substantially reduce the estimation bias. On the other hand, when
the mixing coefficient converges slow algebraically, it substantially increases the estimation bias,
but it is still beneficial to use a large batch size. This result shows that mini-batch update can effectively reduce the statistical bias of stochastic approximation for a wide spectrum of dependent data
generating processes.
We obtain prove the following convergence error bound for mini-batch SGD over dependent data.
**Theorem 4.2. Let Assumption 2.1 and 3.1 hold. Apply mini-batch SGD to solve the stochastic**
_optimization problem (P) over φ-mixing dependent data and assume that it achieves regret Rn._
_Then, for any τ_ N and any minimizer w[∗] _with probability at least 1_ _δ, the averaged predictor_
_n_ _∈_ _−_
_wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_
P _n−τ_ +1
bf (wn) _f_ (w[∗]) _κ(t) +_ _[GR][(][τ][ −]_ [1)]
_−_ _≤_ [R]n[n] [+][ G][(][τ][ −]n [1)] _n_
_t=1_
X
_B_ _B_ [1]
b 1 _τ_ 4
+ _φ(τB + i) +_ _B[−]_ 4[1] + _φ(i)_ _. (8)_
_O_ _nB_ _nB_ [log][ τ]δ [log][ n]δ
 Xi=1 r  h Xi=1 i []
To further understand the order of the above bound, a standard regret analysis shows that mini-batch
_n_
SGD achieves a regret in the order of [R]n[n] [=][ e]( _j=1nB[φ][(][j][)]_ ) and κ(t) ( _Bn_ [)][ (see Theorem][ C.3]
_O_ P _≡O_
for the proof). Consequently, the above convergence error bound reduces to the following bound,q q
where we hide all logarithm factors for simplicity of presentation.
_n_
_j=1_ _[φ][(][j][)]_
_f_ (wn) _f_ (w[∗]) + _[GR][(][τ][ −]_ [1)] (9)
_−_ _≤_ _O[e]_ _nB_ _n_
[sP]
_B_ _B_ [1]
b + [1] _φ(τB + i) +_ _τ_ _B[−]_ 4[1] + _φ(i)_ 4 _._ (10)
_nB_ _nB_
_i=1_ r _i=1_
X  h X i []
Such a bound further implies the following sample complexity results of mini-batch SGD under
different convergence rates of the mixing coefficient φξ.
**Corollary 4.3. The sample complexity of mini-batch SGD in (6) for achieving an ϵ convergence**
_error over φ-mixing dependent data is given as follows._
- If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = 1, B =
_O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][2])._
- If the data is fast algebraic φ-mixing with parameter θ ≥ 1, then we choose τ = 1, B =
_O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB =_ _O[e](ϵ[−][2])._
-----
- If the data is slow algebraic φ-mixing with parameter 0 < θ < 1, then we choose τ =
1, B = O(ϵ[−] _θ[1] ), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][1][−]_ _θ[1] )._
It can be seen that mini-batch SGD can achieve an order-wise lower sample complexity than the SGD
with subsampling in the full spectrum of φ-mixing convergence rate. Specifically, mini-batch SGD
1
improves the sample complexity over that of SGD with subsampling by a factor of O((log [1]ϵ [)] _θ ),_
_O(ϵ[−]_ _θ[1] ) and O(ϵ[−][1]) for geometric φ-mixing, fast algebraic φ-mixing and slow algebraic φ-mixing_
data samples, respectively. This shows that mini-batch update can effectively reduce the bias caused
by data dependence and leverage the full information of all the data samples to improve the learninge
performance.
To intuitively explain, this is because with mini-batch updates, we can choose a sufficiently large
batch size B to reduce the bias caused by the data dependence and choose a small auxiliary parameter τ = 1. As a comparison, to control the bias caused by data dependence, the standard SGD needs
to choose a very large τ and the SGD with subsampling needs to choose a large subsampling period
_r that skips a huge amount of valuable data samples, especially when the mixing coefficient con-_
verges slowly. Therefore, our result proves that it is beneficial to use mini-batch stochastic gradient
updates when the data samples are highly dependent.
We note that our proof of the tight high-probability bound in Theorem 4.2 for mini-batch SGD
involves substantial new developments compared with the proof of (Agarwal & Duchi, 2012). Next,
we elaborate on our technical novelty.
- In (Agarwal & Duchi, 2012), they defined the following random variable
_Xt[i]_ [:=][ f] _w((t_ 1)τ + i) _f_ (w[∗]) + F _w((t_ 1)τ + i); ξt+τ 1 _F_ _w[∗]; ξt+τ_ 1 _._
_−_ _−_ _−_ _−_ _−_ _−_
As this random variable involves only one sample  _ξt+τ_ _−1, they bound the bias term_ 
_Xt[i]_ _t_ _[|F]t[i]_ 1[]][ as a universal constant. As a comparison, the random variable][ X]t[i] [would]
involve a mini-batch of samples[−] [E][[][X] _[i]_ _−_ _xt+τ_ 1 in our analysis. With the mini-batch structure, the
_−_
bias Xt[i] _t_ _[|F]t[i]_ 1[]][ can be written as an average of][ B][ zero-mean dependent random vari-]
ables, which is close to zero with high probability due to the concentration phenomenon.[−][E][[][X] _[i]_ _−_
Consequently, we are able to apply a Bernstein-type inequality developed in (Delyon et al.,
2009) for dependent stochastic process to obtain an improved bias bound from O(1) to
_O(1/√B). This is critical for obtaining the improved bound._
- Second, with the improved high-probability bias bound mentioned above, the remaining
proof of (e Agarwal & Duchi, 2012) no longer holds. Specifically, we can no longer apply
the Azuma’s inequality to bound the accumulated bias _t[(][X]t[i]_ _t_ _[|F]t[i]_ 1[])][, as each bias]
term is no longer bounded with probability one. To address this issue, we developed a gen-[−] [E][[][X] _[i]_ _−_
eralized Azuma’s inequality for martingale differences in Lemma B.3 based on Proposition
[P]
34 of (Tao et al., 2015) for independent zero-mean random variables.
- Third, we develop a high-probability regret bound for mini-batch SGD over dependent data
so that it can be integrated with the high-probability convergence bound in Theorem 4.2.
To our best knowledge, the regret of SGD over dependent data has not been studied before.
5 NUMERICAL EXAMPLE
We examine our theory via a basic convex quadratic optimization problem, which is written as
min (w _ξ)[⊤]A(w_ _ξ)_ _,_
_w_ R[d][ f] [(][w][) :=][ E][ξ][∼][µ] _−_ _−_
_∈_
 
where A ⪰ 0 is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d].
Then, following the construction in (Jarner & Roberts, 2002), we generate an algebraic φ-mixing
Markov chain that has the stationary distribution µ. In particular, its mixing coefficient φξ(k) converges at a sublinear convergence rate k[−] _r[1], where r > 0 is a parameter that controls the speed of_
convergence. Please refer to Appendix D for more details of the experiment setup.
We first estimate the following stochastic bias at the fixed origin point w = 0d.
(Bias): E _F_ (w; xτ )|x0 = 0d _−_ _f_ (w) _,_
 
-----
where the expectation is taken over the randomness of the mini-batch of samples queried at time
_τ ∈_ N. Such a bias is affected by several factors, including the time gap τ, the batch size B and the
convergence rate parameter r of the mixing coefficient.
In Figure 1, we investigate the impact of these factors on the stochastic bias, and we use 10k Monte
Carlo samples to estimate the stochastic bias. The left two figures fix the batch size, and it can be
seen that the bias decreases as τ increases, which matches the definition of the φ-mixing property.
Also, a faster-mixing Markov chain (i.e., smaller r) leads to a smaller bias. In particular, with batch
size B = 1 and a slow-mixing chain r = 2, it takes an unacceptably large τ to achieve a relatively
small bias. This provides an empirical justification to Corollary 3.2 and explains why the standard
SGD suffers from a high sample complexity over highly dependent data. Moreover, as the batch
size gets larger, one can achieve a numerically smaller bias, which matches our Lemma 4.1. The
right two figures fix the convergence rate parameter of the mixing coefficient, and it can be seen
that increasing the batch size significantly reduces the bias. Consequently, instead of choosing a
large τ to reduce the bias, one can simply choose a large batch size B = 100 and set τ = 1. This
observation matches and justifies our theoretical results in Corollary 4.3.
Figure 1: Impact of τ, batch size B and convergence rate of mixing coefficient on the bias.
We further compare the convergence of SGD, SGD with subsampling and mini-batch SGD. Here, we set r = 2 to generate
highly dependent data samples. We set learning rate η = 0.01
for both SGD and SGD with subsampling, and set learning rate
_η = 0.01_ _B_ _B_
_×_ _j=1_ _[φ][ξ][(][j][)][ = 0][.][01][×][100][1][/][4][ for mini-batch SGD with]_
batch size Bq = 100P, as suggested by Theorem C.3 in the appendix.
The results are plotted in Figure 2, where each curve corresponds
to the mean of 100 independent trails. It can be seen that SGD with
subsampling achieves a lower loss than the standard SGD asymptotically, due to the use of less dependent data. Moreover, mini-batch
SGD achieves the smallest asymptotic loss. All these observations
are consistent with our theoretical results.
Figure 2: Comparison of sample complexity of different
SGD algorithms.
6 CONCLUSION
In this study, we investigate the convergence property of SGD under various popular stochastic
update schemes over highly dependent data. Unlike the conventional i.i.d. data setting in which the
stochastic update schemes do not affect the sample complexity of SGD, the convergence of SGD
in the data-dependent setting critically depends on the structure of the stochastic update scheme. In
particular, we show that both data subsampling and mini-batch sampling can substantially improve
the sample complexity of SGD over highly dependent data. Our study takes one step forward toward
understanding the theoretical limits of stochastic optimization over dependent data, and it opens
many directions for future study. For example, it is interesting to further explore the impact of
algorithm structure on the sample complexity of stochastic reinforcement learning algorithms. Also,
it is important to develop advanced algorithm update schemes that can facilitate the convergence of
learning over highly dependent data.
REFERENCES
Alekh Agarwal and John C Duchi. The generalization ability of online algorithms for dependent
data. IEEE Transactions on Information Theory, 59(1):573–587, 2012.
Alekh Agarwal, Nan Jiang, Kakade Sham M, and Wen Sun. Reinforcement learning: Theory and
algorithms. https://rltheorybook.github.io/, 2021.
-----
L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier
and Gilbert Saporta (eds.), Proc. COMPSTAT, pp. 177–186, 2010.
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finitesample analysis of nonlinear stochastic approximation with applications in reinforcement learning. arXiv:1905.11425, 2019.
Gal Dalal, Bal´azs Sz¨or´enyi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with
function approximation. In Proc. AAAI conference on artificial intelligence, 2018.
Vianney Debavelaere, Stanley Durrleman, and St´ephanie Allassonni`ere. On the convergence of
stochastic approximations under a subgeometric ergodic Markov dynamic. Electronic Journal of
Statistics, 15(1):1583 – 1609, 2021.
Bernard Delyon et al. Exponential inequalities for sums of weakly dependent variables. Electronic
Journal of Probability, 14:752–779, 2009.
Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov chains. Springer, 2018.
John Duchi, Alekh Agarwal, Mikael Johansson, and Michael Jordan. Ergodic subgradient descent.
In Allerton Conference, 2011.
Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. Journal of
machine learning Research, 5(1), 2003.
Peter W. Glynn and Sean P. Meyn. A Lyapunov bound for solutions of the Poisson equation. The
Annals of Probability, 24(2):916 – 931, 1996.
Antoine Godichon-Baggioni, Nicklas Werge, and Olivier Wintenberger. Non-asymptotic analysis of
stochastic approximation algorithms for streaming data. arXiv:2109.07117, 2021.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Elad Hazan. Introduction to Online Convex Optimization. Elad Hazan, Erscheinungsort nicht ermittelbar, 2017. ISBN 1521133301.
Søren F Jarner and Gareth O Roberts. Polynomial convergence rates of markov chains. The Annals
of Applied Probability, 12(1):224–247, 2002.
Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time
analysis of linear two-timescale stochastic approximation with markovian noise. In Conference
on Learning Theory, pp. 2144–2203. PMLR, 2020.
Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of
biased stochastic approximation scheme. In Proc. Conference on Learning Theory, pp. 1944–
1974, 2019.
Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic
variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.
arXiv:2011.02987, 11 2020.
Harshat Kumar, Alec Koppel, and Alejandro Ribeiro. On the sample complexity of actor-critic
method for reinforcement learning with function approximation. ArXiv:1910.08412, 2019.
Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Tightening the dependence on horizon in the sample complexity of q-learning. In ICML, volume 139 of Proceedings
of Machine Learning Research, pp. 6296–6306. PMLR, 2021.
Kurt Marti. Stochastic optimization of regulators. Computers and Structures, 180:40–51, February
2017.
Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with
function approximation. In Proc. International Conference on Machine Learning, pp. 664–671,
2008.
-----
Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science &
Business Media, 2012.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602,
2013.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Dharmendra S Modha and Elias Masry. Minimum complexity regression estimation with weakly
dependent observations. IEEE Transactions on Information Theory, 42(6):2133–2145, 1996.
Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In Proc. Advances in Neural
Information Processing Systems, volume 33, 2020.
Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On the finite-time convergence of actorcritic algorithm. In NeurIPS Optimization Foundations for Reinforcement Learning Workshop,
2019.
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation
and q-learning. In Proc. Conference on Learning Theory, pp. 3185–3205, 2020.
Ingo Steinwart and Andreas Christmann. Fast learning from non-iid observations. Advances in
neural information processing systems, 22:1768–1776, 2009.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Bradford, 2018.
Terence Tao, Van Vu, et al. Random matrices: universality of local spectral statistics of nonhermitian matrices. Annals of probability, 43(2):782–874, 2015.
Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global
optimality and rates of convergence. In International Conference on Learning Representations,
2019.
Yue Frank Wu, Weitong ZHANG, Pan Xu, and Quanquan Gu. A finite-time analysis of two
time-scale actor-critic methods. In Proc. Advances in Neural Information Processing Systems
(NeurIPS), volume 33, pp. 17617–17628, 2020.
Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. Proc.
Advances in Neural Information Processing Systems, 22:2116–2124, 2009.
Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. In Proc. International Conference on Machine Learning, pp. 10555–10565, 2020.
Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy TD learning: Nonasymptotic analysis over Markovian samples. In Proc. Advances in Neural Information
Processing Systems (NeurIPS), 2019.
Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural)
actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems (NeurIPS),
volume 33, 2020.
Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of
actor-critic: A case for linear quadratic regulator with ergodic cost. In Proc. Advances in Neural
Information Processing Systems (NeurIPS), 2019.
Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear
function approximation. In Proc. Advances in Neural Information Processing Systems, pp. 8665–
8675, 2019.
-----
# Appendix
### Table of Contents
**A Proof of Corollary 3.3** **12**
**B** **Proof of Theorem 4.2** **13**
B.1 Key Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B.2 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
**C Regret Analysis of Mini-Batch SGD** **20**
**D Experiment Setup** **24**
**Notation: To simplify the notation, throughout the appendix, we denote ξt[(][i][)]** := ξ(t−1)B+i, which
corresponds to the i-th data sample of the t-th mini-batch data xt. With this notation, we have
_xt =_ _ξt[(1)], ξt[(2)], ..., ξt[(][B][)]_ .
_{_ _}_
A PROOF OF COROLLARY 3.3
In this section, we analyze the convergence error bound of the SGD with data-subsampling in (4).
Given a φξ-mixing data stream _ξ1, ξ2, ξ3, . . ._ _, we consider the following subsampled data stream_
_{_ _}_
_ξ1, ξr+1, ξ2r+1, . . ._ _._
_{_ _}_
Let F be the canonical filtration generated by {xt}. Then the subsampled data stream {ξtr+1}t is
_φ[r]ξ[-mixing with the mixing coefficient given by]_
_φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][.]_
With this mixing coefficient, we can apply Theorem 2 of (Agarwal & Duchi, 2012) and obtain the
following convergence error bound for any τ ∈ N.n
Rn _τ_
_f_ (wn) _f_ (w[∗]) _κ(t) +_ _[τ]_ _._
_−_ _≤O_ _n_ [+ (][τ][ −]n [1)] _n_ [+] _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]
_t=1_ r
 X 
Consider the standard SGD with a diminishing learning rate, we have b _κ(t) = O(_ _√[1]t_ [)][ and][ R][n][ =]
_O([√]n). Then, the convergence error bound becomes_
1 (τ 1) _τ_
_f_ (wn) _f_ (w[∗]) _−_ + _._
_−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]
 n o
The above result further implies the following sample complexity results for different convergence
b
rates of the mixing coefficient.
- Geometric φ-mixing: In this case, φξ(k) (exp( _k[θ])) for some θ > 0. Set the last_
_≤O_ _−_ 1
termτ√−n1 _φ=ξ O(rτ() =ϵ). We obtain that O(ϵ). We obtain that nτ_ _[−][2]_ _rτ= O =(ϵ O[−][2]((log). By choosing[1]ϵ_ [)] _θ ). Further set the second term τ = O(1), the sample_
complexity is in the order of
[1]
_θ_ 2[] _θ_
_ϵ-complexity = r_ _n =_ log [1] _τ 2ϵ−_ = _ϵ[−][2][]_ log ϵ[−][1][][ 1] _._
_·_ _O_ _ϵ_ _O_
- Algebraic φ-mixing: In this case, φξ(k) ≤O (k[−][θ]) for some _θ > 0. Set the last term[]_
_φξ(rτ_ ) = O(ϵ). We obtain that τr = O(ϵ[−] _θ[1] ). Set the second term_ _[τ]√[−]n[1] = O(ϵ). We_
obtain that nτ _[−][2]_ = O(ϵ[−][2]). By setting τ = O(1), the sample complexity is in the order of
_ϵ-complexity = r · n = O(ϵ[−]_ _θ[1] τ_ [2]ϵ[−][2]) = O(ϵ[−][2][−] _θ[1] )._
-----
B PROOF OF THEOREM 4.2
B.1 KEY LEMMAS
In this subsection, we present several useful preliminary results for proving Theorem 4.2. Define
N := {1, 2, 3, . . . }. Throughout this subsection, we assume that Assumption 2.1 holds. The following lemma is a generalization of the Lemma 1 in (Agarwal & Duchi, 2012).
**Lemma B.1. Let w, v be measurable with respect to Ft. Then for any τ ∈** N,
_F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_
_−_ _|F_  _≤_ _[GR]B_
_φξ(τB + i) + f_ (w) _f_ (v).
_−_
_i=1_
X
_Proof. For any τ ∈_ **N, we consider the following decomposition.**
E[F (w; xt+τ ) − _F_ (v; xt+τ )|Ft]
=E[F (w; xt+τ ) − _f_ (w) + f (v) − _F_ (v; xt+τ )|Ft] + f (w) − _f_ (v)
1 _B_ 1
= h _B_ Xi=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_
(A)
+ f (w) _f_ (v).
| {z
_−_
_F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ
_[∈·|F][t][)][ −]_
Z
_i=1_
We can further bound the term (A) using the mixing property of the dependent data stream.
1 _B_
(A) =
_B_
_i=1_
h X
_B_
= [1]
_B_
_i=1_
X
_B_
_≤_ _B[1]_
_i=1_
X
_B_ 1
X=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_
(F (w; ξ) _F_ (v; ξ))d P(ξt[(]+[i][)]τ
=1 Z _−_ _[∈]_ [d][ξ][|F][t][)][ −] _[µ][(d][ξ][)]_

_GRd_ P(ξt[(]+[i][)]τ
=1 Z _[∈]_ [d][ξ][|F][t][)][ −] _[µ][(d][ξ][)]_
_B_
_φξ(τB + i),_
_i=1_
X
_F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ
_[∈·|F][t][)][ −]_
Z
_i=1_
_≤_ _[GR]B_
where in the first inequality we use the facts that F (·; ξ) is G-Lipschitz and the domain is bounded
by R, and the second inequality is implied by the φ-mixing property. Substituting the above upper
bound of (A) into the previous equation yields that
E[F (w; xt+τ ) _F_ (v; xt+τ ) _t]_
_−_ _|F_ _≤_ _[GR]B_
This completes the proof.
_φξ(τB + i) + f_ (w) _f_ (v).
_−_
_i=1_
X
**Proposition B.2. Let {w(t)}t∈N be the model parameter sequence generated by (6). Also suppose**
_that Assumption 3.1 holds. Then for any τ ∈_ N, we have
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
X
_n−τ_ +1
_κ(t) + GR(τ −_ 1).
_t=1_
X
[f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1)
_−_ _−_ _−_ _−_ _−_
_t=1_
X
-----
_Proof. For any τ ∈_ N, we consider the following decomposition,
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
Xn
= [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗]) + F (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)]_
_t=1_
Xn
= [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗])] (11)
_t=1_
X
+ _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _._
_t=1_
X  
(B)
| {z }
We will keep the first term and bound the term (B).
_F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_
_t=1_
X
(B) =
_n_ _n−τ_ +1
= [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_
_t=1_ _t=1_
X X
(B1) (B2)
| _n_ {z } _τ_ _−|1_ _τ_ _−1_ {z _n+τ_ _−1_ }
+ _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._
_t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_
(B3)
| {z }
Recall that the term (B1) is the regret Rn. We can bound the term (B2) by noting that
_F_ (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1) ≤_ _G∥w(t + τ −_ 1) − _w(t)∥_
_τ_ _−2_
_≤_ _G_ _∥w(t + i + 1) −_ _w(t + i)∥_
_i=0_
X
_τ_ _−2_
_≤_ _G_ _κ(t + i)_
_i=0_
X
_≤_ _G(τ −_ 1)κ(t).
For the term (B3), we can bound it using the G-Lipschitzness of F ( ; ξ) and the R-bounded domain.
_·_
_τ_ _−1_
_F_ (w(t); xt) +
_t=1_
X
_τ_ _−1_
_F_ (w[∗]; xt)
_−_
_t=1_
X
_n+τ_ _−1_
_F_ (w[∗]; xt+τ _−1)_
_t=n+1_
X
_F_ (w(t); xt+τ _−1) −_
_t=nX−τ_ +2
_n_ _n+τ_ _−1_ _[τ]_ _[−][1]_
= _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _−_ _F_ (w(t); xt) −
h _t=nX−τ_ +2 _t=Xn+1_ i h Xt=1
_n_ _[τ]_ _[−][1]_
_≤_ _G_ _∥w(t) −_ _w[∗]∥_ + G _∥w(t) −_ _w[∗]∥_
h _t=nX−τ_ +2 i h Xt=1 i
_≤_ _GR(τ −_ 1).
_τ_ _−1_
_F_ (w[∗]; xt)
_t=1_
X
-----
Combining the above bounds of (B1), (B2), and (B3), we obtain the upper bound of (B) as follows.
_n_
_F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_
_t=1_
X
_n_ _n−τ_ +1
= [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_
_t=1_ _t=1_
X X
(B1) (B2)
| _n_ {z } _τ_ _−|1_ _τ_ {z _n+τ_ _−1_ }
+ _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._
_t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_
(B3)
| _n−τ_ +1 {z }
_≤_ _Rn + G(τ −_ 1) _κ(t) + GR(τ −_ 1).
_t=1_
X
Then the proof is completed by substituting the upper bound of (B) into (11).
The following generalized Azuma’s inequality generalizes the Proposition 34 of (Tao et al., 2015).
The inequality can be used to bound sum of martingale difference random variables.
**Lemma B.3 (Generalized Azuma’s Inequality). Let {Xt} be a martingale difference sequence with**
_respect to its canonical filtration F. Define Y =_ _i=1_ _[X][i][ and assume][ E][|][Y][ |][ <][ ∞][. Then for any]_
_{αt}t > 0,_
[P][T]
_T_
_αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_
_t=1_ 
X
_Y_ EY _λ_
| _−_ _| ≥_
P(|Xt| ≥ _αt)._
_t=1_
X
_Proof. Let T := min{t : |Xt| > αt}. Then the sets Bt := {ω : T (ω) = t} are disjoint. Construct_
_C_
_Y (ω)_ if ω _Tt=1_ _[B][t]_ _,_
_∈_
E[Y _Bt]_ if ω _B St for all t_ 1, 2, . . ., T _._
_|_ _∈_ _∈{_ _}_
_Y_ _[′](ω) :=_
By the above construction, the associated Doob martingale oft∧T _Y_ _[′]_ with respect to F is {Zt :=
_i=1_ _[X][i][}][. It satisfies the conditions of Azuma’s inequality, i.e.,]_
P
- _Zt_ forms a martingale with respect to (because the stopped martingale is still a mar_{_ _}_ _F_
tingale).
- |Zt − _Zt−1| ≤_ _αt._
Then we can apply Azuma’s inequality to Y _[′]._
_T_
_αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_
_t=1_ 
X
|Y _[′]_ _−_ EY _[′]| ≥_ _λ_
-----
Now we can bound P _|Y −_ EY | ≥ _λ_ _Tt=1_ _[α]t[2]_
 qP
as follows.
_Y_ EY _λ_
| _−_ _| ≥_
_Y_ EY _λ_
| _−_ _| ≥_
_αt[2]_
_t=1_
X
_T_
_αt[2][, Y][ =][ Y][ ′]_
_t=1_
X
=P
_≤P_
+ P
_Y_ EY _λ_
| _−_ _| ≥_
_αt[2][, Y][ ̸][=][ Y][ ′]_
_t=1_
X
_≤P_ |Y _[′]_ _−_ EY _[′]| ≥_ _λv_ _αt[2]_ + P (Y ̸= Y _[′])_
ut=1
uX
 _T_ t 
2 exp + P( _Xt_ _αt)._
_≤_ _−_ _[λ]2[2]_ _|_ _| ≥_
  _t=1_
X
Then the proof is completed. Here we notice the fact that EY _[′]_ = EY by our construction.
The following lemma is taken from (22), Theorem 4 of (Delyon et al., 2009).
**Lemma B.4 (Bernstein’s Inequality for Dependent Process). Let {Zt} be a centered adaptive pro-**
_cess with respect to F. Define the following quantities._
_k−1_
_Zi_ E[Zk _i]_ _,_
_∥_ _∥∞_ _· ∥_ _|F_ _∥∞_
_i=1_
X
_∥E[Zk[2][|][Z][k][−][1][, . . ., Z][1][]][∥][∞][,]_
_q =_
_v =_
_k=1_
_m = sup1≤i≤n_ _∥Zi∥∞._
_Then, it holds that_
_n_
_t[2]_
P _Zi_ _t_ exp _._
 Xi=1 _≥_  _≤_ − 2(v + 2q) + 2tm/3 
**Application of Lemma B.4 to our proof. Here we make some comments about how to apply this**
inequality in our main proof. We define the following random variable in our proof. Throughout, we
use the batch-level filtration F and the intra-batch level filtration _F. The formal definition is given_
in Section B.2.
_Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ [b]w((t 1)τ + 1); xtτ +i 1 _._
_−_ _−_ _−_ _−_ _−_ _−_
We also define the filtration _t[:=][ F][tτ]_ [+][i][−][1] [for simplicity. Then, we have] 
_F_ _[i]_
E[Xt[i][|F]t[i]−1[] =][ f] _w((t −_ 1)τ + 1) _−_ _f_ (w[∗]) + E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _|Ft[i]−1_
Then, the bias can be rewritten as   
_Xt[i]_ _t_ _[|F]t[i]_ 1[]]
_[−]_ [E][[][X] _[i]_ _−_
=F (w[∗]; xtτ +i−1) − _F_ (w((t − 1)τ + 1); xtτ +i−1) − E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _|Ft[i]−1_
= [1] _Yt[i][(][ξ][)][,]_  
_B_
_ξ∈xXtτ_ +i−1
where Yt[i] [is defined as]
_Yt[i][(][ξ][) =][ F]_ [(][w][∗][;][ ξ][)][ −] _[F]_ _w((t_ 1)τ + 1); ξ E _F_ (w[∗]; ξ) _F_ _w((t_ 1)τ + 1); ξ _t_ 1 _._
_−_ _−_ _−_ _−_ _|F_ _[i]−_
   
-----
More specifically, we have
_Xt[i]_ _t_ _[|F]t[i]_ 1[] = 1]
_[−]_ [E][[][X] _[i]_ _−_ _B_
_Yt[i][(][ξ][) = 1]_
_B_
_ξ∈xXtτ_ +i−1
_Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][.]
_−_
_j=1_
X
Recall that _F is the canonical filtration generated from the data stream (12)._ Moreover,
_Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][}][j][=1][,][2][,...,B][ is centered and adaptive with respect to this filtration. Then we can evalu-]
_{_ _−_
ate the quantitiesb _q, v, and m in Lemma B.4 as follows._
- Bounding m is simple. By Assumption 2.1 we have _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥≤] [2][GR][.]
_∥_ _−_
- The above bound of m leads to a simple bound for v, i.e., v ≤ 2nG[2]R[2].
- The quantity q can be bounded as follows.
_n_ _k−1_
_q :=_ _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥][∞][∥][E][[][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][∥][∞]
_∥_ _−_ _−_ _F_ [(][j][)] _−_
_k=1_ _j=1_
X X
_n_ _k−1_
2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][∥][∞]
_≤_ _∥_ _−_ _F_ [(][j][)] _−_
_k=1_ _j=1_
X X
_n_ _k−1_
= 2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][ −] [E][ξ][∼][µ][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][∥][∞]
_∥_ _−_ _F_ [(][j][)] _−_ _−_
_k=1_ _j=1_
X X
_n_ _k−1_
4G[2]R[2] _φξ(k_ _i)_
_≤_ _−_
_k=1_ _i=1_
X X
_≤_ 4G[2]R[2]n
_φξ(i)._
_i=1_
X
Then, by applying Lemma B.4, we obtain the following high-probability bound.
_B[2]t[2]_
P _|Xt[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[t]_ _≤_ 2 exp _−_ 2(v + 2q) + 2Btm/3
 

_B[2]t[2]_
2(2G[2]R[2]B + 8G[2]R[2]B _i=1_ _[φ][ξ][(][i][)) + 4][GRBt/][3]_
_Bt[2]_
[P][B] _._
2(2G[2]R[2] + 8G[2]R[2][ P][B]i=1 _[φ][ξ][(][i][)) + 4][GRt/][3]_ !
_≤_ 2 exp
= 2 exp
Simplifying yields that
_Bt[2]_
_C +_ [4]3 _[GRt][ + 16][G][2][R][2][ P]i[B]=1_ _[φ][ξ][(][i][)]_
P _Xt[i]_ _t_ _[|F]t[i]_ 1[]][| ≥] _[t]_ 2 exp
_|_ _[−]_ [E][[][X] _[i]_ _−_ _≤_

where C := 4G[2]R[2].
B.2 PROOF OF THE MAIN RESULT
Recall that we are considering a data stream divided into small mini-batches. For convenience, we
re-label the data stream _ξ1, ξ2, ξ3, . . ._ as follows to explicitly indicate its mini-batch index.
_{_ _}_
_{ξ1[(1)][, ξ]1[(2)][, . . ., ξ]1[(][B][)], ξ2[(1)][, ξ]2[(2)][, . . ., ξ]2[(][B][)], . . . }._ (12)
The canonical filtration generated by the re-labeled data stream is denoted by _F. Also, when the_
batch size is clear in the context, we denote the data in the specified mini-batch as x. For example,
[b]
-----
we use xt to represent the t-th mini-batch {ξt[(1)], ξt[(2)], . . ., ξt[(][B][)]}. Then we can re-writhe the above
data stream as
_x1, x2, x3. . . ._ _._
_{_ _}_
We denote the canonical filtration generated by the above sequence as F. Note that we have the
following relation:
_t =_ _t_ _._
_F_ _F_ [(][B][)]
In summary, when we analyze the mini-batch SGD dynamics, we use the filtration[b] _F, and when we_
need to consider intra-batch samples, we use the filtration _F._
**Theorem B.5. Let {w(t)}t∈N be the model parameter sequence generated by (6). Suppose Assump-**
_tions 2.1 and 3.1 hold. Then, for any τ ∈_ N, with probability at least[b] 1 − _δ, we have_
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
X
_GR [n]_
_≤_ _B_
_φξ(τB + i)_
_i=1_
X
2
3

_G[2]R[2]_
v
u _B_
u
u
t [2][τn]
_GR_
_B_ [log 4]δ[n] [+]
v 9
u
u
t [4]
(log [4][n] 4G[2]R[2] + 16G[2]R[2]
_δ_ [)][2][ +]
_B_
_φξ(i)_ log [4][n]
_δ_
_i=1_
X 
log [4][τ]
_·_ _δ_ [log 4]δ[n]

log [4][τ]
_·_ _δ_ [log 4]δ [n] _[.]_
_n−τ_ +1
_κ(t) + GR(τ −_ 1).
_t=1_
X
+ Rn + G(τ − 1)
_In particular, if τ = 1, then_
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
X
Rn + GR [n]
_≤_ _B_
_φξ(B + i)_
_i=1_
X
2
v
u _B_ 3
u _[·]_ 
u
t [2][n]
_G[2]R[2]_
_GR_
_B_ [log 4]δ[n] [+]
v 9
u
u
t [4]
(log [4][n] 4G[2]R[2] + 16G[2]R[2]
_δ_ [)][2][ +]
_B_
_φξ(i)_ log [4][n]
_δ_
_i=1_
X 
_Proof. From Proposition B.2, we obtain the following bound._
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
X
_n_ _n−τ_ +1
[f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1) _κ(t) + GR(τ_ 1).
_≤_ _−_ _−_ _−_ _−_ _−_ _−_
_t=1_ _t=1_
X X
To complete the proof, it suffices to bound the first term; we define this term as
_Zn :=_ [f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])].
_−_ _−_ _−_ _−_
_t=1_
X
We apply the same decomposition as the (13) of (Agarwal & Duchi, 2012). Define the index set
(i) as 1, . . ., _τ_ _τ_ _τ_
_I_ _{_ _⌊_ _[n]_ _[⌋]_ [+ 1][}][ for][ i][ ≤] _[n][ −]_ _[τ]_ _[⌊]_ _[n]_ _[⌋]_ [and][ {][1][, . . .,][ ⌊] _[n]_ _[⌋}][ otherwise. Then we have]_
[Xt[i] _t_ _[|F]t[i]_ 1[]] +]
_t∈IX(i)_ _[−]_ [E][[][X] _[i]_ _−_
E[Xt[i][|F]t[i] 1[]][,]
_−_
_t∈IX(i)_
_Zn =_
_i=1_
_i=1_
-----
where
_Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ _w((t_ 1)τ + 1); xtτ +i 1 _._
_−_ _−_ _−_ _−_ _−_ _−_
Note that by Lemma 4.1, we have that E[Xt[i][|F]t[i] 1[]][ ≤] _[GR]B_ _Bi=1_ _[φ][ξ][(][τB][ +][ i][)][. Then, we have]_
_−_
_B_ _τ_
P
P _Zn > [nGR]_ _φξ(τB + i) + γ_ P( [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ][)]
_B_ _i=1_ _≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_
 X  X _∈IX_
_τ_
P [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ]
_≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_
 [ n X∈I o
_τ_
P [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ] _._
_≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_
X  X∈I 
Define Y := _t∈I(i)[[][X]t[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]]][ and][ α][ :=] _√λB_ [. Notice that][ X]t[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][ is a centered]
random variable, that is, E[Xt[i] _t_ _[|F]t[i]_ 1[]] = 0][. Then by the generalized Azuma’s inequality]
(Lemma B.3), we conclude that[P] _[−]_ [E][[][X] _[i]_ _−_
_γ[2]_
2 exp
_≤_ − 2τ [2][ n]τ _[α][2]_
_Y_
_≥_ _[γ]τ_
_t=1_ P(|Xt[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α][)][.]_
X
The second term can be bounded by using the generalized Bernstein’s inequality. The detailed
calculation can be found in the discussion after Lemma B.4. We obtain that
_λ[2]_
P _|Xt[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α]_ _≤_ 2 exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_ ! _,_

where C = 4G[2]R[2]. In summary, the concentration bound for Zn is
P _Zn > GR [n]_ _φξ(τB + i) + γ_
 X 
_γ[2]_
2τ exp
_≤_ − 2τ [2][ n]τ _[α][2]_
_t=1_ P(|Xt[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α][)]_
X
+ τ
_γ[2]_ _λ[2]_
_≤_ 2τ exp _−_ 2τn _[λ]B[2]_ ! + 2n exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_
Then, let 2[δ] [= 2][n][ exp] _−_ _C+_ 3[4] _[GR]_ _√λB_ [+16]λ[G][2] [2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_, and we obtain that
 _B_
_λ[2]_ = _C + [4]_ + 16G[2]R[2] _φξ(i)_ log [4][n]
3 _[GR λ]√B_ _i=1_ _·_ _δ [.]_
 X 
It is a quadratic function of λ. Solving it yields that
2
+ _C + 16G[2]R[2]_
 
_GR_ _G[2]R[2]_
_λ = [2]_ log [4][n]
3 _B_ [log 4]δ[n] [+] v 9 _B_ _δ_
u
u 
t [4]
Also, let 2[δ] [= 2][τ][ exp] _−_ 2τnγ[2][λ][2], we have that
 
_λ = [2]_
_GR_
_B_ [log 4]δ[n] [+]
_B_
_φξ(i)_ log [4][n] (13)
_δ [.]_
_i=1_
X 
_γ[2]_ = 2τn _[λ][2]_
_B_ _[·][ log 4]δ [τ]_ _[.]_
Substituting (13) into the above equation, we obtain that
2
3

2
+ _C + 16G[2]R[2]_
 
_G[2]R[2]_
v
u _B_
u
u
t [2][τn]
_GR_
_B_ [log 4]δ[n] [+] v 9
u
u
t [4]
log [4][n]
_B_
_φξ(i)_ log [4][n]
_δ_
_i=1_
X 
log [4][τ]
_·_ _δ_ [log 4]δ [n] _[.]_
_γ =_
-----
Then, we conclude that with probability at least 1 − _δ,_
_n_
[f (w(t)) − _f_ (w[∗])]
_t=1_
X
_GR [n]_
_≤_ _B_
_φξ(τB + i)_
_i=1_
X
2
3

2
+ 4G[2]R[2] + 16G[2]R[2]
 
_G[2]R[2]_
u _B_
u
u
t [2][τn]
_GR_
_B_ [log 4]δ[n] [+]
v 9
u
u
t [4]
log [4][n]
_B_
_φξ(i)_ log [4][n]
_δ_
_i=1_
X 
log [4][τ]
_·_ _δ_ [log 4]δ[n]
_n−τ_ +1
_κ(t) + GR(τ −_ 1). (14)
_t=1_
X
+ Rn + G(τ − 1)
The desired result follows by noting that _t=1_ _[f]_ [(][w][(][t][))][ ≥] _[nf]_ [(][ b]wn).
C REGRET ANALYSIS OF MINI[P]-B[n]ATCH SGD
In this section, we derive the regret bound of mini-batch SGD algorithm. Throughout, for each sample loss F (w; ξ), recall that its gradient ∥∇F (w; ξ)∥ is uniformly bounded by G (see Assumption
2.1). In particular, we assume the k-th coordinate of _F_ (w; ξ) is uniformly bounded by Gk, and
_∇_
we have G = _k_ _[G]k[2][.]_
**1. Gradient Variance Bound under Dependent DatapP**
In the i.i.d. setting, the variance of stochastic gradient decreases as the batch size increases. Specifically, we have
_B_
E∥∇F (w; ξi) −∇f (w)∥[2] _≤_ [2]B [G][2] _[.]_
_i=1_
X
E
_∥_ _B[1]_
_B_
_F_ (w; ξi) _f_ (w) = [1]
_∇_ _−∇_ _∥[2]_ _B[2]_
_i=1_
X
Therefore,the data samples are dependent. In the following lemma, we develop a similar result when the data E∥ _B[1]_ _Bi=1_ _[∇][F]_ [(][w][;][ ξ][i][)][ −∇][f] [(][w][)][∥][2][ =][ O][(][ 1]B [)][. However, this bound no longer holds if]
is collected from a dependent stochastic process. Recall thatP _F_ (w(t); xt) denotes the averaged
_∇_
gradient over the mini-batch xt, i.e.,
_F_ (w(t); xt) = [1]
_∇_ _B_
_F_ (w(t); ξt[(][i][)][)][.]
_i=1_
X
**Lemma C.1. Let {w(t)}t∈N be the model parameter sequence generated by the mini-batch SGD in**
_(6). Let Assumptions 2.1 and 3.1 hold. Then, with probability at least 1 −_ _δ,_
268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_ 2
_F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2] _._
_∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _j=1_ _·_ _B_ P _B_ !
h X i
_Proof. Let xt = {ξt[(][i][)][}]i[B]=1_ [be the][ t][-th mini-batch samples. We consider the filtration within][ x][t][ and]
denote it as _t_
_{F_ [(][i][)][}][. Then, by the definition of canonical filtration,]
_Xi :=_ _F_ (w(t); ξt[(][i][)][)]
[b] _∇_
is measurable with respect to _Ft[(][i][)][. Define]_
_Yi,k := (Xi_ E[Xi _t_ 1])k
[b] _−_ _|F_ _−_
where (·)k denotes the k-th entry of the specified vector. And it is easy to see that {Yi,k}i is a
centered process for any k ∈{1, 2, . . ., d}. With these construction, we start from the following
-----
decomposition.
_F_ (w(t); xt) _f_ (w(t))
_∥∇_ _−∇_ _∥[2]_
= _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] + E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
_∥∇_ _−_ _∇_ _|F_ _−_ _∇_ _|F_ _−_ _−∇_ _∥[2]_
2 _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] +2 E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
_≤_ _∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _∥_ _∇_ _|F_ _−_ _−∇_ _∥[2]_
(A) (B)
Then we will bound the term (A) and (B), respectively.| {z } | {z }
- Bounding (A): Note that
_F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] = [1]
_∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _B[2][ ∥]_
[Xi E[Xi _t_ 1]]
_i=1_ _−_ _|F_ _−_ _∥[2]_
X
_d_ _B_
= [1] (Xi E[Xi _t_ 1])k 2
_B[2]_ _k=1_ _i=1_ _−_ _|F_ _−_
X h X i
_d_ _B_
2
= [1] _Yi,k_ _._
_B[2]_
_k=1_ _i=1_
X h X i
Then, we show that the process {Yi,k}i satisfies the conditions of Lemma B.4.
Since E[Yi,k _t_ 1] = 0, we conclude that _Yi,k_ _i is a centered process._
_◦_ _|F_ _−_ _{_ _}_
_◦_ Denote the k-th entry of Xi as Xi,k. We know that |Xi,k| ≤ _Gk. Hence, we conclude_
that 0 ≤|Yi,k| ≤ 2Gk. Then, we can set bi = 2Gk for all i.
_◦_ Lastly, we can bound the quantity q defined in Lemma B.4 as follows.
_B_ _j−1_
_q_ 2Gk E[Yj,k _t_ []][∥] [+ 4] _k[B]_
_≤_ _∥_ _|F_ [(][i][)] 3 _[G][2]_
_j=1_ _i=1_
X X
_B_ _j−1_
4G[2]k _φξ(j_ _i) + [4]_ _k[B]_
_≤_ _−_ 3 _[G][2]_
_j=1_ _i=1_
X X
_B_
_φξ(j) + [4]_ _k[B.]_
3 _[G][2]_
_j=1_
X
_≤_ 4G[2]k[B]
Now, we can apply Lemma B.4 and obtain that
_λ[2]_
134
3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_
_Yi,k > λ_
_≤_ exp
With a union bound, we obtain that
_λ[2]_
P _Yi,k_ _> λ_ 2 exp 134 _._
_i_ ! _≤_ _−_ 3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_ !
[X]
Further applying the union bound over k = 1, 2, . . ., d, we obtain that
_d_
P _Yi,k_ _> λ[2]k_ 2 exp 134 _λ[2]k_
_≤_ _−_
_k=1_ _i_ _k_ 3 _[G]k[2][B][ + 128][G]k[2][B][ P]j[B]=1_ _[φ][ξ][(][j][)]_
[ n [X] [2] o[!] X
Let _d[δ]_ [= 2 exp] _−_ 1343 _[G]k[2]_ _[B][+128][G]λ[2]k[2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_, we obtain that
 
_B_
134
_λ[2]k_ [=] _k[B][ + 128][G][2]k[B]_ _φξ(j)_ log [2][d]
3 _[G][2]_ _·_ _δ [.]_
_j=1_
h X i
-----
Then we conclude that,
_d_
134
P _Yi,k_ _k[B][ + 128][G][2]k[B]_
 _≤_ 3 _[G][2]_
_k=1_ _i_
 \ n [X] [2] h
It implies that with the probability at least 1 − _δ,_
_B_
_φξ(j)_ log [2][d]
_·_ _δ_
_j=1_
X i
1 _δ._
 _≥_ _−_
_B_
134
_Yi,k_ _≤_ 3 _[B]_ _G[2]k_ + 128B _φξ(j)_ _G[2]k_ _· log [2]δ [d]_ _[.]_
_k_ _i_ _k_ _j=1_ _k_
X [X] [2] h  X   X  X i
By definition, G = _k_ _[G]k[2][. Finally, we have the following bound for term (A): with]_
probability at least 1 −pPδ,
134
_F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1]
_∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _≤_ 3 _[G][2][ + 128][G][2]_
h
_B_
_φξ(j)_ [log][ 2]δ[d]
_·_ _B_
_j=1_
X i
- Bounding (B): Note that
E[ _F_ (w(t); ξt[(][i][)][)][|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥] [=] _F_ (w(t); ξt[(][i][)][)d][P][(][ξ]t[(][i][)] _t_ 1) _F_ (w(t); ξ)dµ(ξ)
_∥_ _∇_ _∇_ _∈·|F_ _−_ _−_ _∇_
Z Z
_F_ (w(t); ξt[(][i][)][)][∥|][d][P][(][ξ]t[(][i][)] _t_ 1) dµ
_≤_ _∥∇_ _∈·|F_ _−_ _−_ _|_
Z
_G_ _φξ(i)._
_≤_ _·_
Then we bound the norm by triangle inequality,
E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
_∥_ _∇_ _|F_ _−_ _−∇_ _∥≤_ _B[1]_
_≤_ _B[G]_
_∥E[∇F_ (w(t); ξt[(][i][)][)][|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥]
_i=1_
X
_B_
_φξ(i)._
_i=1_
X
Finally, we obtain the bound for the term (B) as
_B_
_i=1_ _[φ][ξ][(][i][)]_
E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _G[2]_
_∥_ _∇_ _|F_ _−_ _−∇_ _∥[2]_ _≤_ _B_
P
2
!
Combing the bounds of (A) and (B) yields that with probability at least 1 − _δ,_
268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_
_F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2]
_∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _·_ _B_ _B_
_j=1_ P
h X i
2
!
**2. High-Probability Regret Bound**
To derive the regret bound for the mini-batch SGD algorithm, we make the following additional
mild assumption.
**Assumption C.2. The stochastic optimization problem (P) satisfies**
_1. Each sample loss F_ (·; ξ) : W → R is convex.
_2. The objective function f : W →_ R is L-smooth.
-----
**Theorem C.3 (High-probability regret bound). Let {w(t)}t∈N be the model parameter sequence**
_generated by the mini-batch SGD in (6). Suppose Assumptions C.2, 3.1 and 2.1 hold. Then, with_
_probability at least 1 −_ _δ,_
268 _B_ log 2dTδ _i=1_ _[φ][ξ][(][i][)]_ 2
RT + ηT _φξ(j)_ + 2G[2][P][B]
_≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ 3 _[G][2][ + 256][G][2]_ _j=1_ _B_ _B_
h X   i
_f_ (w(t)) − _f_ (w[∗])
+ 2ηL
_t=1_
_Moreover, let η = O(_
_Moreover, let η =_ ( _B_
_O_ _T ·[P]j[B]=1_ _[φ][ξ][(][j][)]_ [)][, the optimized upper bound is in the order of]
q
RT = _T ·_ _j=1_ _[φ][ξ][(][j][)]_ + 2ηL _T_ _f_ (w(t)) _f_ (w[∗]) _._
_O_ s _B_ _−_
_t=1_
 [P][B]  X 
[e]
_Proof. For convenience, we define gt =_ _B[1]_ _Bi=1_ _t_ [)][. By the algorithm update (][6][), we]
obtain that _[∇][F]_ [(][w][(][t][);][ ξ][(][i][)]
P
2 _gt, w(t)_ _w[∗]_ + η _gt_
_⟨_ _−_ _⟩≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _∥[2]_
+ 2η _gt_ _f_ (w(t)) + 2η _f_ (w(t)) _._
_≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _−∇_ _∥[2]_ _∥∇_ _∥[2]_
Summing the above inequality over t yields that
_T_
2 _gt, w(t)_ _w[∗]_
_⟨_ _−_ _⟩_
_t=1_
X
_≤_ _[∥][w][(1)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][T][ + 1)][ −]_ _[w][∗][∥][2]_
_gt_ _f_ (w(t)) + 4ηL
_t=1_ _∥_ _−∇_ _∥[2]_
X
(f (w(t)) − _f_ (w[∗])).
_t=1_
X
+ 2η
By convexity of the function, we further obtain that
_T_
2 (F (w(t); xt) _F_ (w[∗]; xt))
_−_ _≤_ _[∥][w][(1)][ −]η_ _[w][∗][∥][2]_
_t=1_
X
_gt_ _f_ (w(t)) + 4ηL
_t=1_ _∥_ _−∇_ _∥[2]_
X
(f (w(t)) − _f_ (w[∗])).
_t=1_
X
+ 2η
Then, we apply Lemma C.1 to bound the second term _t=1_
union bound on over t. We conclude that, with probability at least[∥][g][t][ −∇] 1 − _δ[f],[(][w][(][t][))][∥][2][ and then apply a]_
_T_ [P][T]
(F (w(t); xt) _F_ (w[∗]; xt))
_−_
_t=1_
X
268
+ ηT
_≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ _·_ 3 _[G][2][ + 256][G][2]_
h
_B_
_i=1_ _[φ][ξ][(][i][)]_
+ 2G[2]
_B_
P
2
! i
_B_ _φξ(j)_ log 2dTδ
_B_
_j=1_
X 
(f (w(t)) − _f_ (w[∗])).
_t=1_
X
+ 2ηL
The proof is completed. Lastly, we set the learning rate η. To minimize the obtained upper bound, it
suffices to minimize the first two terms, as the last term can be combined with the left hand side of
(14) when we apply this regret bound. The optimized learning rate is achieved when
_B_
_i=1_ _[φ][ξ][(][i][)]_
+ 2G[2]
_B_
P
2
! i
_w(1)_ _w[∗]_ 268
_∥_ _−_ _∥[2]_ = ηT
2η _·_ 3 _[G][2][ + 256][G][2]_
h
_B_ _φξ(j)_ log 2dTδ
_B_
_j=1_
X 
-----
Then, η is chosen as
_T ·_ 2683 _[G][2][ + 256][G][2][ P]∥j[B]w=1(1)[φ][ξ] −[(][j][)]w[∗] log∥[2]B 2/2dTδ_
h 
_B_
_._
s _T ·_ _j=1_ _[φ][ξ][(][j][)]_ !
[P][B]
_η =_
v
u
u
t
= O
_B_ 2
_i=1_ _[φ][ξ][(][i][)]_
+ 2G[2] _B_
 P  i
D EXPERIMENT SETUP
Recall that we consider the following convex quadratic optimization problem:
min
_w_ R[d][ E][ξ][∼][µ][(][w][ −] _[ξ][)][T][ A][(][w][ −]_ _[ξ][)][,]_
_∈_
where A is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d]. The data
stream admitting such a stationary distribution µ can be generated by a certain Metropolis-Hastings
sampler provided in (Jarner & Roberts, 2002). Specifically, it is described as follows.
**Step 1: Let the “proposal” distribution q(x) have the density of Beta(r + 1, 1); that is,**
(r + 1)xr _x_ [0, 1]
_q(x) =_ _∈_
0 _x /_ [0, 1] _[.]_
 _∈_
Define the acceptance probability α(x, y) = min{ _[q]q[(]([x]y)[)]_ _[,][ 1][}][.]_
**Step 2: If the current state is ξt, then we sample ζ ∼** _q. Define the next state ξt+1:_
_ξt+1 =_ _ξt_ w.p. 1 − _α(ξt, ζ),_
_ζ_ w.p. α(ξt, ζ).

**Step 3: Go back to Step 2 to generate the next state.**
We repeatedly generate d independent sequences starting from the same initial state s0 = 0 to obtain
a d-dimension Markov chain. It has been shown that the above generated Markov chain converges to
_µ in distribution with an algebraic convergence rate φξ(k)_ (k[−][1][/r]) in Proposition 5.2, (Jarner
_≤O_
& Roberts, 2002).
We consider the following bias term at the fixed point w = 0d.
(Bias): E _F_ (w; xτ )|x0 = 0d _−_ _f_ (w) _._
It can be used to approximate the left-hand side of Lemma 4.1 . Since E _F_ (w; xτ )|s0 = 0d cannot
be explicitly obtained, we use Monte Carlo method to estimate this conditional expectation. That
is, we generate n = 10, 000 independent trajectories starting from x0 = 0d. At the step _τ_, we
estimate the expected value as _n[1]_ _ni=1_ _[F]_ [(][w][;][ x]τ[(][i][)][)][, where][ x]τ[(][i][)] with the superscript (i) indicates that
it is sampled from the i-th trajectory. Then we investigate the relation between the step τ and the
mixing parameter r and the relation between the stepP _τ and the batch size B. All the results are_
presented in Section 5.
-----