Processing math: 7%

Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent

Publisher: IEEE

Abstract:

We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning (ML). Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized SGD.
Published in: IEEE Signal Processing Magazine ( Volume: 37, Issue: 3, May 2020)
Page(s): 114 - 122
Date of Publication: 01 May 2020

ISSN Information:

PubMed ID: 33746471
Publisher: IEEE

Funding Agency:


We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning (ML). Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of {n} nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized SGD.

Introduction: Distributed optimization and its limitations

First-order optimization methods, ranging from vanilla gradient descent to Nesterov acceleration and its many variants, have emerged over the past decade as the principal way to train ML models. There is a great need for techniques that train such models quickly and reliably in a distributed fashion over networks where the individual processors or GPUs may be scattered across the globe and communicate over an unreliable network, which may suffer from message losses, delays, and asynchrony (see [1], [2], [29], and [33]).

Unfortunately, what often happens is that the gains achieved from having many different processors running an optimization algorithm are squandered by the cost of coordination, shared memory, message losses, and latency. This effect is especially pronounced when there are many processors and they are spread across geographically distributed data centers. As is widely recognized by the distributed systems community, “throwing” more processors at a problem will not, after a certain point, result in better performance.

This is typically reflected in the convergence time bounds obtained for distributed optimization in the literature. The problem formulation is that one must solve {z}^{\ast}\in\arg\mathop{\min}\limits_{z\in{Bbb{R}}^{d}}\mathop{\sum}\limits_{i = 1}\limits^{n}{f_i}\left({z}\right), \tag{1} View SourceRight-click on figure for MathML and additional features. over a network of {n} nodes (see Figure 1 for an example). Only node {i} has knowledge of the function {f}_{i}\left({z}\right), and the standard assumption is that, at every step when it is awake, node {i} can compute the (stochastic) gradient of its own local function {f}_{i}\left({z}\right). These functions {f}_{i}\left({z}\right) are assumed to be convex. The problem is how to compute this minimum in a distributed manner over the network based on peer-to-peer communication, possible message losses, delays, and asynchrony.

Figure 1. - An example of a network. Two nodes are connected if there is an edge between them.
Figure 1.

An example of a network. Two nodes are connected if there is an edge between them.

This article is concerned with the current limitations of distributed optimization and how to overcome them in certain scenarios.

This relatively simple formulation captures a large variety of learning problems. Suppose each agent {i} stores training data points {\cal{X}}_{i} = \left\{\left({x}_{j},{y}_{j}\right)\right\}, where {x}_{j}\in{\Bbb{R}}^{p} are vectors of features and {y}_{j}\in{\Bbb{R}} are the associated responses (either discrete or continuous). We are interested in learning a predictive model {h}\left({x};{\theta}\right), parameterized by parameters {\theta}\in{\Bbb{R}}^{d}, so that {h}\left({x}_{j};{\theta}\right)\approx{y}_{j} for all {j}. In other words, we are looking for a model that fits all of the data throughout the network. This can be accomplished by empirical risk minimization: {\theta}^{\ast}\in\arg\mathop{\min}\limits_{\theta\in{Bbb{R}}^{d}}\mathop{\sum}\limits_{i = 1}\limits^{n}{c_i}\left({\theta},{\cal{X}}_{i}\right), \tag{2} View SourceRight-click on figure for MathML and additional features. where {c}_{i}\left({\theta},{\cal{X}}_{i}\right) = \mathop{\sum}\limits_{\left({x}_{j},{y}_{j}\right)\in{\cal{X}}_{i}}{\ell}\left({h}\left({x}_{j};{\theta}\right),{y}_{j}\right)View SourceRight-click on figure for MathML and additional features. measures how well the parameter {\theta} fits the data at node {i}, with {\ell}\left({h}\left({x}_{j};{\theta}\right),{y}_{j}\right) being a loss function measuring the difference between {h}\left({x}_{j};{\theta}\right) and {y_j}. Much of modern ML is built around such a formulation, including regression, classification, and regularized variants [7].

It is also possible that each agent {i} does not have a static data set, but instead collects streaming data points \left({x}_{i},{y}_{i}\right)\sim{\Bbb{P}}_{i} repetitively over time, where {\Bbb{P}}_{i} represents an unknown distribution of \left(x_i,\,y_i\right). In this case, we can find {\theta}^{\ast} through expected risk minimization: {\theta}^{\ast}\in\arg\mathop{\min}\limits_{\theta\in{Bbb{R}}^{d}}\mathop{\sum}\limits_{i = 1}\limits^{n}{f_i}\left({\theta}\right), \tag{3} View SourceRight-click on figure for MathML and additional features. where {f}_{i}\left({\theta}\right) = {\Bbb{E}}_{\left({x}_{i},{y}_{i}\right)\sim{\Bbb{P}}_{i}}{\ell}\left({h}\left({x}_{i};{\theta}\right),{y}_{i}\right).View SourceRight-click on figure for MathML and additional features.

This article is concerned with the current limitations of distributed optimization and how to overcome them in certain scenarios. To illustrate our main concern, let us consider the distributed subgradient method in the simplest possible setting, namely, the problem of computing the median of a collection of numbers in a distributed manner over a fixed graph. Each agent {i} in the network holds value {m}_{i}\,{>}\,{0}, and the global objective is to find the median of {m}_{1},{m}_{2},\ldots,{m}_{n}. This can be incorporated in the framework of (1) by choosing {f}_{i}\left({z}\right) = \left\vert{z - {m}_{i}}\right\vert,\,\,\forall{i}.View SourceRight-click on figure for MathML and additional features.

The distributed subgradient method (see [18]) uses subgradients {s}_{i}\left({z}\right) of {f}_{i}\left({z}\right) at any point {z} to have agent {i} update as {z}_{i}\left({k} + {1}\right) = \mathop{\sum}\limits_{j = 1}\limits^{n}{w}_{ij}{z}_{j}\left({k}\right) - {\alpha}_{k}{s}_{i}\left({z}_{i}\left({k}\right)\right), \tag{4}View SourceRight-click on figure for MathML and additional features. where {\alpha}_{k}{>}{0} denotes the step size at iteration {k}, and {w}_{ij}\in\left[{0},{1}\right] are the weights agent {i} assigns to agent {j}’s solutions: two agents {i} and {j} are able to exchange information if {w}_{ij},{w}_{ji}{>}{0} ({w}_{ij} = {w}_{ji} = {0} otherwise). The weights {w}_{ij} are assumed to be symmetric. For comparison, the centralized subgradient method updates the solution at iteration {k} according to {z}\left({k} + {1}\right) = {z}\left({k}\right) - {\alpha}_{k}\frac{1}{n}\mathop{\sum}\limits_{j = 1}\limits^{n}{s_j}\left({z}\left({k}\right)\right). \tag{5} View SourceRight-click on figure for MathML and additional features.

In Figure 2, we show the performance of algorithm (4) as a function of the network size {n}, assuming the agents communicate over a ring network. As can be clearly seen, when the network size grows, it takes a longer time for the algorithm to reach a certain performance threshold.

Figure 2. - The performance of algorithm (4) as a function of the network size ${n}$. The agents communicate over a ring network [see Figure 4(b)] and choose the Metropolis weights (see the “Setup” section for the definition). Step sizes ${\alpha}_{k} = {1}/{\sqrt{k}}$, and ${m_i}$ are evenly distributed in $\left[{-10},{10}\right]$. The time ${k}$ to reach $\left({1}/{n}\right){\Sigma}_{i = 1}^{n}\left\vert{y}_{i}\left({k}\right)\right\vert\,{<}\,{\epsilon}$ is plotted, where ${y}_{i}\left({k}\right) = \left({1}/{k}\right){\Sigma}_{{\ell} = {0}}^{k - 1}{z}_{i}\left({\ell}\right)$ and ${\epsilon} = {0.1}$.
Figure 2.

The performance of algorithm (4) as a function of the network size {n}. The agents communicate over a ring network [see Figure 4(b)] and choose the Metropolis weights (see the “Setup” section for the definition). Step sizes {\alpha}_{k} = {1}/{\sqrt{k}}, and {m_i} are evenly distributed in \left[{-10},{10}\right]. The time {k} to reach \left({1}/{n}\right){\Sigma}_{i = 1}^{n}\left\vert{y}_{i}\left({k}\right)\right\vert\,{<}\,{\epsilon} is plotted, where {y}_{i}\left({k}\right) = \left({1}/{k}\right){\Sigma}_{{\ell} = {0}}^{k - 1}{z}_{i}\left({\ell}\right) and {\epsilon} = {0.1}.

Clearly this is an undesirable property. Glancing at the figure, we see that distributing computation over 50 nodes can result in a convergence time on the order of 107 iterations. Few practitioners will be enthusiastic about distributed optimization if the final effect is vastly increased convergence time.

One might hope that this phenomenon, demonstrated for the problem of median computation and considered here because it is arguably the simplest problem to which one can apply the subgradient method, will not hold for the more sophisticated optimization problems in ML literature. Unfortunately, most work in distributed optimization replicates this undesirable phenomenon. Next we give an extremely brief discussion of known convergence times in the distributed setting (for a much more extended discussion, we refer the reader to the recent survey in [17]).

This relatively simple formulation captures a large variety of learning problems.

We confine our discussion to the following point: most known convergence times in the distributed optimization literature imply bounds of the form {\text{Time}}_{{n},{\epsilon}}\left({\text{decentralized}}\right)\,≤\,{p}\left({\cal{G}}\right){\text{Time}}_{{n},{\epsilon}}\left({\text{centralized}}\right), \tag{6} View SourceRight-click on figure for MathML and additional features. where {\text{Time}}_{{n},{\epsilon}}\left({\text{decentralized}}\right) denotes the time for the decentralized algorithm on {n} nodes to reach {\epsilon} accuracy \left({\text{error }}{<}{\epsilon}\right), and {\text{Time}}_{{n},{\epsilon}}\left({\text{centralized}}\right) is the time for the centralized algorithm, which can query {n} gradients per time step to reach the same level of accuracy. The graph {\cal{G}} = \left({\cal{N}},{\cal{E}}\right) consists of the set of nodes and edges in the network, denoted by {\cal{N}} and {\cal{E}}, respectively. The function {p}\left({\cal{G}}\right) can usually be bounded in terms of some polynomial in the number of nodes {n}.

For instance, in the subgradient methods, [17, Corollary 9] implies that \begin{align*}&{\text{Time}}_{{n},{\epsilon}}\left({\text{decentralized}}\right) = {\cal{O}}\left(\frac{\max\left\{{\left\Vert \frac{1}{n}\mathop{\sum}\limits_{i = 1}\limits^{n}{z}_{i}\left({0}\right) - {z}^{\ast}\right\vert}^{2},{G}^{4}{h}\left({\cal{G}}\right)\right\}}{{\epsilon}^{2}}\right), \\ & {\text{Time}}_{{n},{\epsilon}}\left({\text{centralized}}\right) = {\cal{O}}\left(\frac{\max\left\{{\left\Vert{z}\left({0}\right) - {z}^{\ast}\right\Vert}^{2},{G}^{4}\right\}}{{\epsilon}^{2}}\right), \end{align*}View SourceRight-click on figure for MathML and additional features. where {z}\left(0\right),{z}_{i}\left(0\right) are initial estimates, {z}^{\ast} denotes the optimal solution, and {G} bounds the {\ell}_{2}-norm of the subgradients. Function {h}\left({\cal{G}}\right) is the inverse of the spectral gap corresponding to the graph and will typically grow with {n}; hence, when {n} is large, {p}\left({\cal{G}}\right)\simeq{h}\left({\cal{G}}\right). In particular, if the communication graphs are 1) path graphs, then {p}\left({\cal{G}}\right) = {\cal{O}}\left({n}^{2}\right); 2) star graphs, then {p}\left({\cal{G}}\right) = {\cal{O}}\left({n}^{2}\right); or 3) geometric random graphs, then {p}\left({\cal{G}}\right) = {\cal{O}}\left({n}\log{n}\right). The method developed in [20] achieves {p}\left({\cal{G}}\right) = {n} but, typically, {p}\left({\cal{G}}\right) is at least {n^2}.

By comparing {\text{Time}}_{{n},{\epsilon}}\left({\text{decentralized}}\right) and {\text{Time}}_{{n},{\epsilon}}\left({\text{centralized}}\right), we are keeping the computational power the same in both cases. Naturally, centralized is always better: anything that can be done in a decentralized way could be done in a centralized way. The question though, is: How much better?

Framed in this way, the polynomial scaling in the quantity {p}\left({\cal{G}}\right) is extremely disconcerting. It is, for example, hard to argue that an algorithm should be run in a distributed manner with say, {n} = {100}, if the quantity {p}\left({\cal{G}}\right) in (6) satisfies {p}\left({\cal{G}}\right) = {n}^{2}; that would imply that the distributed variant would be 10,000-times slower than the centralized one with the same computational power.

Sometimes {p}\left({\cal{G}}\right) is written as the inverse spectral gap {1}/{\left({1} - {\lambda}_{2}\right)} in terms of the second eigenvalue of some matrix. Because the second-smallest eigenvalue of an undirected graph Laplacian is approximately {\sim}{1}/{n}^{2} away from zero, such bounds will translate into at least quadratic scalings with {n} in the worst case. Over time-varying B-connected graphs, the best-known bounds on {p}\left({\cal{G}}\right) will be cubic in {n} using the results in [16].

There are a number of caveats to the pessimistic argument outlined previously in this section. For example, in a multiagent scenario where data sharing is not desirable or feasible, decentralized computation might be the only option available. Generally speaking, however, fast-growing {p}\left({\cal{G}}\right) will preclude the widespread applicability of distributed optimization. Indeed, returning to the back-of-the-envelope calculation mentioned previously, if a user has to pay a multiplicative factor of 10,000 in convergence speed to use an algorithm, the most likely scenario is that the algorithm will not be used.

There are some scenarios that avoid the pessimistic discussion mentioned previously: for example, when the underlying graph is an expander, the associated spectral gap is constant (see [8, Ch. 6] for a definition of these terms as well as an explanation), and likewise when the graph is a star graph. In particular, on a random Erdo˝s–Rényi random graph, the quantity {p}\left({\cal{G}}\right) is constant with high probability [17, Corollary 9, Part 9] . Unfortunately, these are very special cases and may not always be realistic. A star graph requires a single node to have the ability to receive and broadcast messages to all other nodes in the system. On the other hand, an expander graph may not occur in geographically distributed systems. By way of comparison, a random graph where nodes are associated with random locations, with links between nodes close together, will not have constant spectral gap and will thus have {p}\left({\cal{G}}\right), which grows with {n} [17, Corollary 9, Part 10]. The Erdo˝s–Rényi graph escapes this because, if we again associate nodes with locations, the average link in such a graph is a “long-range” one, connecting nodes that are geographically far apart. It is a consequence of Cheeger’s inequality, that graphs based on connecting nearest neighbors (i.e., where nodes are regularly spaced in {\Bbb{R}}^{d} and each node is connected to a constant number of closest neighbors) will not have constant spectral gap.

Asymptotic network independence in distributed stochastic optimization

In this article, we provide a discussion of several recent papers which have obtained that, for a number of settings involving distributed stochastic optimization, {p}\left({\cal{G}}\right) = {1} as long as {k} is large enough. In other words, asymptotically, the distributed stochastic gradient algorithm converges to the optimal solution at a comparable rate to a centralized algorithm with the same computational power.

We call this property asymptotic network independence: it is as if the network is not even there. Asymptotic network independence provides an answer to the concerns raised in the previous section.

We begin by illustrating these results with a simulation from [21], shown in Figure 3. Here the problem to be solved is classification using a smooth support vector machine (SVM) between overlapping clusters of points. The performance of the centralized algorithm is shown in orange, and the performance of the decentralized algorithm is shown in dark blue. The graph is a ring of 50 nodes, and the problem being solved is the search for a support vector classifier. The graph illustrates the main result, which is that a network of 50 nodes performs as well in the limit as does a centralized method with 50 times the computational power of one node. Indeed, after ∼8,000 iterations, the orange and dark blue lines are nearly indistinguishable.

Figure 3. - A comparison of DSGD and centralized SGD for training an SVM. (a) A total of 1,000 data points and their labels for SVM classification. The data points are randomly generated around 50 cluster centers. (b) The squared errors and one standard-deviation band for DSGD and centralized SGD. The performance of the centralized algorithm is shown in orange, and the performance of the decentralized algorithm is shown in dark blue. A total of 1,000 Monte Carlo simulations are conducted for estimating the average performance.
Figure 3.

A comparison of DSGD and centralized SGD for training an SVM. (a) A total of 1,000 data points and their labels for SVM classification. The data points are randomly generated around 50 cluster centers. (b) The squared errors and one standard-deviation band for DSGD and centralized SGD. The performance of the centralized algorithm is shown in orange, and the performance of the decentralized algorithm is shown in dark blue. A total of 1,000 Monte Carlo simulations are conducted for estimating the average performance.

We call this property asymptotic network independence: it is as if the network is not even there.

We note that similar simulations are available for other ML methods (training neural networks, logistic regression, elastic net regression, and so on). The asymptotic network independence property enables us to efficiently distribute the training process for a variety of existing learning methods.

The name asymptotic network independence is a slight misnomer, as we actually do not care whether the asymptotic performance depends in some complicated way on the network. All we want is for the decentralized convergence rate to be bounded by {\cal{O}}\left(1\right) times the convergence rate of the centralized method.

The authors in [4]–​[6] and [31] gave the first crisp statement of the relationship between centralized and distributed methods in the setting of distributed optimization of smooth, strongly convex functions in the presence of noise. Under constant step sizes, the authors in [4]–​[6] were the first to show that, when the step size is sufficiently small, a distributed stochastic gradient method achieves a performance comparable to that of the centralized method in terms of the steady-state mean-square error. The step size has to be small enough as a function of the network topology for this to hold. In [31], the authors showed that the distributed stochastic gradient algorithm asymptotically achieves a convergence rate comparable to that of the centralized method, but assuming that all of the local functions {f_i} have the same minimum. This gives the first “asymptotic network independence” result.

The work in [22] approximated distributed stochastic gradient algorithm by stochastic differential equations in continuous time by assuming a sufficiently small constant step size. It was shown that the distributed method outperforms a centralized scheme with synchronization overhead; however, it did not lead to straightforward algorithmic bounds. In our recent work [21], we generalized the results to graphs that are time varying, with delays, message losses, and asynchrony. In a parallel recent work [9], a similar result was demonstrated using a further compression technique, which allowed nodes to save on communication.

When the objective functions are not assumed to be convex, several recent works have obtained asymptotic network independence for distributed stochastic gradient methods. In [13] and [14], a general stochastic approximation setting was considered with decaying step sizes, and the convergence rates of centralized and distributed methods were shown to be asymptotically the same; the proof proceeded based on certain technical properties of stochastic approximation methods. The work in [12] was the first to show that distributed algorithms could achieve a speedup like that of a centralized method when the number of computing steps is large enough. Such a result was generalized to the setting of the directed communication networks in [1] for training deep neural networks, where the push-sum technique was combined with the standard distributed stochastic gradient scheme.

We remark that in this survey, all of the previously mentioned algorithms that enjoy the asymptotic network independence property assume smooth objective functions, i.e., functions with Lipschitz continuous gradients.

In the next sections, we provide a simple and readable explanation of the asymptotic network independence phenomenon in the context of distributed stochastic optimization over smooth and strongly convex objective functions. For more information on the topic of distributed stochastic optimization, the reader is referred to [10], [15], [23], [24], [28], [30], and [32] and the references therein.

Setup

We are interested in minimizing (1) over a network of {n} communicating agents. Regarding the objective functions {f_i}, we make the following standing assumption.

Assumption 1

Each {f}_{i}:{\Bbb{R}}^{d}\rightarrow{\Bbb{R}} is μ-strongly convex with L-Lipschitz continuous gradients, i.e., for any {z},{z}'\in{\Bbb{R}}^{d}, \begin{align*}&{\langle}\nabla{f}_{i}\left({z}\right) - \nabla{f}_{i}\left({z}'\right),{z} - {z}'\rangle\,≥\,{\mu} {\left\Vert{z} - {z}'\right\Vert}^{2}, \\ & {\left\Vert{\nabla}{f}_{i}\left({z}\right) - \nabla{f}_{i}\left({z}'\right)\right\Vert}\,≤\,{L}\left\Vert{z} - {z}'\right\Vert. \tag{7} \end{align*}View SourceRight-click on figure for MathML and additional features.

Under “Assumption 1,” (1) has a unique optimal solution, {z}^{\ast}, and the function {f}\left(z\right) defined as {f}\left({z}\right) = \left({1}/{n}\right){\Sigma}_{i = 1}^{n}{f}_{i}\left({z}\right) has the following contraction property presented in the next section [26, Lemma 10].

Lemma 1

For any {z}\in{\Bbb{R}}^{d} and {\alpha}\in\left({0},{1}/{L}\right), we have \left\Vert{z} - {\alpha}\nabla{f}\left({z}\right) - {z}^{\ast}\right\Vert\,≤\,\left({1} - {\alpha}{\mu}\right){\left\Vert{z} - {z}^{\ast}\right\Vert}.

In other words, gradient descent with a small step size reduces the distance between the current solution and {z}^{\ast}.

In the stochastic optimization setting, we assume that at each iteration {k} of the algorithm, {z}_{i}({k}) being the input for agent {i}, each agent is able to obtain noisy gradient estimates {g}_{i}\left({z}_{i}\left({k}\right),{\xi}_{i}\left({k}\right)\right), which satisfy the following condition.

Assumption 2

For all {i}\in\left\{{1},{2},\ldots,{n}\right\} and {k}\,≥\,{1}, each random vector {\xi}_{i}\left({k}\right)\in{\Bbb{R}}^{m} is independent, and \begin{align*}&{\Bbb{E}}_{{\xi}_{i,k}}\left[{g}_{i}\left({z}_{i}\left({k}\right),{\xi}_{i}\left({k}\right)\right)\,\vert\,{z}_{i}\left(k\right)\right] = \nabla{f}_{i}\left({z}_{i}\left({k}\right)\right), \\ & {\Bbb{E}}_{{\xi}_{i,k}}\left[{\left\Vert{g}_{i}\left({z}_{i}\left({k}\right),{\xi}_{i}\left({k}\right)\right) - \nabla{f}_{i}\left({z}_{i}\left({k}\right)\right)\right\Vert}^{2}\,\vert\,{z}_{i}\left({k}\right)\right]\,≤\,{\sigma}^{2},\,\,{\text{for some }}{\sigma}{>}{0}. \tag{8} \end{align*}View SourceRight-click on figure for MathML and additional features.

Stochastic gradients appear, for instance, when the gradient estimation of {c}_{i}\left({\theta},{\cal{X}}_{i}\right) in empirical risk minimization (2) introduces noise from various sources, such as sampling and quantization errors. For another example, when minimizing the expected risk in (3), where independent data points \left(x_i\,y_i\right) are gathered over time, {g}_{i}\left({z},\left({x}_{i},{y}_{i}\right)\right) = {\nabla}_{z}\ell\left({h}\left({x}_{i};{z}\right),{y}_{i}\right) is a stochastic, unbiased estimator of \nabla{f}_{i}\left({z}\right), satisfying the first condition in (8). The second condition holds for popular problems such as smooth SVMs, logistic regression, and softmax regression, assuming the domain of \left({x}_{i},{y}_{i}\right) is bounded.

The algorithm we discuss is the DSGD method adapted from distributed gradient descent and the diffusion strategy [3]; note that in [3] this method was called adapt-then-combine. We let each agent {i} in the network hold a local copy of the decision vector denoted by {z}_{i}\in{\Bbb{R}}^{d}, and its value at iteration/time {k} is written as {z}_{i}\left({k}\right). Denote {g}_{i}\left({k}\right) = {g}_{i}\left({z}_{i}\left({k}\right),{\xi}_{i}\left({k}\right)\right) for short. At each step {k}\,≥\,{0}, every agent {i} performs the following update: {z}_{i}\left({k} + {1}\right) = \mathop{\sum}\limits_{j = 1}\limits^{n}{w}_{ij}\left({z}_{j}\left({k}\right) - {\alpha}_{k}{g}_{j}\left({k}\right)\right), \tag{9} View SourceRight-click on figure for MathML and additional features. where \left\{{\alpha}_{k}\right\} is a sequence of nonnegative nonincreasing step sizes. The initial vectors {z}_{i}\left(0\right) are arbitrary for all {i}, and {W} = \left[{w}_{ij}\right] is a mixing matrix.

DSGD belongs to the class of so-called consensus-based distributed optimization methods, where different agents mix their estimates at each iteration to reach a consensus of the solutions, i.e., {z}_{i}\left({k}\right)\approx{z}_{j}\left({k}\right), for all {i} and {j} in the long run. To achieve consensus, the following condition is assumed on the mixing matrix and the communication topology among agents.

Assumption 3

The graph {\cal{G}} of agents is undirected and connected (there exists a path between any two agents). The mixing matrix {\bf{W}} is nonnegative, symmetric, and doubly stochastic, i.e., {\bf{W1}} = {\bf{1}} and {\bf{1}}^{\top}{\bf{W}} = {\bf{1}}^{\top}, where {\bf{1}} is the all-one vector. In addition, {\text{w}}_{\text{ii}} > {0} for some {i}\in\left\{{1},{2},\ldots,{n}\right\}.

Some examples of undirected connected graphs are presented in Figure 4. Because of “Assumption 3,” the mixing matrix {\bf{W}} has an important contraction property.

Figure 4. - Examples of undirected connected graphs. (a) A fully connected graph and (b) ring, (c) star, and (d) tree networks.
Figure 4.

Examples of undirected connected graphs. (a) A fully connected graph and (b) ring, (c) star, and (d) tree networks.

Lemma 2

Let “Assumption 3” hold, and let {1} = {\lambda}_{1}\,≥\,{\lambda}_{2}\,≥\,\cdots{\lambda}_{n} denote the eigenvalues of the matrix {\bf{W}}. Then, {\lambda} = \max\left(\left\vert{\lambda}_{2}\right\vert,\left\vert{\lambda}_{n}\right\vert\right)\,{<}\,{1} and \left\Vert{\bf{W}}{\bf{\omega}} - {\bf{1}}{\bar{\omega}}\right\Vert\,≤\,{\lambda}{\left\Vert{\bf{\omega}} - {\bf{1}}{\bar{\omega}}\right\Vert}View SourceRight-click on figure for MathML and additional features. for all {\bf{\omega}}\in{\Bbb{R}}^{{n}\times{d}}, where {\bar{\omega}} = \left({1}/{n}\right){\bf{1}}^{\top}{\bf{\omega}}. As a result, when running a consensus algorithm (which is just (9) without gradient descent) {z}_{i}\left({k} + {1}\right) = \mathop{\sum}\limits_{j = 1}\limits^{n}{w}_{ij}{z}_{j}\left({k}\right), \tag{10} View SourceRight-click on figure for MathML and additional features. the speed of reaching consensus is determined by {\lambda} = \max\left(\left\vert{\lambda}_{2}\right\vert,\left\vert{\lambda}_{n}\right\vert\right). In particular, if we adopt the so-called lazy Metropolis rule for defining the weights, the dependency of {\lambda} on the network size {n} is upper bounded by {1} - {c}/{n}^{2} for some constant {c} [20] (See “Lazy Metropolis Rule for Constructing {\bf{W}}.”).

Despite the fact that {\lambda} may be very close to 1 with large {n}, the consensus algorithm (10) enjoys geometric convergence speed, i.e., \mathop{\sum}\limits_{i = 1}\limits^{n}{\left\Vert{z}_{i}\left({k}\right) - \frac{1}{n}\mathop{\sum}\limits_{j = 1}\limits^{n}{z}_{j}\left({k}\right)\right\Vert}^{2}\,≤\,{\lambda}^{k}\mathop{\sum}\limits_{i = 1}\limits^{n}{\left\Vert{z}_{i}\left({0}\right) - \frac{1}{n}\mathop{\sum}\limits_{j = 1}\limits^{n}{z}_{j}\left({0}\right)\right\Vert}^{2}.View SourceRight-click on figure for MathML and additional features.

By contrast, the optimal rate of convergence for any stochastic gradient methods is sublinear, asymptotically {\cal{O}}\left({1/k}\right) (see [19]). This difference suggests that a consensus based distributed algorithm for stochastic optimization may match the centralized methods in the long term: any errors due to consensus will decay at a fast-enough rate so that they ultimately do not matter.

In the next sections, we discuss and compare the performance of the centralized SGD method and DSGD. We show that both methods asymptotically converge at the rate {\sigma}^{2}/{\left({n{\mu}^{2}k}\right)}. Furthermore, the time needed for DSGD to approach the asymptotic convergence rate turns out to scale as {\cal{O}}\left({n}/{\left({1} - {\lambda}\right)}^{2}\right).

Centralized SGD

The benchmark for evaluating the performance of DSGD is the centralized SGD method, which we describe in this section. At each iteration {k}, the following update is executed: {z}\left({k} + {1}\right) = {z}\left({k}\right) - {\alpha}_{k}{\bar{g}}\left({k}\right), \tag{11} View SourceRight-click on figure for MathML and additional features. where step sizes satisfy {\alpha}_{k} = {1}/{\left({\mu}{k}\right)} and {\bar{g}}\left({k}\right) = \left({1}/{n}\right){\Sigma}_{i = 1}^{n}{g}_{i}\left({z}\left({k}\right),{\xi}_{i}\left({k}\right)\right), i.e., {\bar{g}}\left({k}\right) is the average of {n} noisy gradients evaluated at {z}\left({k}\right) (by utilizing {n} gradients at each iteration, we are keeping the computational power the same for SGD and DSGD). As a result, the gradient estimation is more accurate than using just one gradient. Indeed, from “Assumption 2” we have \begin{align*}&{\Bbb{E}}\left[{\left\Vert{\bar{g}}\left({k}\right) - \nabla{f}\left({z}\left({k}\right)\right)\right\Vert}^{2}\right]\\ &\quad = \frac{1}{{n}^{2}}\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}\left[{\left\Vert{g}_{i}\left({z}\left({k}\right),{\xi}_{i}\left({k}\right)\right) - \nabla{f}_{i}\left({z}\left({k}\right)\right)\right\Vert}^{2}\right]\,≤\,\frac{{\sigma}^{2}}{n}. \tag{12} \end{align*}View SourceRight-click on figure for MathML and additional features.

We measure the performance of SGD by {R}\left({k}\right) = {\Bbb{E}}\left[{\left\Vert{z}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right], the expected squared distance between the solution at time {k} and the optimal solution. “Theorem 1” characterizes the convergence rate of {R}\left({k}\right), which is optimal for such stochastic gradient methods (see [19] and [27]).

Lazy Metropolis Rule for Constructing W

\begin{equation*}{w}_{ij} = \begin{cases}\begin{array}{ll} \frac{1}{2\max\left\{\deg\left(i\right),\deg\left(j\right)\right\}},&{\text{if }}{i}\in{\cal{N}}_{i},\\{1} - \mathop{\sum}\limits_{{j}\in{\cal{N}}_{i}}{w}_{ij},&{\text{if }}{i} = {j},\\{0,}&{\text{otherwise.}}\end{array}\end{cases} \end{equation*}View SourceRight-click on figure for MathML and additional features.

Notation: \deg\left({i}\right) denotes the degree (number of “neighbors”) of node {i}. Correspondingly, {\cal{N}}_{i} is the set of “neighbors” for agent {i}.

Theorem 1

Under SGD (11), supposing “Assumption 1,” “Assumption 2,” and “Assumption 3” hold, we have {R}\left({k}\right)\,≤\,\frac{{\sigma}^{2}}{{n}{\mu}^{2}{k}} + {\cal{O}}_{k}\left(\frac{1}{{k}^{2}}\right). \tag{13} View SourceRight-click on figure for MathML and additional features.

To compare with the analysis for DSGD later, we briefly describe how to obtain (13). Note that \begin{align*}{R}\left({k} + {1}\right) = & {\Bbb{E}}\left[{\left\Vert{z}\left({k}\right) - {\alpha}_{k}{\bar{g}}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right]\\ = & {\Bbb{E}}\left[{\left\Vert{z}\left({k}\right) - {\alpha}_{k}\nabla{f}\left({z}\left({k}\right)\right) - {z}^{\ast}\right\Vert}^{2}\right] \\ &\quad + {\alpha}_{k}^{2}{\Bbb{E}}\left[{\left\Vert{\nabla}{f}\left({z}\left({k}\right)\right) - {\bar{g}}\left({k}\right)\right\Vert}^{2}\right].\end{align*}View SourceRight-click on figure for MathML and additional features.

For large {k}, in light of “Lemma 1” and relation (12), we have the following inequality that relates {R}\left({k} + {1}\right) to {R}\left({k}\right): {R}\left({k} + {1}\right)\,≤\,\left({1} - {\alpha}_{k}{\mu}\right)^{2}{R}\left({k}\right) + \frac{{\alpha}_{k}^{2}{\sigma}^{2}}{n} = {\left({1} - \frac{1}{k}\right)}^{2}{R}\left({k}\right) + \frac{{\sigma}^{2}}{{n}{\mu}^{2}}\frac{1}{{k}^{2}}. \tag{14} View SourceRight-click on figure for MathML and additional features.

A simple induction then gives (13).

DSGD

We assume the same step-size policy for DSGD and SGD. To analyze DSGD starting from (9), define {\bar{z}}\left({k}\right) = \frac{1}{n}\mathop{\sum}\limits_{i = 1}\limits^{n}{{z}_{i}}\left({k}\right), \tag{15} View SourceRight-click on figure for MathML and additional features. as the average of all of the iterates in the network. Different from the analysis for SGD, we are concerned with two error terms. The first term {\Bbb{E}}\left[{\left\Vert{\bar{z}}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right], called the expected optimization error, defines the expected squared distance between {\bar{z}}\left({k}\right) and {z}^{\ast}. The second term {\Sigma}_{i = 1}^{n}{\Bbb{E}}\left[{\left\Vert{z}_{i}\left({k}\right) - {\bar{z}}\left({k}\right)\right\Vert}^{2}\right], called the expected consensus error, measures the dissimilarities of individual estimates among all the agents. The average squared distance between individual iterate {z}_{i}\left({k}\right) and the optimum {z}^{\ast} is given by \begin{align*} \frac{1}{n}\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}\left[{\left\Vert{z}_{i}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right] = & {\Bbb{E}}\left[{\left\Vert{\bar{z}}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right]\\ & + \frac{1}{n}\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}\left[{\left\Vert{z}_{i}\left({k}\right) - {\bar{z}}\left({k}\right)\right\Vert}^{2}\right]. \tag{16} \end{align*}View SourceRight-click on figure for MathML and additional features.

Exploring the two terms will provide us with insights into the performance of DSGD. To simplify notation, denote {U}\left({k}\right) = {\Bbb{E}}\left[{\left\Vert{\bar{z}}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right],\,{V}\left({k}\right) = {\Sigma}_{i = 1}^{n}{\Bbb{E}}\left[{\left\Vert{z}_{i}\left({k}\right) - {\bar{z}}\left({k}\right)\right\Vert}^{2}\right],\,\forall{k}.

Inspired by the analysis for SGD, we first look for an inequality that bounds {U}\left({k}\right), which is analogous to {\Bbb{E}}\left[{\left\Vert{z}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right] in SGD. One such relation turns out to be [25]: \begin{align*}{U}\left({k} + {1}\right)\,≤\,& {\left({1} - \frac{1}{k}\right)}^{2}{U}\left({k}\right) + \frac{2{L}}{{\sqrt{n}}{\mu}}\frac{\sqrt{{U}\left({k}\right){V}\left({k}\right)}}{k}\\ & + \frac{{L}^{2}}{{n}{\mu}^{2}}\frac{{V}\left({k}\right)}{{k}^{2}} + \frac{{\sigma}^{2}}{{n}{\mu}^{2}}\frac{1}{{k}^{2}}. \tag{17} \end{align*}View SourceRight-click on figure for MathML and additional features.

Comparing (17) to (14), we find two additional terms on the right-hand side of the inequality. Both terms involve the expected consensus error {V}\left({k}\right), thus reflecting the additional disturbances caused by the dissimilarities of solutions. Equation (17) also suggests that the convergence rate of {U}\left({k}\right) cannot be better than {R}\left({k}\right) for SGD, which is expected. Nevertheless, if {V}\left({k}\right) decays fast enough compared to {U}\left({k}\right), it is likely that the two additional terms are negligible in the long run, and we deduce that the convergence rate of {U}\left({k}\right) is comparable to that of {R}\left({k}\right) for SGD.

This indeed turns out to be the case, as shown in [25], that {V}\left({k}\right)\,≤\,{\cal{O}}\left({n}/{\left({1} - {\lambda}\right)}^{2}\right)\left({1}/{k}^{2}\right) for {k}\,≥\,{\cal{O}}\left({1}/{\left({1} - {\lambda}\right)}\right). Plugging this into (17) leads to the inequality {U}\left({k}\right)\,≤\, {\theta}^{2}{\sigma}^{2}/{\left(\left({1.5}{\theta} - {1}\right){n}{\mu}^{2}{k}\right)} + {\cal{O}}\left({1}/{\left({1} - {\lambda}\right)}^{2}\right)\left({1}/{k}^{2}\right). Therefore, when {k}\,≥\,{\cal{O}}\left({n}/{\left({1} - {\lambda}\right)}^{2}\right), we have \frac{1}{n}\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}\left[{\left\Vert{z}_{i}\left({k}\right) - {z}^{\ast}\right\Vert}^{2}\right]\,≤\,\frac{{\sigma}^{2}}{{n}{\mu}^{2}{k}}{\cal{O}}\left({1}\right).View SourceRight-click on figure for MathML and additional features.

In other words, we have the asymptotic network independence phenomenon: after a transient, DSGD performs comparably to a centralized SGD method with the same computational power (e.g., which can query the same number of gradients per step as that of the entire network).

Numerical illustration

We provide a numerical example to illustrate the asymptotic network independence property of DSGD. Consider the online Ridge regression problem {z}^{\ast} = \arg\mathop{\min}\limits_{z\in{Bbb{R}}^{d}}\mathop{\sum}\limits_{i = 1}\limits^{n}{f_i}\left({z}\right)\left( = {\Bbb{E}}_{{u}_{i},{v}_{i}}\left[{\left({u}_{i}^{\top}{z} - {v}_{i}\right)}^{2} + {\rho}{\left\Vert{z}\right\Vert}^{2}\right]\right), \tag{18} View SourceRight-click on figure for MathML and additional features. where {\rho}{>}{0} is a penalty parameter. Each agent {i} collects data points in the form of \left(u_i\,v_i\right) continuously over time, with {u}_{i}\in{\Bbb{R}}^{d} representing the features and {v}_{i}\in{\Bbb{R}} being the observed outputs. Suppose each {u}_{i}\in\left[-{1},{1}\right]^{d} is uniformly distributed and {v_i} is drawn according to {v}_{i} = {u}_{i}^{\top}{\tilde{z}}_{i} + {\varepsilon}_{i}, where {\tilde{z}}_{i} are predefined parameters uniformly situated in \left[0,10\right]^{d} and {\varepsilon}_{i} are independent Gaussian random variables with mean 0 and variance 1. Given a pair \left(u_i\,v_i\right), agent {i} can compute an estimated gradient of {f}_{i}\left({z}\right):\,{g}_{i}\left({z},{u}_{i},{v}_{i}\right) = {2}\left({u}_{i}^{\top}{z} - {v}_{i}\right){u}_{i} + {2}{\rho}{z}, which is unbiased. Equation (18) has a unique solution {z}^{\ast} given by {z}^{\ast} = {\left(\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}_{u_i}\left[{u}_{i}{u}_{i}^{\top}\right] + {n}{\rho}{\bf{I}}\right)}^{-1}\mathop{\sum}\limits_{i = 1}\limits^{n}{\Bbb{E}}_{u_i}\left[{u}_{i}{u}_{i}^{\top}\right]{\tilde{z}}_{i}. View SourceRight-click on figure for MathML and additional features.

In the experiments, we consider two instances. In the first instance, we assume {n} = {50} agents constitute a random network for DSGD, where every two agents are linked with probability 0.2. In the second instance, we let {n} = {49} agents form a 7 × 7 grid network. We use Metropolis weights in both instances. The problem dimension is set to {d} = {10} and {z}_{i}\left({0}\right) = {\bf{0}}, the zero vector for all {i}. The penalty parameter is set to {\rho} = {0.1} and the step sizes {\alpha}_{k} = \left({5/k}\right). For both SGD and DSGD, we run the simulations 100 times and average the results to approximate the expected errors.

The performance of SGD and DSGD is shown in Figure 5. We notice that in both instances the expected consensus error for DSGD converges to 0 faster than the expected optimization error, as predicted from our previous discussion. Regarding the expected optimization error, DSGD is slower than SGD in the first ∼800 (respectively, {\sim}{4}\times{10}^{4}) iterations for random network (respectively, the grid network). But after that, their performance is almost indistinguishable. The difference in the transient times is due to the stronger connectivity (or smaller {\lambda}) of the random network compared to that of the grid network.

Figure 5. - The performance comparison between DSGD and SGD for online Ridge regression. For DSGD, the plots show the iterates generated by a randomly selected node ${i}$ from the set $\left\{{1},{2},\ldots,{n}\right\}$. The results are averaged over 100 Monte Carlo simulations. (a) Instance 1 (a random network used for DSGD) and (b) instance 2 (a grid network used for DSGD)
Figure 5.

The performance comparison between DSGD and SGD for online Ridge regression. For DSGD, the plots show the iterates generated by a randomly selected node {i} from the set \left\{{1},{2},\ldots,{n}\right\}. The results are averaged over 100 Monte Carlo simulations. (a) Instance 1 (a random network used for DSGD) and (b) instance 2 (a grid network used for DSGD)

Few practitioners will be enthusiastic about distributed optimization if the final effect is vastly increased convergence time.

Conclusions

In this article, we provided a discussion of recent results that have overcome a barrier in distributed stochastic optimization methods for ML under certain scenarios. These results established an asymptotic network independence property, that is, asymptotically, the distributed algorithm achieves a convergence rate comparable to that of a centralized algorithm with the same computational power. We explained the property using examples of training ML models and provided a short mathematical analysis.

Along the line of achieving asymptotic network independence in distributed optimization, there are various future research directions, including considering nonconvex objective functions, reducing communication costs and transient time, and using exact gradient information. In this section, we briefly describe these directions.

First, the distributed training of deep neural networks, the state-of-the-art ML approach in many application areas, involves minimizing nonconvex objective functions, which are different from the main objectives considered in this article. This area is largely unexplored with a few recent works in [1], [12], [14] and [29].

In distributed algorithms, the costs associated with communication among the agents are often nonnegligible and may become the main burden for large networks. It is therefore important to explore communication-reduction techniques that do not sacrifice the asymptotic network independence property. Recent works [1], [9] touched on this point.

When considering asymptotic network independence for distributed optimization, an important factor is the transient time needed to reach the asymptotic convergence rate, as it may take a long time before the distributed implementation catches up with the corresponding centralized method. In fact, as we have shown in the “Setup” section, this transient time can be a function of the network topology and grows with the network size. Reducing the transient time is thus a key future objective.

Finally, although several recent works have established the asymptotic network independence property in distributed optimization, they are mainly constrained to using stochastic gradient information. If the exact gradient is available, can distributed methods compete with the centralized ones? As we know, centralized algorithms typically enjoy a faster convergence speed with exact gradients. For example, plain gradient descent achieves linear convergence for strongly convex and smooth objective functions. To the best of the authors’ knowledge, as of the writing of this article, with the exception of [11] and [29], the results on asymptotic network independence in this setting are currently lacking.

ACKNOWLEDGMENTS

We would like to thank Artin Spiridonoff from Boston University for his kind help in providing Figure 3. The research was partially supported by the NSF under grants ECCS 1933027, IIS 1914792, DMS 1664644, and CNS 1645681, the U.S. Office of Naval Research under grant N00014-19-1-2571, the National Institutes of Health under grant 1R01GM135930, and the Shenzhen Research Institute of Big Data Startup Fund JCYJ-SP2019090001.

    References

    1.
    M. Assran, N. Loizou, N. Ballas and M. Rabbat, "Stochastic gradient push for distributed deep learning", Proc. Int. Conf. Machine Learning, pp. 344-353, 2019.
    2.
    T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis and W. Shi, "Federated learning of predictive models from federated electronic health records", Int. J. Med. Inform., vol. 112, pp. 59-67, Apr. 2018.
    3.
    J. Chen and A. H. Sayed, "Diffusion adaptation strategies for distributed optimization and learning over networks", IEEE Trans. Signal Process., vol. 60, no. 8, pp. 4289-4305, 2012.
    4.
    J. Chen and A. H. Sayed, "On the limiting behavior of distributed optimization strategies", Proc. 50th IEEE Annu. Allerton Conf. Communication Control and Computing (Allerton), pp. 1535-1542, 2012.
    5.
    J. Chen and A. H. Sayed, "On the learning behavior of adaptive networks–Part I: Transient analysis", IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3487-3517, 2015.
    6.
    J. Chen and A. H. Sayed, "On the learning behavior of adaptive networks–Part II: Performance analysis", IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3518-3548, 2015.
    7.
    R. Chen and I. C. Paschalidis, "A robust learning approach for regression models based on distributionally robust optimization", J. Mach. Learn. Res., vol. 19, no. 1, pp. 517-564, 2018.
    8.
    Random Graph Dynamics, Cambridge, U.K.:Cambridge Univ. Press, vol. 200, 2007.
    9.
    A. Koloskova, S. U. Stich and M. Jaggi, "Decentralized stochastic optimization and gossip algorithms with compressed communication", Proc. Int. Conf. Machine Learning, pp. 3478-3487, 2019.
    10.
    G. Lan, S. Lee and Y. Zhou, "Communication-efficient algorithms for decentralized and stochastic optimization", Math. Program., vol. 180, no. 1–2, pp. 1-48, 2017.
    11.
    Z. Li, W. Shi and M. Yan, "A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates", IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494-4506, 2019.
    12.
    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang and J. Liu, "Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent", Proc. Advances in Neural Information Processing Systems, pp. 5336-5346, 2017.
    13.
    G. Morral, P. Bianchi and G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", Proc. 53rd IEEE Conf. Decision and Control, pp. 1476-1481, 2014.
    14.
    G. Morral, P. Bianchi and G. Fort, "Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks", IEEE Trans. Signal Process., vol. 65, no. 11, pp. 2798-2813, 2017.
    15.
    A. Nedić and A. Olshevsky, "Stochastic gradient-push for strongly convex functions on time-varying directed graphs", IEEE Trans. Autom. Control, vol. 61, no. 12, pp. 3936-3947, 2016.
    16.
    A. Nedic, A. Olshevsky, A. Ozdaglar and J. N. Tsitsiklis, "On distributed averaging algorithms and quantization effects", IEEE Trans. Autom. Control, vol. 54, no. 11, pp. 2506-2517, 2009.
    17.
    A. Nedić, A. Olshevsky and M. Rabbat, "Network topology and communication-computation tradeoffs in decentralized optimization", Proc. IEEE, vol. 106, no. 5, pp. 953-976, 2018.
    18.
    A. Nedic and A. Ozdaglar, "Distributed subgradient methods for multi-agent optimization", IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48-61, 2009.
    19.
    A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, "Robust stochastic approximation approach to stochastic programming", SIAM J. Optim., vol. 19, no. 4, pp. 1574-1609, 2009.
    20.
    A. Olshevsky, "Linear time average consensus and distributed optimization on fixed graphs", SIAM J. Control Optim., vol. 55, no. 6, pp. 3990-4014, 2017.
    21.
    A. Olshevsky, I. C. Paschalidis and A. Spiridonoff, "Robust asynchronous stochastic gradient-push: Asymptotically optimal and network-independent performance for strongly convex functions", 2018, [online] Available: .
    22.
    S. Pu and A. Garcia, "A flocking-based approach for distributed stochastic optimization", Oper. Res., vol. 66, no. 1, pp. 267-281, 2017.
    23.
    S. Pu and A. Garcia, "Swarming for faster convergence in stochastic optimization", SIAM J. Control Optim., vol. 56, no. 4, pp. 2997-3020, 2018.
    24.
    S. Pu and A. Nedić, "Distributed stochastic gradient tracking methods", 2018, [online] Available: .
    25.
    S. Pu, A. Olshevsky and I. C. Paschalidis, "A sharp estimate on the transient time of distributed stochastic gradient descent", 2019, [online] Available: .
    26.
    G. Qu and N. Li, "Harnessing smoothness to accelerate distributed optimization", IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245-1260, 2018.
    27.
    A. Rakhlin, O. Shamir and K. Sridharan, "Making gradient descent optimal for strongly convex stochastic optimization", Proc. 29th Int. Conf. Machine Learning, pp. 1571-1578, 2012.
    28.
    M. O. Sayin, N. D. Vanli, S. S. Kozat and T. Basar, "Stochastic subgradient algorithms for strongly convex optimization over distributed networks", IEEE Trans. Netw. Sci. Eng., vol. 4, no. 4, pp. 248-260, 2017.
    29.
    K. Scaman, F. Bach, S. Bubeck, Y. T. Lee and L. Massoulié, "Optimal convergence rates for convex distributed optimization in networks", J. Mach. Learn. Res., vol. 20, no. 159, pp. 1-31, 2019.
    30.
    B. Sirb and X. Ye, "Decentralized consensus algorithm with delayed and stochastic gradients", SIAM J. Optim., vol. 28, no. 2, pp. 1232-1254, 2018.