Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.SIAM J. OPTIM. c© 2016 Society for Industrial and Applied Mathematics
Vol. 26, No. 3, pp. 1835–1854
ON THE CONVERGENCE OF DECENTRALIZED GRADIENT
DESCENT∗
KUN YUAN† , QING LING† , AND WOTAO YIN‡
Abstract. Consider the consensus problem of minimizing f (x) = ∑n
i=1 fi(x), where x ∈ Rp
and each fi is only known to the individual agent i in a connected network of n agents. To solve this
problem and obtain the solution, all the agents collaborate with their neighbors through information
exchange. This type of decentralized computation does not need a fusion center, offers better network
load balance, and improves data privacy. This paper studies the decentralized gradient descent
method [A. Nedic and A. Ozdaglar, IEEE Trans. Automat. Control, 54 (2009), pp. 48–61], in which
each agent i updates its local variable x(i) ∈ Rn by combining the average of its neighbors’ with a
local negative-gradient step −α∇fi(x(i)). The method is described by the iteration x(i)(k + 1) ←∑n
j=1 wij x(j)(k) − α∇fi(x(i)(k)), for each agent i, where wij is nonzero only if i and j are neighbors
or i = j and the matrix W = [wij ] ∈ Rn×n is symmetric and doubly stochastic. This paper analyzes
the convergence of this iteration and derives its rate of convergence under the assumption that each
fi is proper closed convex and lower bounded, ∇fi is Lipschitz continuous with constant Lfi > 0,
and the stepsize α is fixed. Provided that α ≤ min{(1 + λn(W ))/Lh, 1/L ¯f }, where Lh = maxi{Lfi }
and L ¯f = 1
n
∑n
i=1 Lfi , the objective errors of all the local solutions and the networkwide mean
solution reduce at rates of O(1/k) until they reach a level of O(α). If fi are strongly convex with
modulus µfi and α ≤ min{(1+λn(W ))/Lh, 1/(L ¯f +µ ¯f )}, where µ ¯f = 1
n
∑n
i=1 µfi , then all the local
solutions and the mean solution converge to the global minimizer x∗ at a linear rate until reaching an
O(α)-neighborhood of x∗. We also develop an iteration for decentralized basis pursuit and establish
its linear convergence to an O(α)-neighborhood of the true sparse signal. This analysis reveals how
the convergence of x(i)(k + 1) ← ∑n
j=1 wij x(j)(k) − α∇fi(x(i)(k)), for each agent i, depends on the
stepsize, function convexity, and network spectrum.
Key words. decentralized, distributed, consensus, optimization, gradient descent
AMS subject classifications. 90C25, 90C30
DOI. 10.1137/130943170
1. Introduction. Consider that n agents form a connected network and collab-
oratively solve a consensus optimization problem
minimize
x∈Rp f (x) =
n∑
i=1
fi(x),(1)
where each fi is only available to agent i. A pair of agents can exchange data if
and only if they are connected by a direct communication link; we say that two such
agents are neighbors of each other. Let X ∗ denote the set of solutions to (1), which
is assumed to be nonempty, and let f ∗ denote the optimal objective value.
The traditional (centralized) gradient descent iteration is
(2) x(k + 1) = x(k) − α∇f (x(k)),
∗Received by the editors October 28, 2013; accepted for publication (in revised form) June 22,
2016; published electronically September 8, 2016.
http://www.siam.org/journals/siopt/26-3/94317.html
Funding: The second author’s work was supported by NSF China grant 61573331 and NSF An-
hui grant 1608085QF13. The third author’s work was supported by ARL and ARO grant W911NF-
09-1-0383 and NSF grants DMS-0748839 and DMS-1317602.
†Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026,
China (kunyuan@mail.ustc.edu.cn, qingling@mail.ustc.edu.cn).
‡Department of Mathematics, University of California, Los Angeles, CA 90095 (wotaoyin@math.
ucla.edu).
1835
Vol. 26, No. 3, pp. 1835–1854
ON THE CONVERGENCE OF DECENTRALIZED GRADIENT
DESCENT∗
KUN YUAN† , QING LING† , AND WOTAO YIN‡
Abstract. Consider the consensus problem of minimizing f (x) = ∑n
i=1 fi(x), where x ∈ Rp
and each fi is only known to the individual agent i in a connected network of n agents. To solve this
problem and obtain the solution, all the agents collaborate with their neighbors through information
exchange. This type of decentralized computation does not need a fusion center, offers better network
load balance, and improves data privacy. This paper studies the decentralized gradient descent
method [A. Nedic and A. Ozdaglar, IEEE Trans. Automat. Control, 54 (2009), pp. 48–61], in which
each agent i updates its local variable x(i) ∈ Rn by combining the average of its neighbors’ with a
local negative-gradient step −α∇fi(x(i)). The method is described by the iteration x(i)(k + 1) ←∑n
j=1 wij x(j)(k) − α∇fi(x(i)(k)), for each agent i, where wij is nonzero only if i and j are neighbors
or i = j and the matrix W = [wij ] ∈ Rn×n is symmetric and doubly stochastic. This paper analyzes
the convergence of this iteration and derives its rate of convergence under the assumption that each
fi is proper closed convex and lower bounded, ∇fi is Lipschitz continuous with constant Lfi > 0,
and the stepsize α is fixed. Provided that α ≤ min{(1 + λn(W ))/Lh, 1/L ¯f }, where Lh = maxi{Lfi }
and L ¯f = 1
n
∑n
i=1 Lfi , the objective errors of all the local solutions and the networkwide mean
solution reduce at rates of O(1/k) until they reach a level of O(α). If fi are strongly convex with
modulus µfi and α ≤ min{(1+λn(W ))/Lh, 1/(L ¯f +µ ¯f )}, where µ ¯f = 1
n
∑n
i=1 µfi , then all the local
solutions and the mean solution converge to the global minimizer x∗ at a linear rate until reaching an
O(α)-neighborhood of x∗. We also develop an iteration for decentralized basis pursuit and establish
its linear convergence to an O(α)-neighborhood of the true sparse signal. This analysis reveals how
the convergence of x(i)(k + 1) ← ∑n
j=1 wij x(j)(k) − α∇fi(x(i)(k)), for each agent i, depends on the
stepsize, function convexity, and network spectrum.
Key words. decentralized, distributed, consensus, optimization, gradient descent
AMS subject classifications. 90C25, 90C30
DOI. 10.1137/130943170
1. Introduction. Consider that n agents form a connected network and collab-
oratively solve a consensus optimization problem
minimize
x∈Rp f (x) =
n∑
i=1
fi(x),(1)
where each fi is only available to agent i. A pair of agents can exchange data if
and only if they are connected by a direct communication link; we say that two such
agents are neighbors of each other. Let X ∗ denote the set of solutions to (1), which
is assumed to be nonempty, and let f ∗ denote the optimal objective value.
The traditional (centralized) gradient descent iteration is
(2) x(k + 1) = x(k) − α∇f (x(k)),
∗Received by the editors October 28, 2013; accepted for publication (in revised form) June 22,
2016; published electronically September 8, 2016.
http://www.siam.org/journals/siopt/26-3/94317.html
Funding: The second author’s work was supported by NSF China grant 61573331 and NSF An-
hui grant 1608085QF13. The third author’s work was supported by ARL and ARO grant W911NF-
09-1-0383 and NSF grants DMS-0748839 and DMS-1317602.
†Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026,
China (kunyuan@mail.ustc.edu.cn, qingling@mail.ustc.edu.cn).
‡Department of Mathematics, University of California, Los Angeles, CA 90095 (wotaoyin@math.
ucla.edu).
1835
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.1836 KUN YUAN, QING LING, AND WOTAO YIN
where α is the stepsize, either fixed or varying with k. To apply iteration (2) to
problem (1) under the decentralized situation, one has two choices of implementation:
• let a fusion center (which can be a designated agent) carry out iteration (2);
• let all the agents carry out the same iteration (2) in parallel.
In either way, fi (and thus ∇fi) is only known to agent i. Therefore, in order to ob-
tain ∇f (x(k)) = ∑n
i=1 ∇fi(x(k)), every agent i must have x(k), compute ∇fi(x(k)),
and then send out ∇fi(x(k)). This approach requires synchronizing x(k) and scat-
tering/collecting ∇fi(x(k)), i = 1, . . . , n, over the entire network, which incurs a
significant amount of communication traffic, especially when the network is large and
sparse. A decentralized approach will be more viable since its communication is re-
stricted to neighbors. Although there is no guarantee that decentralized algorithms
use less communication (as they tend to take more iterations), they provide better
network load balance and tolerance to the failure of individual agents. In addition,
each agent can keep its fi and ∇fi private to some extent.1
Decentralized gradient descent [20] does not rely on a fusion center or network-
wide communication. It carries out an approximate version of (2) in the following
fashion:
• let each agent i hold an approximate copy x(i) ∈ Rp of x ∈ Rp;
• let each agent i update its x(i) to the weighted average of its neighborhood;
• let each agent i apply −∇fi(x(i)) to decrease fi(x(i)).
At each iteration k, each agent i performs the following steps:
1. computes ∇fi(x(i)(k));
2. computes the neighborhood weighted average x(i)(k+1/2) = ∑n
j=1 wij x(j)(k),
where wij 6 = 0 only if j is a neighbor of i or j = i;
3. applies x(i)(k + 1) = x(i)(k + 1/2) − α∇fi(x(i)(k)).
Steps 1 and 2 can be carried out in parallel, and their results are used in step 3.
Putting the three steps together, we arrive at our main iteration
(3) x(i)(k + 1) =
n∑
j=1
wij x(j)(k) − α∇fi(x(i)(k)), i = 1, 2, . . . , n.
When fi is not differentiable, by replacing ∇fi with a member of ∂fi we obtain the
decentralized subgradient method [20]. Other decentralization methods are reviewed
in section 1.2.
We assume that the mixing matrix W = [wij ] is symmetric and doubly stochastic.
The eigenvalues of W are real and sorted in a nonincreasing order 1 = λ1(W ) ≥
λ2(W ) ≥ · · · ≥ λn(W ) ≥ −1. Let the second largest magnitude of the eigenvalues of
W be denoted as
(4) β = max {|λ2(W )|, |λn(W )|} .
The optimization of matrix W and, in particular, β, is not our focus; the reader is
referred to [4].
Some basic questions regarding the decentralized gradient method include the
following: (i) When does x(i)(k) converge? (ii) Does it converge to x∗ ∈ X ∗? (iii) If
x∗ is not the limit, does consensus (i.e., x(i)(k) = x(j)(k) ∀i, j) hold asymptotically?
(iv) How do the properties of fi and the network affect convergence?
1Neighbors of i may know the samples of fi and/or ∇fi at some points through data exchanges
and thus obtain an interpolation of fi.
where α is the stepsize, either fixed or varying with k. To apply iteration (2) to
problem (1) under the decentralized situation, one has two choices of implementation:
• let a fusion center (which can be a designated agent) carry out iteration (2);
• let all the agents carry out the same iteration (2) in parallel.
In either way, fi (and thus ∇fi) is only known to agent i. Therefore, in order to ob-
tain ∇f (x(k)) = ∑n
i=1 ∇fi(x(k)), every agent i must have x(k), compute ∇fi(x(k)),
and then send out ∇fi(x(k)). This approach requires synchronizing x(k) and scat-
tering/collecting ∇fi(x(k)), i = 1, . . . , n, over the entire network, which incurs a
significant amount of communication traffic, especially when the network is large and
sparse. A decentralized approach will be more viable since its communication is re-
stricted to neighbors. Although there is no guarantee that decentralized algorithms
use less communication (as they tend to take more iterations), they provide better
network load balance and tolerance to the failure of individual agents. In addition,
each agent can keep its fi and ∇fi private to some extent.1
Decentralized gradient descent [20] does not rely on a fusion center or network-
wide communication. It carries out an approximate version of (2) in the following
fashion:
• let each agent i hold an approximate copy x(i) ∈ Rp of x ∈ Rp;
• let each agent i update its x(i) to the weighted average of its neighborhood;
• let each agent i apply −∇fi(x(i)) to decrease fi(x(i)).
At each iteration k, each agent i performs the following steps:
1. computes ∇fi(x(i)(k));
2. computes the neighborhood weighted average x(i)(k+1/2) = ∑n
j=1 wij x(j)(k),
where wij 6 = 0 only if j is a neighbor of i or j = i;
3. applies x(i)(k + 1) = x(i)(k + 1/2) − α∇fi(x(i)(k)).
Steps 1 and 2 can be carried out in parallel, and their results are used in step 3.
Putting the three steps together, we arrive at our main iteration
(3) x(i)(k + 1) =
n∑
j=1
wij x(j)(k) − α∇fi(x(i)(k)), i = 1, 2, . . . , n.
When fi is not differentiable, by replacing ∇fi with a member of ∂fi we obtain the
decentralized subgradient method [20]. Other decentralization methods are reviewed
in section 1.2.
We assume that the mixing matrix W = [wij ] is symmetric and doubly stochastic.
The eigenvalues of W are real and sorted in a nonincreasing order 1 = λ1(W ) ≥
λ2(W ) ≥ · · · ≥ λn(W ) ≥ −1. Let the second largest magnitude of the eigenvalues of
W be denoted as
(4) β = max {|λ2(W )|, |λn(W )|} .
The optimization of matrix W and, in particular, β, is not our focus; the reader is
referred to [4].
Some basic questions regarding the decentralized gradient method include the
following: (i) When does x(i)(k) converge? (ii) Does it converge to x∗ ∈ X ∗? (iii) If
x∗ is not the limit, does consensus (i.e., x(i)(k) = x(j)(k) ∀i, j) hold asymptotically?
(iv) How do the properties of fi and the network affect convergence?
1Neighbors of i may know the samples of fi and/or ∇fi at some points through data exchanges
and thus obtain an interpolation of fi.