Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.SIAM J. OPTIM. c© 2015 Society for Industrial and Applied Mathematics
Vol. 25, No. 2, pp. 944–966
EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR
DECENTRALIZED CONSENSUS OPTIMIZATION
WEI SHI , QING LING , GANG WU , AND WOTAO YIN
Abstract. Recently, there has been growing interest in solving consensus optimization problems
in a multiagent network. In this paper, we develop a decentralized algorithm for the consensus opti-
mization problem minimizexRp ¯f (x) = 1
n
n
i=1 fi(x), which is defined over a connected network of
n agents, where each function fi is held privately by agent i and encodes the agent’s data and objec-
tive. All the agents shall collaboratively find the minimizer while each agent can only communicate
with its neighbors. Such a computation scheme avoids a data fusion center or long-distance commu-
nication and offers better load balance to the network. This paper proposes a novel decentralized
exact first-order algorithm (abbreviated as EXTRA) to solve the consensus optimization problem.
“Exact” means that it can converge to the exact solution. EXTRA uses a fixed, large step size,
which can be determined independently of the network size or topology. The local variable of every
agent i converges uniformly and consensually to an exact minimizer of ¯f . In contrast, the well-known
decentralized gradient descent (DGD) method must use diminishing step sizes in order to converge
to an exact minimizer. EXTRA and DGD have the same choice of mixing matrices and similar per-
iteration complexity. EXTRA, however, uses the gradients of the last two iterates, unlike DGD which
uses just that of the last iterate. EXTRA has the best known convergence rates among the exist-
ing synchronized first-order decentralized algorithms for minimizing convex Lipschitz–differentiable
functions. Specifically, if the fi’s are convex and have Lipschitz continuous gradients, EXTRA has
an ergodic convergence rate O( 1
k ) in terms of the first-order optimality residual. In addition, as long
as ¯f is (restricted) strongly convex (not all individual fi’s need to be so), EXTRA converges to an
optimal solution at a linear rate O(Ck ) for some constant C > 1.
Key words. consensus optimization, decentralized optimization, gradient method, linear con-
vergence
AMS subject classifications. 90C25, 90C30
DOI. 10.1137/14096668X
1. Introduction. This paper focuses on decentralized consensus optimization, a
problem defined on a connected network and solved by n agents cooperatively,
minimize
xRp
¯f (x) = 1
n
n
i=1
fi(x),(1.1)
over a common variable x Rp, and for each agent i, fi : Rp R is a convex function
privately known by the agent. We assume that the fi’s are continuously differentiable
and will introduce a novel first-order algorithm to solve (1.1) in a decentralized man-
ner. We stick to the synchronous case in this paper, that is, all the agents carry out
their iterations at the same time intervals.
Problems of the form (1.1) that require decentralized computation are found
widely in various scientific and engineering areas including sensor network information
processing, multiple-agent control and coordination, as well as distributed machine
learning. Examples and works include decentralized averaging [7, 15, 34], learning
Received by the editors April 28, 2014; accepted for publication (in revised form) January 21,
2015; published electronically May 7, 2015. This work was supported by Chinese Scholarship Coun-
cil (CSC) grants 201306340046 and 2011634506, NSFC grant 61004137, MOF/MIIT/MOST grant
BB2100100015, and NSF grants DMS-0748839 and DMS-1317602.
http://www.siam.org/journals/siopt/25-2/96668.html
Department of Automation, University of Science and Technology of China, Hefei, China
(shiwei00@mail.ustc.edu.cn, qingling@mail.ustc.edu.cn, wug@.ustc.edu.cn).
Department of Mathematics, UCLA, Los Angeles, CA 90095 (wotaoyin@math.ucla.edu).
944
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 945
[9, 22, 26], estimation [1, 2, 16, 18, 29], sparse optimization [19, 35], and low-rank
matrix completion [20] problems. Functions fi can take the form of least squares
[7, 15, 34], regularized least squares [1, 2, 9, 18, 22], as well as more general ones [26].
The solution x can represent, for example, the average temperature of a room [7, 34],
frequency-domain occupancy of spectra [1, 2], states of a smart grid system [10, 16],
sparse vectors [19, 35], a matrix factor [20], and so on. In general, decentralized
optimization fits the scenarios in which the data are collected and/or stored in a
distributed network, a fusion center is either infeasible or not economical, and/or
computing is required to be performed in a decentralized and collaborative manner
by multiple agents.
1.1. Related methods. Existing first-order decentralized methods for solving
(1.1) include the (sub)gradient method [21, 25, 36], the (sub)gradient-push method
[23, 24], the fast (sub)gradient method [5, 14], and the dual averaging method [8].
Compared to classical centralized algorithms, decentralized algorithms encounter more
restrictive assumptions and typically worse convergence rates. Most of the above
algorithms are analyzed under the assumption of bounded (sub)gradients. The work
[21] assumes a bounded Hessian for strongly convex functions. Recent work [36]
relaxes such assumptions for decentralized gradient descent (DGD). When (1.1) has
additional constraints that force x into a bounded set, which also leads to bounded
(sub)gradients and a Hessian, projected first-order algorithms are applicable [27, 37].
When using a fixed step size, these algorithms do not converge to a solution
x of problem (1.1) but a point in its neighborhood regardless of whether the fi’s
are differentiable or not [36]. This motivates the use of certain diminishing step
sizes in [5, 8, 14] to guarantee convergence to x. The rates of convergence are
generally weaker than their analogues in centralized computation. For the general
convex case and under the bounded (sub)gradient (or Lipschitz–continuous objective)
assumption, [5] shows that diminishing step sizes αk = 1k lead to a convergence rate
of O( ln kk ) in terms of the running best of objective error, and [8] shows that the dual
averaging method has a rate of O( ln kk ) in the ergodic sense in terms of objective
error. For the general convex case, under assumptions of fixed step size and Lipschitz
continuous, bounded gradient, [14] shows an outer–loop convergence rate of O ( 1
k2
) in
terms of objective error, utilizing Nesterov’s acceleration, provided that the inner loop
performs substantial consensus computation. Without a substantial inner loop, the
diminishing step sizes αk = 1
k1/3 lead to a reduced rate of O( ln k
k ). The (sub)gradient-
push method [23] can be implemented in a dynamic digraph and, under the bounded
(sub)gradient assumption and diminishing step sizes αk = O( 1k ), has a rate of O( ln kk )
in the ergodic sense in terms of objective error. A better rate of O( ln k
k ) is proved
for the (sub)gradient-push method in [24] under the strong convexity and Lipschitz
gradient assumptions, in terms of expected objective error plus squared consensus
residual.
Some of the other related algorithms are as follows. For general convex func-
tions and assuming closed and bounded feasible sets, the decentralized asynchronous
alternating direction nethod of multipliers (ADMM) [32] is proved to have a rate of
O( 1
k ) in terms of expected objective error and feasibility violation. The augmented
Lagrangian based primal-dual methods have linear convergence under strong convex-
ity and Lipschitz gradient assumptions [4, 30] or under the positive-definite bounded
Hessian assumption [12, 13].
Our proposed algorithm is a synchronous gradient-based algorithm that has an
ergodic rate of O( 1
k ) (in terms of optimality condition violation) for general convex