Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/AMS/Regular/Main.js

A convergence analysis of distributed dictionary learning based on the K-SVD algorithm

Publisher: IEEE

Abstract:
This paper provides a convergence analysis of a recent distributed algorithm, termed cloud K-SVD, that solves the problem of data-adaptive representations for big, distributed data. It is assumed that a number of geographically-distributed, interconnected sites have massive local data and they are collaboratively learning a sparsifying dictionary underlying these data using cloud K-SVD. This paper provides a rigorous analysis of cloud K-SVD that gives insights into its properties as well as deviations of the dictionaries learned at individual sites from a centralized solution in terms of different measures of local/global data and topology of the interconnections.
Date of Conference: 14-19 June 2015
Date Added to IEEE Xplore: 01 October 2015
ISBN Information:
ISSN Information:
INSPEC Accession Number: 15506595
Publisher: IEEE
Conference Location: Hong Kong, China

SECTION I.

Introduction

Data representation using dictionary learning has gained a lot of attention in recent years. Some important contributions towards solving the dictionary learning problem include [1][3]. But such methods assume data to be present at a centralized location and are therefore not suited for cases when data are distributed across multiple locations. On the other hand, distributed data sets are quite prevalent in today's information processing landscape. In order to address the challenge of dictionary learning from distributed data, [4][7] have recently proposed a few approaches. Among these approaches is a distributed variant of the efficient K-SVD algorithm for dictionary learning, termed cloud K-SVD [5]. Computationally, cloud K-SVD has been shown to have many of the desirable characteristics, such as fast convergence and small approximation error, of a dictionary learning algorithm [5]. Our goal in this paper is to provide a convergence analysis of cloud K-SVD.

In terms of our main contribution, note that cloud K-SVD relies on the power method for computing dominant singular vectors during the dictionary update step of K-SVD, while it uses consensus averaging to perform the power method in a distributed manner. In [5], we provided a preliminary analysis of cloud K-SVD that dealt with the convergence of its distributed power method component. In this paper we build upon our initial analysis in [5] and show that the cloud K-SVD dictionaries converge to the centralized K-SVD dictionary under certain assumptions. The implication of our analysis is that the differences between cloud K-SVD dictionaries and the centralized K-SVD dictionary can be arbitrarily small as long as both algorithms are initialized identically and appropriate numbers of power method and consensus iterations are performed in cloud K-SVD. Furthermore, our analysis guarantees this as long as total number of transmissions by any given site scales as \Omega(\log^{2}S_{\max}), where S_{\max} denotes the maximum number of data samples at any one site within the network.

To the best of our knowledge, this is the first work showing that dictionaries learned in distributed settings can be arbitrarily close to a centralized dictionary. While related works [4], [6] provide algorithms for distributed dictionary learning, they lack convergence guarantees. Other contributions like [7] focus on learning segments of the dictionary at each site, which is different from our setup since we are learning a complete dictionary at each site.

Notation. We use lower-case letters to represent scalars and vectors, while we use upper-case letters to represent matrices. Given a vector v, \mathrm{supp}(v) returns indices of the nonzero entries in v, \Vert v\Vert_{p} denotes its \ell_{p} norm, \Vert v\Vert_{0} counts the number of its nonzero entries, and superscript (\cdot)^{\mathrm{T}} denotes the transpose operation. Given a set \mathcal{I}, v_{\vert\mathcal{I}} and A_{1\mathcal{I}} denote a subvector and a submatrix obtained by retaining entries of vector v and columns of matrix A corresponding to the indices in \mathcal{I}, respectively. Given matrices \{A_{i}\in \mathbb{R}^{n_{i}\times m_{i}}\}_{i=1}^{N}, the operation diag \{A_{1}, \ldots, A_{N}] returns a block-diagonal matrix A\in \mathbb{R}^{\Sigma n_{i}\times\Sigma m_{i}} that has A_{8}'s on its diagonal. Finally, given a matrix A, a_{j} and \alpha_{j, T} denote the j^{th} column and the j^{th} row of A, respectively.

SECTION II.

Problem Formulation

Our focus in this paper is on the convergence behavior of the cloud K-SVD algorithm [5]. Here, convergence of cloud K-SVD means that after T_{d} dictionary learning iterations, the gaps between the dictionaries using cloud K-SVD and the one learned using centralized K-SVD an be made arbitrarily small. Our goal in this regard is understanding the gaps between cloud K-SVD dictionaries and centralized K-SVD dictionary in terms of various problem parameters, such as the number of sites and data samples, the topology of interconnections, and the numbers of consensus and power method iterations.

A. Collaborative Dictionary Learning Using Cloud K-SVD

Consider a collection of N geographically-distributed sites that are interconnected to each other according to a fixed topology. Mathematically, we represent this collection and their interconnections through an undirected graph \mathcal{G}=(\mathcal{N},\ \mathcal{E}), where \mathcal{N}=\{1, 2,\ \cdots,\ N\} denotes the sites and \mathcal{E} denotes edges in the graph with (i,\ i)\in \mathcal{E} and (i, j)\in \mathcal{E} whenever there is a connection between site i and site j. The only assumption we make about the topology of \mathcal{G} is that it is a connected graph. Next, we assume that each site i has a massive collection of local data, expressed as a matrix Y_{i}\in \mathbb{R}^{n\times S_{\tau}} with S_{i}\gg n representing the number of data samples at the i^{th} site. We can express this distributed data as a single matrix Y=[Y_{1}\ Y_{2}... \ Y_{N}]\in \mathbb{R}^{n\times S}, where S=\sum_{i=1}^{N}S_{i} denotes the total number of data samples distributed across the N sites. In this setting, the fundamental objective is for each site to collaboratively learn a dictionary that underlies the global (distributed) data Y.

Assuming that global data Y is available at a centralized location, the dictionary learning problem can be expressed as \begin{equation*} (D,\displaystyle \ X)=\arg\min_{D, X}\Vert Y-DX \Vert_{F}^{2} \mathrm{s.t.} \forall s, \Vert x_{s}\Vert_{0}\leq T_{0}, \tag{1} \end{equation*}View SourceRight-click on figure for MathML and additional features. where D\in \mathbb{R}^{n\times K} with K > n is an overcomplete dictionary having unit \ell_{2}-norm columns, X\in \mathbb{R}^{K\times S} corresponds to representation coefficients of the data having no more than T_{0}\ll n nonzero coefficients per sample, and x_{s}, s=1, \ldots, S denotes the s^{th} column in X. Unlike classical dictionary learning, however, we do not have the global data Y available at a centralized location. Therefore, our goal is to have individual sites collaboratively learn dictionaries \{\hat{D}_{i}\}_{i\in N} from global data Y such that these collaborative dictionaries are close to a dictionary D that could have been learned from Y in a centralized fashion.

Cloud K-SVD, given as Algorithm 1, was developed in [5] to accomplish the goal of collaborative dictionary learning from distributed data. This algorithm is a distributed variant of the famous K-SVD algorithm, which consists of two main steps, sparse coding and dictionary update. The sparse coding step in cloud K-SVD is performed only for locally available data at each site, while the dictionary update step is performed in a distributed manner using distributed power method. This makes the dictionary update step in cloud K-SVD as the most important and challenging step, Specifically, we recall from the description of K-SVD in [2] that dictionary update involves singular value decomposition (SVD) of the error matrix E_{k}^{(t)}=[\overline{E}_{1, k}^{(t)} \ldots\ E_{N_{)}k}^{(t)}] in iteration t, which is available to K-SVD at one location. Cloud K-SVD, on the other hand, can only compute E^{(t)}_{i, k} at any site i due to local sparse coding (cf. Step 5 of Algorithm 1), where \hat{E}_{i, k}^{(t)} denotes a perturbed version of the centralized E_{i, k}^{(t)}. Next, define an ordered set \overline{\omega}_{i, k}^{(t)}=\{s:1\leq s\leq S,\overline{x}_{i, k, T}^{(t)}(s)\neq 0\}, where \overline{x}_{i_{\backslash }k, T}^{(t)}(s) denotes the s^{th} element of \overline{x}_{i, k,\mathcal{I}}^{(t)}, and an S\times\vert \overline{\omega}_{i, k}^{(t)} binary matrix \overline{\Omega}_{i, k}^{(t)} that has ones at (\overline{\omega}_{i, k}^{(t)}(s),\ s) locations and zeros everywhere else. Then each site i in cloud K-SVD only has access to \hat{E}_{i, k, R}^{(t)}=\hat{E}_{i, k}^{(t)}\overline{\Omega}_{i, k}^{(t)} and \overline{x}_{i, k, R}^{(t)}=\overline{x}_{i, k, T}^{(t)}\overline{\Omega}_{i, k}^{(t)}, whereas the centralized K-SVD has access to E_{i, k, R}^{(t)}=B_{i, k}^{(t)}\Omega_{i, k}^{(t)} and x_{i.k.R}^{(t)}=x_{i.kT}^{(t)}\Omega_{i.i}^{(t}. Finally, each site i in cloud K-SVD can only rely on \overline{\Omega}_{i, k}^{(t)}, while K-SVD has access to the matrices \Omega_{k}^{(t)}=diag(\Omega_{1, k}^{(t)},\ \ldots,\ \Omega_{N, k}^{(t)}) and E_{k, R}^{(t)}=E_{k}^{(t)}\Omega_{k}^{(t)}. Steps 4–21 in Algorithm 1 are designed to address these limitations of distributed dictionary learning; we refer the reader to [8] for further details on these steps.

B. Main Challenge

Even after K-SVD and cloud K-SVD are identically initialized, D^{(0)}=\hat{D}_{i}^{(0)}, we will have D^{(1)}\neq\hat{D}_{i}^{(1)} at the end of iteration 1 due to finite power method and consensus iterations in cloud K-SVD. This error in \hat{D}_{i}^{(1)} will then cause errors in sparse coding (Step 3 of Algorithm 1). Next, it can be seen from Steps 5–6 of Algorithm 1 that errors in sparse coding and \hat{D}_{i}^{(1)} will result in deviation of \hat{E}_{i, k, R}^{(2)} from the centralized E_{i, k, R}^{(2)}. This means that the updated k^{th} atom in iteration 2 will have an error due to perturbation of E_{i, k, R}^{(2)} and due to errors caused by finite numbers of consensus and power method iterations. All these errors will keep on accumulating in the same way for any iteration t > 2. In summary, the main sources of error in cloud K-SVD are as follows:

Algorithm 1: Cloud K-SVD for dictionary learning

Algorithm
  1. Error in sparse coding due to perturbed dictionaries at the start of any iteration t > 1.

  2. Error in \overline{E}_{k, R}^{(t)} due to errors in dictionaries from the previous iteration and errors in sparse codes in the current iteration. This error in E_{k, R}^{(t)} will result in an error during the dictionary update step even if there is no error in computing the principal eigenvector of \hat{Fj}_{k.R}^{(t)}\hat{E}_{k.R}^{(t)^{\mathrm{T}}}.

  3. Error in computing the principal eigenvector of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{1}} due to finite numbers of power method and consensus iterations.

Our goal in this paper is to analyze how these errors are accumulating in each iteration and how to control these errors such that the errors in dictionaries \hat{D}_{i}^{(t)} stay below some threshold after T_{d} dictionary learning iterations.

SECTION III.

Analysis of Cloud K-SVD

Analysis of cloud K-SVD requires an understanding of the behavior of its major components, which include sparse coding, dictionary update, and distributed power method within dictionary update. In addition, one also expects that the closeness of \hat{\hat{D}}_{i}'s to the centralized solution will be a function of certain properties of local/global data. We begin our analysis of cloud K-SVD by first stating some of these properties in terms of the centralized K-SVD solution.

A. Preliminaries

We will start by providing algorithmic specification of the sparse coding steps in both algorithms. While the sparse coding step as stated in Step 3 of Algorithm 1 has combinatorial complexity, various low-complexity computational approaches can be used to solve this step in practice. Our analysis in the following will be focused on the case when sparse coding in both cloud K-SVD and centralized K-SVD is carried out using the lasso [9]. Specifically, we assume sparse coding is carried out by solving \begin{equation*} x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2}+\tau\Vert x\Vert_{1} \tag{2} \end{equation*}View SourceRight-click on figure for MathML and additional features. with the regularization parameter \tau > 0 selected in a way that \Vert x_{i, s}\Vert_{0}\leq T_{0}\ll n. This can be accomplished, for example, by making use of the least angle regression algorithm [10]. Note that the lasso also has a dual, constrained form, given by \begin{equation*} x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2} s.t. \Vert x\Vert_{1}\leq\eta, \tag{3} \end{equation*}View SourceRight-click on figure for MathML and additional features. and the solutions of (2) and (3) are identical for an appropriate \eta_{\tau}=\eta(\tau) [11].

We also assume identical centralized and distributed initializations, i.e., \hat{D}_{i}^{(0)}\simeq D^{(0)}, i\simeq 1, \ldots, N, where D^{(t)}, t\geq 0, in the following denotes the centralized K-SVD dictionary estimate in the t^{th} iteration. Despite identical initialization, the cloud K-SVD dictionaries get perturbed in each iteration due to imperfect power method and consensus averaging. In order to ensure these perturbations do not cause the cloud K-SVD dictionaries to diverge from the centralized solution after T_{\mathrm{d}} iterations, we need the dictionary estimates returned by centralized K-SVD in each iteration to satisfy the following properties.

  1. [P1] Let x_{i, s}^{(t)} denote the solution of the lasso (i.e., (2)) for D=D^{(t-1)} and \tau=\tau^{(t)}, t=1, \ldots, T_{\mathrm{d}}. Then there exists some C_{1} > 0 such that the following holds: \begin{equation*} \displaystyle \min_{t, i, s, j\not\in \mathrm{supp}(x_{is}^{(t)})} \tau^{(t)}-\vert \langle d_{j}^{(t)}, y_{i, s}-D^{(t-1)}x_{i, s}^{(t)}\rangle\vert > C_{1}. \end{equation*}View SourceRight-click on figure for MathML and additional features.

    For collection \{\tau^{(t)}\}_{t=1}^{T_{d}}, we also define the smallest regularization parameter \tau_{\min} = \displaystyle \min_{f}\tau^{(t)}, and the largest dual parameter among the (dual) collection \{\eta_{\tau}^{(t)}=-\eta(\tau^{(t)})\}_{t=1}^{T_{d}} as \displaystyle \eta_{\tau,\max}=\max_{t}\eta_{\tau}^{(t)}.

  2. [P2] Define \Sigma_{T_{0}}=\{\mathcal{I}\subset\{1,\ \ldots,\ K\}:\vert \mathcal{I}\vert =\tau_{0}\}. Then there exists some C_{2}' > \displaystyle \frac{C_{1}^{4}\tau_{\min}^{2}}{1936}, such that the following holds, \displaystyle \min_{t=1,\ldots, T_{d},\mathcal{I}\in\Sigma_{T_{0}}}.\sigma_{T_{0}}(D_{\vert \mathcal{I}j}^{(t-1)})\geq\sqrt{C_{2}'}, where \sigma_{T_{0}}(\cdot) denotes the T_{0}^{th} (ordered) singular value of a matrix. In our analysis, we will be using the parameter C_{2}=(\displaystyle \sqrt{C_{2}'}-\frac{c_{1^{\mathcal{T}_{\min}}}^{2^{\vee}}}{44})^{2}.

  3. [P3] Let \lambda_{1, k}^{(t)} > \lambda_{2, k}^{(t)}\geq\ldots\lambda_{n, k}^{(t)}\geq denote the eigenvalues of the centralized “reduced” matrix E_{kR}^{(t)}E_{kR}^{(t)^{\mathrm{T}}}, k\in \{ 1, \ldots, K\}, in the t^{th} iteration, t \in \{ 1, \ldots, T_{d}\}. Then there exists some C_{3}' < 1 such that the following holds, \max_{t, k^{\frac{\lambda_{2k}^{(t)}}{\lambda_{1k}^{(t)}}}}-\leq C_{\mathrm{q}, 0}'. Now define C_{3}= \displaystyle \max\{1, \displaystyle \frac{1}{\min_{tk}\lambda_{1k}^{(t)}(1-C_{3}')}\}, which we will use in our forthcoming analysis.

Properties P1 and P2 correspond to sufficient conditions for x_{i, k}^{(t)} to be a unique solution of (2) [12] and guarantee that the centralized K-SVD generates a unique collection of sparse codes in each dictionary learning iteration. Property P3, on the other hand, ensures that algorithms such as the power method can be used to compute the dominant eigenvector of E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}} in each dictionary learning iteration (Steps 8–18 in Algorithm 1) [13]. In addition to these properties, our final analytical result for cloud K-SVD will also be a function of a certain parameter of the centralized error matrices \{E_{k}^{(t)}\}_{k=1}^{K} generated by the centralized K-SVD in each iteration. We define this parameter as follows. Let E_{i, k}^{(t)}, i= 1, \ldots, N, denote part of the centralized error matrix E_{k}^{(t)} associated with the data of the i^{th} site in the t^{th} iteration, i.e., E_{k}^{(t)}= \lfloor E_{1, k}^{(t)} E_{2, k}^{(t)} E_{N, k}^{(t)}\rfloor, k=1, \ldots, K, t\simeq 1, \ldots, T_{d}. Then C_{4}=\displaystyle \max\{1, \displaystyle \max_{t^{\rightarrow}i, k}\Vert E_{i, k}^{(t)}\Vert_{2}\}.

B. Main Result

We are now ready to state the main result of this paper. This result is given in terms of the \Vert\cdot\Vert_{2} norm mixing time, T_{mix}, of the Markov chain associated with the doubly-stochastic weight matrix W used for consensus averaging, defined as \begin{equation*} T_{mix}=\displaystyle \max_{=}\inf_{i1,\ldots, Nt\in \mathrm{N}}\{t: \displaystyle \Vert \mathrm{e}_{i}^{\mathrm{T}}W^{t}-\frac{1}{N}1^{\mathrm{T}}\Vert_{2}\leq\frac{1}{2}\}. \tag{4} \end{equation*}View SourceRight-click on figure for MathML and additional features.

Here, \mathrm{e}_{i}\in \mathbb{R}^{N} denotes the i^{th} column of the identity matrix I_{N}. In the following, main convergence results for cloud K-SVD along with brief discussions are presented. For detailed proofs and discussions we refer the reader to [8].

Theorem 1

Stability of Cloud K-SVD Dictionaries

Suppose cloud K-SVD (Algorithm 1) and (centralized) K-SVD are identically initialized and both of them carry out T_{d} dictionary learning iterations. In addition, assume cloud K-SVD carries out T_{p} power method iterations during the update of each atom and T_{c} consensus iterations during each power method iteration. Finally, assume the K-SVD algorithm satisfies properties P1-P3. Next, define \begin{align*}&\displaystyle \alpha=\max_{t, k}\sum_{i=1}^{N}\Vert\hat{E}_{i, k, R}^{(t)}\hat{E}_{i, k, R}^{(t)^{T}}\Vert_{2}, \displaystyle \beta=\max_{t, t_{p}, k}\frac{1}{\Vert\hat{E}_{kR}\hat{E}_{kR}q_{ctk}(t)(t)^{T}(t_{p})\Vert_{2}}\\ &\gamma = \displaystyle \max_{t, k}\sqrt{\sum_{i=1}^{N}\Vert\hat{E}_{ikR}^{(t)}\hat{E}_{ikR}^{(t)^{\mathcal{T}}}\Vert_{F}^{2}}, \gamma=\max_{t, k}{\frac{\hat{\lambda}_{2k}^{(t)}}{\hat{\lambda}_{1, h}^{(t)}}}, \end{align*}View SourceRight-click on figure for MathML and additional features.

\hat{\theta}_{k}^{(t)} \in [0,\ \pi/2] as \hat{\theta}_{k}^{(t)} arccos (\displaystyle \frac{\vert \langle u_{1k}^{(t)}, q^{init}\rangle\vert }{\Vert u_{1k}^{(t)}\Vert_{2}\Vert q^{init}\Vert_{2}}), \mu=\displaystyle \max\{1, \displaystyle \max_{k, t}\mathrm{t}an(\hat{\theta}_{k}^{(t)})\}, and \zeta K\displaystyle \sqrt{2S_{\max}}(\frac{6\sqrt{KT_{0}}}{\tau_{\min}C_{2}}+\eta_{\tau,\max}), where S_{\max}=\displaystyle \max_{i}S_{i}, u_{1, k}^{(t)} is the dominant eigenvector of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{T}}, \hat{\lambda}_{1k}^{(t)} and \hat{\lambda}_{2, k}^{(t)} are first and second largest eigenvalues of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{\mathcal{T}}}, respectively, and q_{c, t, k}^{(t_{p})} denotes the iterates of a centralized power method initialized with q^{init} for estimation of the dominant eigenvector of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{7}}. Then, assuming \displaystyle \min_{t, k}\vert \langle u_{1, k}^{(t)}, q^{init}\rangle\vert > 0, and fixing any \in(0, \displaystyle \min\{(10\alpha^{2}\beta^{2})^{-1/3T_{p}},\ (\frac{1-\nu}{4})^{1/3}\}) and \delta_{d}\in(0, \displaystyle \min\{\frac{1}{\sqrt{2}},\ \frac{C_{1}^{2}\tau_{\min}}{44\sqrt{2K}}\}), we have \begin{equation*} \max_{i=1,\ldots, N_{k=1,\ldots, K}}\Vert^{\wedge}d_{i, k}^{(T_{d})^{\wedge}}d_{i, k}^{(T_{d})^{\mathcal{T}}}-d_{k}^{(T_{d})}d_{k}^{(T_{d})^{\mathcal{T}}}\Vert_{2}\leq\delta_{d} \tag{5} \end{equation*}View SourceRight-click on figure for MathML and additional features. as long as the number of power method iterations T_{p}\geq \displaystyle \frac{2(T_{d}K-2)\log(8C_{3}C_{4}^{2^{\mathrm{z}}}N+5)+(T_{d}-1)\log(1+\zeta)+\log(8C_{3}C_{4}\mu N\sqrt{n}\delta_{d}^{-1})}{\log\lceil(l/+4\epsilon^{3})^{-1}\rceil} and the number of consensus iterations T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha\beta\epsilon^{-1})+T_{mix}\log(\alpha^{-1}\gamma\sqrt{N})).

We now comment on the major implications of Theorem 1. First. the theorem establishes that the distributed dictionaries \{\hat{D}_{i}^{(T_{d})}\} can indeed remain arbitrarily close to the centralized dictionary D^{(T_{d})} after T_{\mathrm{d}} dictionary learning iterations (cf. (5)). Second, the theorem shows that this can happen as long as the number of distributed power method iterations T_{p} scale in a certain manner. In particular, Theorem 1 calls for this scaling to be at least linear in T_{d}K (modulo the \log N multiplication factor), which is the total number of SVDs that K-SVD needs to perform in T_{\mathrm{d}} dictionary learning iterations. On the other hand, T_{p} need only scale logarithmically with S_{\text{maec}}, which is significant in the context of big data problems. Other main problem parameters that affect the scaling of T_{p} include T_{0}, n, and \delta_{d}^{-1}, all of which enter the scaling relation in a logarithmic fashion. Finally, Theorem 1 dictates that the number of consensus iterations T_{c} should also scale at least linearly with T_{p}T_{mix} (modulo some log factors) for the main result to hold. In summary, Theorem 1 guarantees that the distributed dictionaries learned by cloud K-SVD can remain close to the centralized dictionary without requiring excessive numbers of power method and consensus averaging iterations.

We now provide a brief heuristic understanding of the roadmap needed to prove Theorem 1. In the first dictionary learning iteration (t=1), we have \{\hat{D}_{i}^{(t-1)}\equiv D^{(t-1)}\} due to identical initializations. While this means both K-SVD and cloud K-SVD result in identical sparse codes for t=1, the distributed dictionaries begin to deviate from the centralized dictionary after this step. The perturbations in \{\hat{d}_{i, k}^{(1)}\} happen due to the finite numbers of power method and consensus averaging iterations for k=1, whereas they happen for k > 1 due to this reason as well as due to the earlier perturbations in \{\hat{d}_{i, j}^{(1)},\hat{x}_{i_{)}j, T}^{(1)}\}, j < k. In subsequent dictionary learning iterations (t > 1), therefore, cloud K-SVD starts with already perturbed distributed dictionaries \{\hat{D}_{i}^{(t-1)}\}. This in turn also results in deviations of the sparse codes computed by K-SVD and cloud K-SVD, which then adds another source of perturbations in \{\hat{d}_{i, k}^{(t)}\} during the dictionary update steps. To summarize, imperfect power method and consensus averaging in cloud K-SVD introduce errors in the top eigenvector estimates of (centralized) E_{1R}^{(1)}E_{1R}^{(1)^{\mathrm{T}}} at individual sites, which then accumulate for (k,\ t)\neq(1, 1) to also cause errors in estimate \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{\mathrm{T}}} of the matrix E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}} available to cloud K-SVD. Collectively, these two sources of errors cause deviations of the distributed dictionaries from the centralized dictionary and the proof of Theorem 1 mainly relies on our ability to control these two sources of errors.

C. Roadmap to Theorem 1

The first main result needed for the proof of Theorem 1 looks at the errors in the estimates of the dominant eigenvector u_{1} of an arbitrary symmetric matrix M=\sum_{i=1}^{N}M_{i} obtained at individual sites using imperfect power method and consensus averaging when the M_{2}'s are distributed across the N sites (cf. Algorithm 1). The following result effectively helps us control the errors in cloud K-SVD dictionaries due to Steps 8–18 in Algorithm 1.

Theorem 2

Stability of Distributed Power Method

Consider any symmetric matrix M=\sum_{i=1}^{N}M_{8} with dominant eigenvector u_{1} and eigenvalues \vert \lambda_{1}\vert > \vert \lambda_{2}\vert \geq\cdots\geq\vert \lambda_{n}\vert. Suppose each M_{i}, i=1, \ldots, N, is only available at the i^{th} site in our network and let \hat{q_{i}} denote an estimate of u_{1} obtained at site i after T_{p} iterations of the distributed power method (Steps 8–18 in Algorithm 1). Next, define \alpha_{p} = \displaystyle \sum_{i=1}^{N}\Vert M_{i}\Vert_{2}^{-}, \beta_{p} = \displaystyle \max_{t_{p}\simeq 1,\ldots, T_{p}}-\frac{1}{\Vert Mo_{r}^{(t_{p})}\Vert _{2}}, and \gamma_{p}\simeq\sqrt{\mathcal{K}_{i=1}^{N}\Vert M_{i}\Vert_{F}^{2}}. where q_{c}^{(t_{p})} denotes the iterates of a centralized power method initialized with q^{init}. Then, fixing any \in(0,\ (10\alpha_{p}^{2}\beta_{p}^{2})^{-1/3T_{p}}), we have \begin{equation*} \max_{i=1,\ldots, N}\Vert u_{1}u_{1}^{T}-\hat{q_{i}}\hat{q_{i}}^{T}\Vert_{2}\leq\tan(\theta)\vert \frac{\lambda_{2}}{\lambda_{1}}\vert ^{T_{p}}+4\epsilon^{3T_{p}}, \tag{6} \end{equation*}View SourceRight-click on figure for MathML and additional features. as long as \vert \langle u_{1}, q^{init}\rangle\vert > 0 and the number of consensus iterations within each iteration of the distributed power method (Steps 11–15 in Algorithm 1) satisfies T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha_{p}\beta_{p}\epsilon^{-1})+T_{mix}\log(\alpha_{v}^{-1}\gamma_{p}\sqrt{N})). Here, \theta denotes the angle between u_{1} and q^{init}, defined as \theta= arccos (\vert \langle u_{1},\ q^{init}\rangle\vert /(\Vert u_{1}\Vert_{2}\Vert q^{init}\Vert_{2})).

Proof of this theorem can be find in our earlier work [5]. Theorem 2 states that \hat{q_{i}}\rightarrow^{T_{p}}\pm u_{1} at an exponential rate at each site as long as enough consensus iterations are performed in each iteration of the distributed power method. In the case of a finite number of distributed power method iterations, (6) in Theorem 2 tells us that the maximum error in estimates of the dominant eigenvector is bounded by the sum of two terms, with the first term due to finite number of power method iterations and the second term due to finite number of consensus iterations.

The second main result needed to prove Theorem 1 looks at the errors between individual blocks of the reduced distributed error matrix \hat{E}_{k, R}^{(t)}=\vert f\hat{f}_{1, k, R}^{(t)}, f\hat{fj}_{2, k, R}^{(t)}, \cdots,\hat{]}X_{N, k, R}^{(t)}\vert and the reduced centralized error matrix E_{k, R}^{(t)} \lceil E_{1, k, R}^{(t)}, E_{2, k, R}^{(t)}, \cdots, E_{N, k, R}^{(t)}\rceil for k \in \{ 1, 2, \cdots, K\} and t\in\{1, 2,\ \cdots,\ T_{d}\}. This result helps us control the error in Step 6 of Algorithm 1 and, together with Theorem 2, characterizes the major sources of errors in cloud K-SVD in relation to centralized K-SVD. The following theorem provides a bound on error in E_{i, k, R}^{(t)}

Theorem 3

Perturbation in the Matrix \hat{E}_{i, k, R}^{(t)}

Recall the definitions of \Omega_{k}^{(t)} and \overline{\Omega}_{i, k}^{(t)} from Section II-A. Next, express \Omega_{k}^{(t)}=diag\{\Omega_{1, k}^{(t)},\ \cdots,\ \Omega_{N, k}^{(t)}\}, where \Omega_{i, k}^{(t)} corresponds to the data samples associated with the i^{th} site, and define B_{i_{\backslash }k, R}^{(t)}=\hat{E}_{ik, R}^{(t)}-E_{i_{\backslash }k, R}^{(t)}. Finally, let \zeta, \mu, \nu, \epsilon, and \delta_{\mathrm{d}} be as in Theorem 1, define \in=u\nu^{T_{p}}+4\epsilon^{3T_{1}}, and assume \varepsilon \leq \displaystyle \frac{\delta_{d}}{8N\sqrt{n}C_{3}(1+\zeta)^{T_{d}-1}C_{\Delta}^{2}(8C_{3}NC_{\Delta}^{2}+5)^{2(T_{d}K-2)}}. Then, if we perform T_{p} power method iterations and T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha\beta\epsilon^{-1})+T_{mix}\log(\alpha^{-1}\gamma\sqrt{N})) consensus ierations in cloud K-SVD and assume P1-P 3 hold, we have for \in\{1,\ \ldots,\ N\}, t\in\{1, 2,\ \cdots,\ T_{d}\}, and k\in\{1, 2,\ \cdots,\ K\} \begin{equation*} \Vert B_{i, k, R}^{(t)}\Vert_{2}\leq\begin{cases} 0,\ for\ t=1,\ k=1,\\ \epsilon(1+\zeta)^{t-1}C_{4}(8C_{3}NC_{4}^{2}+5)^{(t-1)K+k-2}, 0.w. \end{cases} \end{equation*}View SourceRight-click on figure for MathML and additional features.

Theorem 3 tells us that the error in matrix E_{i, k, R}^{(t)} can be made arbitrarily small through a suitable choice of T_{p} and \epsilon as long as all of the assumptions of Theorem 1 are satisfied. One of the steps in proving Theorem 1 involves proving that the assumption on \varepsilon is satisfied as long as we are performing power method iterations and consensus iterations as required by Theorem 1 (see to [8] for complete proof). In the following, we provide a brief sketch of the proof of Theorem 3 and refer the reader to [8] for complete details.

We can prove Theorem 3 by induction over dictionary learning iteration t. But first we need to have a way to bound \Vert B_{i, k+1, R}^{(t)}\Vert_{2} using bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{k} and also we need to have a method to bound \Vert B_{i, 1, R}^{(\check{t}+1)}\Vert_{2} using bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{K}. Notice from Algorithm 1 that sparse coding is always performed before update of the first dictionary atom. But we do not perform sparse coding before updating any other dictionary atom. Due to this distinction, we address the problem of error accumulation in matrix E_{i, k, R}^{(t)} (t) (t) for first dictionary atom (\Vert B_{i, 1, R}^{(t+1)}\Vert_{2}) differently than for any other dictionary atom (\{\Vert B_{i, j, R}^{(i)}\Vert_{2}\}_{j=2}^{K}). Proof of Theorem 3 can then be divided into three steps.

Bound on \Vert B_{i, k+1, R}^{(t)}\Vert_{2}

Recall from Steps 5–6 in A1 gorithm 1 that \hat{E}_{i.k.R}^{(t)}=Y_{i}\overline{\Omega}_{i.k}^{(t)} -\displaystyle \sum_{\tau=1}^{k-1}\hat{d}_{i,\gamma}^{(t)}\hat{x}_{i.\eta.T}^{(t)}\overline{\Omega}_{i.k}^{(t)}- \displaystyle \sum_{j=k+1}^{K}\hat{d}_{i, j}^{(t-1)}\overline{x}_{i, j, T}^{(t)}\overline{\Omega}_{i, k}^{(t)}. Now, if one assumes that \overline{\Omega}_{k}^{(t)}\simeq \Omega_{k}^{(\dot{t})}, which can be argued to be true using results from [14] and assumptions of Theorem 1, then the error in E_{i, k, R}^{(t)} is due to errors in \{x_{i, j, T, R}^{(t)}\}_{j=1}^{K} and \{d_{j}^{(t)}\}_{j=}^{K}\mathrm{l}. Infact, to bound \Vert B_{i_{\backslash }k+1, R}^{(t)}\Vert_{2} we only need to know bounds on errors in d_{i, k}^{(t)} and x_{i, k, T, r}^{(t)}. Next, recall from Step 20 in Algorithm 1 that \hat{x}_{i, k, R}^{(t)}=\hat{d}_{i, k}^{(t)}\hat{E}_{i, k, R}^{(t)^{\mathrm{T}}}, which means we only need to know a bound on d_{k}^{(t)} to bound \Vert B_{i, k+1, R}^{(t)}\Vert_{2}. But the challenge is to bound error in d_{k}^{(t)} from a given bound on \Vert B_{i, k, R}^{(t)}\Vert_{2}. This is accomplished by noting that there are two sources of error in d_{k}^{(t)}. The first source is the difference in eigenvectors of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{1}} and E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}}. We will bound this difference using [13, Theorem 8.1.12]. The second source of error in d_{k}^{(t)} is the error in eigenvector computation, which in our case is due to the distributed power method. It follows from Theorem 2 and statement of Theorem 3 that this error is bounded by \epsilon. Combining these two sources of error, we can bound the error in d_{k}^{(t)}, which we use to bound \Vert B_{i, k+1R}^{(t)}\Vert_{2}.

Bound on \Vert B_{i.1_{\backslash }R}^{(t+1)}\Vert_{2}

In order to bound \Vert B_{i, 1, R}^{(t+1)}\Vert_{2} when we know bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{K}, the difference from previous case is that now we can not write sparse code \{\hat{x}_{i, j, T}^{(t+1)}\}_{j=1}^{K} in terms of dictionary atoms \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{\hat{K}}. Therefore, in addition to bounding errors in dictionary atoms \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K}, we also need to bound errors in sparse codes due to perturbations in dictionaries after iteration t. Since we know \{\Vert B_{i, k, R}^{(t)}\Vert_{2}\}_{j=1}^{K}, we can use these to bound \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K}. Next, using error bounds on \{\hat{d}_{i,\gamma}^{(t)}\}_{j=1}^{K}, we can use [14 Theorem 4] to bound errors in \{\hat{x}_{i_{7}.T}^{(t+1)}\}_{\tau=1}^{K}. Finally, using these error bounds on \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K} and \{\hat{x}_{i, j, T}^{(t+1)}\}_{j=1}^{K} we can bound \Vert B_{i, 1, R}^{t+1)}\Vert_{2}.

Bound on \Vert B_{i, k, R}^{(t)}\Vert_{2}, \forall t and k

Next using induction argument over t we can prove Theorem 3.

SECTION IV.

Conclusion

In this paper, we have provided mathematical analysis of cloud K-SVD. Our analysis shows that under certain assumptions if we perform enough numbers of power method and consensus iterations then the cloud K-SVD dictionaries converge to the centralized K-SVD solution.

    References

    1.
    K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee and T. J. Sejnowski, "Dictionary learning algorithms for sparse representation", Neural Computation, vol. 15, no. 2, pp. 349-396, Feb. 2003.
    2.
    M. Aharon, M. Elad and A. Bruckstein, "K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation", IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4311-4322, Nov. 2006.
    3.
    J. Mairal, F. Bach, J. Ponce and G. Sapiro, "Online learning for matrix factorization and sparse coding", JMLR, vol. 11, pp. 19-60, 2010.
    4.
    P. Chainais and C. Richard, "Learning a common dictionary over a sensor network", Proc. IEEE 5th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2013.
    5.
    H. Raja and W. U. Bajwa, " Cloud \$K\$ -SVD: Computing data-adaptive representations in the cloud ", Proc. 51st Annual Allerton Conference on Communication Control and Computing., pp. 1474-1481, 2013.
    6.
    L. J, Z. M, Z. X and G. Yu, "Distributed dictionary learning for sparse representation in sensor networks", IEEE Trans. on Image Processing, vol. 23, no. 6, pp. 2528-2541, 2014.
    7.
    J. Chen, Z. J. Towfic and A. H. Sayed, "Dictionary learning over distributed models", IEEE Trans. Signal Processing, vol. 63, no. 4, pp. 1001-1016, Feb. 2015.
    8.
    H. Raja and W. U. Bajwa, "Cloud K-SVD: A collaborative dictionary learning algorithm for big distributed data", arXiv preprint arXiv:1412.7839, 2014.
    9.
    R. Tibshirani, "Regression shrinkage and selection via the lasso", Journal of the Royal Statistical Society. Series B (Methodological), 1996.
    10.
    B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, "Least angle regression", Ann. Statist., vol. 32, no. 2, pp. 407-451, 2004.
    11.
    M. A. T. Figueiredo, R. D. Nowak and S. J. Wright, "Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems", IEEE J. Select. Topics Signal Processing, vol. 1, no. 4, pp. 586-597, Dec. 2007.
    12.
    J.-J. Fuchs, "On sparse representations in arbitrary redundant bases", IEEE Trans. Inform. Theory, vol. 50, no. 6, pp. 1341-1344, Jun. 2004.
    13.
    G. H. Golub and C. F. Van Loan, Matrix computations, Baltimore, MD:Johns Hopkins University Press, 2012.
    14.
    N. Mehta and A. G. Gray, "Sparsity-based generalization bounds for predictive sparse coding", Proc. 30th Intl Conf. Machine Learning (ICML'13), pp. 36-44, Jun. 2013.