Analysis of cloud K-SVD requires an understanding of the behavior of its major components, which include sparse coding, dictionary update, and distributed power method within dictionary update. In addition, one also expects that the closeness of \hat{\hat{D}}_{i}'s to the centralized solution will be a function of certain properties of local/global data. We begin our analysis of cloud K-SVD by first stating some of these properties in terms of the centralized K-SVD solution.
A. Preliminaries
We will start by providing algorithmic specification of the sparse coding steps in both algorithms. While the sparse coding step as stated in Step 3 of Algorithm 1 has combinatorial complexity, various low-complexity computational approaches can be used to solve this step in practice. Our analysis in the following will be focused on the case when sparse coding in both cloud K-SVD and centralized K-SVD is carried out using the lasso [9]. Specifically, we assume sparse coding is carried out by solving \begin{equation*}
x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2}+\tau\Vert x\Vert_{1}
\tag{2}
\end{equation*}View Source
\begin{equation*}
x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2}+\tau\Vert x\Vert_{1}
\tag{2}
\end{equation*} with the regularization parameter \tau > 0 selected in a way that \Vert x_{i, s}\Vert_{0}\leq T_{0}\ll n. This can be accomplished, for example, by making use of the least angle regression algorithm [10]. Note that the lasso also has a dual, constrained form, given by \begin{equation*}
x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2} s.t. \Vert x\Vert_{1}\leq\eta,
\tag{3}
\end{equation*}View Source
\begin{equation*}
x_{i, s}=\displaystyle \arg\min_{x\in \mathrm{R}^{K}}\frac{1}{2}\Vert y_{i, s}-Dx\Vert_{2}^{2} s.t. \Vert x\Vert_{1}\leq\eta,
\tag{3}
\end{equation*} and the solutions of (2) and (3) are identical for an appropriate \eta_{\tau}=\eta(\tau) [11].
We also assume identical centralized and distributed initializations, i.e., \hat{D}_{i}^{(0)}\simeq D^{(0)}, i\simeq 1, \ldots, N, where D^{(t)}, t\geq 0, in the following denotes the centralized K-SVD dictionary estimate in the t^{th} iteration. Despite identical initialization, the cloud K-SVD dictionaries get perturbed in each iteration due to imperfect power method and consensus averaging. In order to ensure these perturbations do not cause the cloud K-SVD dictionaries to diverge from the centralized solution after T_{\mathrm{d}} iterations, we need the dictionary estimates returned by centralized K-SVD in each iteration to satisfy the following properties.
[P1] Let x_{i, s}^{(t)} denote the solution of the lasso (i.e., (2)) for D=D^{(t-1)} and \tau=\tau^{(t)}, t=1, \ldots, T_{\mathrm{d}}. Then there exists some C_{1} > 0 such that the following holds:
\begin{equation*}
\displaystyle \min_{t, i, s, j\not\in \mathrm{supp}(x_{is}^{(t)})} \tau^{(t)}-\vert \langle d_{j}^{(t)}, y_{i, s}-D^{(t-1)}x_{i, s}^{(t)}\rangle\vert > C_{1}.
\end{equation*}View Source
\begin{equation*}
\displaystyle \min_{t, i, s, j\not\in \mathrm{supp}(x_{is}^{(t)})} \tau^{(t)}-\vert \langle d_{j}^{(t)}, y_{i, s}-D^{(t-1)}x_{i, s}^{(t)}\rangle\vert > C_{1}.
\end{equation*}
For collection \{\tau^{(t)}\}_{t=1}^{T_{d}}, we also define the smallest regularization parameter \tau_{\min} =
\displaystyle \min_{f}\tau^{(t)}, and the largest dual parameter among the (dual) collection \{\eta_{\tau}^{(t)}=-\eta(\tau^{(t)})\}_{t=1}^{T_{d}} as \displaystyle \eta_{\tau,\max}=\max_{t}\eta_{\tau}^{(t)}.
[P2] Define \Sigma_{T_{0}}=\{\mathcal{I}\subset\{1,\ \ldots,\ K\}:\vert \mathcal{I}\vert =\tau_{0}\}. Then there exists some C_{2}' > \displaystyle \frac{C_{1}^{4}\tau_{\min}^{2}}{1936}, such that the following holds, \displaystyle \min_{t=1,\ldots, T_{d},\mathcal{I}\in\Sigma_{T_{0}}}.\sigma_{T_{0}}(D_{\vert \mathcal{I}j}^{(t-1)})\geq\sqrt{C_{2}'}, where \sigma_{T_{0}}(\cdot) denotes the T_{0}^{th} (ordered) singular value of a matrix. In our analysis, we will be using the parameter C_{2}=(\displaystyle \sqrt{C_{2}'}-\frac{c_{1^{\mathcal{T}_{\min}}}^{2^{\vee}}}{44})^{2}.
[P3] Let \lambda_{1, k}^{(t)} > \lambda_{2, k}^{(t)}\geq\ldots\lambda_{n, k}^{(t)}\geq denote the eigenvalues of the centralized “reduced” matrix E_{kR}^{(t)}E_{kR}^{(t)^{\mathrm{T}}}, k\in \{ 1, \ldots, K\}, in the t^{th} iteration, t \in \{ 1, \ldots, T_{d}\}. Then there exists some C_{3}' < 1 such that the following holds, \max_{t, k^{\frac{\lambda_{2k}^{(t)}}{\lambda_{1k}^{(t)}}}}-\leq C_{\mathrm{q}, 0}'. Now define C_{3}= \displaystyle \max\{1, \displaystyle \frac{1}{\min_{tk}\lambda_{1k}^{(t)}(1-C_{3}')}\}, which we will use in our forthcoming analysis.
Properties P1 and P2 correspond to sufficient conditions for x_{i, k}^{(t)} to be a unique solution of (2) [12] and guarantee that the centralized K-SVD generates a unique collection of sparse codes in each dictionary learning iteration. Property P3, on the other hand, ensures that algorithms such as the power method can be used to compute the dominant eigenvector of E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}} in each dictionary learning iteration (Steps 8–18 in Algorithm 1) [13]. In addition to these properties, our final analytical result for cloud K-SVD will also be a function of a certain parameter of the centralized error matrices \{E_{k}^{(t)}\}_{k=1}^{K} generated by the centralized K-SVD in each iteration. We define this parameter as follows. Let E_{i, k}^{(t)}, i= 1, \ldots, N, denote part of the centralized error matrix E_{k}^{(t)} associated with the data of the i^{th} site in the t^{th} iteration, i.e., E_{k}^{(t)}= \lfloor E_{1, k}^{(t)} E_{2, k}^{(t)} E_{N, k}^{(t)}\rfloor, k=1, \ldots, K, t\simeq 1, \ldots, T_{d}. Then C_{4}=\displaystyle \max\{1, \displaystyle \max_{t^{\rightarrow}i, k}\Vert E_{i, k}^{(t)}\Vert_{2}\}.
B. Main Result
We are now ready to state the main result of this paper. This result is given in terms of the \Vert\cdot\Vert_{2} norm mixing time, T_{mix}, of the Markov chain associated with the doubly-stochastic weight matrix W used for consensus averaging, defined as \begin{equation*}
T_{mix}=\displaystyle \max_{=}\inf_{i1,\ldots, Nt\in \mathrm{N}}\{t: \displaystyle \Vert \mathrm{e}_{i}^{\mathrm{T}}W^{t}-\frac{1}{N}1^{\mathrm{T}}\Vert_{2}\leq\frac{1}{2}\}.
\tag{4}
\end{equation*}View Source
\begin{equation*}
T_{mix}=\displaystyle \max_{=}\inf_{i1,\ldots, Nt\in \mathrm{N}}\{t: \displaystyle \Vert \mathrm{e}_{i}^{\mathrm{T}}W^{t}-\frac{1}{N}1^{\mathrm{T}}\Vert_{2}\leq\frac{1}{2}\}.
\tag{4}
\end{equation*}
Here, \mathrm{e}_{i}\in \mathbb{R}^{N} denotes the i^{th} column of the identity matrix I_{N}. In the following, main convergence results for cloud K-SVD along with brief discussions are presented. For detailed proofs and discussions we refer the reader to [8].
Theorem 1
Stability of Cloud K-SVD Dictionaries Suppose cloud K-SVD (Algorithm 1) and (centralized) K-SVD are identically initialized and both of them carry out T_{d} dictionary learning iterations. In addition, assume cloud K-SVD carries out T_{p} power method iterations during the update of each atom and T_{c} consensus iterations during each power method iteration. Finally, assume the K-SVD algorithm satisfies properties P1-P3. Next, define \begin{align*}&\displaystyle \alpha=\max_{t, k}\sum_{i=1}^{N}\Vert\hat{E}_{i, k, R}^{(t)}\hat{E}_{i, k, R}^{(t)^{T}}\Vert_{2}, \displaystyle \beta=\max_{t, t_{p}, k}\frac{1}{\Vert\hat{E}_{kR}\hat{E}_{kR}q_{ctk}(t)(t)^{T}(t_{p})\Vert_{2}}\\
&\gamma = \displaystyle \max_{t, k}\sqrt{\sum_{i=1}^{N}\Vert\hat{E}_{ikR}^{(t)}\hat{E}_{ikR}^{(t)^{\mathcal{T}}}\Vert_{F}^{2}}, \gamma=\max_{t, k}{\frac{\hat{\lambda}_{2k}^{(t)}}{\hat{\lambda}_{1, h}^{(t)}}},
\end{align*}View Source
\begin{align*}&\displaystyle \alpha=\max_{t, k}\sum_{i=1}^{N}\Vert\hat{E}_{i, k, R}^{(t)}\hat{E}_{i, k, R}^{(t)^{T}}\Vert_{2}, \displaystyle \beta=\max_{t, t_{p}, k}\frac{1}{\Vert\hat{E}_{kR}\hat{E}_{kR}q_{ctk}(t)(t)^{T}(t_{p})\Vert_{2}}\\
&\gamma = \displaystyle \max_{t, k}\sqrt{\sum_{i=1}^{N}\Vert\hat{E}_{ikR}^{(t)}\hat{E}_{ikR}^{(t)^{\mathcal{T}}}\Vert_{F}^{2}}, \gamma=\max_{t, k}{\frac{\hat{\lambda}_{2k}^{(t)}}{\hat{\lambda}_{1, h}^{(t)}}},
\end{align*}
\hat{\theta}_{k}^{(t)} \in
[0,\ \pi/2] as \hat{\theta}_{k}^{(t)} arccos (\displaystyle \frac{\vert \langle u_{1k}^{(t)}, q^{init}\rangle\vert }{\Vert u_{1k}^{(t)}\Vert_{2}\Vert q^{init}\Vert_{2}}), \mu=\displaystyle \max\{1, \displaystyle \max_{k, t}\mathrm{t}an(\hat{\theta}_{k}^{(t)})\}, and \zeta K\displaystyle \sqrt{2S_{\max}}(\frac{6\sqrt{KT_{0}}}{\tau_{\min}C_{2}}+\eta_{\tau,\max}), where S_{\max}=\displaystyle \max_{i}S_{i}, u_{1, k}^{(t)} is the dominant eigenvector of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{T}}, \hat{\lambda}_{1k}^{(t)} and \hat{\lambda}_{2, k}^{(t)} are first and second largest eigenvalues of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{\mathcal{T}}}, respectively, and q_{c, t, k}^{(t_{p})} denotes the iterates of a centralized power method initialized with q^{init} for estimation of the dominant eigenvector of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{7}}. Then, assuming \displaystyle \min_{t, k}\vert \langle u_{1, k}^{(t)}, q^{init}\rangle\vert > 0, and fixing any \in(0, \displaystyle \min\{(10\alpha^{2}\beta^{2})^{-1/3T_{p}},\ (\frac{1-\nu}{4})^{1/3}\}) and \delta_{d}\in(0, \displaystyle \min\{\frac{1}{\sqrt{2}},\ \frac{C_{1}^{2}\tau_{\min}}{44\sqrt{2K}}\}), we have
\begin{equation*}
\max_{i=1,\ldots, N_{k=1,\ldots, K}}\Vert^{\wedge}d_{i, k}^{(T_{d})^{\wedge}}d_{i, k}^{(T_{d})^{\mathcal{T}}}-d_{k}^{(T_{d})}d_{k}^{(T_{d})^{\mathcal{T}}}\Vert_{2}\leq\delta_{d}
\tag{5}
\end{equation*}View Source
\begin{equation*}
\max_{i=1,\ldots, N_{k=1,\ldots, K}}\Vert^{\wedge}d_{i, k}^{(T_{d})^{\wedge}}d_{i, k}^{(T_{d})^{\mathcal{T}}}-d_{k}^{(T_{d})}d_{k}^{(T_{d})^{\mathcal{T}}}\Vert_{2}\leq\delta_{d}
\tag{5}
\end{equation*} as long as the number of power method iterations T_{p}\geq \displaystyle \frac{2(T_{d}K-2)\log(8C_{3}C_{4}^{2^{\mathrm{z}}}N+5)+(T_{d}-1)\log(1+\zeta)+\log(8C_{3}C_{4}\mu N\sqrt{n}\delta_{d}^{-1})}{\log\lceil(l/+4\epsilon^{3})^{-1}\rceil} and the number of consensus iterations T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha\beta\epsilon^{-1})+T_{mix}\log(\alpha^{-1}\gamma\sqrt{N})).
We now comment on the major implications of Theorem 1. First. the theorem establishes that the distributed dictionaries \{\hat{D}_{i}^{(T_{d})}\} can indeed remain arbitrarily close to the centralized dictionary D^{(T_{d})} after T_{\mathrm{d}} dictionary learning iterations (cf. (5)). Second, the theorem shows that this can happen as long as the number of distributed power method iterations T_{p} scale in a certain manner. In particular, Theorem 1 calls for this scaling to be at least linear in T_{d}K (modulo the \log N multiplication factor), which is the total number of SVDs that K-SVD needs to perform in T_{\mathrm{d}} dictionary learning iterations. On the other hand, T_{p} need only scale logarithmically with S_{\text{maec}}, which is significant in the context of big data problems. Other main problem parameters that affect the scaling of T_{p} include T_{0}, n, and \delta_{d}^{-1}, all of which enter the scaling relation in a logarithmic fashion. Finally, Theorem 1 dictates that the number of consensus iterations T_{c} should also scale at least linearly with T_{p}T_{mix} (modulo some log factors) for the main result to hold. In summary, Theorem 1 guarantees that the distributed dictionaries learned by cloud K-SVD can remain close to the centralized dictionary without requiring excessive numbers of power method and consensus averaging iterations.
We now provide a brief heuristic understanding of the roadmap needed to prove Theorem 1. In the first dictionary learning iteration (t=1), we have \{\hat{D}_{i}^{(t-1)}\equiv D^{(t-1)}\} due to identical initializations. While this means both K-SVD and cloud K-SVD result in identical sparse codes for t=1, the distributed dictionaries begin to deviate from the centralized dictionary after this step. The perturbations in \{\hat{d}_{i, k}^{(1)}\} happen due to the finite numbers of power method and consensus averaging iterations for k=1, whereas they happen for k > 1 due to this reason as well as due to the earlier perturbations in \{\hat{d}_{i, j}^{(1)},\hat{x}_{i_{)}j, T}^{(1)}\}, j < k. In subsequent dictionary learning iterations (t > 1), therefore, cloud K-SVD starts with already perturbed distributed dictionaries \{\hat{D}_{i}^{(t-1)}\}. This in turn also results in deviations of the sparse codes computed by K-SVD and cloud K-SVD, which then adds another source of perturbations in \{\hat{d}_{i, k}^{(t)}\} during the dictionary update steps. To summarize, imperfect power method and consensus averaging in cloud K-SVD introduce errors in the top eigenvector estimates of (centralized) E_{1R}^{(1)}E_{1R}^{(1)^{\mathrm{T}}} at individual sites, which then accumulate for (k,\ t)\neq(1, 1) to also cause errors in estimate \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{\mathrm{T}}} of the matrix E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}} available to cloud K-SVD. Collectively, these two sources of errors cause deviations of the distributed dictionaries from the centralized dictionary and the proof of Theorem 1 mainly relies on our ability to control these two sources of errors.
The first main result needed for the proof of Theorem 1 looks at the errors in the estimates of the dominant eigenvector u_{1} of an arbitrary symmetric matrix M=\sum_{i=1}^{N}M_{i} obtained at individual sites using imperfect power method and consensus averaging when the M_{2}'s are distributed across the N sites (cf. Algorithm 1). The following result effectively helps us control the errors in cloud K-SVD dictionaries due to Steps 8–18 in Algorithm 1.
Theorem 2
Stability of Distributed Power Method Consider any symmetric matrix M=\sum_{i=1}^{N}M_{8} with dominant eigenvector u_{1} and eigenvalues \vert \lambda_{1}\vert > \vert \lambda_{2}\vert \geq\cdots\geq\vert \lambda_{n}\vert. Suppose each M_{i}, i=1, \ldots, N, is only available at the i^{th} site in our network and let \hat{q_{i}} denote an estimate of u_{1} obtained at site i after T_{p} iterations of the distributed power method (Steps 8–18 in Algorithm 1). Next, define \alpha_{p} = \displaystyle \sum_{i=1}^{N}\Vert M_{i}\Vert_{2}^{-}, \beta_{p} = \displaystyle \max_{t_{p}\simeq 1,\ldots, T_{p}}-\frac{1}{\Vert Mo_{r}^{(t_{p})}\Vert _{2}}, and \gamma_{p}\simeq\sqrt{\mathcal{K}_{i=1}^{N}\Vert M_{i}\Vert_{F}^{2}}. where q_{c}^{(t_{p})} denotes the iterates of a centralized power method initialized with q^{init}. Then, fixing any \in(0,\ (10\alpha_{p}^{2}\beta_{p}^{2})^{-1/3T_{p}}), we have
\begin{equation*}
\max_{i=1,\ldots, N}\Vert u_{1}u_{1}^{T}-\hat{q_{i}}\hat{q_{i}}^{T}\Vert_{2}\leq\tan(\theta)\vert \frac{\lambda_{2}}{\lambda_{1}}\vert ^{T_{p}}+4\epsilon^{3T_{p}},
\tag{6}
\end{equation*}View Source
\begin{equation*}
\max_{i=1,\ldots, N}\Vert u_{1}u_{1}^{T}-\hat{q_{i}}\hat{q_{i}}^{T}\Vert_{2}\leq\tan(\theta)\vert \frac{\lambda_{2}}{\lambda_{1}}\vert ^{T_{p}}+4\epsilon^{3T_{p}},
\tag{6}
\end{equation*} as long as \vert \langle u_{1}, q^{init}\rangle\vert > 0 and the number of consensus iterations within each iteration of the distributed power method (Steps 11–15 in Algorithm 1) satisfies T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha_{p}\beta_{p}\epsilon^{-1})+T_{mix}\log(\alpha_{v}^{-1}\gamma_{p}\sqrt{N})). Here, \theta denotes the angle between u_{1} and q^{init}, defined as \theta= arccos (\vert \langle u_{1},\ q^{init}\rangle\vert /(\Vert u_{1}\Vert_{2}\Vert q^{init}\Vert_{2})).
Proof of this theorem can be find in our earlier work [5]. Theorem 2 states that \hat{q_{i}}\rightarrow^{T_{p}}\pm u_{1} at an exponential rate at each site as long as enough consensus iterations are performed in each iteration of the distributed power method. In the case of a finite number of distributed power method iterations, (6) in Theorem 2 tells us that the maximum error in estimates of the dominant eigenvector is bounded by the sum of two terms, with the first term due to finite number of power method iterations and the second term due to finite number of consensus iterations.
The second main result needed to prove Theorem 1 looks at the errors between individual blocks of the reduced distributed error matrix \hat{E}_{k, R}^{(t)}=\vert f\hat{f}_{1, k, R}^{(t)}, f\hat{fj}_{2, k, R}^{(t)}, \cdots,\hat{]}X_{N, k, R}^{(t)}\vert and the reduced centralized error matrix E_{k, R}^{(t)} \lceil E_{1, k, R}^{(t)}, E_{2, k, R}^{(t)}, \cdots, E_{N, k, R}^{(t)}\rceil for k \in \{ 1, 2, \cdots, K\} and t\in\{1, 2,\ \cdots,\ T_{d}\}. This result helps us control the error in Step 6 of Algorithm 1 and, together with Theorem 2, characterizes the major sources of errors in cloud K-SVD in relation to centralized K-SVD. The following theorem provides a bound on error in E_{i, k, R}^{(t)}
Theorem 3
Perturbation in the Matrix \hat{E}_{i, k, R}^{(t)} Recall the definitions of \Omega_{k}^{(t)} and \overline{\Omega}_{i, k}^{(t)} from Section II-A. Next, express \Omega_{k}^{(t)}=diag\{\Omega_{1, k}^{(t)},\ \cdots,\ \Omega_{N, k}^{(t)}\}, where \Omega_{i, k}^{(t)} corresponds to the data samples associated with the i^{th} site, and define B_{i_{\backslash }k, R}^{(t)}=\hat{E}_{ik, R}^{(t)}-E_{i_{\backslash }k, R}^{(t)}. Finally, let \zeta, \mu, \nu, \epsilon, and \delta_{\mathrm{d}} be as in Theorem 1, define \in=u\nu^{T_{p}}+4\epsilon^{3T_{1}}, and assume \varepsilon \leq \displaystyle \frac{\delta_{d}}{8N\sqrt{n}C_{3}(1+\zeta)^{T_{d}-1}C_{\Delta}^{2}(8C_{3}NC_{\Delta}^{2}+5)^{2(T_{d}K-2)}}. Then, if we perform T_{p} power method iterations and T_{c}= \Omega(T_{p}T_{mix}\log(2\alpha\beta\epsilon^{-1})+T_{mix}\log(\alpha^{-1}\gamma\sqrt{N})) consensus ierations in cloud K-SVD and assume P1-P 3 hold, we have for \in\{1,\ \ldots,\ N\}, t\in\{1, 2,\ \cdots,\ T_{d}\}, and k\in\{1, 2,\ \cdots,\ K\}
\begin{equation*}
\Vert B_{i, k, R}^{(t)}\Vert_{2}\leq\begin{cases}
0,\ for\ t=1,\ k=1,\\
\epsilon(1+\zeta)^{t-1}C_{4}(8C_{3}NC_{4}^{2}+5)^{(t-1)K+k-2}, 0.w.
\end{cases}
\end{equation*}View Source
\begin{equation*}
\Vert B_{i, k, R}^{(t)}\Vert_{2}\leq\begin{cases}
0,\ for\ t=1,\ k=1,\\
\epsilon(1+\zeta)^{t-1}C_{4}(8C_{3}NC_{4}^{2}+5)^{(t-1)K+k-2}, 0.w.
\end{cases}
\end{equation*}
Theorem 3 tells us that the error in matrix E_{i, k, R}^{(t)} can be made arbitrarily small through a suitable choice of T_{p} and \epsilon as long as all of the assumptions of Theorem 1 are satisfied. One of the steps in proving Theorem 1 involves proving that the assumption on \varepsilon is satisfied as long as we are performing power method iterations and consensus iterations as required by Theorem 1 (see to [8] for complete proof). In the following, we provide a brief sketch of the proof of Theorem 3 and refer the reader to [8] for complete details.
We can prove Theorem 3 by induction over dictionary learning iteration t. But first we need to have a way to bound \Vert B_{i, k+1, R}^{(t)}\Vert_{2} using bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{k} and also we need to have a method to bound \Vert B_{i, 1, R}^{(\check{t}+1)}\Vert_{2} using bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{K}. Notice from Algorithm 1 that sparse coding is always performed before update of the first dictionary atom. But we do not perform sparse coding before updating any other dictionary atom. Due to this distinction, we address the problem of error accumulation in matrix E_{i, k, R}^{(t)} (t) (t) for first dictionary atom (\Vert B_{i, 1, R}^{(t+1)}\Vert_{2}) differently than for any other dictionary atom (\{\Vert B_{i, j, R}^{(i)}\Vert_{2}\}_{j=2}^{K}). Proof of Theorem 3 can then be divided into three steps.
Bound on \Vert B_{i, k+1, R}^{(t)}\Vert_{2}
Recall from Steps 5–6 in A1 gorithm 1 that \hat{E}_{i.k.R}^{(t)}=Y_{i}\overline{\Omega}_{i.k}^{(t)} -\displaystyle \sum_{\tau=1}^{k-1}\hat{d}_{i,\gamma}^{(t)}\hat{x}_{i.\eta.T}^{(t)}\overline{\Omega}_{i.k}^{(t)}- \displaystyle \sum_{j=k+1}^{K}\hat{d}_{i, j}^{(t-1)}\overline{x}_{i, j, T}^{(t)}\overline{\Omega}_{i, k}^{(t)}. Now, if one assumes that \overline{\Omega}_{k}^{(t)}\simeq \Omega_{k}^{(\dot{t})}, which can be argued to be true using results from [14] and assumptions of Theorem 1, then the error in E_{i, k, R}^{(t)} is due to errors in \{x_{i, j, T, R}^{(t)}\}_{j=1}^{K} and \{d_{j}^{(t)}\}_{j=}^{K}\mathrm{l}. Infact, to bound \Vert B_{i_{\backslash }k+1, R}^{(t)}\Vert_{2} we only need to know bounds on errors in d_{i, k}^{(t)} and x_{i, k, T, r}^{(t)}. Next, recall from Step 20 in Algorithm 1 that \hat{x}_{i, k, R}^{(t)}=\hat{d}_{i, k}^{(t)}\hat{E}_{i, k, R}^{(t)^{\mathrm{T}}}, which means we only need to know a bound on d_{k}^{(t)} to bound \Vert B_{i, k+1, R}^{(t)}\Vert_{2}. But the challenge is to bound error in d_{k}^{(t)} from a given bound on \Vert B_{i, k, R}^{(t)}\Vert_{2}. This is accomplished by noting that there are two sources of error in d_{k}^{(t)}. The first source is the difference in eigenvectors of \hat{E}_{k, R}^{(t)}\hat{E}_{k, R}^{(t)^{1}} and E_{k, R}^{(t)}E_{k, R}^{(t)^{\mathrm{T}}}. We will bound this difference using [13, Theorem 8.1.12]. The second source of error in d_{k}^{(t)} is the error in eigenvector computation, which in our case is due to the distributed power method. It follows from Theorem 2 and statement of Theorem 3 that this error is bounded by \epsilon. Combining these two sources of error, we can bound the error in d_{k}^{(t)}, which we use to bound \Vert B_{i, k+1R}^{(t)}\Vert_{2}.
Bound on \Vert B_{i.1_{\backslash }R}^{(t+1)}\Vert_{2}
In order to bound \Vert B_{i, 1, R}^{(t+1)}\Vert_{2} when we know bounds on \{\Vert B_{i, j, R}^{(t)}\Vert_{2}\}_{j=1}^{K}, the difference from previous case is that now we can not write sparse code \{\hat{x}_{i, j, T}^{(t+1)}\}_{j=1}^{K} in terms of dictionary atoms \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{\hat{K}}. Therefore, in addition to bounding errors in dictionary atoms \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K}, we also need to bound errors in sparse codes due to perturbations in dictionaries after iteration t. Since we know \{\Vert B_{i, k, R}^{(t)}\Vert_{2}\}_{j=1}^{K}, we can use these to bound \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K}. Next, using error bounds on \{\hat{d}_{i,\gamma}^{(t)}\}_{j=1}^{K}, we can use [14 Theorem 4] to bound errors in \{\hat{x}_{i_{7}.T}^{(t+1)}\}_{\tau=1}^{K}. Finally, using these error bounds on \{\hat{d}_{i, j}^{(t)}\}_{j=1}^{K} and \{\hat{x}_{i, j, T}^{(t+1)}\}_{j=1}^{K} we can bound \Vert B_{i, 1, R}^{t+1)}\Vert_{2}.
Bound on \Vert B_{i, k, R}^{(t)}\Vert_{2}, \forall t and k
Next using induction argument over t we can prove Theorem 3.