Acceleration in Optimization (Part V)
Published:
Difficulty: Advanced Undergraduate Students
In the previous set of posts, we developed two schemes for finding $\varepsilon$-solutions of $({\cal P})$ based on Nesterov’s Estimating Sequence Scheme. In this post, we will show how our previous developments can be used to obtain convergence rates for finding an $\varepsilon$-stationary point of $({\cal P})$.
Key Property of a Composite Gradient Step
Since the main iterates $\{x_{k}\}_{k\geq1}$ in both the dual gradient method and accelerated gradient method are derived from a composite gradient step, one of the useful results we will need is the following lemma.
Lemma V.1. Suppose $f$, $h$, and $\phi$ are as in problem $({\cal P})$. Moreover, define the function
\[T_{K}[y]:={\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;y)+h(u)+\frac{K}{2}\|u-y\|^{2}\right\}\]for every $K\geq0$ and $y\in\mathbb{R}^{n}$. Then, for every $x\in{\rm dom}\ h$, it holds that
\[(\gamma)\qquad{\rm dist}^{2}(0,\nabla f(T_{L}[x])+\partial h(T_{L}[x]))\leq8L\left[\phi(x)-\phi(T_{L}(x))\right].\]Proof. Let $x\in{\rm dom}\ h$ and denote $T=T_{L}[x]$. Moreover, in view of the optimality of $x$, let $\xi\in\partial h(T)$ be such that $0=\nabla f(x)+\xi+L(T-x)$. Now, since
\[r:=\nabla f(T)-\nabla f(x)+L(x-T)\in\nabla f(T)+\partial h(T),\]it suffices to show that $|r|^{2}\leq8L[\phi(x)-\inf_{u\in\mathbb{R}^{n}}\phi(u)].$ We first start with a technical inequality. Using upper curvature condition $(\alpha)$, the definition of $\xi$, and the definition of the subdifferential, we have that
\[\begin{aligned} \phi(T) & \overset{(\alpha)}{\leq}f(x)+\left\langle \nabla f(x),T-x\right\rangle +\frac{L}{2}\|T-x\|^{2}+h(T)\\ & \overset{\xi}{=}f(x)+\left\langle L(T-x)+\xi,x-T\right\rangle +\frac{L}{2}\|T-x\|^{2}+h(T)\\ & =f(x)-\frac{L}{2}\|T-x\|^{2}+h(T)+\left\langle \xi,x-T\right\rangle \\ & \overset{\xi\in\partial h(T)}{\leq}\phi(x)-\frac{L}{2}\|T-x\|^{2},\end{aligned}\]and hence, we have
\[\|T-x\|\leq\sqrt{\frac{2}{L}\left[\phi(x)-\phi(T)\right]}\]Combining the above bound with the triangle inequality, the $L$-Lipschitz continuity of $\nabla f$, and the definition of $r$, we conclude that
\[\begin{aligned} {\rm dist}(0,\nabla f(T)+\partial h(T)) & \leq\|r\|\leq\|\nabla f(T)-\nabla f(x)\|+L\|x-T\|\\ & \leq2L\|T-x\|\leq2\sqrt{2L\left[\phi(x)-\phi(T)\right]}.\end{aligned}\]Squaring both sides yields the desired bound. QED
Obtaining $\varepsilon$-Stationary Points
The previous lemma showed a direct relationship between the stationarity of a composite gradient step and its optimality (in terms of function value). Let us now use this result to obtain appropriate complexity bounds for obtaining $\varepsilon$-stationary points. It is worth mentioning that this technique shows up in Subsection 2.2.2 of Nesterov’s convex optimization book for the special case of $h = 0$.
We first modify the original accelerated gradient method by defining $z_{0}=x_{0}$ and introducing the following additional computations at the end of each iteration:
\[\begin{aligned} \bar{z}_{k} & \gets{\rm argmin}_{u\in\{x_{k},z_{k-1}\}}\phi(u),\\ z_{k} & \gets{\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \ell_{f}(u;\bar{z}_{k})+h(u)+\frac{L}{2}\|u-\bar{z}_{k}\|^{2}\right\} .\end{aligned}\]We now obtain an interesting result about the modified method.
Proposition V.2. The iterates ${(z_{k},A_{k},\Gamma_{k})}$ of the modified accelerated gradient method satisfy $({\cal R}_{k}^{1})$ and $({\cal R}_{k}^{2})$ with $x_{k}=z_{k}$ and, hence, the modified accelerated gradient method implements the Estimating Sequence Scheme.
Proof. Condition $({\cal R}_{k}^{2})$ is immediate since we are using the same sequence ${\Gamma_{k}}$ from the original accelerated method. To show condition $({\cal R}_{k}^{1})$, it suffices to show that $\phi(z_{k})\leq\phi(x_{k})$ for $k\geq1$ (since the $k=0$ is obvious). Indeed, it follows from $(\gamma)$ in Lemma V.1. and the definition of $\bar{z}_{k}$ that
\[\begin{aligned} \phi(x_{k})\geq\phi(\bar{z}_{k}) & \overset{(\gamma)}{\geq}\phi(T_{L}[\bar{z}_{k}])+\frac{ {\rm dist}^{2}(0,\nabla f(T_{L}[\bar{z}_{k}])+\partial h(T_{L}[\bar{z}_{k}]))}{8L}\\ & \geq\phi(T_{L}(\bar{z}_{k}))=\phi(z_{k}),\end{aligned}\]where $T_{K}[y]$ is as in Lemma V.1. QED
Let us now show that the new sequence ${z_{k}}_{k\ge0}$ converges to a stationary point at an accelerated rate. Using Lemma V.1. with $x=z_{k-1}$, we first have that
\[\begin{aligned} \frac{ {\rm dist}^{2}(0,\nabla f(z_{k})+\partial h(z_{k}))}{8L} & =\frac{ {\rm dist}^{2}(0,\nabla f(T_{L}(\bar{z}_{k-1}))+\partial h(T_{L}(\bar{z}_{k-1})))}{8L}\\ & \overset{(\gamma)}{\leq}\phi(\bar{z}_{k-1})-\phi(T_{L}(\bar{z}_{k-1}))=8L\left[\phi(\bar{z}_{k-1})-\phi(z_{k})\right]\\ & \leq\phi(z_{k-1})-\phi(z_{k})=\left[\phi(z_{k-1})-\phi(x^{*})\right]-\left[\phi(z_{k})-\phi(x^{*})\right],\end{aligned}\]where $T_{K}[y]$ is as in Lemma V.1. and $x^{*}$ is any optimal solution. Defining $\eta_{k}=\phi(z_{k})-\phi(x^{*})$, we are left with the telescoping inequality
\[{\rm dist}^{2}(0,\nabla f(z_{k})+\partial h(z_{k}))\leq8L\left[\eta_{k-1}-\eta_{k}\right].\]If we average the above inequality over indices $k+1$ and $2k$ and minorize it by the minimum squared distance, we have
\[\begin{aligned} \min_{1\leq i\leq2k}{\rm dist}^{2}(0,\nabla f(z_{i})+\partial h(z_{i})) & \leq\frac{1}{k}\sum_{i=k+1}^{2k}{\rm dist}^{2}(0,\nabla f(z_{i})+\partial h(z_{i}))\\ & \leq\frac{8L}{k}\sum_{i=k+1}^{2k}\left[\eta_{i-1}-\eta_{i}\right]=\frac{8L}{k}\left[\eta_{k}-\eta_{2k}\right]\\ & \overset{\eta_{2k}\geq0}{\leq}\frac{8L}{k}\left[\phi(z_{k})-\phi(x^{*})\right].\end{aligned}\]Now, in view of Proposition V.2., we can use property $(*)$ of the Estimating Sequence Scheme to conclude that
\[(\delta)\qquad\min_{1\leq i\leq2k}{\rm dist}^{2}(0,\nabla f(z_{i})+\partial h(z_{i}))\leq\frac{8L}{k}\left[\phi(z_{k})-\phi(x^{*})\right]\leq\frac{4L\|x_{0}-x^{*}\|^{2}}{kA_{k}}.\]Using the fact that $A_{k}$ exhibits either a quadratic or geometric growth rate, we can now derive the following result.
Theorem V.3. Modified accelerated gradient methods obtain a $\varepsilon$-stationary point of $({\cal P})$ in $O(T_{\varepsilon}’)$ iterations, where
\[T_{\varepsilon}'=\begin{cases} \left(\frac{L\|x_{0}-x^{*}\|}{\varepsilon}\right)^{2/3}, & \mu=0,\\ \sqrt{\frac{L}{\mu}}\ W\left(\frac{L\|x_{0}-x^{*}\|^{2}}{\varepsilon^{2}}\right), & \mu>0. \end{cases}\]and $W(\cdot)$ is the Lambert-W (or product log) function.
Sketch Proof. Show that the number of iterations needed to make $4L\|x_{0}-x^{*}\|^{2}\varepsilon^{2}\leq kA_{k}$ is $O(T_{\varepsilon}’)$ and combine this with relation $(\delta)$ above. A helpful fact is that for $y\geq0$ and any $x\in\mathbb{R}$ we have $W(y)=x$ if and only if $y=xe^{x}$. Moreover for the $\mu>0$ case, it is helpful to show the identity
\[c=y(1+\alpha)^{y}\iff c\log(1+\alpha)=\left[y\log(1+\alpha)\right]e^{y\log(1+\alpha)},\]which for $\alpha\in(0,1)$ implies that $y=W(c\log(1+\alpha))/\log(1+\alpha)\underset{\sim}{>}W(c)/\alpha$ (from the fact that $\log(1+x)\geq x/(1+x)$ for $x\geq-1$). QED
Remark V.4. For $x \geq e$, Hoorfar and Hassani showed that
\[\log x- {\log\log x} + \frac{\log \log x}{2\log x} \leq W(x) \leq \log x - \log \log x + \frac{1.6 \log \log x}{\log {x}}.\]In view of this bound, the modified accelerated gradient method asymptotically takes more iterations to find an $\varepsilon$-stationary point than an $\varepsilon$-solution when $\mu=0$ and slightly fewer iterations when $\mu>0$.
Remark V.5. The technique of introducing the auxiliary sequence $\{z_{k}\}_{k\geq1}$ does not have a specific name in the current optimization literature, but in nearly all of my recent works, you will see it being referred to as a refinement of the original iterate sequence. Personally, I would have liked to call it a transportation procedure following a well-know result of Brønsted & Rockafellar (see, for example, Theorem XI.4.2.1 in Lemarechal’s book), but this name didn’t jive too enough with my collaborators.
Exercise V.6. Show similar (but worse) bounds for the dual gradient method.
Remark V.7. The $O(\varepsilon^{-2/3})$ complexity bound for obtaining $\varepsilon$-stationary points of $({\cal P})$ is moderately better than the $O(\varepsilon^{-1})$ complexity bound for the composite gradient method. However, it is still does not reach the optimal lower bound complexity of $O(\varepsilon^{-1/2})$. Using a regularization trick, this bound can be improved to $O(\varepsilon^{-1/2} \log \varepsilon^{-1})$, but it still remains an open problem whether the logarithmic term in the previous complexity can be removed. It also remains open whether the previous bound can be obtained without and regularization tricks.
Leave a Comment