Acceleration in Optimization (Part I)

12 minute read

Published:

Difficulty: Advanced Undergraduate Students

An extremely popular — but easily misunderstood — concept in optimization theory is the notion of acceleration for iterative optimization algorithms. In this series of posts, I will attempt to demystify this frequently misunderstood concept.

Some specifics in the definitions and proofs will be omitted to keep the discussion relatively informal. For a more rigorous treatment of the material, see my convex optimization course notes, Appendix B of my thesis, one of my recent papers on FISTA, or any of Nesterov’s papers on accelerated gradient methods.

To begin our adventure, we start with reviewing a few key optimization concepts. In the next post, we will look at the fundamentals of the acceleration technique. To avoid repetition, for a convex (defined in the next section) function $\psi:\mathbb{R}^{n}\mapsto\mathbb{R}$, we denote

\[{\rm dom}\ \psi:=\{x\in\mathbb{R}^{n}:\psi(x)<+\infty\}\]

to be the domain of $\psi$ and

\[\partial\psi(x)=\{\xi\in\mathbb{R}^{n}:\psi(y)\geq\psi(x)+\langle \xi,y-x\rangle\ \forall y\in\mathbb{R}^{n}\}\]

to be the (convex) subdifferential of $\psi$.

Convexity and Accelerated Algorithms

Crucial to our presentation of acceleration are the notions of convexity and strong convexity. A function $\phi:\mathbb{R}^{n}\mapsto(-\infty,+\infty]$ is said to be convex if for every $x,y\in\mathbb{R}^{n}$ where $\phi$ is finite we have

\[\phi(tx+[1-t]y)\leq t\phi(x)+[1-t]\phi(y)\quad\forall t\in(0,1).\]

On the other hand, $\phi$ is said to be $\mu$-strongly convex if $\phi-\mu\|\cdot\|^{2}/2$ is convex, where $\|\cdot\|$ denotes the Euclidean norm. In terms of its relevance to acceleration, we “informally” say that an iterative optimization algorithm applied to the problem

\[({\cal P}) \qquad \inf_{z\in\mathbb{R}^{n}}\ \phi(z)\label{eq:main_prb}\]

is accelerated if it can use properties of convexity to obtain an approximate solution point faster than all other algorithms that do not use any properties of convexity.

Composite Functions

Following the setup of modern optimization frameworks, we will look at the case where $\phi$ is a composite function, taking the form

\[\phi(x)=f(x)+h(x)\]

where $h:\mathbb{R}^{n}\mapsto(-\infty,+\infty]$ is a (possibly) nonsmooth $\mu$-strongly convex function whose prox (or resolvent) operator

\[{\rm prox}_{\lambda h}(\bar{x})=(\lambda\partial h+{\rm id})^{-1}(\bar{x}):={\rm argmin}_{u\in\mathbb{R}^{n}}\left\{ \lambda h(u)+\frac{1}{2}\|u-\bar{x}\|^{2}\right\}\]

exists for any $\lambda>0$ and $\bar{x}\in\mathbb{R}^{n}$, and $f$ is a convex and continuously differentiable function satisfying

\[(\alpha)\qquad f(x)-f(y)-\left\langle \nabla f(y),x-y\right\rangle \leq\frac{L}{2}\|x-y\|^{2}\quad\forall x,y\in\mathbb{R}^n \ h,\label{eq:descent}\]

for some $L>0$. The above condition is often called an upper curvature condition.

Remark I.1. If $h\equiv0$, then ${\rm prox}_{\lambda h}(\bar{x})=\bar{x}$. This particular setup is popular in introductory optimization textbooks.

Remark I.2. Condition $(\alpha)$ is often replaced by the condition that $\nabla f$ is $L$-Lipschitz continuous on ${\rm dom}\ h$, which is a sufficient condition for $(\alpha)$ to hold. Interestingly enough, the condition that $\nabla f$ is $L$-Lipschitz continuous can be shown (see Proposition 2.1.55 in my thesis) to be equivalent to

\[\left|f(x)-f(y)-\left\langle \nabla f(y),x-y\right\rangle \right|\leq\frac{L}{2}\|x-y\|^{2}\quad\forall x,y\in{\rm dom}\ h.\]

Approximate Solutions and First-order Oracle

Given $\varepsilon>0$, we say that a point $x^{*}\in{\rm dom}\ h$ is an $\varepsilon$-solution of $({\cal P})$ if

\[\phi(x^{*})-\inf_{z\in\mathbb{R}^{n}}\ \phi(z)\leq\varepsilon\]

and is a $\varepsilon$-stationary point of $({\cal P})$ if

\[{\rm dist}(0,\partial\phi(x^{*}))\leq\varepsilon\]

where ${\rm dist}(x,X)$ is the Euclidean distance between a point $x$ to a set $X$. A straightforward exercise is to show that if $x^{*}$ is a 0-solution or 0-stationary point of $({\cal P})$, then $x^{*}$ is a global solution of $({\cal P})$.

In the posts that follow, we will be looking at the effort undertaken by an algorithm ${\cal A}$ for obtaining an $\varepsilon$-solution or $\varepsilon$-stationary point of $({\cal P})$, in terms of the number of times that ${\cal A}$ calls a particular oracle. Specifically, we will consider the case of a first-order oracle which only returns the values of

\[f(x),\ h(x),\ \nabla f(x),\ {\rm prox}_{\lambda h}(x)\]

for a given input point $x\in\mathbb{R}^{n}$ and scalar $\lambda>0$.

Remark I.3. It can be shown (see Lemma 15 in my paper on min-max optimization) that if $x^{*}$ is a $\varepsilon$-stationary point of $({\cal P})$ then

\[\inf_{\|d\|\leq1}\left[\phi'(x^{*};d):=\lim_{t\downarrow0}\frac{\phi(x^{*}+td)-\phi(x^{*})}{t}\right]\geq-\varepsilon.\]

Leave a Comment