Acceleration in Optimization (Part VI)
Published:
Difficulty: Advanced Undergraduate Students
Let us conclude this series of posts with some commentary about the derived results.
The final blueprint for the accelerated gradient method is quite verbose and requires, at the minimum, generating the iterates $x_{k}$, $y_{k}$, $v_{k}$, $A_{k}$, and $\Gamma_{k}$.
- However, in the very special case of $h=0$ and $\mu>0$, it can be shown (see, for example, the discussion after (2.2.19) in Nesterov’s book) that the iterates $v_{k}$, $A_{k}$, and $\Gamma_{k}$ can be eliminated and a single iteration of the accelerated method can be written as
- where $q_{f}:=\mu/L$ is a condition number on $({\cal P})$.
An important caveat of the above reformulation, though, is that acceleration will not occur when $\mu=0$ and the updates above are used.
This simplified form is often mischaracterized in machine learning literature as a “momentum” step due to the extrapolation done in the $y_{k}$ update. Even worse, many empirical machine learning studies consider methods that replace the $y_{k}$ update with the more general update
- and assert that their method is an accelerated one due to the fact that the above update generalizes accelerated one (where $\beta=(1-\sqrt{q_{f}})(1+\sqrt{q_{f}})$). It is for this reason, that I abhor any association between "momentum" and acceleration.
- It still remains an open problem whether a similar simplification occurs when $h\neq0$.
Every iteration of accelerated gradient method requires at two evaluations of ${\rm prox}_{h/L}(\cdot)$ to generate its key iterates and another one for generating the auxiliary iterates $\bar{z}_{k}$ and $z_{k}$, which are used for checking $\varepsilon$-stationarity.
An important question is: can the number of resolvent/prox evaluations be reduced? It turns out that another version of the accelerated method, which I call S-FISTA, can reduce to the total number to one in both scenarios!
S-FISTA also has the nice property that it does not need to generate an auxiliary sequence for checking $\varepsilon$-stationarity.
All of the iteration complexity bounds in our analysis require the existence of an optimal solution $x^{*}$.
- An important question is: can these bounds be generalized for function functions that have an optimal objective value, but no optimal solution (e.g., consider $f(x)=1/x$ on $[1,\infty)$)? As far as I know, there are only three methods that look at this case. Two of them require the domain to be bounded while the other one (in a recent paper of mine) does not.
While not discussed in our previous posts, there are adaptive versions of the accelerated method in which $L$ and $\mu$ do not have to be known beforehand.
Leave a Comment