Acceleration in Optimization (Part VI)

8 minute read

Published:

Difficulty: Advanced Undergraduate Students

Let us conclude this series of posts with some commentary about the derived results.

  • The final blueprint for the accelerated gradient method is quite verbose and requires, at the minimum, generating the iterates $x_{k}$, $y_{k}$, $v_{k}$, $A_{k}$, and $\Gamma_{k}$.

    • However, in the very special case of $h=0$ and $\mu>0$, it can be shown (see, for example, the discussion after (2.2.19) in Nesterov’s book) that the iterates $v_{k}$, $A_{k}$, and $\Gamma_{k}$ can be eliminated and a single iteration of the accelerated method can be written as
    \[\begin{aligned} x_{k} & \gets y_{k-1}-\frac{1}{L}\nabla f(y_{k-1}),\\ y_{k} & \gets x_{k}+\frac{1-\sqrt{q_{f}}}{1+\sqrt{q_{f}}}(x_{k}-x_{k-1}),\end{aligned}\]
      where $q_{f}:=\mu/L$ is a condition number on $({\cal P})$.
    • An important caveat of the above reformulation, though, is that acceleration will not occur when $\mu=0$ and the updates above are used.

    • This simplified form is often mischaracterized in machine learning literature as a “momentum” step due to the extrapolation done in the $y_{k}$ update. Even worse, many empirical machine learning studies consider methods that replace the $y_{k}$ update with the more general update

    \[y_{k}\gets x_{k}+\beta(x_{k}-x_{k-1}),\quad\beta\in(0,1),\]
      and assert that their method is an accelerated one due to the fact that the above update generalizes accelerated one (where $\beta=(1-\sqrt{q_{f}})(1+\sqrt{q_{f}})$). It is for this reason, that I abhor any association between "momentum" and acceleration.
    • It still remains an open problem whether a similar simplification occurs when $h\neq0$.
  • Every iteration of accelerated gradient method requires at two evaluations of ${\rm prox}_{h/L}(\cdot)$ to generate its key iterates and another one for generating the auxiliary iterates $\bar{z}_{k}$ and $z_{k}$, which are used for checking $\varepsilon$-stationarity.

    • An important question is: can the number of resolvent/prox evaluations be reduced? It turns out that another version of the accelerated method, which I call S-FISTA, can reduce to the total number to one in both scenarios!

    • S-FISTA also has the nice property that it does not need to generate an auxiliary sequence for checking $\varepsilon$-stationarity.

  • All of the iteration complexity bounds in our analysis require the existence of an optimal solution $x^{*}$.

    • An important question is: can these bounds be generalized for function functions that have an optimal objective value, but no optimal solution (e.g., consider $f(x)=1/x$ on $[1,\infty)$)? As far as I know, there are only three methods that look at this case. Two of them require the domain to be bounded while the other one (in a recent paper of mine) does not.
  • While not discussed in our previous posts, there are adaptive versions of the accelerated method in which $L$ and $\mu$ do not have to be known beforehand.

Leave a Comment