程序案例-MATH46132/66132

MATH46132/66132
Numerical Optimization and Inverse Problems 2021/22
Dr Oliver Dorn
The University of Manchester
Department of Mathematics, Alan Turing Building
Week 5: Unconstrained Optimization in IRn – alternative search directions
min
x
x
0
Figure 1: (Fig 5.1) Typical zigzagging of line search in Steepest Descent Algorithm.
Alternative search directions
The figure (Fig 5.1) shows that due to the strong zigzagging behaviour the steepest-descent technique usually
converges very slowly. Can we design more efficient algorithms than steepest descent by finding good choices for Bk
We first will discuss line-search methods that use a search direction of the form
pk = Bkgk (1)
where Bk is a specifically chosen symmetric positive definite matrix which is a function of xk and possibly k.
If Bk is strictly positive definite then,
gTk pk = gTkBkgk < 0 and so we have indeed found another descent direction. For steepest-descent, Bk = I for all steps k. In this course, we briefly discuss the following popular line-search methods: The steepest-descent method Newton’s method 34 Quasi-Newton or secant methods (only ‘BFGS’) Gauss-Newton and Levenberg-Marquardt method The linear conjugate gradient technique The nonlinear conjugate gradient technique We have already discussed the steepest-descent method last week. Let us now conider the remaining techniques from that list. They are distinguished mainly by their choice of search directions pk. The Newton method When studying the one-dimensional minimization problem we have seen that the Newton method which minimizes a quadratic approximation to the given function in each step has very nice convergence properties. We want to show now that we actually can extend this method to the minimization problem in IRn and that we obtain a line-search technique this way with a very specific choice of B. Let f(x) be any sufficiently often differentiable function of which we would like to find a local minimum. Taylor expansion in the iterate xk yields the quadratic approximation f(xk + p) ≈ Qk(p) = f(xk) + pTgk + 1 2 pTHkp. (2) A necessary condition for a minimizer p of this quadratic approximation is that its gradient is zero Qk(p) = gk +Hkp = 0. (3) Recall from the necessary condition that close to a local minimum of f its Hessian H is positive semi-definite. If H is positive definite then it is invertible and we can solve (3) for p to obtain Hkp N k = gk or pNk = H 1k gk. (4) The search direction pNk is called Newton direction. Accepting the minimum of Qk(p) as a new iterate means to use the step length αk = 1. But in the framework of line-search methods we can as well use the just derived Newton direction and apply an inexact line-search to it, for example a back-stepping scheme with the Armijo or Wolfe condition. Obviously, the Newton direction is only well-defined in step k if Hk is invertible, and it is only a descent direction if Hk is positive definite. Any of these criteria might be violated in a given stage of a line search strategy. In such a case, usually the Newton step is replaced by a steepest descent step which amounts to replacing Hk ≈ I. We formulate the resulting Newton method below. Algorithm 5.1: The Newton Method Choose x0 as initial estimate of the minimum of f(x). Repeat for k = 0, 1, 2, . . . Set gk = f(xk), Hk = 2f(xk). if Hk is positive-definite then obtain pk by solving Hkpk = gk. % Newton direction else set pk = gk. % alternative steepest descent direction Determine αk either by an inexact line search Or put alternatively αk := 1 (full Newton step); Set xk+1 = xk + αkpk until ∥ f(xk+1)∥ is sufficiently small. 35 Quasi-Newton Methods In 1-dimensional optimization algorithms we have discussed the so-called secant methods where the often difficult to calculate 2nd order derivative of the Newton method is approximated using information based on 0th (i.e. function values) and 1st order derivatives around the given point. How could we use 0th and 1st order derivative information in a higher-dimensional optimization scheme in order to approximate the information stored in the Hessian of the Newton method in each step In so-called quasi-Newton methods (sometimes also called ‘secant methods’) this often difficult to calculate true Hessian is replaced by an approximation which is updated in each iteration based on information of 0th and 1st order derivatives from previous iterations. One of the most popular of these techniques is the BFGS method, named after Broyden, Fletcher, Goldfarb and Shanno who discovered the particular update formula (5). Let us start with any suitable initial guess for the Hessian matrix H0 at iteration index 0 (for example the identity H0 = I). BFGS then calculates in step k, given the latest approximation Hk to Hessf(xk) from previous steps, first the newly available 0th and 1st order derivatives information sk = xk+1 xk and yk = gradf(xk+1) gradf(xk). The new approximation to the Hessian is then computed via Hk+1 = Hk Hksks T kHk Hksk, sk + yky T k yk, sk (5) where we have made use of the notation aTb = a,b for two vectors a,b ∈ IRn. Algorithm 5.2: BFGS Method with line search To minimize a general (nonquadratic) function f(x). k := 0; x0 := initial guess for minimizer H0 := initial guess for Hessian g0 := gradf(x0); % initial gradient begin quasi-Newton iterations pk+1 := H 1k gk; % compute quasi-Newton step τk+1 := argminτ>0f(xk + τpk+1); % line search
xk+1 := xk + τk+1pk+1; % update approximate solution
gk+1 := gradf(xk+1); % new gradient
Hk+1 := updated Hessian based on (5)
k := k + 1;
end quasi-Newton iterations.
More practical form of BFGS
Since not the Hessian itself but its inverse is actually needed for the calculations, in the following more practical
form, sometimes called limited memory variant of BFGS, not the Hessian itself is updated from one iteration to the
next, but directly its inverse.
Denote the approximation to (Hessf)(xk)
1 in step k of the iteration by Bk. Then an approximation Bk+1 to
(Hessf)(xk+1)
1 is calculated directly via the formula
Bk+1 =
(
I sky
T
k
yk, sk
)
Bk
(
I yks
T
k
yk, sk
)
+
sks
T
k
yk, sk
. (6)
without any matrix inversion being involved.
In standard implementations, the initial Hessian approximation H0 (or B0) is often taken to be a scalar multiple of
the identity. However, more sophisticated choices are possible if the general structure of the true Hessian is known.
Quadratic objective functions
An optimisation algorithm is said to have the quadratic termination property if, when applied to the quadratic
function
Q(x) =
1
2
xTAx+ bTx+ c (7)
where A is a symmetric and positive definite (SPD) matrix and b ∈ Rn and c ∈ R are given, it terminates at the
minimiser x = A 1b in a finite number of iterations.
36
Notice that finding the minimiser of the quadratic objective function Q(x) is equivalent to solving a linear system
of equations
Ax = b. (8)
Refer to the course unit MATHx6101. We have seen above that the Newton Method has the Quadratic Termination
Property when full-Newton steps are taken. It will converge in just one iteration to solve (7). However, finding the
Newton direction requires us to calculate A 1 which might be prohibitive for large scale problems.
Conjugate directions methods
Are there other strategies with such a Quadratic Termination Property which might be easier to implement The
property of ‘conjugacy’ leads to another well-known algorithm with the quadratic termination property.
Definition
A set of non-zero vectors
{
p0,p1, . . .pn 1
}
are said to be conjugate with respect to an SPD matrix A ∈ Rn×n if
pTi Apj = 0, for all i = j
We list some properties of conjugacy:
If A = I then ‘conjugacy’ is the same as ‘orthogonal’.
A given set of conjugate vectors are linearly independent (exercise)
The set of eigenvectors of an SPD matrix are always conjugate (exercise)
Theorem
Let Q(x) = 12x
TAx + bTx + c where A ∈ Rn×n is SPD. Consider a full set of search directions {p0,p1, . . .pn 1}
which are conjugate w.r.t A. Let then for k = 0, 1, . . . n 1 the iterates defined as xk+1 = xk + αkpk where the αk
satisfy
Q(xk + αkpk) = min
α>0
Q(xk + αpk), (exact line searches)
Then,
xk converges to the minimizer of Q in at most n steps.
xk+1 is the minimiser of Q in the subspace
x0 + span {p0,p1, . . .pk}
(Best approximation property at step k + 1)
The proof of this theorem can be found in Nocedal and Wright and is also treated in the Exercises.
Example
Let us reconsider the minimisation of the following quadratic objective function.
f(x1, x2) =
1
2
(
x1 x2
)( 4 2
2 3
)
︸︷︷︸
=A
(
x1
x2
)
+
(
3 1
)( x1
x2
)
1.
We have shown in the exercises that it has as single (global) mimimizer x = ( 0.875, 0.25). The above theorem tells
us that if we perform exact line-searches and if we can find a suitable set of conjugate search directions, then the
method should converge in only two iterations.
Let the initial guess be x0 = [1, 0]
T . Since the eigenvectors of A are conjugate w.r.t A we can choose:
p0 =
[
0.6154
0.7882
]
, p1 =
[ 0.7882
0.6154
]
37
Iterating gives:
x1 = x0 + α0p0 =
[
1
0
]
+ α0
[
0.6154
0.7882
]
=
[
0.1686
1.0648
]
where α0 minimises Q(x0 + αp0) w.r.t α. We have: α0 = g
T
0 p0
pT0 Ap0
= 1.3509. Then,
x2 = x1 + α1p1 =
[
0.1686
1.0648
]
+ α1
[ 0.7882
0.6154
]
=
[ 0.8750
0.2500
]
where α1 minimises Q(x1 + αp1) w.r.t α α1 = g
T
1 p1
pT1 Ap1
= 1.3240. So only two steps are indeed required.
The Conjugate Gradient (CG) Method
The above theorem and example show that it might be very efficient to use conjugate search directions for iterative
line-search techniques keeping in mind that close to a local minimum many functions locally behave similar as their
quadratic Taylor approximation. In order to implement such a conjugate direction scheme we need to be able to obtain
easily conjugate search directions.
Even though eigenvectors of A might be a choice, we know that for large n, computing these is far too expensive.
We would prefer a more efficient way to compute conjugate direction vectors {p0,p1,p2, . . .} in a practical line-search
strategy.
The conjugate gradient method, invented in the 1950s by Hestenes and Stiefel and often abbreviated CG, iteratively
constructs such a sequence of search directions p1,p2, . . . which have the property that pk+1 is conjugate to all previous
vectors p1, . . . ,pk by construction. At the same time it provides a sequence of approximation vectors xk to the solution
of the above quadratic minimization problem which has the quadratic termination property.
CG method for minimizing a convex quadratic function
We consider again the quadratic function
Q(x) =
1
2
xTAx+ bTx+ c
where A is SPD, or equivalently, to solve Ax = b (‘linear CG’). Different versions of the CG method can be
found in the literature which more or less are equivalent mathematically even though certain numerical situations or
environments might prefer one over the others.
We present three different versions in the following. In the exercises we will verify how to derive one from the
other.
Algorithm 5.3: CG method for quadratic minimization (Version 1)
Choose x0 as an initial estimate of the solution.
Calculate g0 = Ax0 + b. Set p0 = g0.
Repeat for k = 0, 1, 2 . . .
find αk so that p
T
k gk+1 = p
T
k (A(xk + αkpk) + b) = 0
set xk+1 = xk + αkpk (exact line search)
determine βk and pk+1 using
βk =
gTk+1gk+1
gTk gk
and pk+1 = gk+1 + βkpk
until ∥gk+1∥ is sufficiently small.
Remarks
Notice that the formula
βk =
gTk+1gk+1
gTk gk
(9)
in Algorithm 5.3 for calculating βk is designed to make the search directions conjugate with respect to A, that is
pTi Apj = 0 when i = j.
You might have seen the following Version 2 of the CG method already in the course unit MATHx61011.
38
Algorithm 5.4: CG method for quadratic minimization (Version 2)
Given an initial guess x0.
Let β0 = 0, p 1 = 0, r0 = b Ax0, andk = 0.
REPEAT
If k > 0, let βk =
rTk rk
rTk 1rk 1
.
Let pk = rk + βkpk 1.
Let αk =
rTk rk
pTkApk
.
Let xk+1 = xk + αkpk.
Let rk+1 = rk αkApk.
Let k = k + 1.
UNTIL convergence
A slightly different presentation of the linear CG method is given in the textbook by Vogel which we present next
as Version 3.
Algorithm 5.5: CG method for quadratic minimization (Version 3)
k := 0; x0 := initial guess; g0 := Ax0 + b; % initial gradient
p0 := g0; % initial search direction
δ0 = ∥g0∥2;
begin CG iterations
hk := Apk;
τk := δk/ pk , hk ; % line search parameter
xk+1 := xk + τkpk % update approximate solution
gk+1 := gk + τkhk; % update gradient
δk+1 = ∥gk+1∥2;
βk := δk+1/δk
pk+1 := gk+1 + βkpk ; % update search direction
k := k + 1;
end CG iterations.
General remarks on the CG method
The above explained CG method is well-known in the framework of solving linear systems (compare the course
unit MATHx6101). These can be interpreted as specific optimality conditions of an associated quadratic mini-
mization problem. For that reason we call it simply linear CG.
In order to stabilize calculations a preconditioned form of standard linear CG is often employed. See the
textbooks by Nocedal and Wright or Vogel for more details.
There exist various nonlinear generalizations of the standard linear CG method. They essentially differ from
the above strategy in that they can be applied to quite general non-quadratic objective functions f . Gradients
however need to be calculated specifically in each step and a line search needs to be done.
We describe some popular variants of nonlinear CG in the following where we follow closely the presentation in
the textbook by Vogel.
The Nonlinear Conjugate Gradient Method
The following Algorithm 5.6 is a nonlinear version of the conjugate gradient method Algorithm 5.3 which is using
(9) for the calculation of βk. It is due to Fletcher and Reeves.
Notice that this algorithm proceeds in ’cycles’ of n iterations, with every n-th search direction being reset as the
steepest descent direction. Because we cannot have more than n vectors which are mutually conjugate with respect
to a given matrix, each cycle of n steps is regarded as a search for the minimum of a local quadratic model of f . If
this does not yield a suitable estimate of the true minimum then a fresh cycle must be started.
39
Algorithm 5.6: Nonlinear CG method
for minimizing a non-quadratic nonlinear function f : IRn → IR.
Choose x0 as an initial estimate of the solution.
Calculate g0 = f(x0). Set p0 = g0
Repeat for k = 0, 1, 2 . . .
find s by an ‘almost exact’ line search to minimize f(xk + spk).
set xk+1 = xk + spk, gk+1 = f(xk+1)
if k is not a multiple of n then
find βk and pk+1 from
βk =
gTk+1gk+1
gTk gk
and pk+1 = gk+1 + βkpk
else
set pk+1 = gk+1
until ∥gk+1∥ is sufficiently small.
An alternative form has been proposed by Polak and Ribiere which is overall identical to the Fletcher Reeves form
with the exception of the calculation of βk. In the Polak and Ribiere form this is done as
βk =
(
gk+1 gk
)T
gk+1
gTk gk
.
An even different form for calculating βk in the above algorithm has been proposed by Hestenes and Stiefel. It is
βk =
(
gk+1 gk
)T
gk+1(
gk+1 gk
)T
pk
.
. . . and there are more variants for which we refer to Nocedal and Wright.
Notice that for non-quadratic f(x) the search directions pk, pk+1 are not truly conjugate because there is not a
constant Hessian 2f for them to be conjugate with respect to. Also, different strategies can be discussed for the line
search in each step which for general nonlinear functions f most often has to be inexact. The Fletcher-Reeves form
of nonlinear CG usually uses the strong Wolfe conditions as step size criterion, see Nocedal and Wright. Notice also
that it is important to perform an ’almost perfect’ line search in nonlinear CG.
An alternative pseudo-code following the notation from Algorithm 5.5 for linear CG is given next for the nonlinear
CG method.
Algorithm 5.7: CG method for general nonlinear minimization
To minimize a non-quadratic nonlinear function f(x)
k := 0;
x0 := initial guess;
g0 := gradf(x0); % initial gradient
p0 := g0; % initial search direction
δ0 = ∥g0∥2;
begin CG iterations
τk ≈ argminτ>0f(xk + τpk) % line search
xk+1 := xk + τkpk % update approximate solution
gk+1 := gradf(xk+1); % compute gradient
δk+1 = ∥gk+1∥2;
βk := δk+1/δk
pk+1 := gk+1 + βkpk ; % update search direction
k := k + 1;
end CG iterations.
Variants of Newton-type techniques
A number of variations and combinations of the above discussed techniques are available which try to produce
efficient algorithms for solving practical minimization problems in realistic situations.
40
When calculating Newton directions in (4) a linear system with system matrix Hk needs to be solved. Several
modifications Hk → H k have been proposed in cases where Hk is not SPD such that the modified system H kp = gk
is SPD which leads to more robust performance. These are often referred to as modified Newton methods.
Regardless whether the actual Hk is SPD or not, solving the linear system (4) still remains challenging for large
scale problems. Application of a (possibly truncated) Gaussian Elimination or matrix factorizations are popular
approaches. Iterative solvers can also be used for solving these large scale linear systems. However, keep in mind that
solving such a system accurately for large n in each Newton step can be prohibitive.
As part of a practical Newton iteration, the above described linear subproblems are often only performed in some
simplified or truncated form in order to save computation time. In these cases so-called inexact Newton steps result
which often still provide good convergence overall.
In particular, if a truncated Conjugate Gradient technique is applied for finding approximate Newton directions
via (4) in each step of the Newton iteration, combined with an efficient line-search, we obtain the Newton-CG method.
Also so-called trust-region variants of Newton’s method can be used where the quadratic approximation (2) is
minimized in a restricted neighbourhood of the current iterate xk, called the ’trust region’.
For more details on these and other modifications we refer to Nocedal and Wright and the other textbooks listed
below.
Recommended literature
If you want to read more about the material of this week, I recommend:
J E Dennis, Jr and R B Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations,
SIAM Classics in Applied Mathematics Series, Vol 16, 1996. (available as E-Book)
R Fletcher, Practical Methods of Optimization, Vol 1 and 2, John Wiley & Sons, 1980.
P E Gill, W Murray and M Wright, Practical Optimization, Academic Press, 1981.
D G Luenberger and Y Ye, Linear and Nonlinear Programming, Springer, 2010.
J Nocedal and S Wright, Numerical Optimization, Springer, 1999. (available as E-Book)
C R Vogel, Computational Methods for Inverse Problems, SIAM, 2002. (available as E-Book at the University
Library).
41