HT2026 Continuous Maths notes

Remaining TODOs: 46

1. Derivatives and Taylor’s Theorem

Definition 1.1: Let .

is continuous at if .

is continuous if it is continuous at every .

Definition 1.2: Let .

is continuous at if .

is continuous if it is continuous at every .

Theorem 1.3: If and are continuous, so are:

, when strictly positive
, where strictly positive
, where nonzero

Definition 1.4: Let .

is differentiable at if exists. Then is the derivative of at .

is differentiable if it is differentiable at every .

Remark: If a function is not continuous, it is not differentiable.

Definition 1.5: A secant is a line that connects two points on a surface.

A tangent can be thought of as a limit of secants as the distance between the two points tend to 0.

Remark: If and are differentiable, so are all the combinations that preserve continuity mentioned earlier, with the exception of and .

Remark: The derivative is a linear operator.

Theorem 1.6 (Quotient rule):

Theorem 1.7 (Derived quotient rule):

Proof: Exercise; see problem sheet 0

Theorem 1.8 (Taylor's Theorem (univariate)): For ,

for some .

The Taylor polynomial (of degree ) is also written , and the error term, or Lagrange remainder term is denoted by .

So, we can write for any so long as the first derivatives of all exist and are continuous on .

For small , .

Remark: is the unique -order polynomial that agrees with as to its first derivatives at .

Remark:Taylor’s Theorem is not the same as a Taylor series; the Taylor series is an infinite sum and does not always converge to the correct answer. (If tends to 0 for large and small enough , the series does converge and that function is analytic).

Definition 1.9: Let .

is the partial derivative of with respect to , as if were constant.

means differentiate first with respect to , then . In general this is not equal to , but often it is.

Definition 1.10: Let . Then the vector derivative is

It turns out that this makes sense to be a derivative, because

Remark: Some standard derivatives for vectors that are analogous to scalars:

Remark: Similarly to scalar derivatives, the vector derivative is linear and has a product and quotient rule.

The chain rule for is the same as for scalars:

For ,

where is the Jacobean - see Definition 1.14.

Remark: The type of the vector derivative of is

Also

Definition 1.11: Let . Then the Hessian of is

Remark: The Hessian is analogous to the second derivative of a scalar function.

Theorem 1.12 (Clairout's theorem): If everything is continuous, .

Definition 1.13: A vector field or vector-valued function is a function for .

Remark: A vector field can be decomposed into

Definition 1.14: The derivative of a vector field is the Jacobean, , where

Remark: For ,

Remark: The Jacobean in a linear operator. Also,

It has a dot-product rule:

It has a scalar-vector product rule:

Don’t need to learn these!

It also has a chain rule, called the partial chain rule:

Remark: Some standard Jacobeans:

Remark: The Hessian is a Jacobean:

Luckily because of Theorem 1.12, the transpose can usually be ignored.

Note that this basically just says that the second derivative is the derivative of the derivative.

Remark: Given vectors ,

where the rows of are the s.

Moreover, given scalars ,

with as before and a diagonal matrix with the s as the diagonal entries.

These identities can come in handy when computing the Jacobean.

Theorem 1.15 (Taylor's Theorem for multivariate functions): For , if ,

Here, the brackets with partial derivatives inside are carrying out operator algebra - they are not multiplication, but follow the same distributive rules so can be though of as so.

The taylor polynomials up to degree 2 can be written a bit more nicely:

For higher powers, we need tensors which are beyond the scope of this course.

The error terms (up to ) can be written in a similar fashion but evaluated at for some .

Remark: is small in the sense that

2. Optimisation

Definition 2.1: Let .

The canonical optimisation problem is to find

Note that is a set.

is the feasible region.

If , the optimisation is unconstrained.

If , the optimisation is constrained. The standard form for constraints is:

Definition 2.2: If is empty, there is no solution and the optimisation problem is inconsistent.

Remark: The problem can also have no solution if, in the feasible region, the function tends to or there is a vertical asymptote.

2.1. Optimisation in 1 dimension

Solutions to satisfy the 1st-order condition . But so do non-solutions!

Stationary points can be classified by Taylor’s theorem.

If , then for some . So if the second derivative is positive, then for some region around , around , so is a minimum. Similarly, if the second derivative is negative, then around so is a maximum.

If the second derivative is 0, this doesn’t necessarily mean that it is a point of inflection; we need to look for the first nonzero derivative. If the first nonzero derivative is odd, then it is a stationary point of inflection - since the function changes concavity around the stationary point - but if the first nonzero derivative is an even derivative, use the same rules as for the second derivative.

However, this doesn’t always work (e.g. has all of its derivatives at 0 equal to 0).

For constrained optimisations, do the same but also consider the endpoints.

2.2. A brief detour into linear alegebra

Definition 2.2.1: Given a symmetric matrix , is

positive definite if for all
positive semidefinite if for all
negative definite if for all
negative semidefinite if for all
indefinite otherwise

Proposition 2.2.2:

These are equivalent to certain conditions on the eigenvalues: positive definite iff all eigenvalues strictly positive, positive semidefinite iff all eigenvalues positive, etc.

Proof (for positive definiteness):

A fact: If is an symmetric matrix, there exists a basis (for ) of orthogonal eigenvectors of . (spectral theorem)

TODO

Remark:

Given a symmetric matrix with eigenvalues ,

Hence

indefinite;
positive definite;
positive semidefinite
negative definite;
negative semidefinite.

Definition 2.2.3: The pivots of a matrix are the entries on the diagonal when in echelon form, obtained without row multiplication or row swaps.

Remark: A symmetric matrix is

positive definite if all its pivots
positive semidefinite if all its pivots are
negative definite if all its pivots
negative semidefinite if all its pivots
indefinite otherwise

but the indefiniteness conditions hold only if the pivots can be found without row swaps

Remark: Definiteness works a bit like signs for scalars. Pos def equivalent to positive, pos semidef quivalent to nonnegative, etc. Incl. (pos def)^⁻¹ pos def (and this always exists, in this case).

Remark: If is positive (semi)definite then so are

for any
, which is guaranteed to exist if is positive definite
for any positive (semi)definite
any upper-left submatrix of TODO intuition
, if is of full rank - but if not of full rank, we can still guarantee positive semidefiniteness TODO intuition
, if

Remark: The zero matrix is both positive semidefinite and negative semidefinite.

TODO : lookup: diagonally dominant

2.3. Unconstrained optimisation over

Solutions to , if they exist, satisfy the 1st-order condition . This is in general not easy to solve as it is a system of (not necessarily linear) equations.

In higher dimensions, stationary points can be local minima, local maxima or saddle points.

Definition 2.3.1: A saddle point is a stationary point at which moving in some directions gives a more positive result, and in another direction decreases.

We can classify these higher-dimensional stationary points using the Hessian, similarly to how we used the second derivative for scalar functions.

Theorem 2.3.2:If is positive definite, then is positive definite for close to .

Hence if the Hessian at is positive definite, the stationary point at is a local minimum. If the Hessian is negative definite, the stationary point is a local maximum. If the Hessian is indefinite, the stationary point is a saddle point.

If the Hessian is semidefinite, you cannot determine anything other than that the point is not a local minimum/maximum. For better analysis we would need to use the 3rd-order Taylor approximation.

Fact: the outer product of a vector with itself is always positive semidefinite. Proof: .

2.4. Convexity

Definition 2.4.1: A set is convex if

That is, the line connecting any lies wholly in .

Definition 2.4.2: A function for convex is convex if

That is, any point on between and lies below the secant between and .

Equivalently, every secant on lies above the surface of .

If the inequality is strict, is strictly convex if the inequality is the other way around, is concave or strictly concave in a similar manner.

Theorem 2.4.3: If is convex, every stationary point is a global minimum.

Theorem 2.4.4: If is strictly convex, there is at most one stationary point, and that is the global minimum.

Remark: If carrying out unconstrained optimisation on a convex function, just solve the first-order condition (once you have established that there is a minimum).

Remark: In one dimension, is convex if everywhere; if the inequality is strict then is strictly convex. The implication does not hold in reverse.

Remark: Convexity does not imply the existence of a minimum!

Remark: For higher dimensions, is is convex iff is positive semidefinite everywhere; if (but not only if) is positive definite everywhere, then is strictly convex.

Remark:

Some common convex and concave functions:

Linear functions are both convex and concave.
for is strictly convex
is strictly convex
is strictly concave
for convex , is concave
is convex if is positive (semi)definite
for convex , is convex
for convex , is convex (but strict convexity is only preserved if has full rank)
for convex , is convex
for convex , is convex

Theorem 2.4.5: For ,

Proof: convex iff everywhere.

is always positive, so we require to be positive everywhere, i.e. is convex. We then require and to have the same signs everywhere, so either is increasing and is convex, or is decreasing and is concave.

The proof for concavity is similar.

Remark: These conditions are sufficient but not necessary.

Remark: Theorem 2.4.5 can be used also for .

Theorem 2.4.6:

Let , such that with .

Then if is (strictly) convex, and for each , either

then is (strictly) convex.

Theorem 2.4.7 (Jenson's inequality): For a random variable and convex function ,

(This is not on the syllabus but is useful)

2.5. Optimisation tricks

Theorem 2.5.1:If is strictly increasing, then

Theorem 2.5.2: If is injective/1-to-1 then

2.6. Optimisation over with equality constraints

Remark: Given a minimisation problem with an equality constraint, it can sometimes be reduced to an unconstrained problem:

Given ,
Given ,

But usually, this isn’t the case.

Theorem 2.6.1 (Lagrange's theorem / method of Lagrange multipliers): For a minimisation problem with one equality constraint, minimise subject to , the stationary points must satisfy

i.e. a first order condition is .

Proof:

TODO

Definition 2.6.2: The Lagrangian of such a minimisation problem is

The stationary points of where are the stationary points of .

Remark: It is not simple to determine if the stationary points of the Lagrangian are a minimum, maximum or saddle point of the problem. It is not sufficient to look at the definiteness of the Hessian of , or the convexity of .

TODO : look up bordered Hessian

Theorem 2.6.3: For ,

the Lagrangian is

Then the stationary points of the minimisation problem are the stationary points of the Lagrangian.

2.7. Inequality contrained optimisation (over )

Definition 2.7.1: Given a minimisation problem subject to , at the solution, is tight if , and slack if .

To solve such a minimisation, solve:

for tight constraints via the previous section;
for slack constraints, by solving the usual first-order condition (as usual), and ignoring any solutions outside of the feasible region.

Either way, , with possibly zero.

Theorem 2.7.2: The Lagrange multiplier is always positive.

Proof: TODO

Corollary 2.7.3 (Complementary slackness): At a minimum of an minimisation problem with constraint and Lagrange multiplier ,

Equivalently, .

Theorem 2.7.4: When optimising with one inequality constraint, if is convex and the feasible set is convex, then either every stationary point inside the constraint is a global minimum, or one of the stationary points on the constraint is a global minimum.

TODO KKT conditions (not on syllabus)

TODO : what to do when mix of equality and inequality constraints

3. Algorithms for numerical integration

3.1. 1-dimensional integration

Here we will consider approximating the integral of over a small interval , with .

The simplest approximation is the 0th order Taylor approximation; this is equal to the approximation using the 1st order Taylor approximation.

Definition 3.1.1: The midpoint rule: define

The subscript 1 indicates that it is operating over 1 strip. The notation for this (and other methods in this section) is not standard.

Theorem 3.1.2: The error of the midpoint rule,

is bounded by

where

Proof: TODO

Definition 3.1.3: Instead of Taylor’s theorem, we can use polynomial interpolation - a polynomial that agrees with at some points.

Definition 3.1.4: The Trapezium rule creates a trapezium with the axis and the endpoints of the strip. It is generally worse than the midpoint rule, so will not be assessed in the course.

Theorem 3.1.5: The error bound for the trapezium rule is

Remark: The error bound is twice as big as the midpoint rule.

Definition 3.1.6: Simpson’s rule interpolates a parabola from the endpoints and midpoints. The derivation is quite complicated but the resulting formula is very simple: it is just a weighted average. Although tis is over a single strip, Simpson’s rule is usually considered to operate over 2 strips, for notational consistency (as we will see later).

Theorem 3.1.7: The error bound for Simpson’s rule is

Remark: This error bound is much better, if and are close. But this error bound only applies if the function has at least 4 derivatives; otherwise, the method can still be used, but we can’t bound the error in this manner.

Remark:Boole’s rule gives a similar formula for interpolating a quartic polynomial, but is a bit more complicated and doesn’t offer much improvement over Simpson’s rule.

Higher order Taylor approximations might not be a great choice for a better approximation because:

you need to evaluate high derivatives
Taylor approximations are local
functions might not have enough derivatives/be analytic

Higher-order polynomial interpolation also might not be a good idea:

it is not necessarily numerically unstable (as commonly believed), but it can be if you do it badly
the Runge phenomenon - some functions get worse and worse approximations as the order of the interpolated polynomial increases because polynomial interpolation works worse near the ends of the interval
but Chebyshev interpolations are much better! These have interpolation points closer together near to the ends of the interval

Definition 3.1.8: The Runge function is

Higher-order polynomial interpolation on the Runge function, using evenly spaced points, is really bad.

A better method is just to split the interval into smaller strips and use an easier method on each of them. This is guaranteed to converge to the right answer regardless of which method you use, so long as the function is continuous.

Definition 3.1.9: The composite midpoint rule is essential just an average of all the heights of the endpoints of strips, multiplied by the width of the region.

Theorem 3.1.10: The error bound for the composite midpoint rule is

Remark: The main computation cost is evaluations of ; the error is so this is pretty good for the amount of work we do.

Definition 3.1.11: The composite Simpson’s rule is again akin to a weighted average, now with coefficients , with the 1s and 4s as before and the 2s coming from where the endpoints are repeated.

Theorem 3.1.12: The error bound for the composite Simpson’s rule is

Remark: The main computational cost is again in evaluations of , this time . But now the error bound is which is even better relative to the amount of work we do.

We can make improvements on these methods:

Pick more optimal strip endpoints (Gaussian quadrature - TODO what is this?)
Extrapolate a sequence of or (Romberg integration - TODO what is this?)
Recursive subdivision (reuse evaluation, do more [via thinner strips] on difficult bits) - based on 4th derivative being large, or use the a posteriori error estimate

TODO proof, how, why??
Or just do more work where is large, as the errors have a greater effect there!

Also note that these methods can’t help with integrals with in either of the limits.

3.2. Integration in dimensions

In two dimensions, is the volume under a surface in region .

Theorem 3.2.1 (Fubini's Theorem):

for sufficiently nice and (we need to be able to express the bounds nicely). This “split” integral is called an iterated integral.

This extends to higher dimensions as expected.

As long as everything is integrable, the order doesn’t matter.

We can use the methods we already know to approximate higher-dimensional integrals, as long as we can split the integral into iterated integrals.

Remark: The midpoint rule can be extended to multiple dimensions by converting into an iterated integral via Fubini’s theorem, and then applying the 1-dimensional midpoint rule to successive inner integrals. For a -dimensional integral of the function over the unit hypercube, the midpoint rule with strips in each dimension gives

This is just taking the average of the value of at the midpoints of each hypercube in a grid over the region.

Remark: Simpson’s rule can also be extended to higher dimensions: for 2 dimensions,

where

Remark: In dimensions,

This seems quite good, but we do evaluations of the function; so relative to the work, the error becomes steadily worse.

This is called the “curse of dimensionality”… all of these methods improve extremely slowly in high dimensions.

Definition 3.2.2: Monte Carlo integration does the midpoint rule, but using random points according to some distribution.

Consider sampling random points along a scalar function:

Therefore, given i.i.d. according to , for large enough ,

We extend this to dimensions, using the notation

for large to approximate , with i.i.d. uniformly distributed across the region , and the area of the region .

Remark: This is called Monte Carlo integration because it is a Monte Carlo algorithm; the answer is random.

Remark: This is in contrast to Las Vegas algorithms, which have random running time but nonrandom answer.

Lemma 3.2.3:

Monte Carlo integration is unbiased (on average it is correct).

Proof: Take i.i.d. with pdf

Then

Proposition 3.2.4:

for a constant . Moreover, can be approximated by that can be expressed in terms of , , and the s.

Proof:

Definition 3.2.5: The standard error is the square root of the estimated variance,

Remark: This all looks quite bad, but none of it is dependent on the dimension, so this is actually quite good! The goodness of the algorithm only depends on the area of the region we are integrating over, and the “wiggliness” of the function.

Lemma 3.2.6:

Monte Carlo integration is consistent:

Proof: Let .

Then

Then by Chebyshev’s inequality,

Theorem 3.2.7 (Central Limit Theorem): If i.i.d., then

(This is not a precise statement.)

That is, a large sample of i.i.d. random variables tends to a normal distribution.

Lemma 3.2.8:

For large , the error is distributed approximately according to .

Proof:

Then and .

Then

Corollary 3.2.9: The error of Monte Carlo integration is within 1 standard error about 68% of the time; within 2 standard errors 95% of the time; and within 3 standard error about 99.7% of the time.

Remark:

There is no guarantee on the error (you could get really unlucky)
Sampling even from strangely shaped regions is actually easy - just sample from a tight box around it, and reject any samples that are outside the region
This sampling technique could be quite time consuming in higher dimensions
Finding the area of the region could be quite difficult - but we can estimate it by considering the size of the sampling box and also how many samples were rejected. (Note that this is also a Monte Carlo integral! of the function ). We therefore need to adjust the standard error.

Remark: Some improvements we could make:

Divide into subregions, perhaps recursively (stratified sampling)
Sample some parts of more densely, using the pdf of as , then

(importance sampling)
Subtract an easy integral (a control variate) to make the Monte Carlo integral smaller, so that the error is smaller

These don’t change the asymptotic error, but can give a better constant term.

4. Accuracy

Definition 4.1: Approximating by ,

is the error
is the absolute error
is the relative error.

For vector , replace with the Euclidean norm .

Definition 4.2: If is an ideal function of , and is an approximate implementation of , then

is the forward error
is the backward error, where .

Definition 4.3: A floating-point number is a binary rational approximation to real numbers:

where is the mantissa, is the fixed precision (of the mantissa), and is the exponent.

The IEEE 754 standard specifies:

single precision floats: 23 bit mantissa precision, 8 bit exponent, 1 sign bit
double precision: 52 bit mantissa precision, 11 bit exponent, 1 sign bit

Definition 4.4: Machine epsilon is the such that every is representable with a relative error of no more than .

Note that is the relative error when we try to represent

Then

TODO go back over this

Machine epsilon is therefore for floats, and for doubles.

Some people define to be the smallest such that is representable. This is double the definition used here.

Remark: Machine epsilon is the largest possible relative error!

Remark: If the result of a computation is orders of magnitude smaller than the inputs, we can get “catastrophic cancellation” leading to very large relative errors. E.g. subtraction of close numbers can round really badly for large inputs.

Definition 4.5: Truncation error occurs when approximating an infinite process by a finite process.

Roundoff error is due to floating-point storage and computation.

4.1. Iterative methods

Definition 4.1.1: Starting with an initial estimate approximating unknown , we repeatedly refine it with , stopping when some criterion is met, e.g. < delta.

The error at step is .

If as :

has linear convergence if for some ,
has sublinear convergence if
- in particular has logarithmic convergence if
converges superlinearly if
- in particular has order- convergence for if

Remark: Linear convergence is called that not because the error is , but because the amount of work you need to do to get an extra decimal place is constant, i.e. on a log plot we get a straight line.

Example:

converges linearly
converges logarithmically (a straight line on a log-log graph)
converges superlinearly, and in fact converges quadratically. On a log plot, this is an inverted parabola.

Remark: This can all be extended to vectors easily, by replacing with .

Remark: Before finding how quickly an iterative method converges, first determine if it converges at all: assume that it does converge, and show it converges to the right thing, then find an expression for in terms of , then in terms of with induction, and show that that tends to 0.

Lemma 4.1.2: If converges with order , then converges with order .

Proof: TODO

5. Algorithms for root-finding

5.1. 1-dimensional root-finding

Definition 5.1.1: A root of is with .

Definition 5.1.2: A bracket is an interval with

Theorem 5.1.3: If is continuous, every bracket contains at least one root.

Remark: This does not hold in higher dimensions.

Remark: We can effectively use binary search to find roots using brackets. However, we might not find an exact zero (because floating point), so we terminate when the interval is small enough, guess is the midpoint of the interval, and can bound by the bracket.

This is called interval bisection.

Iterating, we get a sequence with

so this method converges linearly.

Definition 5.1.4: Newton’s method in 1 dimension approximates by its first-order Taylor polynomial.

Take a guess , then find the root of the first-order Taylor polynomial at ; this is , i.e.

TODO derive this explicitly to make super clear

This converges quadratically for some . We can also get stuck at a fixed point, or hit a turning point and divide by zero, or diverge.

Remark: Some limitations of Newton’s method:

you need to be able to evaluate the derivative
fails if (throw an error)
iteration can get stuck in a loop
iteration can diverge

Theorem 5.1.5: The forward error estimate for Newton’s method is

Proof:

Remark: So we might choose to terminate when successive iterations are sufficiently close, by some parameter , often or , preferably by some relative error.

We can also look at backward error, .

There’s no one best way of doing a stopping condition, you should think about it.

Remark: You also must stop at a sensible upper bound of steps so you don’t get stuck in an infinite loop (ideally with a warning that convergence didn’t happen).

You must also stop if an iteration step is poorly defined.

Lemma 5.1.6:

Let for some .

Let

Then if , then .

Proof:

Suppose .

Lemma 5.1.7:

For as before, if and then converges to at least quadratically.

Proof:

Suppose and that as before. Note that from the definition of .

Then letting , by induction, for all , and . so as .

Therefore .

Moreover,

so we have at least quadratic convergence (it might be better if it actually tends to 0; we don’t have equality).

Remark: The important thing here is that we have proven that the s stay within .

Remark: The condition on here, , is a sufficient condition for to be a basin of convergence.

Lemma 5.1.8: Let , and let

Then if and , then quadratically.

Proof:

Suppose that and .

From Lemma 5.1.6,

because the interval contained within with at its centre, , must have , so .

Note that although we cannot guarantee that stays within ,we can guarantee that it stays in , if , by the same reasoning as Lemma 5.1.7.

TODO make this a bit nicer from lecture notes

Remark: We might find such an by finding a suitable bracket.

Theorem 5.1.9: There exists an such that is a basin of convergence if .

Proof: Given a bracket containing , we can shrink the interval down such that . If , then this is finite and we can find small such that ; the same is true for .

Remark: The convergence result doesn’t work for double roots (it might still converge, but only linearly).

We can fix it by using the iteration , which will converge quadratically to double roots.

Definition 5.1.10: The secant method uses a linear interpolation to approximate .

TODO justification

Remark: This is a “quasi-Newton method” where we use an estimate for the derivative.

Remark: This only needs one evaluation of at each iteration.

Remark: This can converge superlinearly, in fact order- in good cases. (proof: exercise)

Remark: In theory this can diverge or get stuck in a loop, but it is more stable than Newton’s method, because it relies on the previous 2 iterations.

TODO forward error (same as Newton)

Remark: We should be able to do two iterations of the secant method for no more than the cost of 1 iteration of Newton’s method, which would converge with order . This is also nice because we just don’t want to evaluate the derivative of .

Remark: Some other 1-dimensional root finding methods:

Halley’s method - higher order Taylor approximation. But difficult because we might end up with no roots or many roots. But cubic convergence in good cases.
Muller’s method - higher order polynomial interpolation - same problems as Halley’s method, but no derivatives needed and order when it works
Inverse quadratic interpolation, i.e. find , and the root is at (always!). Order convergence in good cases.

But the best method is Brent’s method:

maintains a bracket
uses inverse quadratic interpolation unless iteration undefined or outside the bracket
uses the secant method as second choice
falls back to bisection if these do not sufficiently shrink the bracket (only happens at at most half of iterations)
derivative free
always converges, at order in good cases.

5.2. -dimensional root finding

Definition 5.2.1: A root of is with .

If , the problem is underconstrained, so there are probably infinitely many solutions.

If , the problem is unconstrained, so there are probably no (exact) solutions.

We will consider .

Remark: There isn’t really such a thing as a bracket in higher dimensions, because although we might have roots of components, we might not find simultaneous roots of these.

Remark: If instead we take a hyperrectangle with a bracket for components on opposite sides, then we can guarantee a simultaneous root.

TODO diagram

But this is not very useful, because we can’t algorithmically prove tha a function is positive/negative all the way along a line.

Remark: For higher-dimensional iterative methods, the sensible termination conditions are the same as they were in one dimension, but with norms rather than absolute value, and a tolerance parameter should probably rely on .

Definition 5.2.2: Given , we approximate each component by its first-order Taylor polynomial:

TODO juggle equations

We can put all of these together to get

So we get an iteration

Each iteration, this requires 1 evaluation of , derivative, and an inversion, which is .

Remark: This is not the curse of dimensionality; it only blows up polynomially with , not exponentially.

Remark: We still have existence of a basin of quadratic convergence, if:

are continuous
TODO

Remark: If , we should backtrack, TODO

Definition 5.2.3: In dimensions, the equivalent to the secant method is Broyden’s method:

TODO complexity

The convergence is order in good cases.

Remark: Rather than calculating , we can let ; this does converge correctly.

Theorem 5.2.4:

Proof: Exercise

Remark: This means that we can start from and never need to calculate any inverses.

This gives for Broyden’s method.

6. Algorithms for optimisation

6.1. 1-dimensional minimisation

Remark: If we can show that a function is convex, we can use a root finding algorithm to find the minimum. We can find a bracket for the minimum by finding a bracket for a root of the derivative (but for a minimum it;s got to be negative on the left and positive on the right).

Definition 6.1.1: We can find a bracket for a minimum without evaluating the derivative: if there is s.t. , then is a bracket containing the minimum.

We can make an algorithm for reducing the bracket, by keeping track of the middle value and testing at a new point , and discarding one of the endpoints. This is called golden section search.

Let’s ignore the case of a tie between the middle values.

The placement of is quite important: we want ideally the same amount to be chopped off whether or not.

If we say that we want , then as we’re repeating using the same rations, also .

Then

Therefore

Proposition 6.1.2: This converges linearly.

Proof: TODO

Remark: Our termination condition should be , where should only be about .

Definition 6.1.3: Successive parabolic interpolation carries out golden section search but by choosing to be the minimum of the parabola passing through .

This is order , but it can go wrong in many ways (exercise).

Remark: Finding a bracket isn’t necessarily easy, but if the function is convex, we can use a geometrically expanding bracket to find a bracket. Even better, we can keep the golden ratio, with the geometrically expansion.

Definition 6.1.4: Brent’s method for minimisation:

combines successive parabolic interpolation with golden section search, keeping track of 6 points
is derivative-free and superlinear in good cases.

6.2. -dimensional minimisation

Given a position , choose direction and step length , and set .

Define , with standing for gradient.

We would like to be a descent direction:

We’d also like consecutive directions to be roughly orthogonal, so that they explore the space optimally.

should loosely minimise

TODO infinite travel

Definition 6.2.1: Coordinate gradient descent chooses alternating basis vectors for the direction. It’s not very good.

Definition 6.2.2: Gradient descent finds the steepest downhill direction.

So just go in the direction of the negative gradient.

To find a good step length, overshoot and backtrack (typically halving) until the step length is acceptable.

Our acceptable condition is

This is called the Armijo rule. is typically to . It says that we want the new value of the function to be below a line that is somewhere in between the 0th order and 1st order approximation to f at .

It can be proven that this always works if .

We could take an initial guess for to be

TODO justification

We still need a large initial step length. We could even forward-track until we get something that doesn’t meet the Armijo rule.

TODO overall iteration

TODO termination condition.

TODO complexity

TODO convergence

Definition 6.2.3: Conjugate gradient descent uses conjugate directions that satisfy .

TODO justification, what does this mean?

Remark: Conjugate gradient descent isn’t actually on the syllabus but it’s useful and quite simple (only one line of code changed compared to gradient descent).

Remark: We can also use Newton’s method:

Approximate by its second-order Taylor polynomial

Differentiating this gives

where .

Therefore . But remember that computing matrix inverses is slow, so better to say solve for .

TODO expression for

This has problem-independent step length ( TODO why?).

TODO (dis)advantages.

Remark: Newton’s method doesn’t always give a descent direction.

Therefore Newton’s method can converge to maxima or saddle points equally as much as minima.

TODO compare the two Newton’s methods (they’re the same)

Remark: We could use a quasi-Newton method whereby we use an approximation of the Hessian, e.g. by using Broyden’s method on the derivative. However, this is bad because it doesn’t necessarily give a symmetric approximation for the Hessian.

We want :

to satisfy secant equation TODO
to be symmetric
to be close to
to be positive definite.

Positive definiteness requires the curvature condition , which we can’t always guarantee.

Definition 6.2.4: BFGS update is TODO (not in spec).

If the curvature condition fails, reset .

Best to start with , and reset to the identity if the curvature condition fails. When the approximate Hessian is the identity, it is just gradient descent; so it does gradient descent until it gets to a convex section of the function, at which point it starts approximating the Hessian and goes a lot quicker.

TODO advantages

TODO disadvantages

Remark: There is a way to only update the inverse Hessians, like we did for Broyden’s method. It’s a bit more complicated because this uses a rank-2 update rather than a rank-1 update.

Remark: We can make a rank- approximation of the Hessians and get memory. We call this L-BFGS.