Remaining TODOs: 46
1. Derivatives and Taylor’s Theorem
Definition 1.1: Let .
is continuous at if .
is continuous if it is continuous at every .
Definition 1.2: Let .
is continuous at if .
is continuous if it is continuous at every .
Theorem 1.3: If and are continuous, so are:
- , when strictly positive
- , where strictly positive
- , where nonzero
Definition 1.4: Let .
is differentiable at if exists. Then is the derivative of at .
is differentiable if it is differentiable at every .
Definition 1.5: A secant is a line that connects two points on a surface.
A tangent can be thought of as a limit of secants as the distance between the two points tend to 0.
Theorem 1.6 (Quotient rule):
Theorem 1.7 (Derived quotient rule):
Theorem 1.8 (Taylor's Theorem (univariate)): For ,
for some .
The Taylor polynomial (of degree ) is also written , and the error term, or Lagrange remainder term is denoted by .
So, we can write for any so long as the first derivatives of all exist and are continuous on .
For small , .
Definition 1.9: Let .
is the partial derivative of with respect to , as if were constant.
means differentiate first with respect to , then . In general this is not equal to , but often it is.
Definition 1.10: Let . Then the vector derivative is
It turns out that this makes sense to be a derivative, because
Remark: Some standard derivatives for vectors that are analogous to scalars:
Remark: Similarly to scalar derivatives, the vector derivative is linear and has a product and quotient rule.
The chain rule for is the same as for scalars:
For ,
where is the Jacobean - see Definition 1.14.
Remark: The type of the vector derivative of is
Also
Definition 1.11: Let . Then the Hessian of is
Remark: A vector field can be decomposed into
Definition 1.14: The derivative of a vector field is the Jacobean, , where
Remark: For ,
Remark: The Jacobean in a linear operator. Also,
- .
It has a dot-product rule:
It has a scalar-vector product rule:
Don’t need to learn these!
It also has a chain rule, called the partial chain rule:
Remark: Some standard Jacobeans:
Remark: The Hessian is a Jacobean:
Luckily because of Theorem 1.12, the transpose can usually be ignored.
Note that this basically just says that the second derivative is the derivative of the derivative.
Remark: Given vectors ,
where the rows of are the s.
Moreover, given scalars ,
with as before and a diagonal matrix with the s as the diagonal entries.
These identities can come in handy when computing the Jacobean.
Theorem 1.15 (Taylor's Theorem for multivariate functions): For , if ,
Here, the brackets with partial derivatives inside are carrying out operator algebra - they are not multiplication, but follow the same distributive rules so can be though of as so.
The taylor polynomials up to degree 2 can be written a bit more nicely:
For higher powers, we need tensors which are beyond the scope of this course.
The error terms (up to ) can be written in a similar fashion but evaluated at for some .
Remark: is small in the sense that
2. Optimisation
Definition 2.1: Let .
The canonical optimisation problem is to find
.
Note that is a set.
is the feasible region.
If , the optimisation is unconstrained.
If , the optimisation is constrained. The standard form for constraints is:
2.1. Optimisation in 1 dimension
Solutions to satisfy the 1st-order condition . But so do non-solutions!
Stationary points can be classified by Taylor’s theorem.
If , then for some . So if the second derivative is positive, then for some region around , around , so is a minimum. Similarly, if the second derivative is negative, then around so is a maximum.
If the second derivative is 0, this doesn’t necessarily mean that it is a point of inflection; we need to look for the first nonzero derivative. If the first nonzero derivative is odd, then it is a stationary point of inflection - since the function changes concavity around the stationary point - but if the first nonzero derivative is an even derivative, use the same rules as for the second derivative.
However, this doesn’t always work (e.g. has all of its derivatives at 0 equal to 0).
For constrained optimisations, do the same but also consider the endpoints.
2.2. A brief detour into linear alegebra
Definition 2.2.1: Given a symmetric matrix , is
- positive definite if for all
- positive semidefinite if for all
- negative definite if for all
- negative semidefinite if for all
- indefinite otherwise
Proposition 2.2.2:
These are equivalent to certain conditions on the eigenvalues: positive definite iff all eigenvalues strictly positive, positive semidefinite iff all eigenvalues positive, etc.
Proof (for positive definiteness):
A fact: If is an symmetric matrix, there exists a basis (for ) of orthogonal eigenvectors of . (spectral theorem)
TODO
Remark:
Given a symmetric matrix with eigenvalues ,
Hence
- indefinite;
- positive definite;
- positive semidefinite
- negative definite;
- negative semidefinite.
Remark: A symmetric matrix is
- positive definite if all its pivots
- positive semidefinite if all its pivots are
- negative definite if all its pivots
- negative semidefinite if all its pivots
- indefinite otherwise
but the indefiniteness conditions hold only if the pivots can be found without row swaps
Remark: If is positive (semi)definite then so are
- for any
- , which is guaranteed to exist if is positive definite
- for any positive (semi)definite
- any upper-left submatrix of TODO intuition
- , if is of full rank - but if not of full rank, we can still guarantee positive semidefiniteness TODO intuition
- , if
TODO : lookup: diagonally dominant
2.3. Unconstrained optimisation over
Solutions to , if they exist, satisfy the 1st-order condition . This is in general not easy to solve as it is a system of (not necessarily linear) equations.
In higher dimensions, stationary points can be local minima, local maxima or saddle points.
We can classify these higher-dimensional stationary points using the Hessian, similarly to how we used the second derivative for scalar functions.
Hence if the Hessian at is positive definite, the stationary point at is a local minimum. If the Hessian is negative definite, the stationary point is a local maximum. If the Hessian is indefinite, the stationary point is a saddle point.
If the Hessian is semidefinite, you cannot determine anything other than that the point is not a local minimum/maximum. For better analysis we would need to use the 3rd-order Taylor approximation.
Fact: the outer product of a vector with itself is always positive semidefinite. Proof: .
2.4. Convexity
Definition 2.4.1: A set is convex if
That is, the line connecting any lies wholly in .
Definition 2.4.2: A function for convex is convex if
That is, any point on between and lies below the secant between and .
Equivalently, every secant on lies above the surface of .
If the inequality is strict, is strictly convex if the inequality is the other way around, is concave or strictly concave in a similar manner.
Remark:
Some common convex and concave functions:
- Linear functions are both convex and concave.
- for is strictly convex
- is strictly convex
- is strictly concave
- for convex , is concave
- is convex if is positive (semi)definite
- for convex , is convex
- for convex , is convex (but strict convexity is only preserved if has full rank)
- for convex , is convex
- for convex , is convex
Theorem 2.4.5: For ,
Proof: convex iff everywhere.
is always positive, so we require to be positive everywhere, i.e. is convex. We then require and to have the same signs everywhere, so either is increasing and is convex, or is decreasing and is concave.
The proof for concavity is similar.
Theorem 2.4.6:
Let , such that with .
Then if is (strictly) convex, and for each , either
then is (strictly) convex.
Theorem 2.4.7 (Jenson's inequality): For a random variable and convex function ,
(This is not on the syllabus but is useful)
2.5. Optimisation tricks
Theorem 2.5.1:If is strictly increasing, then
Theorem 2.5.2: If is injective/1-to-1 then
2.6. Optimisation over with equality constraints
Remark: Given a minimisation problem with an equality constraint, it can sometimes be reduced to an unconstrained problem:
-
Given ,
-
Given ,
But usually, this isn’t the case.
Theorem 2.6.1 (Lagrange's theorem / method of Lagrange multipliers): For a minimisation problem with one equality constraint, minimise subject to , the stationary points must satisfy
i.e. a first order condition is .
Proof:
TODO
Definition 2.6.2: The Lagrangian of such a minimisation problem is
The stationary points of where are the stationary points of .
TODO : look up bordered Hessian
Theorem 2.6.3: For ,
the Lagrangian is
Then the stationary points of the minimisation problem are the stationary points of the Lagrangian.
2.7. Inequality contrained optimisation (over )
Definition 2.7.1: Given a minimisation problem subject to , at the solution, is tight if , and slack if .
To solve such a minimisation, solve:
- for tight constraints via the previous section;
- for slack constraints, by solving the usual first-order condition (as usual), and ignoring any solutions outside of the feasible region.
Either way, , with possibly zero.
Theorem 2.7.2: The Lagrange multiplier is always positive.
Corollary 2.7.3 (Complementary slackness): At a minimum of an minimisation problem with constraint and Lagrange multiplier ,
Equivalently, .
TODO KKT conditions (not on syllabus)
TODO : what to do when mix of equality and inequality constraints
3. Algorithms for numerical integration
3.1. 1-dimensional integration
Here we will consider approximating the integral of over a small interval , with .
The simplest approximation is the 0th order Taylor approximation; this is equal to the approximation using the 1st order Taylor approximation.
Definition 3.1.1: The midpoint rule: define
The subscript 1 indicates that it is operating over 1 strip. The notation for this (and other methods in this section) is not standard.
Theorem 3.1.2: The error of the midpoint rule,
is bounded by
where
Definition 3.1.4: The Trapezium rule creates a trapezium with the axis and the endpoints of the strip. It is generally worse than the midpoint rule, so will not be assessed in the course.
Theorem 3.1.5: The error bound for the trapezium rule is
Definition 3.1.6: Simpson’s rule interpolates a parabola from the endpoints and midpoints. The derivation is quite complicated but the resulting formula is very simple: it is just a weighted average. Although tis is over a single strip, Simpson’s rule is usually considered to operate over 2 strips, for notational consistency (as we will see later).
Theorem 3.1.7: The error bound for Simpson’s rule is
Higher order Taylor approximations might not be a great choice for a better approximation because:
- you need to evaluate high derivatives
- Taylor approximations are local
- functions might not have enough derivatives/be analytic
Higher-order polynomial interpolation also might not be a good idea:
- it is not necessarily numerically unstable (as commonly believed), but it can be if you do it badly
- the Runge phenomenon - some functions get worse and worse approximations as the order of the interpolated polynomial increases because polynomial interpolation works worse near the ends of the interval
- but Chebyshev interpolations are much better! These have interpolation points closer together near to the ends of the interval
Definition 3.1.8: The Runge function is
Higher-order polynomial interpolation on the Runge function, using evenly spaced points, is really bad.
A better method is just to split the interval into smaller strips and use an easier method on each of them. This is guaranteed to converge to the right answer regardless of which method you use, so long as the function is continuous.
Definition 3.1.9: The composite midpoint rule is essential just an average of all the heights of the endpoints of strips, multiplied by the width of the region.
Theorem 3.1.10: The error bound for the composite midpoint rule is
Definition 3.1.11: The composite Simpson’s rule is again akin to a weighted average, now with coefficients , with the 1s and 4s as before and the 2s coming from where the endpoints are repeated.
Theorem 3.1.12: The error bound for the composite Simpson’s rule is
We can make improvements on these methods:
- Pick more optimal strip endpoints (Gaussian quadrature - TODO what is this?)
- Extrapolate a sequence of or (Romberg integration - TODO what is this?)
-
Recursive subdivision (reuse evaluation, do more [via thinner strips] on difficult bits) - based on 4th derivative being large, or use the a posteriori error estimate
TODO proof, how, why??
- Or just do more work where is large, as the errors have a greater effect there!
Also note that these methods can’t help with integrals with in either of the limits.
3.2. Integration in dimensions
In two dimensions, is the volume under a surface in region .
Theorem 3.2.1 (Fubini's Theorem):
for sufficiently nice and (we need to be able to express the bounds nicely). This “split” integral is called an iterated integral.
This extends to higher dimensions as expected.
As long as everything is integrable, the order doesn’t matter.
We can use the methods we already know to approximate higher-dimensional integrals, as long as we can split the integral into iterated integrals.
Remark: The midpoint rule can be extended to multiple dimensions by converting into an iterated integral via Fubini’s theorem, and then applying the 1-dimensional midpoint rule to successive inner integrals. For a -dimensional integral of the function over the unit hypercube, the midpoint rule with strips in each dimension gives
This is just taking the average of the value of at the midpoints of each hypercube in a grid over the region.
Remark: Simpson’s rule can also be extended to higher dimensions: for 2 dimensions,
where
Remark: In dimensions,
This seems quite good, but we do evaluations of the function; so relative to the work, the error becomes steadily worse.
This is called the “curse of dimensionality”… all of these methods improve extremely slowly in high dimensions.
Definition 3.2.2: Monte Carlo integration does the midpoint rule, but using random points according to some distribution.
Consider sampling random points along a scalar function:
Therefore, given i.i.d. according to , for large enough ,
We extend this to dimensions, using the notation
for large to approximate , with i.i.d. uniformly distributed across the region , and the area of the region .
Lemma 3.2.3:
Monte Carlo integration is unbiased (on average it is correct).
Proof: Take i.i.d. with pdf
Then
Proposition 3.2.4:
for a constant . Moreover, can be approximated by that can be expressed in terms of , , and the s.
Proof:
Definition 3.2.5: The standard error is the square root of the estimated variance,
Lemma 3.2.6:
Monte Carlo integration is consistent:
Proof: Let .
Then
Then by Chebyshev’s inequality,
Theorem 3.2.7 (Central Limit Theorem): If i.i.d., then
(This is not a precise statement.)
That is, a large sample of i.i.d. random variables tends to a normal distribution.
Lemma 3.2.8:
For large , the error is distributed approximately according to .
Proof:
Then and .
Then
Remark:
- There is no guarantee on the error (you could get really unlucky)
- Sampling even from strangely shaped regions is actually easy - just sample from a tight box around it, and reject any samples that are outside the region
- This sampling technique could be quite time consuming in higher dimensions
- Finding the area of the region could be quite difficult - but we can estimate it by considering the size of the sampling box and also how many samples were rejected. (Note that this is also a Monte Carlo integral! of the function ). We therefore need to adjust the standard error.
Remark: Some improvements we could make:
- Divide into subregions, perhaps recursively (stratified sampling)
-
Sample some parts of more densely, using the pdf of as , then
(importance sampling)
- Subtract an easy integral (a control variate) to make the Monte Carlo integral smaller, so that the error is smaller
These don’t change the asymptotic error, but can give a better constant term.
4. Accuracy
Definition 4.1: Approximating by ,
- is the error
- is the absolute error
- is the relative error.
For vector , replace with the Euclidean norm .
Definition 4.2: If is an ideal function of , and is an approximate implementation of , then
- is the forward error
- is the backward error, where .
Definition 4.3: A floating-point number is a binary rational approximation to real numbers:
where is the mantissa, is the fixed precision (of the mantissa), and is the exponent.
The IEEE 754 standard specifies:
- single precision floats: 23 bit mantissa precision, 8 bit exponent, 1 sign bit
- double precision: 52 bit mantissa precision, 11 bit exponent, 1 sign bit
Definition 4.4: Machine epsilon is the such that every is representable with a relative error of no more than .
Note that is the relative error when we try to represent
Then
TODO go back over this
Machine epsilon is therefore for floats, and for doubles.
Some people define to be the smallest such that is representable. This is double the definition used here.
Definition 4.5: Truncation error occurs when approximating an infinite process by a finite process.
Roundoff error is due to floating-point storage and computation.
4.1. Iterative methods
Definition 4.1.1: Starting with an initial estimate approximating unknown , we repeatedly refine it with , stopping when some criterion is met, e.g. < delta.
The error at step is .
If as :
-
has linear convergence if for some ,
-
has sublinear convergence if
-
-
in particular has logarithmic convergence if
-
-
converges superlinearly if
-
-
in particular has order- convergence for if
-
Example:
- converges linearly
- converges logarithmically (a straight line on a log-log graph)
- converges superlinearly, and in fact converges quadratically. On a log plot, this is an inverted parabola.
Lemma 4.1.2: If converges with order , then converges with order .
5. Algorithms for root-finding
5.1. 1-dimensional root-finding
Definition 5.1.2: A bracket is an interval with
Remark: We can effectively use binary search to find roots using brackets. However, we might not find an exact zero (because floating point), so we terminate when the interval is small enough, guess is the midpoint of the interval, and can bound by the bracket.
This is called interval bisection.
Iterating, we get a sequence with
so this method converges linearly.
Definition 5.1.4: Newton’s method in 1 dimension approximates by its first-order Taylor polynomial.
Take a guess , then find the root of the first-order Taylor polynomial at ; this is , i.e.
TODO derive this explicitly to make super clear
This converges quadratically for some . We can also get stuck at a fixed point, or hit a turning point and divide by zero, or diverge.
Remark: Some limitations of Newton’s method:
- you need to be able to evaluate the derivative
- fails if (throw an error)
- iteration can get stuck in a loop
- iteration can diverge
Theorem 5.1.5: The forward error estimate for Newton’s method is
Proof:
Remark: So we might choose to terminate when successive iterations are sufficiently close, by some parameter , often or , preferably by some relative error.
We can also look at backward error, .
There’s no one best way of doing a stopping condition, you should think about it.
Remark: You also must stop at a sensible upper bound of steps so you don’t get stuck in an infinite loop (ideally with a warning that convergence didn’t happen).
You must also stop if an iteration step is poorly defined.
Lemma 5.1.6:
Let for some .
Let
Then if , then .
Proof:
Suppose .
Lemma 5.1.7:
For as before, if and then converges to at least quadratically.
Proof:
Suppose and that as before. Note that from the definition of .
Then letting , by induction, for all , and . so as .
Therefore .
Moreover,
so we have at least quadratic convergence (it might be better if it actually tends to 0; we don’t have equality).
Lemma 5.1.8: Let , and let
Then if and , then quadratically.
Proof:
Suppose that and .
From Lemma 5.1.6,
because the interval contained within with at its centre, , must have , so .
Note that although we cannot guarantee that stays within ,we can guarantee that it stays in , if , by the same reasoning as Lemma 5.1.7.
TODO make this a bit nicer from lecture notes
Theorem 5.1.9: There exists an such that is a basin of convergence if .
Remark: The convergence result doesn’t work for double roots (it might still converge, but only linearly).
We can fix it by using the iteration , which will converge quadratically to double roots.
Definition 5.1.10: The secant method uses a linear interpolation to approximate .
TODO justification
TODO forward error (same as Newton)
Remark: Some other 1-dimensional root finding methods:
- Halley’s method - higher order Taylor approximation. But difficult because we might end up with no roots or many roots. But cubic convergence in good cases.
- Muller’s method - higher order polynomial interpolation - same problems as Halley’s method, but no derivatives needed and order when it works
- Inverse quadratic interpolation, i.e. find , and the root is at (always!). Order convergence in good cases.
But the best method is Brent’s method:
- maintains a bracket
- uses inverse quadratic interpolation unless iteration undefined or outside the bracket
- uses the secant method as second choice
- falls back to bisection if these do not sufficiently shrink the bracket (only happens at at most half of iterations)
- derivative free
- always converges, at order in good cases.
5.2. -dimensional root finding
Definition 5.2.1: A root of is with .
If , the problem is underconstrained, so there are probably infinitely many solutions.
If , the problem is unconstrained, so there are probably no (exact) solutions.
We will consider .
Remark: If instead we take a hyperrectangle with a bracket for components on opposite sides, then we can guarantee a simultaneous root.
TODO diagram
But this is not very useful, because we can’t algorithmically prove tha a function is positive/negative all the way along a line.
Definition 5.2.2: Given , we approximate each component by its first-order Taylor polynomial:
TODO juggle equations
We can put all of these together to get
So we get an iteration
Each iteration, this requires 1 evaluation of , derivative, and an inversion, which is .
Remark: We still have existence of a basin of quadratic convergence, if:
- are continuous
- TODO
Definition 5.2.3: In dimensions, the equivalent to the secant method is Broyden’s method:
TODO complexity
The convergence is order in good cases.
Theorem 5.2.4:
Remark: This means that we can start from and never need to calculate any inverses.
This gives for Broyden’s method.
6. Algorithms for optimisation
6.1. 1-dimensional minimisation
Definition 6.1.1: We can find a bracket for a minimum without evaluating the derivative: if there is s.t. , then is a bracket containing the minimum.
We can make an algorithm for reducing the bracket, by keeping track of the middle value and testing at a new point , and discarding one of the endpoints. This is called golden section search.
Let’s ignore the case of a tie between the middle values.
The placement of is quite important: we want ideally the same amount to be chopped off whether or not.
If we say that we want , then as we’re repeating using the same rations, also .
Then
Therefore
Proposition 6.1.2: This converges linearly.
Definition 6.1.3: Successive parabolic interpolation carries out golden section search but by choosing to be the minimum of the parabola passing through .
This is order , but it can go wrong in many ways (exercise).
Definition 6.1.4: Brent’s method for minimisation:
- combines successive parabolic interpolation with golden section search, keeping track of 6 points
- is derivative-free and superlinear in good cases.
6.2. -dimensional minimisation
Given a position , choose direction and step length , and set .
Define , with standing for gradient.
We would like to be a descent direction:
We’d also like consecutive directions to be roughly orthogonal, so that they explore the space optimally.
should loosely minimise
TODO infinite travel
Definition 6.2.2: Gradient descent finds the steepest downhill direction.
So just go in the direction of the negative gradient.
To find a good step length, overshoot and backtrack (typically halving) until the step length is acceptable.
Our acceptable condition is
This is called the Armijo rule. is typically to . It says that we want the new value of the function to be below a line that is somewhere in between the 0th order and 1st order approximation to f at .
It can be proven that this always works if .
We could take an initial guess for to be
TODO justification
We still need a large initial step length. We could even forward-track until we get something that doesn’t meet the Armijo rule.
TODO overall iteration
TODO termination condition.
TODO complexity
TODO convergence
Definition 6.2.3: Conjugate gradient descent uses conjugate directions that satisfy .
TODO justification, what does this mean?
Remark: We can also use Newton’s method:
Approximate by its second-order Taylor polynomial
Differentiating this gives
where .
Therefore . But remember that computing matrix inverses is slow, so better to say solve for .
TODO expression for
This has problem-independent step length ( TODO why?).
TODO (dis)advantages.
Remark: Newton’s method doesn’t always give a descent direction.
Therefore Newton’s method can converge to maxima or saddle points equally as much as minima.
TODO compare the two Newton’s methods (they’re the same)
Remark: We could use a quasi-Newton method whereby we use an approximation of the Hessian, e.g. by using Broyden’s method on the derivative. However, this is bad because it doesn’t necessarily give a symmetric approximation for the Hessian.
We want :
- to satisfy secant equation TODO
- to be symmetric
- to be close to
- to be positive definite.
Positive definiteness requires the curvature condition , which we can’t always guarantee.
Definition 6.2.4: BFGS update is TODO (not in spec).
If the curvature condition fails, reset .
Best to start with , and reset to the identity if the curvature condition fails. When the approximate Hessian is the identity, it is just gradient descent; so it does gradient descent until it gets to a convex section of the function, at which point it starts approximating the Hessian and goes a lot quicker.
TODO advantages
TODO disadvantages