Remaining TODOs: 44
1. Derivatives and Taylor’s Theorem
1.1. Continuity and differentiation
Definition 1.1.1: Let .
is continuous at if .
is continuous if it is continuous at every .
Definition 1.1.2: Let .
is continuous at if .
is continuous if it is continuous at every .
Theorem 1.1.3: If and are continuous, so are:
- , when strictly positive
- , where strictly positive
- , where nonzero
Definition 1.1.4: Let .
is differentiable at if exists. Then is the derivative of at .
is differentiable if it is differentiable at every .
Definition 1.1.5: A secant is a line that connects two points on a surface.
A tangent can be thought of as a limit of secants as the distance between the two points tend to 0.
Theorem 1.1.6 (Quotient rule):
Theorem 1.1.7 (Derived quotient rule):
1.2. Taylor’s theorem for univariate functions
Theorem 1.2.1 (Taylor's Theorem (univariate)): For ,
for some .
The Taylor polynomial (of degree ) is also written , and the error term, or Lagrange remainder term is denoted by .
So, we can write for any so long as the first derivatives of all exist and are continuous on .
For small , .
1.3. Partial and vector derivatives
Definition 1.3.1: Let .
is the partial derivative of with respect to , as if were constant.
means differentiate first with respect to , then . In general this is not equal to , but often it is.
Definition 1.3.2: Let . Then the vector derivative is
It turns out that this makes sense to be a derivative, because
Remark: Some standard derivatives for vectors that are analogous to scalars:
Remark: Similarly to scalar derivatives, the vector derivative is linear and has a product and quotient rule.
The chain rule for is the same as for scalars:
For ,
where is the Jacobean - see Definition 1.4.2.
Remark: The type of the vector derivative of is
Also
Definition 1.3.3: Let . Then the Hessian of is
1.4. Vector fields and the Jacobian
Remark: A vector field can be decomposed into
Definition 1.4.2: The derivative of a vector field is the Jacobean, , where
Remark: For ,
Definition 1.4.3: The Jacobian has a dot-product rule:
Definition 1.4.4: The Jacobian has a scalar-vector product rule:
Definition 1.4.5: The Jacobian has a chain rule, called the partial chain rule:
Remark: Some standard Jacobeans:
Remark: The Hessian is a Jacobean:
Luckily because of Clairaut’s Theorem, the transpose can usually be ignored.
Note that this basically just says that the second derivative is the derivative of the derivative.
Remark: Given vectors ,
where the rows of are the s.
Moreover, given scalars ,
with as before and a diagonal matrix with the s as the diagonal entries.
These identities can come in handy when computing the Jacobean.
1.5. Taylor’s theorem for multivariate functions
Theorem 1.5.1 (Taylor's Theorem for multivariate functions): For , if ,
Here, the brackets with partial derivatives inside are carrying out operator algebra - they are not multiplication, but follow the same distributive rules so can be though of as so.
The Taylor polynomials up to degree 2 can be written a bit more nicely:
For higher powers, we need tensors which are beyond the scope of this course.
The error terms (up to ) can be written in a similar fashion but evaluated at for some .
Theorem 1.5.2: is small in the sense that
(Proof not needed.)
2. Optimisation
Definition 2.1: Let .
The canonical optimisation problem is to find
.
Note that is a set.
is the feasible region.
If , the optimisation is unconstrained.
If , the optimisation is constrained. The standard form for constraints is:
2.1. Optimisation in 1 dimension
Solutions to satisfy the 1st-order condition . But so do non-solutions!
Stationary points can be classified by Taylor’s theorem.
If , then for some . So if the second derivative is positive, then for some region around , around , so is a minimum. Similarly, if the second derivative is negative, then around so is a maximum.
If the second derivative is 0, this doesn’t necessarily mean that it is a point of inflection; we need to look for the first nonzero derivative. If the first nonzero derivative is odd, then it is a stationary point of inflection - since the function changes concavity around the stationary point - but if the first nonzero derivative is an even derivative, use the same rules as for the second derivative.
However, this doesn’t always work (e.g. has all of its derivatives at 0 equal to 0).
For constrained optimisations, do the same but also consider the endpoints.
2.2. Positive/negative (semi)definiteness
Definition 2.2.1: Given a symmetric matrix , is
- positive definite if for all
- positive semidefinite if for all
- negative definite if for all
- negative semidefinite if for all
- indefinite otherwise
Theorem 2.2.2 (Spectral theorem): If is an symmetric matrix, then there exists a basis (for ) of orthogonal eigenvectors of .
Proposition 2.2.3:
These are equivalent to certain conditions on the eigenvalues: positive definite iff all eigenvalues strictly positive, positive semidefinite iff all eigenvalues positive, etc.
Proof (for positive definiteness):
Let be a symmetric matrix. By the spectral theorem, there exists an orthogonal basis of ; w.l.o.g. let the s be unit vectors. Let be the corresponding eigenvalues.
First suppose that . By the spectral theorem, forms an orthogonal basis for , so for any , s.t. .
Then for any ,
Since and the s are strictly positive, and there exists at least one non-zero , the sum of the products of positive numbers must be positive, so is positive definite.
Conversely, suppose that is positive definite.
Then, for all ,
Remark:
Given a symmetric matrix with eigenvalues ,
Hence
- indefinite;
- positive definite;
- positive semidefinite
- negative definite;
- negative semidefinite.
Remark: A symmetric matrix is
- positive definite if all its pivots
- positive semidefinite if all its pivots are
- negative definite if all its pivots
- negative semidefinite if all its pivots
- indefinite otherwise
but the definiteness conditions hold only if the pivots can be found without row swaps
Remark: If is positive (semi)definite then so are
- for any
- , which is guaranteed to exist if is positive definite
- for any positive (semi)definite
- any upper-left submatrix of TODO intuition
- , if is of full rank, regardless of if is square or not - but if not of full rank, we can still guarantee positive semidefiniteness TODO intuition
- , if
TODO: lookup: diagonally dominant
2.3. Unconstrained optimisation over
Solutions to , if they exist, satisfy the 1st-order condition . This is in general not easy to solve as it is a system of (not necessarily linear) equations.
In higher dimensions, stationary points can be local minima, local maxima or saddle points.
We can classify these higher-dimensional stationary points using the Hessian, similarly to how we used the second derivative for scalar functions.
Hence if the Hessian at is positive definite, the stationary point at is a local minimum. If the Hessian is negative definite, the stationary point is a local maximum. If the Hessian is indefinite, the stationary point is a saddle point.
If the Hessian is semidefinite, you cannot determine anything other than that the point is not a local minimum/maximum. For better analysis we would need to use the 3rd-order Taylor approximation.
Fact: the outer product of a vector with itself is always positive semidefinite. Proof: .
2.4. Convexity
Definition 2.4.1: A set is convex if
That is, the line connecting any lies wholly in .
Definition 2.4.2: A function for convex is convex if
That is, any point on between and lies below the secant between and .
Equivalently, every secant on lies above the surface of .
If the inequality is strict, is strictly convex if the inequality is the other way around, is concave or strictly concave in a similar manner.
Remark:
Some common convex and concave functions:
- Linear functions are both convex and concave.
- for is strictly convex
- is strictly convex
- is strictly concave
- for convex , is concave
- is convex if is positive semidefinite, strictly convex if is positive definite
- for convex , is convex
- for convex , is convex (but strict convexity is only preserved if has full rank)
- for convex , is convex
- for convex , is convex
Theorem 2.4.5: For ,
Proof: convex iff everywhere.
is always positive, so we require to be positive everywhere, i.e. is convex. We then require and to have the same signs everywhere, so either is increasing and is convex, or is decreasing and is concave.
The proof for concavity is similar.
Theorem 2.4.6:
Let , such that
with .
Then if is (strictly) convex, and for each , either
then is (strictly) convex.
Theorem 2.4.7 (Jenson's inequality): For a random variable and convex function ,
(This is not on the syllabus but is useful)
2.5. Optimisation tricks
Theorem 2.5.1:If is strictly increasing, then
Theorem 2.5.2: If is injective/1-to-1 then
2.6. Optimisation over with equality constraints
Remark: Given a minimisation problem with an equality constraint, it can sometimes be reduced to an unconstrained problem:
-
Given ,
-
Given ,
But usually, this isn’t the case.
Theorem 2.6.1 (Lagrange's theorem / method of Lagrange multipliers): For a minimisation problem with one equality constraint, minimise subject to , the stationary points must satisfy
i.e. a first order condition is .
Proof:
TODO
Definition 2.6.2: The Lagrangian of such a minimisation problem is
The stationary points of where are the stationary points of .
TODO: look up bordered Hessian
Theorem 2.6.3: For ,
the Lagrangian is
Then the stationary points of the minimisation problem are the stationary points of the Lagrangian.
2.7. Inequality contrained optimisation (over )
Definition 2.7.1: Given a minimisation problem subject to , at the solution, is tight if , and slack if .
To solve such a minimisation, solve:
- for tight constraints via the previous section;
- for slack constraints, by solving the usual first-order condition (as usual), and ignoring any solutions outside of the feasible region.
Either way, , with possibly zero.
Theorem 2.7.2: The Lagrange multiplier is always positive.
Corollary 2.7.3 (Complementary slackness): At a minimum of an minimisation problem with constraint and Lagrange multiplier ,
Equivalently, .
TODO KKT conditions (not on syllabus)
TODO: what to do when mix of equality and inequality constraints
3. Algorithms for numerical integration
3.1. 1-dimensional integration
Here we will consider approximating the integral of over a small interval , with .
The simplest approximation is the 0th order Taylor approximation; this is equal to the approximation using the 1st order Taylor approximation.
Definition 3.1.1: The midpoint rule: define
The subscript 1 indicates that it is operating over 1 strip. The notation for this (and other methods in this section) is not standard.
Theorem 3.1.2: The error of the midpoint rule,
is bounded by
where
Proof:
We can bound by , so
Definition 3.1.4: The Trapezium rule creates a trapezium with the axis and the endpoints of the strip. It is generally worse than the midpoint rule, so will not be assessed in the course.
Theorem 3.1.5: The error bound for the trapezium rule is
Definition 3.1.6: Simpson’s rule interpolates a parabola from the endpoints and midpoints. The derivation is quite complicated but the resulting formula is very simple: it is just a weighted average. Although this is over a single strip, Simpson’s rule is usually considered to operate over 2 strips, for notational consistency (as we will see later).
Theorem 3.1.7: The error bound for Simpson’s rule is
Higher order Taylor approximations might not be a great choice for a better approximation because:
- you need to evaluate high derivatives
- Taylor approximations are local
- functions might not have enough derivatives/be analytic
Higher-order polynomial interpolation also might not be a good idea:
- it is not necessarily numerically unstable (as commonly believed), but it can be if you do it badly
- the Runge phenomenon - some functions get worse and worse approximations as the order of the interpolated polynomial increases because polynomial interpolation works worse near the ends of the interval
- but Chebyshev interpolations are much better! These have interpolation points closer together near to the ends of the interval
Definition 3.1.8: The Runge function is
Higher-order polynomial interpolation on the Runge function, using evenly spaced points, is really bad.
A better method is just to split the interval into smaller strips and use an easier method on each of them. This is guaranteed to converge to the right answer regardless of which method you use, so long as the function is continuous.
Definition 3.1.9: The composite midpoint rule is essential just an average of all the heights of the endpoints of strips, multiplied by the width of the region.
Theorem 3.1.10: The error bound for the composite midpoint rule is
Definition 3.1.11: The composite Simpson’s rule is again akin to a weighted average, now with coefficients , with the 1s and 4s as before and the 2s coming from where the endpoints are repeated.
Theorem 3.1.12: The error bound for the composite Simpson’s rule is
We can make improvements on these methods:
- Pick more optimal strip endpoints (Gaussian quadrature - TODO what is this?)
- Extrapolate a sequence of or (Romberg integration - TODO what is this?)
-
Recursive subdivision (reuse evaluation, do more [via thinner strips] on difficult bits) - based on 4th derivative being large, or use the a posteriori error estimate
TODO proof, how, why??
- Or just do more work where is large, as the errors have a greater effect there!
Also note that these methods can’t help with integrals with in either of the limits.
3.2. Integration in dimensions
In two dimensions, is the volume under a surface in region .
Theorem 3.2.1 (Fubini's Theorem):
for sufficiently nice and (we need to be able to express the bounds nicely). This “split” integral is called an iterated integral.
This extends to higher dimensions as expected.
As long as everything is integrable, the order doesn’t matter.
We can use the methods we already know to approximate higher-dimensional integrals, as long as we can split the integral into iterated integrals.
Remark: The midpoint rule can be extended to multiple dimensions by converting into an iterated integral via Fubini’s theorem, and then applying the 1-dimensional midpoint rule to successive inner integrals. For a -dimensional integral of the function over the unit hypercube, the midpoint rule with strips in each dimension gives
This is just taking the average of the value of at the midpoints of each hypercube in a grid over the region.
Remark: Simpson’s rule can also be extended to higher dimensions: for 2 dimensions,
where
Remark: In dimensions,
This seems quite good, but we do evaluations of the function; so relative to the work, the error becomes steadily worse.
This is called the “curse of dimensionality”… all of these methods improve extremely slowly in high dimensions.
Definition 3.2.2: Monte Carlo integration does the midpoint rule, but using random points according to some distribution.
Consider sampling random points along a scalar function:
Therefore, given i.i.d. according to , for large enough ,
We extend this to dimensions, using the notation
for large to approximate , with i.i.d. uniformly distributed across the region , and the area of the region .
Lemma 3.2.3:
Monte Carlo integration is unbiased (on average it is correct).
Proof: Take i.i.d. with pdf
Then
Proposition 3.2.4:
for a constant . Moreover, can be approximated by that can be expressed in terms of , , and the s.
Proof:
Definition 3.2.5: The standard error is the square root of the estimated variance,
Lemma 3.2.6:
Monte Carlo integration is consistent:
Proof: Let .
Then
Then by Chebyshev’s inequality,
Theorem 3.2.7 (Central Limit Theorem): If i.i.d., then
(This is not a precise statement.)
That is, a large sample of i.i.d. random variables tends to a normal distribution.
Lemma 3.2.8:
For large , the error is distributed approximately according to .
Proof:
Then and .
Then
Remark:
- There is no guarantee on the error (you could get really unlucky)
- Sampling even from strangely shaped regions is actually easy - just sample from a tight box around it, and reject any samples that are outside the region
- This sampling technique could be quite time consuming in higher dimensions
- Finding the area of the region could be quite difficult - but we can estimate it by considering the size of the sampling box and also how many samples were rejected. (Note that this is also a Monte Carlo integral! of the function ). We therefore need to adjust the standard error.
Remark: Some improvements we could make:
- Divide into subregions, perhaps recursively (stratified sampling)
-
Sample some parts of more densely, using the pdf of as , then
(importance sampling)
- Subtract an easy integral (a control variate) to make the Monte Carlo integral smaller, so that the error is smaller
These don’t change the asymptotic error, but can give a better constant term.
4. Accuracy
Definition 4.1: Approximating by ,
- is the error
- is the absolute error
- is the relative error.
For vector , replace with the Euclidean norm .
Definition 4.2: If is an ideal function of , and is an approximate implementation of , then
- is the forward error
- is the backward error, where .
Definition 4.3: A floating-point number is a binary rational approximation to real numbers:
where is the mantissa, is the fixed precision (of the mantissa), and is the exponent.
The IEEE 754 standard specifies:
- single precision floats: 23 bit mantissa precision, 8 bit exponent, 1 sign bit
- double precision: 52 bit mantissa precision, 11 bit exponent, 1 sign bit
Definition 4.4: Machine epsilon is the such that every is representable with a relative error of no more than .
Note that is the relative error when we try to represent
Then
TODO go back over this
Machine epsilon is therefore for floats, and for doubles.
Some people define to be the smallest such that is representable. This is double the definition used here.
Definition 4.5: Truncation error occurs when approximating an infinite process by a finite process.
Roundoff error is due to floating-point storage and computation.
4.1. Iterative methods
Definition 4.1.1: Starting with an initial estimate approximating unknown , we repeatedly refine it with , stopping when some criterion is met, e.g. < delta.
The error at step is .
If as :
-
has linear convergence if for some ,
-
has sublinear convergence if
-
-
in particular has logarithmic convergence if
-
-
converges superlinearly if
-
-
in particular has order- convergence for if
for some
-
Example:
- converges linearly
- converges logarithmically (a straight line on a log-log graph)
- converges superlinearly, and in fact converges quadratically. On a log plot, this is an inverted parabola.
Lemma 4.1.2: If converges with order , then converges with order .
5. Algorithms for root-finding
5.1. 1-dimensional root-finding
Definition 5.1.2: A bracket is an interval with
Remark: We can effectively use binary search to find roots using brackets. However, we might not find an exact zero (because floating point), so we terminate when the interval is small enough, guess is the midpoint of the interval, and can bound by the bracket.
This is called interval bisection.
Iterating, we get a sequence with