|
|
Training
Course Notes Front Page >>
Table of contents >>
Next Section >>
Previous Section >>
4 . Statistical interpolation with least-squares
estimation
In this section we present the fundamental equation for
linear analysis in a general algebraic form: the least squares estimation,
also called Best Linear Unbiased Estimator (BLUE). The following
sections will provide more explanations and illustrations, and we shall
see how the least-squares estimation can be simplified to yield the most
common algorithms used nowadays in meteorology and oceanography.
4.1 Notation and hypotheses
The dimension of the model state is and the
dimension of the observation vector is . We will
denote:
|
true model
state (dimension ) |
|
background model
state (dimension ) |
|
analysis
model state (dimension ) |
|
vector of observations
(dimension ) |
|
observation
operator (from dimension to ) |
|
covariance matrix
of the background errors (dimension ) |
|
covariance matrix
of observation errors (dimension ) |
|
covariance matrix
of the analysis errors (dimension ) |
The following hypotheses are assumed:
| |
• Linearized observation
operator: the variations of the observation operator in the
vicinity of the background state are linear: for any close enough to , where is a linear operator. |
| |
• Non-trivial errors:
and are positive definite matrices. |
| |
• Unbiased errors:
the expectation of the background and observation errors is zero i.e.
 |
| |
• Uncorrelated
errors: observation and background errors are mutually uncorrelated
i.e.  |
| |
• Linear analysis:
we look for an analysis defined by corrections to the background which
depend linearly on background observation departures. |
| |
• Optimal analysis:
we look for an analysis state which is as close as possible to the
true state in an r.m.s. sense (i.e. it is a minimum variance estimate). |
ref: Daley
1991; Lorenc
1986; Ghil 1989
4.2 Theorem: least-squares analysis equations
|
(a) The optimal least-squares
estimator, or BLUE analysis, is defined by the
following interpolation equations: |
|
where the linear operator
is called the gain, or weight matrix, of the analysis. |
|
(a) The analysis error
covariance matrix is, for any : |
|
If is the optimal
least-squares gain, the expression becomes |
|
(a) The BLUE analysis is
equivalently obtained as a solution to the variational optimization
problem: |
|
where is called
the cost function of the analysis (or misfit,
or penalty function), is the
background term, is the
observation term. |
|
(a) The analysis is optimal: it is closest in an r.m.s. sense to the
true state . |
|
(b) If the background and
observation error pdfs are Gaussian, then is also
the maximum likelihood estimator of . |
|
Proof:
|
With a translation of by , we can assume that so the observation
operator is linear for our purposes. The equation (A1)
is simply a mathematical expression of the fact that we want the analysis
to depend linearly on the observation departures. The expression of
in (A2) is well-defined
because is a positive definite matrix, and is positive. The minimization problem (A5) is well-defined because is a convex function and is a strictly
convex function (it is a quadratic form). |
|
The equivalence between items
(a) and (c) of the theorem stems from the requirement that the gradient
of is zero at the optimum : |
|
The identity with (A2) is straightforward to prove (all inverse
matrices considered are positive definite): |
|
The expressions (A3) and (A4) for are obtained
by rewriting the analysis equation (A1) in terms of the background, analysis and
observation errors: |
|
By developing the expression
of and taking its expectation, by linearity of the expectation operator
one finds the general expression (A3) (remember that and being uncorrelated, their cross-covariance is zero).
The simpler form (A4) is easy to derive by substituting the expression
for the optimal and simplifying
the terms that cancel. |
|
Finally to prove (A2) itself we take the analysis error covariance
matrix given by (A3) and we minimize its trace, i.e. the total
error variance: (note that and ) |
|
This is a continuous differentiable
scalar function of the coefficients of , so we can
express its derivative as the first- order terms in of the difference , being an arbitrary
test matrix: |
|
The last line shows that the
derivative is zero for any choice of if and only
if , which is equivalent to |
|
because is assumed to
be invertible. |
In the case of Gaussian pdfs, one can model the background,
observation and analysis pdfs as follows, respectively: ( , and are normalization
factors.)
| |
|
|
which yields the right averages and covariances for the
background and observations errors, and the analysis error pdf is simply
defined as the Bayesian product of the two known sources of information,
the background and the observation pdfs (this can be derived rigorously
by using Bayes' theorem to write as a conditional probability of given the observations
and the a priori pdf of the background). Then, by taking minus the logarithm
of , one finds
that the model state with the maximum probability (or likelihood)
is the one that minimizes the cost function expressed in the theorem.
4.3 Comments
The hypotheses of non-triviality can always been made in
well-posed analysis problems: if is non-positive,
one can restrict the control space to the orthogonal of the kernel of (the analysis will not make any correction
to background variables that are perfectly known). If is not a surjection, then some observations are redundant and
the observing network shall be restricted to the image of . If is non-positive, the expression
(A2) for still holds (then the analysis will be equal to the observed
value at the observation points ), but the variational
version of the least-squares analysis cannot be used. It is even possible
(with some algebraic precautions) to have some infinite eigenvalues in , i.e. a non-positive , which means that some observations are not used because their errors
are infinite.
The hypothesis of unbiased errors is a difficult
one in practice because there often are significant biases in the background
fields (caused by biases in the forecast model) and in the observations
(or in the observation operators). If the biases are known, they can be
subtracted from the background and observation values, and the above algebra
applies to the debiased quantities. If the biases are left in, the analysis
will not be optimal, even though it will seem to reduce the biases by interpolating
between the background and observations. It is important to monitor the
biases in an assimilation system, e.g. by looking at averages of background
departures, but it is not trivial to decide which part of these are model
or observation biases. The problem of bias monitoring and removal is the
subject of ongoing research.
The hypothesis of uncorrelated errors is usually
justified because the causes of errors in the background and in the observations
are supposed to be completely independent. However, one must be careful
about observation preprocessing practices (such as satellite retrieval procedures)
that use the background field in a way that biases the observations toward
the background. It might reduce the apparent background departures, but
it will cause the analysis to be suboptimal (too close to the background,
a condition nicknamed as the incest problem).
The tangent linear hypothesis is not trivial and
it is commented in the next section.
It is possible to rewrite the least-squares analysis equations
in terms of the inverses of the error covariance matrices, called
information matrices. It makes the algebra a bit more complicated,
but it allows one to see clearly that the information contained
in the analysis is the sum, in a simple sense, of the observations provided
by the background and by the observations. This is illustrated in the section
on the estimation of analysis quality below.
It will be shown in the section on dual algorithms (PSAS
analysis) that the equations, and in particular the cost function , can be rewritten in the space of the observations
. Also, it is easy to that least-squares analysis
is closely related to a linear regression between model state and observations.
4.4 On the tangent linear hypothesis
The hypothesis of linearized observation operator
is needed in order to derive a rigorous algebraic expression for the optimal
. In practice, may not
be linear, but it usually makes physical sense to linearize it in the vicinity
of the background state:
| |
|
|
Then, being a continuous function of , the least-squares
equations for the analysis should intuitively yield a nearly optimal .
More generally, the tangent linear hypothesis
on can be written as the first-order Taylor-Young
formula in the vicinity of an arbitrary state and for a perturbation :
| |
,
|
|
with . This hypothesis, called the tangent linear hypothesis
is only acceptable if the higher-order variations of can be neglected (in particular there should be no discontinuities)
for all perturbations of the model state which have the same order of magnitude
as the background errors. The operator is called the differential, or first
derivative, or tangent linear (TL)1 function of at point
. Although this is a desirable mathematical property of
, it is not enough for practical purposes, because
the approximation
must be satisfactory, in user-defined terms, for finite values of that depend on the application considered. In the least-squares
analysis problem, we need
| |
|
|
for all values of that will be encountered in the analysis procedure,
notably , , and
also all trial values used in the minimization of if a variational analysis is performed2. Thus the important requirement is that the difference
between and should be much smaller than the typical
observation errors (defined by ), for all
model state perturbations of size
and structure consistent with typical background errors, and also with the
amplitude of the analysis increments .
Thus the problem of linearizing is not
just related to the observation errors themselves. It must be appreciated
in terms of the errors in the background too, which in a sequential assimilation system are the previous
forecast errors, which depend on the forecast range and the quality of the
model. Ultimately the correctness of the linearization must be appreciated
in the context of the fully integrated assimilation system. It will be easier
to apply the linearization to a good system because the departures will be
smaller. Conversely, the linearization may be inapplicable to difficult
data assimilation problems. This is often the case with ocean models or
satellite data, which means that it can be wrong to use sophisticated analysis
algorithms that rely too much on the linearity of the problem.
The linearization problem can be even more acute for the
linearization of the model forecast operator which is
needed in 4D-Var and in the Kalman filter described below. As with the linearization
of , it may
or may not be licit depending on the quality of all components of the assimilation
system: data coverage, observation quality, model resolution and physics,
and forecast range. The user requirements and the physical properties of
the system must be considered.
The non-linear analysis problem
The assumption of linear analysis is a strong one. Linear
algebra is needed to derive the optimal analysis equations. One can rely
on the linearization of a weakly non-linear observation operator, at the
expense of optimality. The incremental method (described below
for the variational analysis) performs this procedure iteratively in an
empirical attempt to make the analysis more optimal. For strongly non-linear
problems, there is no general and simple way to calculate the optimal analysis.
The simulated annealing method can be useful; specific methods,
such as the simplex, deal with variables with bounded definition
domains. Finally, it is sometimes possible to make a problem more linear
simply by a clever definition of model and observation variables (see the
section on minimization methods).
4.5 The point of view of conditional probabilities
It is interesting to formalize the analysis problem using
the conditional, or Bayesian, probabilities. Let us denote the a priori pdf (probability density function) of the
model state before the observations are considered, i.e. the background
pdf. Let us denote the pdf of the
observations. The aim of the analysis is to find the maximum of , the conditional probability of the model state given the
observations. The joint pdf of and (i.e. the probability that and occur together) is
| |
|
|
i.e. it is the probability that occurs
when occurs, and vice versa. The above expression
is the Bayes theorem. In the analysis procedure we know that a measurement
has been made and we know its value , so and we obtain
which means that the analysis pdf is equal to the background pdf times the
observation pdf . The latter peaks at but it is not
a Dirac distribution because the observations are not error-free.
The virtue of the probabilistic derivation of the analysis
problem is that it can be extended to non-Gaussian probabilities (although
this spoils the equivalence with the (A2) equation for ). A practical application is done in the framework
of variational quality control, where it is assumed that observation errors
are not Gaussian but they contain some amount of "gross errors", i.e. there
is a probability that the error is not generated by the usual Gaussian physical
processes but by some more serious problem, like coding or instrumental
failure. The gross errors might be modelled using a uniform pdf over a predefined
interval of admissible gross errors, leading to a non-Gaussian observation
pdf. When the opposite of the logarithm of this pdf is taken, the resulting
observation cost function is not quadratic, but gives less weight to the
observation (i.e. there is less slope) for model states that disagree strongly
with the observed value.
ref: Lorenc 1986
4.6 Numerical cost of least-squares analysis
In current operational meteorological models, the dimension
of the model state (or, more precisely, of the control variable space) is of the order of , and the dimension
of the observation vector (the number of observed scalars) is of the order
of per analysis3. Therefore the analysis problem is
mathematically underdetermined (although in some regions it might be overdetermined
if the density of the observations is larger than the resolution of the
model). In any practical application it is essential to keep in mind the
size of the matrix operators involved in computing the analysis (Fig. 4 ). The least-squares analysis method requires in principle the specification
of covariance matrices and (or their inverses in the variational form of the algorithm) which
respectively contain of the order of and distinct coefficients, which are statistics to estimate (the estimation
of a variance or covariance statistic converges like the square root of
the number of realizations). The explicit determination of requires the inversion of a matrix of size
, which has an asymptotic complexity of the
order of . The exact minimization of the cost function
requires, in principle, evaluations of the cost function and its gradient, assuming
is quadratic and there are no numerical errors (e.g.
using a conjugate gradient method).
Figure 4 . Sketches of the shapes of the matrices
and vector dimensions involved in an usual analysis problem where there
are many fewer observations than degrees of freedom in the model: from top
to bottom, in the equations of the linear analysis, the computation of , of the term, and the computation of the cost function .
It is obvious that, except in analysis problems of very
small dimension (like one-dimensional retrievals), it is impossible to compute
exactly the least-squares analysis. Some approximations are necessary, they
are the subject of the following sections.
4.7 Conclusion
We have seen that there are two main ways of defining the
statistical analysis problem:
| |
• either assume that the
background and error covariances are known, and derive the analysis
equations by requiring that the total analysis error variances are
minimum, |
| |
• or assume that the background
and observation error pdfs are Gaussian, and derive the analysis equations
by looking for the state with the maximum probability. |
Both approaches lead to two mathematically equivalent algorithms:
| |
• the direct determination
of the analysis gain matrix , |
| |
• the minimization of a
quadratic cost function. |
These algorithms have very different numerical properties,
and their equivalence stops as soon as some underlying hypotheses are not
verified, like the linearization of the observation operator, for instance.
Training Course Notes Front Page >>
Table of contents >>
Next Section >>
Previous Section >>
1 Both qualifiers tangent and linear
are needed: obviously could be linear
without satisfying the Taylor formula. A function can also be tangent to
another without being linear, if the difference between them is an , e.g. and are tangent to each other for .
2 Qualitatively speaking they all belong to
a neighbourhood of having a shape and size which is
consistent with the and error covariances.
3 At ECMWF in winter 1998 the control variable
dimension was 512000, the number of active observations (per 6-hour interval)
was about 150000
|