Home page  
Home   Your Room   Login   Contact   Feedback   Site Map   Search:  
Discover this product  
About Us
Overview
Getting here
Committees
Products
Forecasts
Order Data
Order Software
Services
Computing
Archive
PrepIFS
Research
Modelling
Reanalysis
Seasonal
Publications
Newsletters
Manuals
Library
News&Events
Calendar
Employment
Open Tenders
   
Home > Newsevents > Training > Rcourse_notes > GENERAL_CIRCULATION > GENERAL_CIRCULATION >  
   

Predicting uncertainty in forecasts of weather and climate
(Also published as ECMWF Technical Memorandum No. 294)
By T.N. Palmer

Research Department, ECMWF

November 1999



 
  Training Course Notes Front Page >>
Table of contents >>
Next Section >>
Previous Section >>






8 . Verifying forecasts of uncertainty


As discussed, the output from an ensemble forecast can be used to construct a probabilistic prediction. In this section, we discuss two basic measures of skill for assessing a probability forecast: the Brier Score and the Relative Operating Characteristic. Both of these measures are based on the skill of probabilistic forecasts of a binary event E, as discussed in Section 7 above. For example E could be: temperatures will fall below 0oC in three days time; average rainfall for the next three months will be at least one standard deviation below normal; seasonal-mean rainfall will be below average and temperature above average, and so on.

8.1 The Brier score and its decomposition


Consider an event E which, for a particular ensemble forecast, occurs a fraction of times within the ensemble. If E actually occurred then let , otherwise . Repeat this over a sample of different ensemble forecasts, so that is the probability of E in the th ensemble forecast and or , depending on whether E occurred or not in the th verification ( ).

The Brier score (Wilks, 1995) is defined by

 
(47)


From its definition , equalling zero only in the ideal limit of a perfect deterministic forecast. For a large enough sample, the Brier score can be written as

 
(48)


where is the relative frequency that E was forecast with probability between and , and gives the proportion of such cases when E actually occurred. To see the relationship between (47) and (48) note that is the Brier score for ensembles where E actually occurred, and is the Brier score for ensembles where E did not occur.

Simple algebra on (48) gives Murphy's (1973) decomposition

 
(49)


of the Brier score, where

 
(50)


is the reliability component,

 
(51)


is the resolution component

 
(52)


is the uncertainty component, and

 
(53)


is the (sample) climatological frequency of E.

A reliability diagram (Wilks, 1995) is one in which is plotted against for some finite binning of width . In a perfectly reliable system and the graph is a straight line oriented at 45o to the axes, and . Reliability measures the mean square distance of the graph of to the diagonal line.

Resolution measures the mean square distance of the graph of to the sample climate horizontal line. A system with relatively high is one where the dispersion of about is as large as possible. Conversely, a forecast system has no resolution when, for all forecast probabilities, the event verifies a fraction times.

The term on the right-hand side of (49) ranges from 0 to 0.25. If E was either so common, or so rare, that it either always occurred or never occurred within the sample of years studied, then ; conversely if E occurred 50% of the time within the sample, then . Uncertainty is a function of the climatological frequency of E, and is not dependent on the forecasting system itself. It can be shown that the resolution of a perfect deterministic system is equal to the uncertainty.

When assessing the skill of a forecast system, it is often desirable to compare it with the skill of a forecast where the climatological probability is always predicted (so ). The Brier score of such a climatological forecast is (using the sample climate), since, for such a climatological forecast . In terms of this, the Brier skill score, , of a given forecast system is defined by

 
(54)


for a forecast no better than climatology, and for a perfect deterministic forecast.

Skill-score definitions can similarly be given for reliability and resolution, i.e.

 
(55)


 
(56)


For a perfect deterministic forecast system, . Hence, from Eqs. (49) and (54)

 
(57)


Fig. 13 shows two examples of reliability diagrams for the ECMWF EPS taken over all day-6 forecasts from December 1998 - February 1999 over Europe (cf. Fig. 8 ). The events are , :- lower tropospheric temperature being at least 4oC, 8oC greater than normal. The Brier score, Brier skill score, and Murphy decomposition are shown on the figure.


Figure 13 . Reliability diagram and related Brier score skill score and Murphy decomposition for the events: (a) 850 hPa temperature is at least 4oC above normal and (b) at least 8oC above normal, based on 6-day forecasts over Europe from the 50-member ECMWF ensemble prediction system from December 1998 - February 1999. Also shown is the pdf for the event in question.



The reliability skill score is extremely high for both events. However, the reliability diagrams indicate some overconfidence in the forecasts. For example, on those occasions where was forecast with a probability between 80% and 90% of occasions, the event only verified about 72% of the time. However, it should be remembered that the integrand in Eq. (50) is weighted by the pdf , shown in each reliability diagram. In both cases, forecasts where are relatively rare and hence contribute little to .

To see why probability forecasts of have higher Brier skill scores than probability forecasts of , consider Eq. (57). From Fig. 13 , whilst is the same for both events, is larger for than for . This can be seen by comparing the histograms of in Fig. 13 which are more highly peaked for than for ; there is less dispersion of the probability forecasts of the more extreme event about its climatological frequency, than the equivalent probability forecasts of the more moderate event. This is hardly surprising; the more extreme event is relatively rare (its climatological frequency is ) and most of the time is forecast with probabilities which almost always lie in the first probability category ( ). In order to increase the Brier score of this relatively extreme event, one would need to increase the ensemble size so that finer probability categories can be reliably defined. (For example, suppose an extreme event has a climatological probability of occurrence of Let us suppose that we want to be able to forecast probabilities of this event which can discriminate between probability categories with a band width comparable with this climatological frequency, then the ensemble size should be .) With finer probability categories, the resolution component of the Brier score can be expected to increase. Providing reliability is not compromised, this will lead to higher overall skill scores.

However, this raises a fundamental dilemma in ensemble forecasting given current computer resources. It would be meaningless to increase ensemble size by degrading the model (e.g. in terms of "physical" resolution) making it cheaper to run, if by doing so it could no longer simulate extreme weather events. Optimising computer resources that on the one hand ensemble sizes are sufficiently large to give reliable probability forecasts of extreme but rare events, and that on the other hand the basic model has sufficient complexity to be able to simulate such events, is a very difficult balance to define.

The Brier score and its decomposition provide powerful objective tools for comparing the performance of different probabilistic forecast systems. However, the Brier score itself does not address the issue of whether a useful level of skill has been achieved by the probabilistic forecast system. In order to prepare the ground for a diagnostic of probabilistic forecast performance which determines potential economic value, we first introduce skill score originally derived from signal detection theory.

8.2 Relative operating characteristic


The relative operating characteristic (ROC; Swets, 1973; Mason 1982; Harvey et al., 1992) is based on the forecast assumption that E will occur, providing E is forecast by at least a fraction of ensemble members, where the threshold is defined a priori by the user. As discussed below, optimal can be determined by the parameters of a simple decision model.

Consider first a deterministic forecast system. Over a sufficiently large sample of independent forecasts, we can form the forecast- model contingency matrix giving the frequency that E occurred or did not occur, given it was forecast or not forecast, i.e.

 


Based on these values, the so-called "hit rat" (H) and "false-alarm rate" (F) for E are given by

 
(58)


Hit and false alarm rates for all ensemble forecast can be defined as follows. It is assumed that E will occur if (and will not occur if ). By varying between 0 and 1 we can define , . In terms of the pdf

 
(59)


The ROC curve is a plot of against for . A measure of skill is given by the area under the ROC curve (A). A perfect deterministic forecast has , whilst a no-skill forecast for which the hit and false alarm rates are equal, has .

Relative operating characteristic curve for seasonal timescale integrations (run over the years 1979-93, run with prescribed observed SST) for the event :- the seasonal-mean (December-February) 850 hPa temperature anomaly is below normal. Solid: based on a single model 9-member ensemble. bottom: based on a multi-model 36-member ensemble (see Palmer et al., 2000) for more details. The area A under the two curves (a measure of skill) is shown.


Figure 14 . Relative operating characteristic curve for seasonal timescale integrations (run over the years 1979-93, run with prescribed observed SST) for the event :- the seasonal-mean (December-February) 850 hPa temperature anomaly is below normal. Solid: based on a single model 9-member ensemble. bottom: based on a multi-model 36-member ensemble (see Palmer et al., 2000) for more details. The area A under the two curves (a measure of skill) is shown.



We illustrate in Fig. 14 the application of these measures of skill to a set of multi-model multi-initial condition ensemble integrations made over the seasonal timescale (Palmer et al., 2000). The event being forecast is :- the seasonal-mean (December-February) 850 hPa temperature anomaly will be below normal. The global climate models used in the ensemble are the ECMWF model, the UK Meteorological Office Unified Model, and two versions of the French Arpège model; the integrations were made as part of the European Union "Prediction of Climate Variations on Seasonal to Interannual Timescales (PROVOST)". For each of these models, 9-member ensembles were run over the boreal winter season for the period 1979-1993 using observed specified SSTs. The values and have been estimated from probability bins of width 0.1. The ROC curve and corresponding A value is shown for the 9-member ECMWF model ensemble, and for the 36-member multi-model ensemble. It can be seen that in both cases, A is greater than the no-skill value of 0.5; however, the multi-model ensemble is more skilful than the ECMWF-model ensemble. Studies have shown that the higher skill of the multi-model ensemble arises mainly because of the larger ensemble size, but also because of a sampling of the pdf associated with model uncertainty.

Training Course Notes Front Page >>
Table of contents >>
Next Section >>
Previous Section >>






Copyright © 2003, ECMWF. All rights reserved.
 

Top of page 13.05.2003
 
   Page Details         © ECMWF
shim shim shim