Home page  
Home   Your Room   Login   Contact   Feedback   Site Map   Search:  
Discover this product  
About Us
Overview
Getting here
Committees
Products
Forecasts
Order Data
Order Software
Services
Computing
Archive
PrepIFS
Research
Modelling
Reanalysis
Seasonal
Publications
Newsletters
Manuals
Library
News&Events
Calendar
Employment
Open Tenders
   
Home > Research > Demeter > News >  
   

Verification process


 
 
 

These are two documents prepared by R. Graham (atmosphere) and M. Davey (ocean) for setting up a common framework for the verification of the DEMETER forecasts. These proposals will provide the basis for further discussions during the next DEMETER meeting.


Atmosphere

Common evaluation format for DEMETER meteorological and ocean variables
Purpose

N.B. This straw man proposal applies only to the set of variables chosen for common evaluation by all groups. The purpose of the common evaluation format is to define a subset of variables from the minimum DEMETER output list, that each of the modelling partners will evaluate for their own model using the same set of regions, skill measures etc., and to allocate tasks for multiple-model evaluation. Suggested aims are:
1. Provide easy comparisons of skill between models, both before and after post-processing.

2. Provide the user community with information on the level of model skill for the key meteorological variables used in crop or disease models - post-processed multiple-model evaluations only.

3. Co-ordinated evaluation of the multiple-model output among the modellers

4. If ECMWF agree, display results on the DEMETER website.

5. Coordination of tools for extracting output from MARS, calculating scores and plotting/tabulating results (at the very least, the latter would need to be coordinated if DEMETER website display is agreed)

Common evaluation for individual models to be on just 2 variables 2m Temp and precip (i.e. the most common variables in existing real-time systems).

Evaluation on multiple-model output to be done on a more extensive set (to be decided) with tasks distributed around the modelling centres (e.g. each centre might take on one or two variables)

Assessment of atmosphere fields

The volume of evaluation output quickly becomes unmanageable with each variable evaluated potentially multiplied by several periods, seasons and skill diagnostics. I therefore propose just 2 variables and one period in the first instance (two for the multi-model). More variables, including those of interest to the user groups to be incorporated in the multiple-model evaluations (Users partners please mail me with input on this (e.g. which variables do you most need meteorological skill evaluation for) - I have had some input from Liverpool already.

1. Individual model output

Verification of all fields made against ERA-40. Model data to be interpolated on to a standard grid (I suggest 2.5deg by 2.5deg (as WMO Standard Verification System (SVS) - or perhaps same as ERA-40) before evaluation.

Common variables:

  • 2m Temp at 12Z: 3-months averages
  • total precip: 3-month averages
  • 2 variables

More extensive set for the multi-model (see above)

Periods:

  • 3-month averages to be: Months 2-4. With the proposed start dates of Feb 1, May 1, Aug 1 & Nov 1, that gives: MAM, JJA, SON, DJF. (could also do 4-6, but this would multiply up the output by factor 2 - comments?)
  • 4 periods

Skill Measures

  • Model bias (mean error of ensemble-mean)
  • One probabilistic skill score and one deterministic skill score
  • Categories = above/below normal for 9 member ensemble output.
  • 3 diagnostics
  • Probability score = area under ROC curve
  • Deterministic skill score = RMSSS skill scores for predicted anomalies (as SVS)
  • (reference score to be persistence in tropics and climate elsewhere?).
  • Other scores may be favoured for familiarity (anomaly correlation for regions, timeseries correlations for maps) - To be discussed.
  • Information on the calculation of ROC and RMSSS may be found in the draft SVS document at. http://www.wmo.int/web/www/DPS/verification_systems.html

Stratification

  • All years
  • ENSO years (El Nino and La Nina years)
  • 2 "strata"

2x4x3x2 = 48 evaluation elements per model

Presentation

a. Global skill maps

Global skill maps. 2 variables x 4 periods x 3 diagnostics x 2 strata = 48 global maps per modelling group. These might be placed on the DEMETER website when sufficient years have been run and updated as further years become available.

b. Regions

Aggregated scores over the following regions; Northern extratropics (30N to 90N) as SVS Tropics (30N to 30S) as SVS Southern Hemisphere (30S to 90S) as SVS Europe (12.5W to 42.5E, 35N to 75N) - as PROVOST North America (130W to 60W, 30N to 70N) - as PROVOST Parts of Africa (to be specified by MALSAT) NAO index (can anyone suggest a "standard" formula) SOI index (as SVS document) [Include also the Indian monsoon region?]

2 variables x 8 regions x 2 diagnostics x 2 strata = 64 tables (each including 4 seasons)

c. Timeseries showing interannual variability variables x 8 regions x 2 diagnostics = 32 timeseries plots (each including 4 seasons)

2. Multiple-model output

As for individual models, but for at least 6 output variables (1 per modelling centre), and for two periods 2-4 and 4-6. In addition probability forecasts to be evaluated for 3 equiprobable categories - with one score per tercile category.

Global maps (per centre)

1 variables x 8 periods x 5 diagnostics (recall 3 categories) x 2 strata = 32 global maps for bias and deterministic evaluations plus 16 x 3-in-one maps for probability skill

Tables (per centre)

1 variables x 8 regions x 2 periods x 5 diagnostics x 2 strata = 64 tables (each including 4 seasons) for bias and deterministic skill, plus 32 tables (each including 4 seasons and above/normal/below categories) for probability skill

Statistical Methodology

Cross validation method to be used: i.e. model reference climate is calculated over all years except the year which is being evaluated.

Significance of skill should be assessed using Monte Carlo methods.

Ocean

(draft version 3aug00 by Mike Davey)

The purpose of the common evaluation format is to define a subset of variables from the minimum DEMETER output list, that each of the modelling partners will evaluate for their own model using the same set of regions, skill measures etc. Suggested aims are:

1. Provide easy comparisons of skill between models, both before and after post-processing.

2. Provide the user community with information on the level of model skill.

3. Co-ordinated evaluation of the multiple-model output among the modellers

4. If ECMWF agree, display results on the DEMETER website.

5. Coordination of tools for extracting output from MARS, calculating scores and plotting/tabulating results. Notes from WMO meeting on verification of long-range forecasts: (see http://www.wmo.int/web/www/reports/ECMWF-AUG-99.html)

From key list of parameters for verification:

Sea surface temperature predictions for

  • Nino1+2, Nino3, Nino3.4, Nino4, Pacific warm pool (4S-4N,130E-150E)
  • Tropical Indian ocean (area not yet defined)
  • Tropical Atlantic ocean (area not yet defined)
  • Others (not yet defined)

Verification dataset:

  • Reynolds OI, with option for additional use of GISST

Recommended basic diagnostics:

  • Relative Operating Characteristics (ROC)
  • Root mean square skill scores (RMSSS)
  • Also for consideration as information to supply to users: a number of contingency table based diagnostics. (See the report for further details.)

Verification list

Ocean hindcasts

Note

  • 4 periods' refers to the hindcast sets using the 4 agreed start times per year (Feb 1, May 1, Aug 1, Nov 1)
  • month 1 is the first calendar month after the start (i.e. Feb for the Feb 1 start time), etc.
  • month 0 is the calendar month before the start (i.e. Jan for the Feb 1 start time)
  • where appropriate, equivalent maps/plots for simple persistence and/or climatology hindcast strategies to be produced for comparison.

1. SST

verify against Reynolds SST, and MetO GISST or HadISST: need to be aware of systematic differences between different observational SST analyses

SST systematic error: latitude-longitude maps of model systematic error for months 3 and 6 for the 4 periods (8 maps/model) (systematic error = average of all hindcasts (all members, all years) for the target month minus observed SST for the target month/year') This will provide information on model drift. It would be useful to include equivalent maps of the initial SST. But should this be SST systematic error at day 0 (when hindcasts actually start), or month 0? I suggest month 0, as it is easier to verify and provides information on the ocean analysis component. As most (all?) models will make use of observed SST, the month 0 errors should be small (but probably not insignificant!).

SST anomaly skill: monthly SST anomalies (SST'): Model SST anomalies calculated by subtracting model climatology for the corresponding lead time and time-of-year, to remove systematic error. Model climatology calculated as average of all hindcasts for that month-of-year starting from the same month-of-year. (NB - anomalies calculated in this way make use of information beyond the forecast start time, may inflate skill slightly?) latitude-longitude skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 3 skill measures - RMSSS, area under ROC curve, correlation (24 maps/model!)

SST' area averages: Nino3, Nino4, North Trop Atlantic (10N-20N, 20W-60W), Central Indian (0-10N, 60E-80E) (the Atlantic and Indian regions are areas that are robustly linked to the equatorial Pacific at 0-6 month lags in observations) East and west equatorial Indian regions? (cf recent 'dipole' ideas) For each area: plot timeseries for months 2-4 and 4-6 (plot ensemble mean, and some measure of spread [+/- 1 standard deviation?] along with observed SST) contingency tables for months 2-4 and 4-6 all ensemble members, vs observed categories (3 or 5 categories? I suggest 3 if number of years available is 12-24, 5 if 25 or more) (NB - such tables can also be used to adjust tercile/quintile forecasts.) tables of anomaly correlations and rmse vs lead time

2. Vertical sections of temperature

Verification dataset: TAO for Pacific (1990s onward only) temperature systematic error: longitude-depth (to 400m) maps of systematic error at months 0, 3 and 6 (cf SST above) latitude-depth (to 400m) maps of systematic error at months 0, 3 and 6 along 165E and 140W (Provides information about model drift.) temperature anomalies (T'): longitude-depth skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 2 skill measures - RMSSS, correlation

3. 20C isotherm depth (D20)

verification datasets: TAO (tropical Pacific, 1990s onward only) PIRATA (tropical Atlantic, end 1990s onward) ocean analyses (global, all years) D20 anomalies (D20'): monthly anomalies from model climatology area averages of D20': Nino3 Nino4 Atlantic? (where? PIRATA area?) Indian? (where?) for each area: plot timeseries of months 2-4, 4-6 (plot ensemble mean, and some measure of spread [+/- 1 standard deviation?] along with observed (where available) and ocean analysis from corresponding model)

4. Sea level

verification dataset: CLS (global, 1993-2000) monthly sea level anomalies (SLA): latitude-longitude skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 2 skill measures - RMSSS, correlation (limited data years - better to use all data rather than separate into 4 periods?)

Ocean analyses

SST and vertical T section diagnostics are included in the hindcast list above (month 0), with regard to systematic errors.

Sea level

independent verification dataset: tide gauges (several suitable sites were identified and used for the DUACS altimeter project, mainly in the tropical Pacific)

plot timeseries of SLA at select TG sites, along with TG and altimeter-derived SLA data

plot scatterplots of model vs TG SLA suggested sites (from DUACS list) (data available from University of Hawaii Sea Level Center)

  • Betio (Kiribati) (1.2N, 172.5E) (equatorial west Pacific)
  • Christmas Island (1.6N, 157.3W, equatorial central Pacific)
  • Galapagos (50:50 average of Santa Cruz [0.4N,90.2W] and Baltra [0.3S, 90.0W], equatorial east Pacific)
  • Johnston (16.5N, 169.3W, tropical north Pacific)
  • Kwajalein (8.4N, 167.4E, tropical north Pacific)
  • Penrhyn (Cook Islands) (8.6S, 158.0W, tropical south Pacific)
  • Funafuti (Tuvalu) (8.3S, 179.1E, tropical south Pacific)
  • Ponta Delgada (Azores) (37.4N,25.4W, North Atlantic)
  • Bermuda (32.2N,64.4W, North Atlantic)
  • Point La Rue (Seychelles) (4.4S,55.3E, equatorial Indian ocean)

salinity??

Although verification data are very limited (and are likely to be assimilated), it is useful to monitor salinity to detect possible drift. For DUACS 0-300m averages in Nino3 and Nino4 were used. It is also useful to monitor a particular depth.

currents??

(cf intercomparison by Fevrier et al., J.Mar.Sys.,24,249-275 (2000) )

Is it worth including verification of variables that are assimilated? (the results depend on the weight given to the observations in the various assimilation schemes)


 

Top of page 07.12.2001
 
   Page Details         © ECMWF
shim shim shim