![]() |
|||||||||||||||||||
|
|||||||||||||||||||
Verification process |
|||||||||||||||||||
|
These are two documents prepared by R. Graham (atmosphere) and M. Davey (ocean) for setting up a common framework for the verification of the DEMETER forecasts. These proposals will provide the basis for further discussions during the next DEMETER meeting. AtmosphereCommon evaluation format for DEMETER meteorological and ocean variablesPurposeN.B. This straw man proposal applies only to the set of variables
chosen for common evaluation by all groups. The purpose of the common
evaluation format is to define a subset of variables from the minimum
DEMETER output list, that each of the modelling partners will evaluate
for their own model using the same set of regions, skill measures
etc., and to allocate tasks for multiple-model evaluation. Suggested
aims are: 2. Provide the user community with information on the level of model skill for the key meteorological variables used in crop or disease models - post-processed multiple-model evaluations only. 3. Co-ordinated evaluation of the multiple-model output among the modellers 4. If ECMWF agree, display results on the DEMETER website. 5. Coordination of tools for extracting output from MARS, calculating scores and plotting/tabulating results (at the very least, the latter would need to be coordinated if DEMETER website display is agreed) Common evaluation for individual models to be on just 2 variables 2m Temp and precip (i.e. the most common variables in existing real-time systems). Evaluation on multiple-model output to be done on a more extensive set (to be decided) with tasks distributed around the modelling centres (e.g. each centre might take on one or two variables) Assessment of atmosphere fieldsThe volume of evaluation output quickly becomes unmanageable with each variable evaluated potentially multiplied by several periods, seasons and skill diagnostics. I therefore propose just 2 variables and one period in the first instance (two for the multi-model). More variables, including those of interest to the user groups to be incorporated in the multiple-model evaluations (Users partners please mail me with input on this (e.g. which variables do you most need meteorological skill evaluation for) - I have had some input from Liverpool already. 1. Individual model output Verification of all fields made against ERA-40. Model data to be interpolated on to a standard grid (I suggest 2.5deg by 2.5deg (as WMO Standard Verification System (SVS) - or perhaps same as ERA-40) before evaluation. Common variables:
More extensive set for the multi-model (see above) Periods:
Skill Measures
Stratification
2x4x3x2 = 48 evaluation elements per model Presentation a. Global skill maps Global skill maps. 2 variables x 4 periods x 3 diagnostics x 2 strata = 48 global maps per modelling group. These might be placed on the DEMETER website when sufficient years have been run and updated as further years become available. b. Regions Aggregated scores over the following regions; Northern extratropics (30N to 90N) as SVS Tropics (30N to 30S) as SVS Southern Hemisphere (30S to 90S) as SVS Europe (12.5W to 42.5E, 35N to 75N) - as PROVOST North America (130W to 60W, 30N to 70N) - as PROVOST Parts of Africa (to be specified by MALSAT) NAO index (can anyone suggest a "standard" formula) SOI index (as SVS document) [Include also the Indian monsoon region?] 2 variables x 8 regions x 2 diagnostics x 2 strata = 64 tables (each including 4 seasons) c. Timeseries showing interannual variability variables x 8 regions x 2 diagnostics = 32 timeseries plots (each including 4 seasons) 2. Multiple-model output As for individual models, but for at least 6 output variables (1 per modelling centre), and for two periods 2-4 and 4-6. In addition probability forecasts to be evaluated for 3 equiprobable categories - with one score per tercile category. Global maps (per centre) 1 variables x 8 periods x 5 diagnostics (recall 3 categories) x 2 strata = 32 global maps for bias and deterministic evaluations plus 16 x 3-in-one maps for probability skill Tables (per centre) 1 variables x 8 regions x 2 periods x 5 diagnostics x 2 strata = 64 tables (each including 4 seasons) for bias and deterministic skill, plus 32 tables (each including 4 seasons and above/normal/below categories) for probability skill Statistical Methodology Cross validation method to be used: i.e. model reference climate is calculated over all years except the year which is being evaluated. Significance of skill should be assessed using Monte Carlo methods. Ocean(draft version 3aug00 by Mike Davey) The purpose of the common evaluation format is to define a subset of variables from the minimum DEMETER output list, that each of the modelling partners will evaluate for their own model using the same set of regions, skill measures etc. Suggested aims are: 1. Provide easy comparisons of skill between models, both before and after post-processing. 2. Provide the user community with information on the level of model skill. 3. Co-ordinated evaluation of the multiple-model output among the modellers 4. If ECMWF agree, display results on the DEMETER website. 5. Coordination of tools for extracting output from MARS, calculating scores and plotting/tabulating results. Notes from WMO meeting on verification of long-range forecasts: (see http://www.wmo.int/web/www/reports/ECMWF-AUG-99.html) From key list of parameters for verification: Sea surface temperature predictions for
Verification dataset:
Recommended basic diagnostics:
Verification listOcean hindcasts Note
1. SST verify against Reynolds SST, and MetO GISST or HadISST: need to be aware of systematic differences between different observational SST analyses SST systematic error: latitude-longitude maps of model systematic error for months 3 and 6 for the 4 periods (8 maps/model) (systematic error = average of all hindcasts (all members, all years) for the target month minus observed SST for the target month/year') This will provide information on model drift. It would be useful to include equivalent maps of the initial SST. But should this be SST systematic error at day 0 (when hindcasts actually start), or month 0? I suggest month 0, as it is easier to verify and provides information on the ocean analysis component. As most (all?) models will make use of observed SST, the month 0 errors should be small (but probably not insignificant!). SST anomaly skill: monthly SST anomalies (SST'): Model SST anomalies calculated by subtracting model climatology for the corresponding lead time and time-of-year, to remove systematic error. Model climatology calculated as average of all hindcasts for that month-of-year starting from the same month-of-year. (NB - anomalies calculated in this way make use of information beyond the forecast start time, may inflate skill slightly?) latitude-longitude skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 3 skill measures - RMSSS, area under ROC curve, correlation (24 maps/model!) SST' area averages: Nino3, Nino4, North Trop Atlantic (10N-20N, 20W-60W), Central Indian (0-10N, 60E-80E) (the Atlantic and Indian regions are areas that are robustly linked to the equatorial Pacific at 0-6 month lags in observations) East and west equatorial Indian regions? (cf recent 'dipole' ideas) For each area: plot timeseries for months 2-4 and 4-6 (plot ensemble mean, and some measure of spread [+/- 1 standard deviation?] along with observed SST) contingency tables for months 2-4 and 4-6 all ensemble members, vs observed categories (3 or 5 categories? I suggest 3 if number of years available is 12-24, 5 if 25 or more) (NB - such tables can also be used to adjust tercile/quintile forecasts.) tables of anomaly correlations and rmse vs lead time 2. Vertical sections of temperature Verification dataset: TAO for Pacific (1990s onward only) temperature systematic error: longitude-depth (to 400m) maps of systematic error at months 0, 3 and 6 (cf SST above) latitude-depth (to 400m) maps of systematic error at months 0, 3 and 6 along 165E and 140W (Provides information about model drift.) temperature anomalies (T'): longitude-depth skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 2 skill measures - RMSSS, correlation 3. 20C isotherm depth (D20) verification datasets: TAO (tropical Pacific, 1990s onward only) PIRATA (tropical Atlantic, end 1990s onward) ocean analyses (global, all years) D20 anomalies (D20'): monthly anomalies from model climatology area averages of D20': Nino3 Nino4 Atlantic? (where? PIRATA area?) Indian? (where?) for each area: plot timeseries of months 2-4, 4-6 (plot ensemble mean, and some measure of spread [+/- 1 standard deviation?] along with observed (where available) and ocean analysis from corresponding model) 4. Sea level verification dataset: CLS (global, 1993-2000) monthly sea level anomalies (SLA): latitude-longitude skill maps: average months 2-4 and 4-6 (cf atmos evaluation) 4 periods 2 skill measures - RMSSS, correlation (limited data years - better to use all data rather than separate into 4 periods?) Ocean analyses SST and vertical T section diagnostics are included in the hindcast list above (month 0), with regard to systematic errors. Sea level independent verification dataset: tide gauges (several suitable sites were identified and used for the DUACS altimeter project, mainly in the tropical Pacific) plot timeseries of SLA at select TG sites, along with TG and altimeter-derived SLA data plot scatterplots of model vs TG SLA suggested sites (from DUACS list) (data available from University of Hawaii Sea Level Center)
salinity?? Although verification data are very limited (and are likely to be assimilated), it is useful to monitor salinity to detect possible drift. For DUACS 0-300m averages in Nino3 and Nino4 were used. It is also useful to monitor a particular depth. currents?? (cf intercomparison by Fevrier et al., J.Mar.Sys.,24,249-275 (2000) ) Is it worth including verification of variables that are assimilated? (the results depend on the weight given to the observations in the various assimilation schemes) |
||||||||||||||||||
|
|
||||||||||||||||||