|Met Office Hadley Centre observations datasets|
|> Home > HadSST3 >|
Historical measurements of sea-surface temperature were not made to the stringent criteria demanded for monitoring climate change. The obvious gulf between the ideal and the reality leads naturally to questions about the reliability of the surface temperature record. Often this question is couched as a yes/no dichotomy: are surface temperature records reliable? But a more scientific question is "How reliable are surface temperature records?". Because historical measurements were not made for climate research does not mean that it is impossible to derive a record that is useful for climate research from those observations, but it does mean that especial care must be taken in identifying and - as far as possible - quantifying uncertainties in SST and other climate records.
The following classifications of uncertainties are by no means definitive, nor are the classifications completely distinct, but they do provide a general framework for thinking about the uncertainties in SST data sets and how their effects might be treated or minimised. The uncertainties have been tackled in an approximate ascending order of abstraction from the random errors associated with individual observations to the generic problem of unknown unknowns.
1. General Classification of Uncertainties
Broadly speaking, there are two kinds of errors in individual SST observations: random observational errors and systematic observational errors.
Random observational errors are relatively benign. They occur for many reasons: misreading of the thermometer, rounding errors, the difficulty of reading the thermometer to an accuracy higher than the smallest marked gradation, incorrectly recorded values, errors in transcription from written to digital sources, sensor noise and many others. Although they might confound a single observation, such random errors tend to cancel out when large numbers of observations are averaged together. The cancellation occurs because the errors are not related to one another and are as likely to be positive as they are to be negative.
In any year, the global annual average SST is based on tens of thousands of observations. When averaged together, random errors tend to fall as the inverse of the square root of the number of observations that contribute to the average. Therefore, the contribution of random independent errors to the uncertainty on the global annual average SST is around 100 times smaller than the contribution of random error to the uncertainty on a single observation even in the most sparsely observed years (e.g. there were 23000 observations contributing to the average for the year 1850 and the square root of 23000 is greater than 100).
Systematic observational errors are much more problematic because their effects become relatively more pronounced as greater numbers of observations are aggregated. Such errors might occur because a particular thermometer is miscalibrated, or poorly sited. No amount of averaging of observations from a thermometer that is miscalibrated such that it reads 1°C too high, will reduce the error in the aggregate below this level. However, in many cases the systematic error will depend on the exact nature and environment of the sensor, or thermometer. In this case, averaging together observations from many different ships, or buoys will tend to reduce the contribution of systematic observational errors to the uncertainty of the average because some of the biases will be positive, cancelling with others which will are negative.
In early SST records, the majority of observations were made using buckets to haul a sample of water up to the deck for measurement. Although buckets were not always of a standard shape or size they had a general tendency to lose heat via evaporation, or directly to the air when the air sea temperature difference was large. Such pervasive systematic observational errors are of particular pertinence for climate studies because the errors are potentially common to the whole observational system and change over time as observing technology and practice changes. The changes can be gradual as old methods are slowly phased out, but they can also be abrupt reflecting significant geopolitical events such as the Second World War. Abrupt changes also arise because the digital archives are themselves discontinuous.
Generally, systematic errors are dealt with by making adjustments based on knowledge of the systematic effects. These adjustments are themselves uncertain because we usually have imperfect knowledge concerning the size of the biases and the exact methods used to make the measurements. This uncertainty can be estimated by allowing uncertain parameters in the bias-adjustment algorithm to be varied within their plausible ranges thus generating a range of bias adjustments. This parametric uncertainty gives an idea of the uncertainties associated with poorly determined parameters conditional on a particular approach, but it does not address the more general uncertainty arising from the underlying assumptions. This uncertainty will be dealt with below as structural uncertainty.
First, however, it is worth mentioning a number of other uncertainties associated with the creation of gridded data sets and SST analyses. These are closely related because they all arise in the estimation of area-averages from a finite numbers of, often sparely distributed observations. In HadSST3 we consider two forms of this uncertainty: sampling uncertainty and coverage uncertainty.
Sampling uncertainty was the term used in HadSST2 and HadSST3 to refer to the uncertainty accruing from estimating the area-average SST anomaly within a gridbox from a finite and often small number of observations.
Coverage uncertainty was used to refer to the uncertainty arising from estimating an area-average for a large area that encompasses grid boxes that contain no observations. Some SST data sets contain many gridboxes which are not assigned an SST value because they contain no data. Other SST data sets - sometimes referred to as SST analyses - use a variety of techniques to fill the gaps in the data. SST analyses generally use information gleaned from data rich periods to train statistical methods for estimating SSTs in data voids in more sparsely observed epochs. There are many ways to tackle this problem, and, as no method corresponds precisely with reality, all are necessarily approximations to the truth. The correctness of the uncertainty estimates derived from these statistical methods in regions with no observations are conditional upon the correctness of the methods used to derive them. No method is correct - although some no doubt will come close - therefore the uncertainty estimates - based on a particular method - will tend to underestimate the true uncertainty which must factor in our lack of knowledge about the correct methods to use. Thus bringing us back to structural uncertainty.
There are many reasonable, defensible ways to produce a bias adjusted data set, a gridded data set, an analysis or any combination of these three. Many different approaches exist particularly for filling gaps in the data record and each of these will give different although not utterly dissimilar results. Structural uncertainty is the term used to describe the uncertainty that arises from the many choices and foundational assumptions that can be (and are) made when creating a dataset. To an extent this overlaps with parametric uncertainty in so far as the parametric uncertainty explores the full range of assumptions made in the analysis.
The structural uncertainty is one of the more difficult uncertainties to explore efficiently because it requires multiple, independent efforts to resolve the same problem over and again. This can be misinterpreted as redundancy, but its role in uncovering, resolving, and quantifying some of the more mystifying uncertainties in climatological analyses is unquestionable. The most obvious examples are those of tropospheric temperature records made using satellites and radiosondes (Thorne et al. 2011) and sub-surface ocean temperature analyses (Lyman et al. 2010).
Which leaves unknown unknowns. On February 12th 2002, at a news briefing at the Deparment of Defence, US Secretary of Defense Donald Rumsfeld memorably divided the world of knowledge into three quarters:
There are known knowns. These are things we know we know. We also know there are known unknowns. That is to say, there we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know.
In the context of SST uncertainty unknown unknowns are those things that everyone has got wrong, or overlooked. By their nature unknown unknowns are unquantifiable, they represent the deeper uncertainties that beset all scientific endeavours. Unknown unknowns will only come to light with continued, diligent and sometimes imaginative investigation of the data and metadata.
2. The current state of uncertainty in in situ SST analyses
The classification of uncertainties outlined above will now be used to assess uncertainties in the global data sets based on in situ data. But the first question is to define what exactly is meant by sea-surface temperature.
Under conditions of low-wind speed and high insolation, a stable stratified layer of warm water can form near the surface (for a recent review see Kawai and Wada 2007). This can lead to strong temperature gradients in the upper few metres of the ocean and consequently measurements made at the same time and location but at different depths can record quite different temperatures. The simple solution to this problem of definition is to record the depth of the measurement along with the temperature, but for many historical SST measurements, the depth of the measurements is not known. Nor is it clear to what extent any warm surface layer is mixed with cooler subsurface water by the passage of the ship making the measurement. A further barrier to using this approach is that in order to produce SST data sets based on observations from many ships, some method would need to be found to convert reliably from the measurement depth to a reference depth defined for that data set.
Traditionally, in situ SST analyses have been considered representative of the upper ten or so metres of the ocean. This makes sense for a diverse fleet of ships, with differences in measurement depth appearing as unexplained variance between observations. However, for satellite analyses, which see the topmost surface layers that are most strongly affected by diurnal warming, such considerations are more important and have driven recent concerns about diurnal SST warming.
Kennedy et al. (2007) used drifting buoy data to estimate the climatological average size of the diurnal cycle at a depth of approximately 25cm. They found geographically and seasonally varying climatological-average diurnal temperature ranges from 0K to in excess of 0.5K. How the diurnal cycle interacts with individual observations and derived quantities such as monthly averages will determine how important this effect is for SST analyses. One thing to note is that measurements made by ships, usually in the engine rooms using water sucked in at depth, are typically warmer than nearby buoy observations made much nearer to the surface. This suggests that considerations of measurement depth are less important than other potential systematic errors.
As Kent et al. (2010) note "The implicit assumption is that the sampling of conditions is regular enough that no regional or time-varying bias is introduced into the datasets by neglecting such effects." Ships make SST observations at regular intervals throughout the day, typically every four or six hours. This is generally sufficient to minimise the aliasing of diurnal cycles. However, during earlier periods, there were systematic changes in the time of observation, but their effect on average sea-surface temperatures has not been quantified.
2.1 Individual observational errors
The general quality of SST measurements is not good (see for example Figure 4 from Rayner et al. 2006). Consequently, all SST analyses perform a stage of pre-screening, or quality control in order to remove observations of low quality and minimise the number of gross errors. The effect of differences in QC between analyses has not been explicitly assessed, but the effects of different choices will be one reason for differences between analyses.
Many estimates of random observational error uncertainty exist. These are all empirical estimates dervied from considerations of the variance of the data. This recognises the impossibility of determining the errors from first principles, or by comparison to or calibration against a national standard. Most analyses have not distinguished between random observational errors and systematic observational errors, tending to combine these into one estimate. A single SST measurement from a ship has a typical uncertainty of around 1-1.5K. (Kent and Challenor (2006), 1.2±0.4K or 1.3±0.3K; Kent et al. (1999), 1.5±0.1K; Kent and Berry (2005), 1.3±0.1K and 1.2±0.1K; Reynolds et al. (2002), 1.3K; Kennedy et al. (2011a), 1.0K; Kent and Berry (2008) 1.1K; Bernstein and Chelton (1985) 1.1K. These analyses are based on the more modern portion of the record. No studies have been done to see if there are systematic changes in the size of these errors through time. It should also be noted that not all measurements are of identical quality. Some ships and buoys take much higher quality measurements than others.
Drifting buoy measurements are typically somewhat more accurate: Kennedy et al. (2011a), 0.2-0.4K ; Reynolds et al. (2002), 0.5K; Emery et al. (2001), 0.15K; O'Carroll et al. (2008), 0.23K; Kent and Berry (2008) 0.67K.
As noted above, random observational errors are of relatively minor importance in large-scale averages particularly in the modern period when observations are numerous. For an uncertainty of a single observation due to random observational error of 1.0K, the resulting uncertainty of a global annual average based on 10000 (or more) observations would be around 0.01K (or smaller).
Kent and Berry (2008) and Kennedy et al. (2011a, 2011b) decomposed the observational errors into random and systematic components. Brasnett (2008) implicitly used the same error model and the results are very similar to those of Kent and Berry (2008). For ships, Kent and Berry found that the random error component was around 0.7K and the systematic observational error component was around 0.8K. Kennedy et al. (2011a) found that the random error component was around 0.74K and the systematic observational error component was around 0.71K. Adding the errors in quadarture gives a combined observational uncertainty of slightly more than 1K, consistent with earlier estimates. For drifting buoys Kennedy et al. (2011a) estimated the random error component to be around 0.26K and the systematic observational error component to be around 0.29K. The equivalent values from Kent and Berry were 0.6 and 0.3K respectively. The systematic component of the error was assumed to be different for each ship, so this model does not on its own capture the effects of pervasive systematic errors.
The addition of the systematic component has a pronounced effect on the uncertainty of large-scale averages. Kennedy et al. (2011b) estimated that the effect of these was to increase the uncertainty of the global annual average to more than 0.04K in the 19th Century (under the assumption that the errors were uncorrelated, the value was around 0.01K) and to more than 0.01K even in the well observed modern period. However, because of the assumed independence of the errors between ships, this component of the uncertainty remains relatively unimportant for the analysis of long-term trends of large-scale averages.
A difficulty with estimating uncertainties associated with systematic errors from individual ships is that not all observations in ICOADS can be associated with a ship. Some of the reports have no more information than a location, time and SST observation. Kennedy et al. (2011b) had to make estimates of how the uncertainty arising from systematic errors behaved as the number of observations increased by considering the behaviour at times when the majority of reports contained a ship name or call-sign. They assumed that the observations without callsigns behaved in the same manner.
Many SST products and the analyses that depend on them assume that the observational errors are normally distributed and this is not necessarily the case for individual observations. Kennedy et al. (2011a) investigated the properties of observations that had been quality controlled using the procedures described in Rayner et al. (2006). They found that in comparisons with satellite observations the distributions of errors were 'fat-tailed' with the distribution of errors having a postive kurtosis. In the creation of gridded products from these observations, the effects of outliers are mitigated somewhat by the use of resistant statistics such as winsorised, or trimmed means. The effects of outliers is further reduced in large scale averages and the distribution of errors in these quantities will tend towards a normal distribution as the number of observations increases.
2.2 Pervasive Systematic Errors and Biases
Kent et al. (2010) recently conducted a review of literature on biases in SST measurements. Many studies have looked at the relative biases between different measurement methods, but fewer have attempted to adjust SST records to minimise the effects of changes in instrumentation.
The analysis of uncertainty related to pervasive systematic errors has taken two approaches. Some analyses have attempted to adjust for biases in the data others have attempted to assess the uncertainty without adjusting the data.
Folland and Parker (1995) estimated the adjustments using a simplified physical model of the buckets used to make SST measurements and climatological air and sea temperatures. They estimated the uncertainties of the adjustments based on considerations of the potential range of parameters in their model. Rayner et al. (2006) explored this uncertainty more systematically using a monte-carlo method. This approach was also taken by Kennedy et al. (2011c), thus exploring the parametric uncertainty within their particular approach. These three analyses all make use of the same scheme for adjusting bucket measurements prior to 1941. Therefore, in the period 1850-1941 they cannot be seen as methodologically independent. However, Kennedy et al. (2011c) extend the bias adjustments to cover the period 1850-2006.
Smith and Reynolds (2002) took an alternative approach to the bias adjustments thus providing a measure of structural uncertainty which they calculated. Prior to 1941, they adjusted SST based on statistical relationships between Night Marine Air Temperature and SST. The resulting adjustments were similar, but not identical to those produced by Folland and Parker (1995) and following papers. The independence between these methods is not complete as they both rely on night marine air temperatures, which have their own particular pervasive systematic errors. In Folland and Parker (1995), Rayner et al. (2006) and Kennedy et al (2011c) the comparison with night marine air temperature is used to estimate the fractions of different bucket types in the nineteenth and early twentieth centuries. Kennedy et al. (2011c) considered the effect of removing this constraint. They noted that "The two extreme cases would be that all buckets were considered to be canvas buckets from 1850 to 1940, or that before 1920 all buckets were wooden. The first case would increase SSTs in 1850 by around 0.2C with the increase dropping linearly to zero by 1920. The latter case, which is less likely, would lead to a decrease in estimated global average SST prior to 1920 of between 0.1 in 1850 and 0.3C in 1919." The estimate is a rough one, but suggests that the general uncertainty in marine temperature can be reduced by considering variables (SST and NMAT) together rather than singly.
In the post-1941 period, Smith et al.(2005) and Smith et al.(2003) opted not to adjust the data. Instead, they estimated the uncertainty due to pervasive systematic errors by considering the difference in estimated bias between measurements made in the engine rooms of the ships and measurements from all ships between 1994 and 1997. They estimated a minimum 1-sigma standard error in the global average based on this of around 0.015K. This range is similar to, but slightly narrower than that estimated by Kennedy et al. (2011c) from a more careful analysis of the metadata and biases. The difficulty with the approach taken by Smith et al. (2008) is that the quoted uncertainty range is considered to be symmetric whereas Kennedy et al. (2011c) suggests that the true global mean is consistently higher than that estimated by Smith et al. (2008) in the period 1945-1960.
In order to minimise uncertainties in the bias adjustments for long-term analyses, it is useful to have a detailed understanding of how biases varied for individual components of the observing system. Currently there are few studies of ERI and bucket biases in ship data particularly outside of the North Atlantic and even fewer provide information that is (a) time resolved and (b) traceable back to ICOADS. There are no studies of Hull Contact measurement biases for the recent period when such measurements are plentiful. There are few studies looking at the long-term stability and calibration drifts of drifting buoys. Reverdin et al. (2010) installed 16 drifters with high quality temperature sensors in addition to their usual temperatures sensors and found that the temperatures measured by the drifters showed inaccuracies that were larger than the 0.1°C target accuracy and exhibited calibration drifts.
Currently, Kennedy et al. (2011c) is the only data set containing bias adjustments for the period 1941 onwards. Therefore, the structural uncertainties of bias adjustments for this period are currently unquantified. As Kennedy et al. (2011c) note, this remains a key weakness of historical SST analyses.
The validity of the bias adjustments and their uncertainties can be assessed via other means. SSTs adjusted using the scheme of Folland and Parker (1995) were used by Folland (2005) to drive an atmosphere only GCM. The modelled air temperatures over land were compared to land station data and the adjusted SST data were found to give a significantly better agreement with the observed land temperatures. Folland et al. (2003) compared the adjusted SST to air temperatures on Pacific Islands. Hanawa et al. (2000) showed that the Folland and Parker adjustments improved the agreement between Japanese ship data and independent SST data from Japanese coastal stations in two periods: before and after the second world war. However, the Japanese collection of ship data used in Hanawa et al. (2000) might not have had the same bias characteristics as is assumed in the Folland and Parker adjustments.
Since the late 1940s, there is consistency between adjusted estimates of global SST change made using observations collected using buckets and observations made using the Engine Room Intake method (Kennedy et al. 2011c). From the 1990s, there are also plentiful observations from drifting and moored buoys as well as SSTs retrieved from satellite instruments (Good et al. 2007, Kennedy et al. 2011a). However, during the Second World War and in earlier periods, the majority of observations were made using a single method: Engine Intake Measurements during the War; buckets before the War. Qualitative agreement between the long-term behaviour of different global temperature measures - including NMAT, SST and land temperatures - gives a generally consistent picture of historical global temperature change, but has little to say about the general reliability of the magnitude of the trends.
2.3 Sampling and Coverage Uncertainty
The magnitude of the sampling uncertainty depends on the correlation of SST anomalies within the grid box, the variability of the SSTs within the grid box, on the number of observations contributing to the grid-box average and where they are located. High average correlations, low variability and large numbers of observations lead to lower uncertainty estimates. Conversely areas of high variability or low average correlation - such as frontal regions, or western boundary currents - tend to have higher sampling uncertainties as do grid-box averages based onsmaller numbers of observations.
The estimation of uncertainties arising from the sparseness of observations at scales from grid box level to global have been approached in a number of ways. Rayner et al. (2006) estimated a combined measurement and sampling uncertainty by considering how the variance of the grid-box average changed as a function of the number of observations. The technique picked up spatial variations in sampling uncertainty associated with regions of high variability. Rayner et al. (2010) showed results from an unpublished analysis by Kaplan, in which complete satellite data were used to estimate the variability within a 1x1 grid box. The same features were seen in both analyses, allowing for differences in resolution, although the uncertainties estimated by Kaplan tend to be higher than those for Rayner et al. (2006)
Morrissey and Greene (2009) developed a theoretical model for estimating sampling uncertainty that accounted for non-random sampling within a grid box. This was an extension of the method used to estimate sampling uncertainties in land temperature data and global temperatures by Jones et al. (1997). Land temperatures are measured by stations at fixed locations that take measurements every day. Marine temperature measurements are taken at fixed times, but the ships and drifting buoys move during a particular month. Morrissey and Greene (2009) note that practical implementation of their theoretical model would be difficult. Kennedy et al. (2011b) extended the concept of the average correlation within a grid box developed in Jones et al. (1997) to incorporate a time dimension. An alternative to the Jones et al. (1997) method was provided by Shen et al. (2007), but this has not yet been applied in SST analyses.
As noted by Rayner et al. (2006), the sampling uncertainties are likely to be uncorrelated, or only weakly correlated between grid boxes so the effect of averaging together many grid boxes will be to reduce the combined sampling uncertainty by a factor proportional to the square root of the number of grid boxes.
Because Rayner et al. (2006) and Kennedy et al. (2011b) make no attempt to estimate temperatures in grid boxes which contain no observations, an additional uncertainty had to be computed when estimating area-averages. Rayner et al. (2006) used Optimal Averaging (OA) as described in Folland et al. (2001) which estimates the area average in a statistically optimal way and provides an estimate of the coverage uncertainty. Kennedy et al. (2011b) followed the example of Brohan et al. (2006) and subsampled globally complete fields taken from previous analyses. The uncertainty of the global average computed by Kennedy et al. (2011b) were generally larger than those estimated by Rayner et al. (2006). How comparable these two sets of numbers are is difficult to assess. Kennedy et al. (2011b) used a simple area weighted average of the available grid boxes, with no attempt made to optimise the weights to account for the distribution of data as Rayner et al. (2006) did.
The HadSST3 coverage uncertainty is largest (with a 2-sigma uncertainty of around 0.15°C) in the 1860s when coverage is poorest. This falls to 0.03°C by 2006. That this uncertainty should be so small - particularly in the nineteenth century - is surprising. To an extent the relatively small uncertainty might simply be a reflection of the assumptions made in the analyses used by Kennedy et al. (2011b) to estimate the coverage uncertainty. Another way of assessing the coverage uncertainty is to look at the effect of reducing the coverage of well-sampled periods to that of the less well sampled nineteenth century and recomputing the global average.
This figure shows the range of global annual average SSTs obtained by reducing each year to the coverage of years in the nineteenth century. So, for example, the range indicated by the blue area in the upper panel for 2006 shows the range of global annual averages obtained by reducing the coverage of 2006 successively to be at least as bad as 1850, 1851, 1852 and so on to 1899. The red line shows the global average SST anomaly from data that has not been reduced in coverage. For most years the difference between the subsampled and more fully sampled data is smaller than 0.15°C and the largest deviations are smaller than 0.2°C. For the coverage uncertainty on the global average to be significantly larger would require the variability in the nineteenth century data gaps to be much larger than in the well observed period.
2.4 Analysis techniques
The structural uncertainties associated with estimating SSTs in data voids and at data sparse times are somewhat better explored. A large number of different SST data sets have been produced including ICOADS summaries (Woodruff et al. 2011), HadSST2 (Rayner et al. 2006), HadISST (Rayner et al. 2003), ERSST (Smith et al. 2008), COBE (Ishii et al. 2005), Kaplan et al. (1998), TOHOKU (Yasunaka and Hanawa 2002), NOCS (Berry and Kent 2011), Ilin and Kaplan (2009), Luttinen and Ilin (2009). A variety of different statistical techniques have been applied to both the in situ and satellite data providing a range of sometimes quite different results. The problem of creating globally complete analyses is challenging because of the relative sparseness of early observations, the non-stationarity of the changing climate and the fact that more observations are missing at colder periods than at warmer ones.
One particular concern is that patterns of variability in the modern era - which are used to train the statistical models - might not faithfully represent variability at earlier times. It is not clear to what extent this question has been resolved. Smith et al. (2008) allow for a non-stationary low frequency component in their analysis which weakens the criticism as it pertains to this analysis. They used sub-sampled climate models to optimise their algorithms in periods when data are few. Ilin and Kaplan (2009) and Luttinen and Ilin (2009) used iterative algorithms that make use of data throughout the record to estimate the covariance structures and other parameters of their statistical models. However, such methods will still tend to give a greater weight to periods with more plentiful observations.
Another concern is that methods which use Empirical Orthogonal Functions to describe the variability might inadvertantly impose long-range teleconnections that do not exist in the data (Dommenget 2007). Smith et al. (2008) explicily limit the range across which teleconnections can occur, likely mitigating this problem. A number of analysis methods have a tendency to lose variance at a range of time scales either because they do not explicitly resolve small scale processes (Kaplan et al. 1997, Smith et al. 2008) or because in the absence of data the method tends towards the climatological average (Ishii et al. 2005). Rayner et al. (2003) used the method of Kaplan et al. (1997) but with certain changes to the method to explicitly resolve the long-term trend and improve small scale variability where observations were plentiful. Karspeck et al (submitted) analyse the residual difference between the observations and the Kaplan et al. (1997) analysis using local non-stationary covariances and then draw a range of samples from the analysis posterior distribution in order to provide consistent variance at all times and locations.
Yasunaka and Hanawa (2011) examined a range of climate indices based on seven different SST products. They found that the disagreement between data sets was marked before 1880, and that the trends in large scale averages and indices tend to diverge outside of the common climatology period. For the global average, the differences between analyses was around 0.2K before 1920 and around 0.1-0.2K in the modern period. Even for relatively well observed events such as the 1925/26 El Nino, the detailed evolution of the event varied from analysis to analysis. The reasons for the differences are not completely clear because each data set is based on a slightly different set of observations, which have been quality controlled, and processed in different ways.
The GCOS SST and sea ice working group has also made various intercomparisons between SST data sets from in situ and satellite sources. They are also working on creating a common test data set that can be used to directly compare the effects of different analysis techniques. By using carefully defined subsets and tests it should be possible to isolate the reasons for differences between the analyses. At the MARCDAT 3 meeting in May 2011, it was suggested that a representative of the marine climatology community sit on the benchmarking panel of the surfacetemperatures initiative.
Currently, only a few groups provide explicit uncertainty estimates based on their analysis techniques. As noted above, the uncertainty estimates derived from a particular analysis will tend to underestimate the true uncertainty because they are conditional on the analysis method being correct. There is a further difficuly in supplying and using analysis uncertainty estimates because the traditional means of displaying uncertainties - the error bar, or error range - does not preserve the covariance structure of the uncertainties. On the other hand storing covariance information for all but the lowest resolution data sets can be prohibitively expensive. EOF based analyses, such as Kaplan et al. (1997) could in principle efficiently store the error covariances because only the covariances of the reduced space of principal components need to be kept. For Kaplan et al. (1997) based on a reduced space of only 80 EOFs this is a matrix of order 802 elements as opposed to 10002 elements for the full field covariance matrix.
Another method noted in Rayner et al. (2010), citing Karspeck et al. (submitted) is to draw samples from the posterior probability produced by the analysis. This has the added advantage that it can be combined easily with similar monte carlo samples from the measurement bias distributions.
2.5 Minimising exposure to uncertainty
Alternative approaches to using the SST data in a way that is less sensitive to biases and other data errors have been attempted. Thompson et al. (2008) identified a rapid drop in global average SST in late 1945. This they attributed to a rapid change in the composition of ICOADS 2.0 from mostly US ships immediately before the 1945 drop and mostly UK ships immediately afterwards. In a follow up paper they identified a drop in northern hemisphere SST. In order to show that the drop was not an artefact of measurement method change, they divided the ICOADS data into distinct subsets based on the country of the ships making the measurements, considered a range of different SST analyses, and looked at related variables such as NMAT and land surface air temperatures. The probability of a drop being due to a coincident change in the way that all countries measured SST, simultaneous with a sudden change in NMAT and land temperature bias is small. The fact that the drop was seen in all the different data sets considered implied that the drop was real.
It is common in detection and attribution studies only to use data where there are observations by reducing the coverage of the models to match that of the data (see e.g. Allen et al. 2006). This reduces the exposure of the study to structural uncertainties associated with analysis techniques.
3 Improving the understanding of uncertainties in SST analyses
One of the chief difficulties in creating SST data sets that are useful for climate analyses is the impossibility of tracing individual observations back via an unbroken chain to international measurement standards. The institution of a global array of reference stations each making simultaneous redundant measurements of a variety of marine variables would solve many of the problems of SST analysis that have bedevilled the understanding of historical SST change and provide a gold standard against which the future wider observing system - incorporating observations from ships, buoys, profiling floats and satellites - can be assessed. Even without such traceability a climate record can be more easily maintained by adherence to the GCOS climate monitoring principles.
In the absence of such a network the estimation of uncertainties in SST analyses has depended heavily on redundancies in both measurement systems and in analysis techniques. Full use of the redundancies is now being made in the modern period via intercomparisons of the many satellite sources with each other and with in situ and sub-surface data. Operational analyses that ingest a variety of data sources usually produce bias statistics for each of the inputs. Such information can be used to assess their relative quality and as such reanalyses are pushed further back in time, they will help assess uncertainties through a larger part of the record.
A more systematic approach to the assessment of analysis techniques based on well-defined sets of common observations such as those being undertaken by the GCOS SST and sea-ice working group will help to elucidate the reasons for the differences between analyses. Benchmarking tests such as those planned by the Surface Temperature project will also help by providing an objective measure against which analysis techniques can be evaluated. Although, it should be noted that the problems associated with SST measurements are somewhat different to those faced in the analysis of land temperatures so the benchmarks will have to be tailored appropriately.
For long-term historical analyses, there is no substitute for actual observations. Efforts to identify archives of marine observations and digitise them are ongoing (e.g. Brohan et al. 2009, Wilkinson et al. 2011). Such programmes are labour intensive, first in identifying and cataloguing the holdings in archives around the world, then in creating and storing digital images of the paper books and finally in keying the observations. The difficulties of decoding hand written entries in a variety of languages, formats and scripts means that OCR technologies are of relatively little use. A number of popular crowd-sourcing projects have been started to key information from ships logs that have historical as well meteorological interest. For example see oldweather.org which is keying data from Royal Navy logs from the First World War.
As noted, a key weakness of historical SST data sets is the lack of attention paid to evaluating the effects of data biases. Multiple, independent estimates of the biases produced using as diverse a range of means as possible should be undertaken.
Within the bias framework described by Kennedy et al. (2011c) there is scope to improve estimates of the parametric uncertainty by metadata rescue, or by performing further intercomparisons of different data sources to, for example, better estimate the geographical and temporal variations of ERI biases. For example, the period of the Second World War, when the observing fleet was small and measurement methods were changing is a period of larger uncertainty owing to a lack of available metadata. It has been suggested that the preference for ships to travel in convoy during the war might be an ideal opportunity to assess relative biases between ships.
Finally, the work of identifying and quantifiying uncertainties will be pointless, if those uncertainties are not used when the SST data sets are used. Uncertainty estimates provided with data sets have sometimes been difficult to use, or easy to use inappropriately. As pointed out by Rayner et al. (2010), "more reliable and user-friendly representations of uncertainty should be provided" in order to encourage their widespread and effective use. HadSST3 has been produced as a set of interchangeable realisations that together define the bias uncertainty range of HadSST3. It is hoped that providing these 100 realisations in a form identical to the median estimate will encourage users to explore the sensitivity of their analysis to observational uncertainty with little extra effort. It is also suggested that users run their analysis on a range of different SST analyses.
Appendix A: Specific problems arising from Kennedy et al. (2011b, 2011c)
Appendix B: Recommendations from Kent et al. 2010
Appendix C: Recommendations from the Ocean Obs 09 white paper
Commercial and media enquiries
You can access the Met Office Customer Centre, any time of the day or night by phone, fax or e-mail. Trained staff will help you find the information or products that are right for you.
© Crown Copyright