2. Bayesian Chronological Modelling

For a single sample, calibration of the radiocarbon age (or of the weighted mean if there is more than one determination on the sample) is sufficient to convert the radiocarbon measurement to the calendar timescale (see section 1.5).

There is a limit to the precision that can be achieved this way, as a measurement on a cereal grain — actually harvested on one day of one particular year — typically produces a calibrated date range that spans more than a century. But the probability distribution (and range) of the calibrated date does estimate that point in time accurately to within the quoted uncertainty, and that date does provide a range-finder for the date of the deposits or objects sampled.

However, when we have a group of radiocarbon ages from samples that are in some way related, then more sophisticated statistical approaches are required.

2.1 The need for statistical analysis

Statistical analysis of groups of radiocarbon dates is needed because the simple calibration process makes a statistical assumption: that the date of the sample is equally likely to fall at any point on the calibration curve used.

For a single measurement, this assumption is usually valid. But as soon as there is a group of measurements that are related in some way (e.g. that are from the same site), then this assumption is violated. For example, if the first sample from a site is of Bronze Age date, then the chances are that the samples subsequently dated will also be Bronze Age.

The effect of ignoring this issue is illustrated in Figure 7, which shows a series of calibrated radiocarbon dates (on well-associated, short-lived samples, see section 3.2.2 and section 3.2.3) for two sites.

When faced with interpreting a graph such as Figure 7, most archaeologists inspect the probability distributions, visually assessing their widest limits (perhaps excluding low parts of the probability distributions from the edges of the graph), and estimate that activity on Site A happened between c. 2025 cal BC and c.1750 cal BC; that this activity took place over several hundred years; and that Site B was occupied at a similar time and for a similar period.

These interpretations are importantly wrong. The calibrated dates in Figure 7 have been simulated, using a process of ‘back calibration’ from samples of known calendar date. For example, if we have a sample that actually dates to 1932 BC and produces a measurement with an error term of ±30 BP, then we can transfer the calendar date through the calibration curve to the radiocarbon age scale. Each simulation will produce a slightly different value because of the error term on the radiocarbon age. For example, 1932 BC might produce a simulated radiocarbon age of 3612±30 BP. This is then calibrated to produce a realistic estimate of the calibrated radiocarbon date that would be produced by a sample of this calendar age, in this case 2120–2090 cal BC (3% probability) or 2040–1880 cal BC (92% probability; using the probability method).

So for the data in Figure 7, because we have simulated the radiocarbon dates ourselves from known calendar ages, we know that Site A was in use for 200 years between 2000 BC and 1800 BC and that Site B was in use for 40 years between 1925 BC and 1885 BC. In both cases, without formal statistical analysis, there is a very significant risk that past activity will be interpreted as starting earlier, ending later and enduring for longer than was actually the case.

Since estimating radiocarbon ages is a probabilistic process, calibrated radiocarbon dates scatter around the actual calendar dates of the samples. Given the uncertainties on most calibrated radiocarbon dates and the relative brevity of much human activity, this statistical scatter on the dates can be substantial in comparison to the actual duration and dates of the archaeological activity in question.

Proportionately, the quantity of scatter is greater when the actual period of dated activity is short and/or the number of radiocarbon dates is large (compare, for example, the scatter on the calibrated radiocarbon dates outside the actual calendar dates of the samples in Figure 7, Site A with those in Figure 7, Site B).

2.2 Bayesian Chronological Modelling

Bayesian statistics provide an explicit, probabilistic method for combining different sorts of evidence to estimate the dates of events that happened in the past and for quantifying the uncertainties of these estimates. This enables us to account for the relationships between samples during the calibration process.

The basic idea is encapsulated in Bayes’ theorem (Fig. 8), which simply states that we analyse the new data we have collected about a problem (‘the standardised likelihoods') in the context of our existing experience and knowledge about that problem (our 'prior beliefs'). This enables us to arrive at a new understanding (our 'posterior belief'), which incorporates both our existing knowledge and our new data.

This is not the end of the matter, however, since today’s posterior belief becomes tomorrow’s prior belief, informing the collection of new data and their interpretation as the cycle repeats.

Lindley (1991) provides an accessible introduction to the principles of Bayesian statistics.

2.2.1 Components of a Bayesian chronological model

When constructing a Bayesian chronological model, the scientific dates form the ‘standardised likelihoods’ component of the model (Fig. 8). They are the data to be reinterpreted in the light of archaeological prior beliefs. Most often these are calibrated radiocarbon dates, but it is also possible to include dates from coins, historical sources, dendrochronology and the results of other scientific dating methods such as luminescence and archaeomagnetic dating.

The second component in a chronological model is composed of our ‘prior beliefs’. These are no more than a formal, mathematical expression of our understanding of the archaeological context of the problem that we are modelling.

Sometimes it is clear that we have strong archaeological evidence of the relative chronology of the samples that have been dated: for example, when one dated grave cuts another. This type of clear relative sequence provided by archaeological stratigraphy often provides strong constraints on the calibration of dates from related samples in a site sequence (see section 5.7).

The tree-ring series used during wiggle-matching (see section 5.6) also provide strong prior beliefs for the relative dating of the sampled rings. At a wider scale, dates can be combined with other forms of archaeological information that provide relative sequences, such as typology (e.g. Needham et al. 1998) or seriation (e.g. Bayliss et al. 2013).

Sometimes this seems so obvious that its importance in chronological modelling is not at first apparent. The most common information of this kind is that a group of radiocarbon dates are related. Most often this is because the samples collected relate to a single site, although other forms of relatedness, such as samples associated with particular pottery styles, can also be used (e.g. Healy 2012).

To return to the example considered in Figure 7, if we model the radiocarbon dates from each site using only the information that each group of measurements derives from a site, which began at some point in time and then was used relatively continuously until it ended, then we get the models shown in Figure 9.

These statistical models are clearly able to distinguish between the scatter of radiocarbon dates that derives from the actual duration of activity in the past, from scatter that simply arises from the probabilistic process of radiocarbon dating.

The models both provide formal date estimates for the start and end of the relevant sites that are compatible with the actual dates input into the simulation, and are clearly able to distinguish that the activity at Site B was of much shorter duration than Site A.

Figure 9 also illustrates that Bayesian Chronological Modelling is not simply about refining the calibration of radiocarbon dates, although the outputs of the model (shown in black) are clearly more precise than the simple calibrated radiocarbon dates (shown in outline).

It is also possible to calculate distributions for the dates of events that have not been dated directly by radiocarbon measurements, such as the date when a site was established or abandoned. For example, the parameter ‘start A’ (Fig. 9) has been calculated using all the radiocarbon dates from the site (a–s) and the interpretation that it was occupied continuously until it was abandoned.

All of these measurements have also been used to estimate the date when the site went out of use (‘end A’; Fig. 9).

By comparing estimates such as these, it is possible to calculate new probability distributions to estimate the duration of phases of activity (e.g. ‘use A’; Fig. 10).

The posterior beliefs that are output by a Bayesian model are known as posterior density estimates (the distributions in black in Figure 9).

These probability distributions can be summarised as ranges, which are known as Highest Posterior Density intervals and are expressed in italics to distinguish them clearly from date estimates that have not been produced by modelling.

2.2.2 Model calculation, validation and comparison

In theory, once the model has been defined, the posterior beliefs can be calculated using Bayes’ theorem (Fig. 8).

In practice, however, almost all chronological models have so many independent parameters that the number of possible outcomes to consider at a useful resolution makes such a calculation impractical (the exception is wiggle-matching, see section 5.6). For this reason, Markov Chain Monte Carlo (MCMC) methods are used to provide a possible solution set of all of the parameters of the model.

The degree to which a truly representative solution set has been generated is called ‘convergence’. A variety of diagnostic tools have been proposed to validate convergence, and all the software packages that have been developed to undertake Bayesian Chronological Modelling employ some form of convergence checking (that employed in OxCal is described by Bronk Ramsey 1995, 429).

Stability of the model outputs is not the only criterion by which models can be validated. We also need to consider whether the two components input into the model, the ‘prior beliefs’ and the ‘standardised likelihoods’, are compatible.

This compatibility can be at a particular level, for example considering whether a sample really fits into the sequence at the position where it has been placed in the model, or at a general level, for example, examining whether phase 1 is really earlier than phase 2.

At present the validation of Bayesian models is an inexact science, although several statistical approaches have been developed to assist in the identification of incorrect models and incompatible prior beliefs and standardised likelihoods

Statistics alone cannot be relied upon to identify all the incorrect components of a model, and so archaeological critique of the character and context of the dated material, and scientific understanding of the complexities of radiocarbon dating, are key elements in model validation (see section 3.2.2 and section 3.2.3).

The first statistical method for assessing the compatibility of the components of a model is formal statistical outlier analysis (Christen 1994). In this method each measurement is given a prior probability of being an outlier (typically a low probability like 5%) and the date is further down-weighted in the model if it is inconsistent with the rest of the available information.

The output from the model is affected by this down-weighting, and in addition to the normal model outputs, a posterior probability for the sample being an outlier is also generated. Either this probability can be used to identify outliers and remove them, or the model that incorporates outlier weighting can be accepted (technically, this approach is a form of model averaging; Bronk Ramsey et al. 2010; see section 5.7).

This approach is available in several of the software packages that have been developed to undertake chronological modelling.

Secondly, we can consider the agreement indices provided by the OxCal software (Bronk Ramsey 1995, 429; 2009a, 356–7). These are not derived from a formal statistical approach and have the disadvantage that there is no theoretically defined cut-off applicable in all cases, but they do have the advantage that the model itself is not affected by the calculations. They are also easy to calculate and have proved useful and robust in practice for a wide range of case studies.

The individual index of agreement provides a measure of how well the posterior distribution (i.e. that incorporating the prior beliefs and shown in black in Figure 9) agrees with the standardised likelihood (i.e. the calibrated date shown in outline in Figure 9); if the posterior distribution is situated in a high-probability region of the standardised likelihood, then the index of agreement is high; if it falls in a low-probability region, it is low.

Most individual indices of agreement in a model should be above 60 (a threshold value obtained by simulation). Usually those that fall below this level are statistical outliers (see, for example, ‘9’ in Figure 9), although a very low index of agreement can also suggest that part of the model is wrong and needs further examination.

An overall index of agreement is then calculated for the model from the individual agreement indices, providing a measure of the consistency between the prior information and the scientific dates. Again, the model index of agreement generally has a threshold value of 60, and models that produce values lower than this should be subject to critical re-examination (for example, phase 1 is possibly not actually earlier than phase 2).

It should be noted that what is important statistically is that a model fails to meet the threshold (Amodel: 60), and so alarm bells are triggered. A higher model index of agreement is not necessarily ‘better’, because the agreement index is also influenced by the strength of the constraints incorporated into a model, so a model with more informative prior information will — all other things being equal — have a lower index of agreement than one with less informative prior beliefs.

While in practice outlier analysis and agreement indices almost always identify the same dates or prior constraints as problematic, these two approaches are alternatives and should not be used in the same model. They are, however, both compatible with rigorous archaeological critique of the character and context of the dated material and meticulous scientific examination of the complexities of radiocarbon dating. These are critical constituents in model validation and should be employed whichever statistical approach is chosen.

Having identified problems with particular dates, or with particular components of a model, these need to be resolved. Sometimes this involves a reassessment of the overall structure of a model — was phase 1 really earlier than phase 2, or could they have overlapped? In other cases, single dates need to be reinterpreted individually and handled appropriately.

The best way of dealing with such dates depends on our assessment of why they are problematic. The most common categories are:

  1. Misfits – dates that do not fit in the expected stratigraphic position, or that are inaccurate for some technical reason. Generally, samples that prove to be residual can be used as termini post quem for their contexts, but intrusive samples or inaccurate dates need to be excluded from the analysis. Sometimes it is possible to reinterpret the stratigraphy.
  2. Outliers — the 1 in 20 dates whose true calendar date lies outside the 2σ range. These must be retained in the model, as their exclusion would statistically bias the results; outlier analysis can be useful.
  3. Offsets — measurements that are systematically offset from the calibration data by a knowable amount. Reservoir effects can be accounted for in the calibration process (see section 1.6), if necessary, old-wood offsets can be accounted for in the modelling process (Dee and Bronk Ramsey 2014); other types of offset will be rarely, if ever, encountered in English archaeology.

Having constructed a plausible chronological model, the next step in Bayesian modelling is to assess its sensitivity to different aspects of the model being incorrect. This construction of alternative models is called sensitivity analysis. One component of a model is changed, and it is rerun.

The posterior density estimates from the original model and its variant are then compared. When these outputs are very similar, the model can be regarded as insensitive to the component of the model that has been varied. When the outputs differ markedly, the model is sensitive to that component. Sensitivity analyses are useful not only in determining how far the outputs of a model are stable, but also help us to identify which components of a model are most critical.

This introduction to Bayesian Chronological Modelling inevitably masks many of the technical complexities of the method. It aims to provide enough understanding of the principles employed to enable archaeologists to collaborate actively with their specialist modellers.

It cannot be emphasised enough that modelling is a collaborative exercise that relies essentially on the skills, experience and understanding of participating archaeologists. The explicit expression of relevant archaeological knowledge and its appropriate inclusion in models is as critical a step in the modelling process as is the selection and dating of samples.

A general introduction to the application of the Bayesian approach to archaeological data is provided by Buck et al. (1996).

More specific introductions to building Bayesian chronologies in archaeology are provided by Bayliss et al. (2007a) and Bayliss (2007).

Bayesian Chronological Modelling

Statistical methods are required to handle relationships between dated samples. Bayesian statistics enable calibrated radiocarbon dates to be combined with other information we might have about a chronological problem, producing posterior beliefs that take account of all the evidence. Prior beliefs can be simply that all the dated samples are from a single site or associated with the same kind of pottery, but could include relative sequences provided by stratigraphy, seriation or the growth-rings in wood or charcoal.

Most Bayesian Chronological Models are calculated using Markov Chain Monte Carlo (MCMC) methods. The stability of a model is assessed by its convergence, and the compatibility of components of the model using outlier analysis or agreement indices. Most dates that are incompatible with a model are misfits, outliers or offsets. The best way to incorporate such samples in a model depends on an assessment of why they are problematic. The stability of model outputs to variations in the prior information included or the modelling approach adopted is assessed by constructing a series of alternative models as part of a sensitivity analysis.