12
Validity of Network Meta-Analysis

12.1 Introduction

Doubts have been expressed about the validity of network meta-analysis from the outset, and there has been a steadily growing literature around this that deserves to be evaluated and reviewed. Certainly, any statistical method has its limitations. In the case of network meta-analysis, the concerns have revolved mainly around its assumptions. There have been two main lines of empirical research.

Firstly, there is a series of reviews of applied literature that set out to document whether the assumptions of network meta-analysis have been checked or even mentioned. Unfortunately, there has been little consensus on what the assumptions actually are, and in a number of papers, the assumptions are stated in a form that is subtly different from how they have been stated in this book. For this reason we begin the chapter by reviewing the terminology surrounding network meta-analysis assumptions (Section 12.2) in the hope of dispelling some of the confusion.

A persistent assumption running throughout the clinical and methodological literature has been that direct evidence is superior to indirect evidence, although there seems to be no formal analysis that supports this claim. In Section 12.3 we present some ‘thought experiments’ that aim to clarify in quantitative terms the threats that unrecognised effect modifiers pose to valid inference in evidence synthesis. We ask whether indirect comparisons are more vulnerable, less vulnerable or just equally vulnerable to bias originating from unrecognised effect modifiers. The thought experiments also provide qualitative and quantitative insight into the threats to validity of inference in both pairwise and network meta-analyses.

A second line of empirical work in the literature has focussed on the validity of the ‘consistency assumption’. This consists of meta-epidemiological studies that set out to check whether, in large collections of evidence networks, the empirical data accords with this key assumption. We review this area in Section 12.4.

It will be evident from Chapter 7 on inconsistency that it is generally not possible to confirm that the evidence in any specific network is ‘consistent’. However carefully one checks for inconsistency, there is seldom sufficient data to reject the null hypothesis of consistency with any confidence. In networks without loops, the key assumptions cannot be tested at all. This has given rise to a legitimate concern with the validity of estimates from network meta-analysis, leading to attempts to assess the ‘quality’ of evidence that it generates, given the quality of input evidence. In Section 12.5 we suggest that attention should be shifted away from ‘quality’ and towards the robustness of the treatment recommendation to the quantitative assumptions about the evidence inputs.

12.2 What Are the Assumptions of Network Meta-Analysis?

12.2.1 Exchangeability

The assumptions of network meta-analysis can be stated in a variety of ways. One much-discussed question is whether or not network meta-analysis makes ‘additional’ assumptions over and above what is normally assumed in pairwise meta-analysis. This has taken on some importance because investigators want to know whether additional checks need to be made to ensure validity when conducting a network meta-analysis that were not necessary for a pairwise analysis. Specifically, does the guidance on conduct of ‘high quality’ pairwise meta-analysis, such as that contained in the Cochrane Handbook (Higgins and Green, 2008), need to be extended for a network meta-analysis? There is a premise here, of course, that the ‘standard’ methods for pairwise meta-analysis are sufficient to ensure its validity: as we shall see, this is open to question.

The issue has become further clouded because many of the assumptions of network meta-analysis can be derived from others, and it can become difficult to tell whether or not a particular assumption is ‘additional’ to assumptions that are already made for pairwise meta-analysis. We shall try to unpick which assumptions are derived from others, which are re-statements of the same assumptions in other terms and which, if any, are additional.

Random effects pairwise meta-analysis makes an assumption of exchangeability of trial-specific treatment effects (see Chapter 2). (Fixed effects models are a special case where there is no between-trial variation because every study estimates the same effect.) Note that exchangeability is a relationship between parameters in a hierarchical model. In a pairwise meta-analysis, predictive cross-validation (DuMouchel, 1996) is one way of demonstrating that the exchangeability assumption has not been met (see Chapter 3). Comparison of a random effects model with an unrelated mean effects model using DIC and images (Chapter 7) might also indicate a lack of exchangeability if the random effects model failed to demonstrate shrinkage. But it is probably impossible to verify that the exchangeability requirement has been met.

Moving to network meta-analysis, the only assumption required is, again, exchangeability of the true trial-specific effects. In the context of a network meta-analysis, it needs to be understood that this applies to the entire ensemble of trials, including treatment effect parameters that may have never been observed: for example, the δi,BC effects in AD trials are to be considered exchangeable with those in BC trials. One might imagine that the entire set of M trials had each included all S treatments and that subsequently some of the treatment arms went ‘missing at random’ (MAR). This is most simply interpreted as meaning that the missingness is without regard for the presence of effect modifiers. Note that there is no requirement that each treatment is missing with an equal probability. Nor is it necessary that missingness be unrelated to absolute efficacy of each treatment. The only requirement is that missingness is unrelated to relative efficacy: it is, after all, the trial-specific relative treatment effects that are assumed to be exchangeable. Arms are MAR, conditional on the original choice of trial design. Trialists are free to choose which treatments to include, without introducing bias.

It is easily shown that the consistency assumptions follow mathematically from exchangeability (Lu et al., 2011) (see Exercise 12.1). Consistency is a relationship between the true treatment effects or, in a random effects model, between the means of the random effects distributions. Technically, then, this completes everything that needs to be said about the assumptions of network meta-analysis: exchangeability is the only assumption and exchangeability implies consistency.

12.2.2 Other Terminologies and Their Relation to Exchangeability

However, other investigators have expressed different views, using different terminologies, not always in the same way. In a key paper, Song et al. (2009) refer to three separate assumptions: homogeneity, similarity and consistency. Homogeneity is said to be the standard assumption in pairwise meta-analysis. In a network meta-analysis, in this formulation, homogeneity should be fulfilled separately for each pairwise contrast. Similarity and consistency are seen as additional assumptions required for network meta-analysis. It appears that the term homogeneity is being used to stand for exchangeability, perhaps implying an additional requirement that the degree of variation is not too large. While we certainly agree that the extent of heterogeneity is a major issue, and in fact the most important one (see succeeding text), we do not believe that the assumptions required for valid evidence synthesis include specific limits on the extent of between-trial variation. Exchangeability implies variation, albeit variation with very specific properties (see Chapter 2), but it implies nothing about the degree of variation.

One of the difficulties in this literature is that definitions are somewhat informal and lacking in mathematical development. Exchangeability is clearly a property of the true trial-specific relative treatment effect parameters, δi,XY, and consistency is a property of pooled mean relative treatment effects. The view that there are three assumptions, homogeneity, similarity and consistency, and that they are all separate assumptions that must be met, has been repeated many times, but the terms are not used in precisely the same way. Similarity has been described as a property of ‘trials’ or of ‘moderators of relative treatment effects’ (Song et al., 2009). Elsewhere the consistency assumption is said to refer to ‘evidence’ (Edwards et al., 2009). But other authors using the same homogeneity, similarity and consistency terminology are clear that these are properties of parameters and also note that they are very closely related (Donegan et al., 2011).

Cipriani et al. (2013) explain that ‘the main assumption (…) is that there are no important differences between the trials making different comparisons, other than the treatments being compared’. This is an elegant and simple way of stating the exchangeability assumption and an endorsement that it is the only assumption required.

Another term that has been introduced is transitivity (Baker and Kramer, 2002; Salanti, 2012; Cipriani et al., 2013), which is, like consistency, a relationship between parameters, and it is usually accepted that the terms are equivalent (Salanti, 2012). Some experts prefer it because it reminds us that the mathematical relationships between parameters that need to be true for valid inference must also be in place even when no evidence loops are involved, for example, with indirect comparisons. Both transitivity and consistency are relationships between expectations of parameters in a loop; both are implied by exchangeability, but neither implies exchangeability (see Exercise 12.2). However, once again, there is a lack of formal development and an unclear and inconsistent use of terms. In some papers it is implied that consistency can hold when transitivity does not (Salanti, 2012). Elsewhere transitivity is supposed to be another way of referring to similarity of study characteristics (Puhan et al., 2014). Yet another usage has transitivity as a property required of estimates rather than a property of parameters (Baker and Kramer, 2002).

A series of empirical reviews of the methodological quality of applied network meta-analysis (Edwards et al., 2009; Song et al., 2009; Donegan et al., 2011) have tried to document the extent to which researchers have checked or even shown awareness of particular assumptions. This is a difficult exercise to carry out and interpret in the absence of clarity on what the assumptions actually are. We expect we are not the only authors who have submitted a network meta-analysis with a thorough analysis of consistency only to have a reviewer complain that we have not examined, or even mentioned, transitivity (when in fact this is just another name for consistency). All this gives weight to calls to regularise the way network meta-analyses are reported (Bafeta et al., 2014; Hutton et al., 2014).

In view of the overlap and redundancy in these conflicting formulations and the inconsistent use of terminology, we prefer to couch the assumptions of network meta-analysis in terms of exchangeability, a long-standing concept whose properties are well studied (Anon, 2015).

However, none of this addresses the legitimate concerns of investigators wishing to ensure that their estimates are valid. Although it is easy to criticise the literature on the assumptions of network meta-analysis for spreading terminological confusion, it is a body of literature that correctly identifies the issues that must be addressed if valid inferences are to be drawn from network meta-analysis. In Section 12.3, we explore this in a slightly more formal way through a series of thought experiments, which have two purposes. The first is to determine whether direct evidence is more likely to lead to valid inference than indirect evidence. The second is to identify the main factors that determine whether inference in evidence synthesis is likely to be valid or not and to gain a more quantitative understanding of their impact.

12.3 Direct and Indirect Comparisons: Some Thought Experiments

Many investigators have assumed that ‘direct’ evidence is superior to ‘indirect’ (Song et al., 2009; Donegan et al., 2011). Cranney et al. (2002) state

…an apparently more effective treatment may have been tested in a more responsive population. … Conclusions about the relative effectiveness of therapies must await results of head-to-head comparisons.

But if this is interpreted as a concern about the distribution of an unrecognised effect modifier, how can we be sure that any additional direct evidence will be any better? If the effect modifier is unrecognised, there will be no way of knowing whether it is present in the proposed head-to-head comparison or not.

Another claim is that indirect comparisons are ‘observational studies’ (Higgins and Green, 2008):

Indirect comparisons are not randomized comparisons…. They are essentially observational findings across trials, and may suffer the biases of observational studies, for example due to confounding…. unless there are design flaws in the head-to-head trials, the two approaches should be considered separately and the direct comparisons should take precedence as a basis for forming conclusions.

But other authorities have regarded even pairwise meta-analysis as ‘observational’ (Victor, 1995; Egger et al., 1997).

Besides the fact that indirect estimates tend to have greater variance than direct estimates, there has been little in the way of formal analysis to support any of these claims. If a collection of five trials is ‘observational’, why should a single trial be any different? And how can a meta-analysis, which is supposed to be more reliable than a single trial, be observational while a trial is not? Or is a single trial an observational study, too?

Randomisation protects each trial from confounding factors: these are covariates that affect the trial outcome (without necessarily changing the relative treatment effect) and are distributed differently in each arm because of the selection biases that can occur in observational studies. In principle, therefore, each trial delivers an unbiased estimate of the treatment effect for its trial population. Neither a single trial nor a collection of trials is an observational study in the usual sense. The difficulty with trials is not one of confounding, but of effect modification. A random effects meta-analysis model delivers a weighted average of the unbiased trial estimates, and the pooled estimate is therefore also, in a sense, unbiased. If the variation is due to patient population heterogeneity, the problem is that we no longer know what the pooled estimate is an estimate of. In the presence of unrecognised effect modifiers, it is not clear what each trial is estimating nor what the pooled estimate means. The between-study variance in the random effects model actually reflects the uncertain relevance of the trial data to our target population. This is the rationale for using the predictive distribution rather than the distribution of the mean treatment effect (Section 5.6.2) and why we emphasise in Section 12.3.2 and throughout the book (Chapters 79) that between-trial variation should be minimised.

In Chapter 1 we point out that equation images tells us that indirect estimates inherit their properties exclusively from the direct estimates they are made up from. In the next section we use this simple relationship to sketch out a formal investigation of the issues and provide some preliminary results.

12.3.1 Direct Comparisons

To avoid having to worry about sampling error, our ‘thought experiments’ will be conducted on the results of infinite-sized trials. Consider indicator Q for the presence of a trial-level effect modifier. When Q = 0, the true effect of B relative to A is δAB; when Q = 1, the effect is images. For simplicity, and without prejudice to the argument, we are going to assume that this effect modifier reflects a trial characteristic that is either 100% present or 100% absent in any trial – for example, Q = 0 is primary and Q = 1 is a secondary care setting. As usual, if investigators were aware that Q was an effect modifier, separate analysis or covariate adjustment should be used to analyse data from AB trials (Chapter 8).

If Q is an unrecognised covariate, we are immediately in difficulty because it is now unclear what the target parameter for inference is. Let us introduce a new parameter π, the probability that the investigator will happen to choose a Q = 1 trial setting. If hundreds of infinite-sized trials were run, we would expect the pooled estimate to be close to images, as some trials would be conducted in settings where Q = 1, while others would have Q = 0. Although we probably all feel some discomfort at this definition of the target parameter, it is nevertheless what the meta-analysis delivers. The target parameter is thus dependent on the prevalence of Q in the ‘population’ of trials. This simply restates what has been known for a long time: there is no population interpretation for a random effects model (Rubin, 1990). Of course, if we knew that 80% of the target population were treated in a primary care setting and we were aware this was an effect modifier, we would adopt images as our target parameter and the proportion of trials in which Q was 0 or 1 would become irrelevant (Rubin, 1990).

Under these circumstances no single trial can ever be ‘on target’: the estimate from every trial is biased. A trial will either estimate δAB, if Q = 0, or images, if Q = 1; it will never estimate the target images. We can show that the expected error is 0:

images

However, this represents the error that would be seen ‘in the long run’ if we conducted hundreds of trials. In practice, when confronted with a single trial, we are unaware of the effect modifier, and we therefore do not know whether the result is an overestimate or an underestimate of the ‘long run’ expected effect. In these circumstances, the expected absolute error is the appropriate statistic that reflects the size of the deviation of the observed estimate from its true long-term value:

If π = 0.5, for example, the expected absolute error is 0.5θ (Table 12.1). This seems at first sight to be a startling result, but in fact it accords with what many commentators have always noted – trials often give the ‘wrong’ results (Ioannidis, 2005), as a result of random sampling of the effect modifier.

Table 12.1 Expected error and expected absolute error in a ‘direct comparison’ meta-analysis with N = 1 RCTs in the presence of an unrecognised effect modifier, present with probability π = 0.5.

Outcomes Trial outcomes Pr(outcome) Meta-analysis Error | Error |
1 δAB 0.50 δAB images θ/2
2 images 0.50 images images θ/2
Expectation 0 θ/2

Pooled effect target parameter is images.

Having looked at the expected absolute error as a measure of bias in a single trial in the presence of an unrecognised effect modifier, we now look at single realisations of meta-analyses consisting of two, three or more trials. Consider a meta-analysis of two trials with π = 0.5. There is a 25% chance that Q = 1, a 25% chance that Q = 0 on both trials and 50% that one trial has Q = 1 and the other is Q = 0. Here the expected absolute error is θ/4 (Table 12.2). As the number of trials increases, the expected absolute error decreases (see Exercise 12.3). This is portrayed in Figure 12.1: with π = 0.5, the expected absolute error decreases to 0.125θ when the number of trials M = 10 and then to 0.08θ at M = 20 and 0.065θ at M = 40 trials.

Table 12.2 Expected error and expected absolute error in a ‘direct comparison’ meta-analysis with N = 2 RCTs in the presence of an unrecognised effect modifier, present with probability π = 0.5.

Outcome Trial outcomes Pr(outcome) Meta-analysis Error | Error |
1 δAB , δAB 0.25 δAB images θ/2
2 δAB , images 0.50 images 0 0
3 images , images 0.25 images θ/2 θ/2
Expectation 0 θ/4

Pooled effect target parameter is images.

Number of trials vs. expected absolute error displaying 5 descending curves representing π= 0.5; π= 0.4, 0.6; π = 0.3, 0.7; π = 0.2, 0.8; and π = 0.1, 0.9.

Figure 12.1 Direct comparisons: expected absolute error in meta-analyses, in units of θ, as a function of the number of trials and the population proportion π of trials with a trial-level effect modifier that adds θ to the treatment effect.

Note in passing that the thought experiments extend naturally to continuous trial-level covariates where we would sample Q from a continuous distribution and have a distribution of interaction terms.

We conclude that in the presence of unrecognised effect modifiers, direct comparisons are biased with respect to their ‘population’ target parameter but that the degree of bias decreases geometrically with the numbers of trials. These calculations provide theoretic support for the widely held intuition that meta-analysis becomes more reliable as the number of trials increases. But they also show that bias – the difference between what would be observed and the true estimate – is directly proportional to the size of the interaction effect θ. This again points to the importance of not only identifying known effect modifiers in advance but also taking deliberate action to limit the extent of heterogeneity from unknown sources.

12.3.2 Indirect Comparisons

We now extend the same analysis to indirect comparisons: our estimate of the AB effect will be formed from AC and BC trials. Again, to keep it simple but without prejudice to the generality of the argument, we will assume that all trials are of infinite size and that there are always an equal number of AC and BC trials. We begin with a scenario where the effect modifier that acts on the AB effect also impacts the exact same extent on AC comparisons, but it does not impact BC comparisons. A typical scenario would be where A is placebo and B and C are active treatments in the same class. We have images, and the target parameter, of course, is unchanged: images.

Consider first a single AC and a single BC trial. With π = 0.5 the AC trial has a 50% chance of being run at the Q = 0 setting and 50% Q = 1. Therefore the expected pooled estimate is images, and the expected bias is zero. The expected absolute error, however, is again θ/2. Because the BC trial is immune to the effect modifier, the indirect estimate of δAB inherits its expected absolute error from the AC trial. Clearly, as we go from one each of AC and BC trials to two or more each, the expected absolute error in the indirect estimate will always be exactly the same as the expected absolute error in a direct estimate based on the same number of trials.

In Chapter 1 we pointed out that because images is an equation, the left-hand side can only be biased if the right-hand side is biased. The previous discussion is just a further illustration: where just one of the direct estimates is biased, the indirect estimate is biased to exactly the same degree.

This suggests that if both direct estimates are biased, then the indirect estimate might inherit a double dose of bias. This is precisely what happens if we consider a second indirect comparison scenario. Suppose we now have the same set of treatments (A placebo, B and C active and of the same class), but we now wish to make inferences about δBC from the AC and AB trials. Starting again with a single AB and a single AC trial, there are four possible outcomes, and the expected absolute error is θ/2, as before. However, as the number of trial pairs increases, the expected absolute error continues to decrease, but the degree of bias is always greater than in direct comparisons (Figure 12.2), because both direct estimates can be biased (see Exercise 12.4).

Image described by caption.

Figure 12.2 Absolute expected error in direct comparisons, or indirect comparisons, where only one of the direct contrasts is subject to an effect modifier present with probability images (lower curve) and indirect comparisons where both direct contrasts are subject to the same effect modifier (upper curve). In units of the interaction term.

We can summarise these findings as follows. In the presence of unrecognised effect modifiers, both direct and indirect evidence are biased (relative to the target parameter) due to random sampling of the effect modifier. The extent of bias is identical if only one of the sources of direct evidence is subject to the effect modifier. Where both sources of direct evidence are subject to the same effect modifier, the indirect evidence is even more biased, but the notable finding, perhaps, is that in this scenario the direct estimate is unbiased in the presence of an unrecognised effect modifier, because it affects both arms equally. This is, in fact, exactly what we observed in Chapter 8 when exploring covariate effects using meta-regression: with a single observed covariate acting on AB and AC trials, the regression coefficients for the interaction terms in BC trials cancel out.

The scenarios we have explored previously are for ‘fixed’ effects modifiers, as used in meta-regression (Chapter 8). This form of analysis can be extended to random biases (Chapter 9) and to multiple effect modifiers. Under some circumstances random biases – or to be more precise their expectations – may ‘cancel out’ (Song et al., 2008), making indirect comparisons potentially less biased than direct. But in other cases, where the indirect comparison is formed by addition rather than subtraction, the expected biases add as well as their variances, making indirect comparisons substantially worse.

The prediction that direct comparisons between active treatments in the same class will be relatively unbiased is supported empirically by studies that show that active–active comparisons tend to have a lower degree of between-trial heterogeneity (Turner et al., 2012). Conversely, the finding that indirect comparisons are particularly vulnerable to bias when the comparators are in the same class is significant, as this is probably the most common kind of indirect comparison. This situation is portrayed in Figure 12.3, which shows a constant dBC effect, at all levels of the effect modifier, but highly variable results when this is estimated indirectly from AB and AC trials.

Covariate vs. treatment effect displaying 2 parallel ascending lines for AC and AB trials with a line (dBC) and triangles between the 2 parallel lines for direct and indirect estimates.

Figure 12.3 Illustration of the difference between direct and indirect estimates formed from two direct comparisons that are both affected by an unrecognised effect modifier.

The thought experiments established that, under a wide range of circumstances, meta-analytic estimates based on direct and indirect evidence are equally prone to error in the presence of unrecognised fixed effects modifiers. Under some circumstances, indirect may be more biased, and direct virtually unbiased. Two further important findings are as follows: (i) the degree of bias increases in proportion to the size of the interaction effect associated with the effect modifier and (ii) the degree of bias decreases as the number of studies increases.

12.3.3 Under What Conditions Is Evidence Synthesis Likely to Be Valid?

These last two findings have interesting theoretical and practical implications. On the practical side, investigators may not have much control over how many studies are included in their evidence synthesis, but they are in a position to create conditions under which the risks of invalid inference are curtailed simply by limiting the size of potential interaction effects, which means limiting the degree of clinical heterogeneity at the study inclusion/exclusion stage. Many of the key practical consequences have already been discussed, such as the need to avoid ‘lumping’, that is, to avoid averaging relative treatment effects over different doses, different co-treatments or over patient groups on first, second or third line therapy. We have in fact argued that, in a decision-making context, it would make no sense to average over these factors in the first place. But even from a strictly evidence synthesis perspective, averaging over different doses is a strange way of pooling evidence, because different doses are, obviously, deliberately designed to have different effects. Similarly, patients who have failed on a specific class of treatments are, by definition, already known to react to them differently than patients who have not.

Thinking more broadly about effect modifiers and heterogeneity, our thought experiments suggest there would be great benefit in a more systematic exploration to assess vulnerability of inference. This could consist of a systematic review of the literature on potential effect modifiers, particularly IPD analyses of trial data. The distribution of effect modifiers across trials, and especially across contrasts, including severity at the start of trial, should be explicitly tabulated (Jansen and Naci, 2013). Covariate adjustment via meta-regression can then be considered, either as a base-case analysis or as a sensitivity analysis. Careful attention should be paid to indicators of potential heterogeneity. For example, if the absolute effect under a given treatment is similar across trials, this can give some reassurance that a degree of patient homogeneity across trials is in place. Wide variation in study-specific baseline values does not necessarily indicate heterogeneity in relative effects, but it does point to clinical heterogeneity in trial populations, which in turn indicates that estimated effects are more vulnerable to the presence of effect modifiers.

At a theoretical level, the thought experiments tell us that as long as heterogeneity can be kept at a low level, the validity of both pairwise and network meta-analysis does not really depend on whether the abstract mathematical conditions of exchangeability are met. In fact, if the degree of heterogeneity can be kept low, exchangeability is almost an irrelevance. After all, if the between-trial standard deviation is only 10% of the treatment effect, we might not be too concerned about deviations from exchangeability. Conversely, if the between-trial standard deviation was as high as 50% of the mean treatment effect, which is by no means unusual, we should be quite concerned particularly if there are only a small number of trials on influential comparisons, even if we knew that exchangeability was in place. As the number of trials increases, we could be less worried about the degree of heterogeneity, but we would then be vulnerable to failures in exchangeability.

We can now distinguish between two reasons why we might observe inconsistency between direct and indirect evidence. The first, due to failure in exchangeability, we might call intrinsic inconsistency. In the language of the thought experiments, the probability that Q = 1 in AB trials, πAB, is different from the corresponding probability πAC in AC trials. In the second case we have chance inconsistency: here the underlying probability that Q = 1 is the same in both AB and AC trials, but we obtain a different distribution of effect modifiers through random sampling.

We might summarise this as follows. From a technical point of view, exchangeability is the only assumption both in network and in pairwise meta-analyses. But the validity, or otherwise, of estimates in both cases depends on the size of interaction effects (heterogeneity) and the number of trials, whether or not the exchangeability assumption is met. Given that it is virtually impossible to ensure that exchangeability is in place, validity can only be ensured by limiting the degree of heterogeneity. Thus, while ‘homogeneity’ and ‘similarity’ (Song et al., 2009) are not strictly necessary assumptions, investigators would be well advised to pay them very close attention.

12.4 Empirical Studies of the Consistency Assumption

On the face of it, meta-epidemiological studies testing the heterogeneity assumption using the methods set out in Chapter 7 would seem to be the obvious way of gaining some understanding of the validity of network meta-analysis. In the largest study of this type, looking at 112 triangular networks, significant (p < 0.05) inconsistency was reported in 16 (14%, 95% CI 9–22%) networks based on the Bucher method (Song et al., 2011). There are technical difficulties in interpreting this, because under the assumption of consistency there are quite strong constraints on the between-trial variances in loops of evidence (Lu and Ades, 2009), which were not taken into account. This was explored in a more thorough investigation of network inconsistency (Veroniki et al., 2013) looking at 40 networks with four or more treatments containing over 300 evidence loops. The networks were identified from the PubMed system. This paper showed that 9% of triangle loops would be inconsistent using the simple Bucher method adopted by Song et al. (2011), but only 5% when a more stable and consistent estimate of the variance of effects was used. However, using the tests for design inconsistency (Jackson et al., 2014) (see also Chapter 7), they found that eight (20%) of the networks showed evidence of global inconsistency.

Confirming the results of our thought experiments in the previous section, these studies reported a strong negative relation between the risk of inconsistency and the number of studies informing the evidence loop (Song et al., 2011), and a high risk of inconsistency in loops where one of the contrasts is informed by only a single study (Veroniki et al., 2013). As noted in Section 12.3, this does not necessarily indicate a failure in the technical exchangeability assumption: a higher probability of chance imbalances in effect modifiers introduced through random variation is exactly what we would predict from the thought experiments.

The significance of findings from such meta-epidemiological studies is far from clear. If an evidence network is inconsistent, then we know that either the exchangeability assumptions required for valid network meta-analysis were not met (intrinsic inconsistency) or they were met but chance imbalances in effect modifiers were enough to lead to ‘inconsistency’ being detected.

If the evidence networks are based on pairwise systematic reviews that do not have the same trial inclusion/exclusion criteria, the data generation process might easily deliver a systematic imbalance in effect modifiers, constituting a deviation from exchangeability. This might lead to inconsistency in even the largest networks. On the other hand, if the entire evidence network was assembled under a common inclusion/exclusion protocol, the risk of inconsistency is distinctly lower in large networks, but where small numbers of trials are involved, chance imbalances in effect modifiers are very likely.

Thus, we can only make sense of meta-epidemiological studies if we go back to examine the precise protocols that were followed and look for relationships between the protocols and the findings. The same misgivings must apply to other meta-epidemiological studies of network meta-analysis (Chaimani and Salanti, 2012).

Cochrane reviews are carried out under rigorous protocols that are open to examination. But there appears to be nothing in these protocols that is specifically intended to have the effect of guaranteeing, or helping to guarantee, the exchangeability assumptions that are required for valid evidence synthesis. Nor is there an explicit recognition that clinical heterogeneity must be kept to a minimum, whether to limit the impact of potential deviations from exchangeability or just to ensure the interpretability of the pooled mean. We could, for example, examine the literature identification and selection protocols to see how recognised effect modifiers were to be dealt with. A particular issue is whether the inclusion criteria would result in pooling estimates from different doses, different co-treatments and different formulations or for first, second and third line therapies. As we have seen, ‘lumping’ over all of these is very common in many Cochrane reviews (see Preface and Chapter 1). It may be no coincidence that inconsistency was found in 16/85 Cochrane reviews compared with 0/27 non-Cochrane reviews (p = 0.011) (Song et al., 2011), although this could also be due to the numbers of trials informing each contrast.

In a scientific context where one is asking whether a product or intervention ‘in principle’ has an effect, lumping may be appropriate given the ‘hypothesis testing’ rationale (Gotzsche, 2000). In a decision-making context, by contrast, the emphasis is on the coherent estimation of a set of specific relative effects. It would normally be inappropriate to generate effect estimates that were averaged overdose or stage of disease progression, as such estimates have no interpretation for decision makers, and it is debatable as to whether they have any clinical meaning.

It is significant that the meta-epidemiological literature on inconsistency has so far been restricted to systematic reviews and clinical studies and has not looked at the use of network meta-analysis in health technology assessment, for example, in submissions to the NICE technology appraisal process or to similar bodies in other countries. In our experience, evidence networks assembled for decision-making seldom show signs of inconsistency, which we attribute to narrower definitions of treatment and patient population. Empirical studies of inconsistency in this literature would be worthwhile.

12.5 Quality of Evidence Versus Reliability of Recommendation

While previous sections have looked at the validity question from the perspective of someone about to undertake a review and evidence synthesis, we now consider the same questions as they might occur to someone reading a report of the results. Given that a clear statement on the presence or absence of inconsistency will seldom be possible, what can be said about the validity of a network meta-analysis? Once again, a small amount of theory takes us a long way, and before reviewing the proposals that have been made, we begin by reminding ourselves of the formal properties of network estimates.

12.5.1 Theoretical Treatment of Validity of Network Meta-Analysis

In pairwise meta-analysis, whether based on fixed or random effects models, the summary pooled estimates are weighted averages of the treatment effect estimates from the original trials. This same is true in network meta-analysis: each coherent estimate is a linear combination of coefficients (weights) and contrast-specific treatment effects (Lu et al., 2011). For example, a network with AB, AC, BC, AD and BD trials,

This finding has a number of important implications. The first is that if we are satisfied that the in-going treatment effects are unbiased estimates for the target population, we can be confident that the out-going coherent estimates are also unbiased. The network meta-analysis cannot, for example, ‘add bias’ that is not already present in the input evidence.

A second implication is that in cases where there are doubts about potential bias in the input study estimates, we should take action at the outset to manage this. Bias adjustment of individual studies is a subjective process, but if a consensus can be reached on the distributions of study-specific biases (Turner et al., 2009), investigators can be confident in a network meta-analysis based on the adjusted trial estimates. Alternatively, the methods of Chapter 9 can be used to adjust out generic biases in the network model, although we should view the results with some caution, as with any form of meta-regression.

A third implication is that, again, in the presence of potential bias, we can use the weight coefficients (β1, β2, …) in equation (12.2) to tell us exactly how much impact a potential bias in, say, the estimate of δBC could be having on a final estimate images. We return to this in Section 12.5.3.

12.5.2 GRADE Assessment of Quality of Evidence from a Network Meta-Analyses

There are two proposals (Puhan et al., 2014; Salanti et al., 2014) on how to develop quality assessments for network meta-analysis. Both start from the quality assessments of pairwise meta-analyses from the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group (Guyatt et al., 2008).

GRADE assessments of quality cover five domains: ‘study limitations’, approximately risk of internal bias (Guyatt et al., 2011e); ‘inconsistency’, the term used for heterogeneity (Guyatt et al., 2011c); ‘indirectness’, which refers to the lack of applicability due to differences in outcome or in target population (Guyatt et al., 2011b); ‘imprecision’, a high variance in the estimate (Guyatt et al., 2011a); and ‘publication bias’ (Guyatt et al., 2011d). In each domain, the process assigns an ordered quality score: high, moderate, low and very low. Finally, the five domain-level quality assessments are combined to give an overall quality assessment (Balshem et al., 2011).

For network meta-analysis, the objective of the GRADE assessment is to assign a quality rating score to each and every pairwise effect estimated by the network meta-analysis. The process starts from the five GRADE domain assessments applied to each of the ‘direct’ estimates. Then a parallel set of GRADE assessments is developed for each of the ‘indirect’ estimates. Because there are often several non-independent indirect estimates for each comparison, investigators are advised to focus attention on the indirect estimate that is formed from the two direct contrasts that are supported by the most evidence. The quality rating of the indirect evidence is then the lowest of the ratings of the constituent parts, although it may be rated down a level if there is an imbalance in effect modifiers (Puhan et al., 2014).

The final step is to assign a GRADE quality rating to each of the (coherent) network estimates themselves: this is taken to be the highest of the quality rating assigned to the direct and indirect evidence on that contrast, but this can also be downrated one step if the direct and indirect estimates are ‘incoherent’ (inconsistent in the terminology of this book). Some leeway is allowed on how this inconsistency is assessed: investigators may use node splitting (Chapter 7), or they can pick out the main source of indirect evidence and use the Bucher method. It is also possible for investigators to simply assign the final GRADE rating on the basis of the direct and indirect ratings without a formal examination of inconsistency (Johnston et al., 2014).

Readers will note that in order to assign GRADE quality ratings to a network meta-analysis, it is not necessary to actually carry one out: the entire process can be put together from the original GRADE assessments of the individual pairwise contrasts. Another property is that it does not give an overall quality assessment of the network meta-analysis results, only a set of unrelated assessments of the quality of the estimates for each separate contrast. This takes us back to evaluating a set of unrelated pairwise comparisons (see Chapter 1). There is no specific advice on how a decision maker choosing between S treatments is supposed to use the S(S − 1)/2 GRADE quality assessments.

It is suggested, however, that clinicians could ‘choose a lower ranked treatment with supporting evidence they can trust over a higher ranked treatment with supporting evidence they cannot trust’ (Puhan et al., 2014). Although decision-making bodies can always take account of numerous factors ‘outside’ any formal decision analysis, this approach does not seem to accord with any principles of rational and transparent decision-making.

Looked at in statistical terms, by treating the assessments of the pairwise summaries as independent, the GRADE process fails to recognise that the risk of bias, heterogeneity, relevance and publication bias issues that impact on each contrast in an evidence network are likely to be conceptually and quantitatively similar. Biases caused by known effect modifiers can be removed by covariate adjustment (Chapter 8). Further, as discussed in Chapter 9, random biases associated with risk of bias indicators, or publication biases, can be seen as generic biases operating throughout the network, which can be adjusted for by a suitable model. Treating them as independent generates a weak analysis when compared with explicit modelling of effect modifiers and covariates.

But the main weakness of a quality assessment system is that it is not clear how decision makers should use it. It requires considerable scholarship to understand, collate and document the various internal and external validity biases in order to implement the GRADE assessment of quality of network evidence, or the Cochrane Collaboration risk of bias tool (Higgins and Altman, 2008; Lundh and Gotzsche, 2008; Higgins et al., 2011). However, the resulting ‘evidence profiles’, quality ratings and accompanying narrative are usually consigned to an appendix and have little or no discernible impact on the final recommendation.

Another method for assigning quality ratings to network evidence has been developed by statisticians associated with the Cochrane Collaboration (Salanti et al., 2014). This starts with the GRADE assessments of the pairwise contrasts at the domain level, as described previously. These are combined into domain-specific assessments of the entire evidence network, using the ‘contributions matrix’ (König et al., 2013; Krahn et al., 2013). This is the matrix of weight coefficients (β1, β2, …) from equation (12.2) that determine the influence that each piece of evidence has on each of the estimates (Lu et al., 2011). This is an interesting, although complex, method, and readers are referred to the original papers for details. However, the quality assessments of each of the network estimates that it delivers are, according to its authors, expected to be similar to the assessments given by the GRADE approach. This method also delivers a quality rating of the rankings in the network as a whole. While this is probably a more useful output, it is still unclear how a decision maker would use it.

12.5.3 Reliability of Recommendations Versus Quality of Evidence: The Role of Sensitivity Analysis

The approach we prefer is to deliver to decision makers an evidence synthesis, or a synthesis embedded in a cost-effectiveness analysis or other decision-analytic model, that incorporates all the relevant adjustments and uncertainties, both those due to sampling error and those associated with known effect modifiers, random biases or uncertain relevance.

This is the ‘base-case’ analysis from which treatment recommendations can be directly derived. As noted in Chapter 5, it delivers a decision that is ‘optimal’, given the available evidence, but which is not necessarily the ‘correct’ decision because it is made under uncertainty. Once the base-case analysis has been generated, decision makers may wish to take into consideration factors that are not included in the formal model. This should include an analysis of the robustness of the recommendation to changes in the assumptions. This is the role of sensitivity analyses. In a decision-making context, sensitivity analysis has always been the proper way of raising, and analysing, questions relating the reliability of conclusions. The purpose of a sensitivity analysis is not to assign a quality rating to the analysis nor to overturn the base-case analysis. Its role is to inform decision makers whether plausible changes in their original assumptions, including their assumptions about the evidence, could lead to changes in their final recommendations.

Rather than provide a quality rating to every estimate, the sensitivity analysis can focus attention on what are probably a small number of elements in the evidence ensemble, which are ‘driving’ the results. The same level of scrutiny and the same degree of scholarship required to assign GRADE ratings to evidence, or to operate the Cochrane risk of bias tool, must now be deployed but are now directed towards providing not a ‘quality assessment’ but a quantitative assessment of the likely biases and their potential impact on the treatment decision. What is required is a structured sensitivity analysis that asks questions of the form: ‘if the evidence on contrast XY is biased by an amount β, would that change the treatment decision, and what would the new recommendation be?’ A simple method for this kind of sensitivity analysis has been illustrated by Caldwell et al. (2016).

Clearly, the results of such an investigation will depend on the precise quantitative relationships between the two or three treatments with the highest expected score on the objective function used for decision-making, whether efficacy or net economic benefit. The GRADE quality ratings contain no information on the size or direction of treatment effects, let alone the size of differences between treatments, nor on the size of potential biases. Fundamentally, GRADE ratings might tell us how much credence to give each network estimate, but they provide no information on how the treatment recommendation might change if ‘better’ information was available in different parts of the network. This would seem to be a fruitful area for further research.

12.6 Summary and Further Reading

Network meta-analysis is now routinely used in every area of clinical medicine, in academic papers (Lee, 2014), in clinical guideline development and in re-imbursement decisions in several countries. Nevertheless, doubts about the method continue to be expressed. Some of these concerns relate to whether the assumptions of network meta-analysis are being met, but there has been a degree of confusion about exactly what the assumptions are, leading to alternative formulations and inconsistent use of terminology.

We argue that exchangeability of the trial-specific relative effect parameters is the only technical assumption required and that consistency of the expected treatment effects follows from exchangeability. However, it is virtually impossible to establish that exchangeability holds in any specific instance. The key strategy to ensure validity of inference, whether from direct or indirect data, is to limit the impact of deviations from exchangeability and the impact of chance imbalances in effect modifiers, but holding down clinical heterogeneity. Some suggested ‘action points’ to achieve this are listed below.

It has been widely stated, or implicitly assumed, that ‘direct’ evidence is superior to ‘indirect’ evidence. We have pointed out throughout this book that the properties of indirect estimates can only be inherited from the direct estimates of which they are composed. Our thought experiments in Section 12.3 show that, in general, direct and indirect estimates are equally vulnerable to bias, but that in certain circumstances, where two active treatments of the same class are being indirectly compared via a placebo, indirect comparisons will be more biased and direct comparisons potentially free from bias. These exercises also confirmed that the error that we can expect to observe in syntheses of small numbers of trials is directly proportional to the size of the interactions that drive effect modification. Also, meta-analytic estimates are increasingly vulnerable to bias as the number of trials diminishes.

The thought experiments of Section 12.3 led us to distinguish intrinsic inconsistency, resulting from failure of exchangeability, from chance imbalances in effect modifiers. The distinction throws a particular light on the inconsistency models (see Chapter 7), which some have proposed should be routinely used to allow for the omnipresent risk of inconsistency in networks (Lumley, 2002; Jackson et al., 2014). We can begin by observing that our standard random effects model of Chapters 2 and 4 already instantiate exactly the data generation process that gives rise to chance inconsistency. There is therefore no need to add a second layer of random variation to absorb this. But if this is the case, how should we interpret the extra variation for inconsistency? These additional parameters are also assumed to be exchangeable, but it is difficult to imagine a process that generates real, but random (and so exchangeable), variation in effect modifier distributions across the various designs. For example, one might reasonably expect that the designers of AB, AC and AD trials each consistently draw their effect modifiers from different distributions, whether intentionally or not. But it is much harder to believe that the parameters that define these different distributions are themselves generated randomly and exchangeably.

This takes us back to inconsistency models in which the inconsistency terms are treated as fixed effects rather than exchangeable. These models, however, while useful for detecting global inconsistency, cannot be used for decision-making because the estimated treatment effects and inconsistency terms depend on parameterisation (Higgins et al., 2012; White et al., 2012).

We would never claim that the problems attending network meta-analysis have been overstated. But it does appear that precisely the same problems attend pairwise meta-analysis as well and that this has been under-recognised by many investigators. The advent of network meta-analysis methods has simply brought attention to issues that were previously ignored. In particular, testing for inconsistency in networks has turned out to be a method for detecting either systematic deviations from exchangeability or the presence of chance imbalances in effect modifiers, both of which signal significant heterogeneity in the evidence base.

Of course, we should not have had to wait for network meta-analysis to raise questions about the validity of pairwise meta-analysis. An extensive earlier literature already points out that random effects estimates had no population interpretation (Rubin, 1990) and warns of the dangers of uncritical pooling of clinically and statistically heterogeneous estimates (Greenland, 1994b). Summary estimates from random effects models have even been described as ‘meaningless’ (Greenland, 1994a; Shapiro, 1994), although we might add that estimates become increasingly meaningful as the degree of between-trial variation is diminished.

In this closing section we bring together strands from the different chapters, and indeed from classic meta-analysis literature, to offer some suggestions about what can be done to help ensure that conclusions from syntheses of direct or indirect evidence are secure.

Question formulation, trial inclusion/exclusion and network connectivity

  1. Restrict attention to a clinically meaningful target population at a specific stage of their disease.
  2. Ensure that all patients included in the trials could be randomised to any of the treatments in the network (Salanti, 2012) (Chapter 1).
  3. As a starting point, keep different doses, different co-therapies, as different treatments. Consider class, dose or treatment combination models (Chapter 8) if appropriate.
  4. Where trials report multiple outcomes, consider multi-outcome models to ensure robust and coherent decisions based on all the evidence (Chapter 11).

Heterogeneity and bias management

  1. Examine the absolute event rates in different trials. If these are heterogeneous, there is a higher risk that unrecognised effect modifiers are present (Song, 1999).
  2. Review the general literature, as well as the meta-analysis literature, for potential and known effect modifiers, and consider meta-regression models (Chapter 8).
  3. Examine and report on the distribution of effect modifiers across trials and across pairwise contrasts (Jansen and Naci, 2013).
  4. Explore the potential effects of quality-related biases including publication bias, and consider bias adjustment models (Chapter 9).
  5. Check for inconsistency and report results (Chapter 7).

Reporting

  1. Show how many trials inform each contrast through a table or network diagram. Include a description of trials with more than two arms and how they influence the network structure.
  2. Report relative treatment effects relative to a reference treatment.
  3. Report model fit and methods for model choice.
  4. Report heterogeneity.
  5. Give a precise reference for the statistical methods used, and supply the computer code and datasets used to allow readers to replicate results.

Those managing or undertaking systematic reviews leading to network meta-analyses can take advantage of a literature on ‘checklists’, starting of course with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) requirements (Liberati et al., 2009). Some of these seek to assign a summary numerical assessment (Oxman, 1994). Others, which we believe are more useful, are intended to assist those whose task is to make recommendations based on meta-analysis, or to the technical analysts advising them, or simply to journal editors and reviewers considering papers for publication (Ades et al., 2012, 2013). There is also guidance and a checklist from an ISPOR task force oriented to network meta-analyses (Jansen et al., 2014).

There is a growing literature on simulation studies, but it is beyond the scope of this book to review this. Some of which have concluded that indirect comparisons are biased (Wells et al., 2009; Mills et al., 2011). As noted previously, this is difficult to understand as indirect comparisons cannot be biased unless direct comparisons are also biased. It seems probable that the bias in indirect estimates has indeed been inherited from the well-recognised biases in random effects variance estimators (Böhning et al., 2002) or possibly they are due to the biases resulting from adding a small constant to zero cells. There are also simulation studies claiming to show that the probability(best) outcomes from Bayesian network meta-analysis are biased, because they are sensitive to the number of studies (Kibret et al., 2014). However, it is well known that the posterior distribution of ranks is expected to be sensitive to the number and size of trials in different parts of the network.

There is certainly a scope for well-conducted simulations both on the impact of between-study variance prior distributions on Bayesian meta-analysis (Gelman, 2006) and on the coverage properties of different types of inconsistency detection, which themselves depend critically on how between-study variances are estimated (Veroniki et al., 2013).

The idea of studying the geometrical properties of networks was introduced in 2008 (Salanti et al., 2008a, 2008b), with a number of suggested metrics based on ecological science. These metrics can provide insights into the processes behind the choice of comparators in new trials. Recently, though, it has been suggested that features of network geometry may be related to ‘bias’ in the evidence network (Salanti, 2012; Hutton et al., 2015; Linde et al., 2016). As noted previously (Section 12.2.1) bias is only introduced in network meta-analysis if the exchangeability assumption is violated or – saying the same thing another way – the missingness of treatments from trials is related to their relative effectiveness. Decisions about which comparators to enter into trials are often made on marketing grounds, but this does not imply that relative effects are biased. Trial designers are free to choose any treatments for inclusion in trials, even on the basis of their expected relative efficacy, without introducing bias.

We conclude with some brief thoughts about future research. In Chapter 1 we showed how a comparison of pairwise estimates and their credible intervals with the corresponding coherent network estimates could help investigators understand the ‘drivers’ in the analysis. This is also the key role for sensitivity analysis, touched on in Section 12.5, and surely an area that requires further investigation. For general texts on sensitivity analysis, we refer to relevant chapters and tutorial papers in CEA (Briggs et al., 2006, 2012). However, these techniques concern ‘forward’ MC simulation models in which each parameter is informed by independent sources of evidence. They do not address the complex flow of evidence in Bayesian evidence networks (Madigan et al., 1997).

We noted earlier that the network meta-analysis represents a relative simple evidence network on the scale of the linear predictor in which the influence of each ‘input’ observation on a coherent network estimate can be characterised as a weighted average of the ‘input’ contrast estimates (Lu et al., 2011). This idea has been developed further to provide an analysis of inconsistency (Krahn et al., 2013) and a general analysis of information flow in linear networks (König et al., 2013). These algebraic methods might be further adapted to drive sensitivity analyses, or analyses of how an existing evidence network would respond to additional data or which new data to collect to reduce uncertainty in specific parts of the network.

12.7 Exercises

  1. 12.1 Starting from the assumption AB and AC effects are each exchangeable, with images and images, demonstrate that consistency holds for dBC.
  2. 12.2 Show that consistency does not imply exchangeability.
  3. 12.3 Using the same approach as in equation (12.1), and, in Tables 12.1 and 12.2, what is the expected absolute error in a direct comparison based on three trials, when images?
  4. 12.4 Extend the same methods to study the extent of bias in indirect comparisons where the BC effect from two AB and two AC trials, in the presence of an effect modifier present with probability images, affects treatments B and C equally but not A.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.113