Issues and Challenges | 14 | Noninferiority Testing in Clinical Trials

ABSTRACT

The null hypothesis of a specified difference had been formulated as early as the 1970s (e�g�, Remington and Schork 1970; Dunnett and Gent 1977; Makuch and Simon 1978)� Mathematically, it appears to be not much different from testing the null hypothesis of no difference because the t-statistic can be calculated by subtracting the specified difference from the difference in the sample means� In reality, however, it is not as simple as it looks� For noninferiority (NI) trials, the specified difference is known as the NI margin� In Chapter 2, the proposed NI margin as a small fraction of the therapeutic effect of the active control in a clinical trial is conceptually very simple� In practice, however, there are many issues in the determination of the NI margin and in the analysis of an NI trial as seen throughout this book� This chapter covers many other issues and challenges that need to be addressed, although many advances have been made over the past three decades�

A search on the PubMed website for articles with “noninferiority” in title/ abstract was performed� The number of publications by year of publication is shown in Figure 12�1� In spite of the limitations of such a search, this figure shows how much interest in this research area has grown during the last decade�

Five special issues on noninferiority were published by three journals in the mid-2000s: two by Statistics in Medicine (2003, 2006), two by the Journal of Biopharmaceutical Statistics (2004 and 2007), and one by the Biometrical Journal (2005)� One additional special issue on NI trials was published by Biopharmaceutical Statistics in 2011� In addition, there were two workshops on NI trials in the 2000s: PhRMA (Pharmaceutical Research and Manufacturers of America) Non-Inferiority Workshop in 2002 and FDA-Industry Workshop in 2007� Finally, a short course on NI trials was offered at the FDA-Industry Workshop in 2003� These tremendous research efforts in NI trials in the past 15 years have made many advances in this area� However, many issues remain unresolved�

This chapter highlights (1) the fundamental issues (see Section 12�2), (2) advances (see Section 12�3), (3) current controversial issues (see Section 12�4), and (4) issues and challenges (Section 12�5) in the design and analysis of NI trials�

Many issues in NI trials are not seen in superiority trials, although the test statistics (e�g�, t-test) used the NI, and superiority trials do not “look” much different mathematically (see Section 12�1)� The three fundamental issues are (1) lack of internal validation of assay sensitivity (see Sections 2�6 and 2�7), (2) need for a strong, unverifiable constancy assumption (see Sections 2�5�1 and 2�7), and (3) bias toward equality�

To validate assay sensitivity, one has to rely on the historical placebo-controlled trials of the active control, which are, very often, inadequate� Although discounting (see Section 2�5�2 of Chapter 2) may alleviate the concern with constancy assumption, such discounting is subjective, and there is no scientific basis to determine the discounting factor� It is extremely difficult, if not impossible, to assess the publication bias in the meta-analysis in the estimation of the control effect� Under the null hypothesis of a specific difference, poorly conducted studies (e�g�, mixing up treatment assignment) would bias toward equality� This would potentially lead to (1) a fault rejection of the null hypothesis and (2) an increase in the Type I error rate although the variability may be increased� Unlike the superiority trial, such a bias may happen without unblinding�

In the two-arm NI trials, the historical data is used to either (1) come out with the NI margin using the fixed margin method (Section 5�3) or (2) be incorporated the into the test statistic using the synthesis approach (Section 5�4)� The most fundamental issue in the analysis of NI trials is that

patients were not “randomized” into the two studies-the current NI trial and the “super-study” of the historical studies� We have to rely on the constancy assumption or discounting in the assessment of efficacy, as compared to placebo or percent preservation� Therefore, efficacy assessment in NI trials is less credible than that in the placebo-controlled trial (or superiority trial) or the three-arm, gold-standard design (see Section 8�2 of Chapter 8)� On the other hand, the constancy assumption is weaker than the nonrandomized, historical-controlled studies, where the historical control mean is assumed to be the same as in the current trial (see Section 2�5�1 of Chapter 2)�

Design and analysis of an active-control trial have gone through two major transitions over the past four decades� The first transition dealt with the formulation of the hypotheses, and the second transition dealt with the study objective� These transitions happened very slowly�

12.3.1 Formulation of Noninferiority Hypothesis and Noninferiority Margin

Although the null hypothesis of a specified difference was suggested in the 1970s (see Section 12�1), based on the lists in Sections 1�3 and 1�4 of Chapter 1, the formulation of the hypotheses given by Equation 1�3 in Section 1�7 of Chapter 1 was widely recognized and accepted two decades later in the 1990s� With such a formulation of the hypotheses, researchers were facing a great challenge in determining the NI margin (or equivalence margin, as it was called at the time; see Section 1�3 of Chapter 1)� A small fraction of the effect size as the NI margin was first proposed by Ng (1993) (see Section 2�2 of Chapter 2)� Such a proposal was “translated” into preservation or retention in the literature (see Section 2�4 of Chapter 2) and was used in the late 1990s� See the thrombolytic example discussed in Chapter 11�

12.3.2 Study Objective: Showing Efficacy

The hypotheses in Section 1�7 of Chapter 1 were first formulated with the objective of showing NI or equivalence in a broad sense rather than a strict sense (see Sections 1�2 and 1�6 of Chapter 1)� Very often, such an objective is not practical due to the limitation of the sample size� A viable alternative is to show the efficacy of T by an indirect comparison with P� Such an objective was first discussed by Ng (1993) in determining the sample size in active-controlled trials through specification of the treatment difference� The concept of showing efficacy dates back to 1985 (see Section 5�6 of

Chapter 5)� This latter objective of showing efficacy became noticeable in the literature in the 2000s (e�g�, Hassalblad and Kong 2001; Wang and Hung 2003; EMEA/CPMP 2005; FDA 2010)� However, showing efficacy in NI trials might not the sufficient for drug approval (FDA 2010), leading to a controversial issue (see Section 12�4�1)� Furthermore, Snapinn and Jiang (2014) question the formulation of the NI hypothesis given by Equation 1�3 in Section 1�7 of Chapter 1 when the study objective is to show efficacy (see Section 12�4�2)�

12.3.3 From the Fixed-Margin Method to the Synthesis Method

When the NI hypotheses given in Section 1�7 of Chapter 1 were first formulated, the NI margin δ was considered a fixed constant� Therefore, it is natural to use the fixed-margin method (see Section 5�3 of Chapter 5)� However, with the NI margin depending on the effect size as given by Equation 2�1 in Chapter 2, an alternative to the fixed-margin method-namely, the synthesis method (see Section 5�4 of Chapter 5)—emerged in the 2000s� See, for example, Ng (2001); Wang and Hung (2003); Snapinn and Jiang (2008b); and Hung, Wang, and O'Neill (2009)�

The concept of the synthesis method to show efficacy of the test treatment as compared to placebo dates back to 1985 (see Section 5�6 of Chapter 5)� See also Hassalblad and Kong (2001) and the references therein� Based on a search of the PubMed website, it appears that the term “synthesis method” was first introduced in the mid-2000s by Wang and Hung (2003)�

12.3.4 Beyond Two Treatment Groups: Gold-Standard Design

The analyses of clinical trials with more than two treatment groups are much more complicated than those of studies with just two treatment groups because there are many pairwise comparisons to be considered� Chapter 8 focused the discussion on three treatment groups� Of particular interest is the gold-standard design (STP) (see Section 8�2 of Chapter 8)� The traditional strategy is to simultaneously test the three null hypotheses H01, H02, and H03 given in Section 8�2�2 of Chapter 8 at the prespecified significance level α, and the trial will be considered successful only if all three null hypotheses are rejected� The modified Koch-Rohmel procedure proposed by Rohmel and Pigeot (2011) is a step-down procedure to control the family-wise error rate (see Section 8�2�2 of Chapter 8)� Briefly, this procedure first tests H01 at the prespecified significance level α, and if H01 is rejected, then tests H02 and H03 simultaneously at the same α significance level� Obviously, this is a great improvement over the traditional strategy� Note that the NI margin δ in H02 is considered a fixed constant� With the NI margin given by Equation 2�1 of Chapter 2, the linearized approach is a viable alternative (see Section 8�2�3 of Chapter 8)�

12.3.5 Toward One Primary Analysis in Noninferiority Trials: Intention-to-Treat versus Per-Protocol

It was widely recognized in the 1990s and 2000s that the intention-to-treat (ITT) analysis was anticonservative in NI trials, while the exclusion of noncompliant subjects in the per-protocol (PP) analysis could undermine the prognostic balance between the two treatment arms achieved through randomization (see Section 10�5�2 of Chapter 10)� Therefore, both analyses are currently required by regulatory agencies (see Section 10�5�1 of Chapter 10)� However, recent literature has tilted the balance in favor of the ITT analysis, and such a movement is supported by the facts that the ITT analysis (1) preserves the value of randomization and (2) estimates real-world effectiveness (see Section 10�5�5 of Chapter 10)�

The major hurdle in ITT analysis is the missing data� With (1) the 18 recommendations by the National Research Council (2010) (see Appendix 10�A in Chapter 10) and (2) the recent 10 mandatory standards in the prevention and handling of missing data in patient-centered outcomes research recommended by Li et al� (2014) (see Appendix 10�B of Chapter 10), the amount of missing data hopefully will be kept to minimum and the study will be of high quality so that the ITT analysis will be widely accepted as the primary analysis in NI trials�

12.4.1 One Standard for Approval: Efficacy versus Preservation of Control Effect

As noted in Chapter 11, preservation of the 50% control effect was used in studying thrombolytics in the late 1990s� Since then, 50% preservation is often used in the literature (e�g�, Rothmann et al� 2003; Wang and Hung 2003; FDA 2004; Sorbello, Komo, and Valappil 2010)� In fact, the Food and Drug Administration (FDA) draft guidance on NI (FDA 2010) states that “a typical value for M2 is often 50% of M1, …” and that “choosing M2 as 50% of M1 has become usual practice for cardiovascular (CV) outcome studies, whereas …” However, there has been push-back from the pharmaceutical industry in the regulatory approval of the requirement of showing greater than 50% preservation (e�g�, Snapinn and Jiang 2008a; Peterson et al� 2010a; Huitfeldt and Hummel 2011; Snapinn and Jiang 2014)�

Peterson et al� (2010a) argued that regardless of whether a placebo-controlled or active-control trial has been performed, the same standard of evidence should be applied� The same standard of evidence refers to showing superiority to placebo (i�e�, efficacy)� Showing greater than 50% preservation is clearly a higher standard than showing superiority to placebo, as

illustrated in Figure 2�5b in Chapter 2� On the other hand, due to untestable assumptions in the indirect comparison with placebo in an active-control trial, Hung and Wang (2010) argued that such an indirect comparison is different from the direct comparison with placebo in a placebo-controlled trial�

Since preservation and discounting are indistinguishable mathematically (see Section 5�7 of Chapter 5), 50% preservation is subjected to two different interpretations (Snapinn and Jiang 2008a)� If the study objective is to show efficacy as compared to placebo, then the preservation may be “interpreted” as the discounting� For example, Wang and Hung (2003, 153) stated the following:

However, given possible uncertainty on cross-trial inference, in order to be fairly certain that the new drug would have been superior to placebo had the placebo treatment been studied in the trial, it was decided that the new drug must be shown to preserve at least 50% of the control effect in this target population of the active-controlled trial�

In that case, “discounting” should have been used rather than “preservation�” In other words, γ = 0�5 and ε = 1 should be used instead of γ = 1 and ε = 0�5, although the results are exactly the same� Equivalently, imposing some degree of conservativeness can also be done through some kind of discounting in the determination of M1 (Huitfeldt and Hummel 2011)�

On the other hand, if the study objective is to show greater than 50% preservation, then the constancy assumption is needed� A much higher preservation is probably needed for a NI claim� In any case, we should keep in mind the fundamental issues of the indirect comparison, as discussed in Section 12�2�

12.4.2 Efficacy Hypotheses in Noninferiority Trials

To be consistent with the study objective of showing efficacy, Snapinn and Jiang (2014) suggest that the null hypothesis T ≤ P be tested against the alternative hypothesis T > P, instead of the hypotheses defined by Equation 1�3 in Chapter 1� This makes sense logically at first glance� However, since there is no placebo arm in the current NI trial, a direct comparison to placebo cannot be made and formulation of such a hypothesis is questionable� A viable alternative is to make an indirect comparison to placebo through comparison to the standard therapy or active control in the current NI trial (Julious 2011; Snapinn and Jiang 2011)� The formulation of the hypotheses for such an indirect comparison is given by Equation 5�1 in Chapter 5, with ε = 1 for efficacy, where γ = 1 if the constancy assumption holds� There are two approaches for testing the hypothesis given by Equation 5�1a, as discussed in Chapter 5, and no consensus has been reached as to which method should be used (Section 12�4�3)�

12.4.3 Fixed-Margin versus Synthesis Methods

The FDA draft guidance (FDA 2010) suggests the fixed-margin method be used for showing efficacy, as it states:

We believe the fixed-margin approach is preferable for ensuring that the test drug has an effect greater than placebo (i�e�, the NI margin M1 is ruled out)� However, the synthesis approach, appropriately conducted, can be considered in ruling out the clinical margin M2�

On the other hand, many authors (e�g�, Snapinn and Jiang 2008a; Peterson et al� 2010a; Huitfeldt and Hummel 2011; Snapinn and Jiang 2014) advocate the synthesis method� For example, Huitfeldt and Hummel (2011) state:

We propose that the most efficient method should be used for both analyses, that is, the synthesis method�

The choice of statistical approach (fixed-margin or synthesis) in the analysis of NI trials remains controversial� See, for example, Hung and Wang, (2010) and Peterson et al� (2010b)� As discussed in Section 5�1 of Chapter 5, the fixed-margin method (Section 5�3) is conditioned on the historical data, while the synthesis method (Section 5�4) is unconditioned on the historical data� In testing the preservation hypotheses given by Equation 5�1 in Section 5�1, assuming the effect size is appropriately discounted, the fixed margin cannot control the unconditional Type I error rate at the α significance level exactly� More specifically, the unconditional Type I error rate will be inflated (deflected) if the effect size is overestimated (underestimated) by the onesided (1 – α*/2)100% lower confidence limit as discussed in Section 5�8 of Chapter 5� On the other hand, the synthesis method can control the unconditional Type I error rate at the nominal α/2 level� This is an advantage of the synthesis method� However, the results of the analysis using this method may be difficult to interpret in the sense that the clinical significance cannot be assessed (Hung, Wang, and O'Neil 2007)�

As discussed in Section 5�5 of Chapter 5, from a statistical point of view, the synthesis method rather than the fixed-margin method should be used; however, from a practical point of view, the fixed-margin method rather than the synthesis method should be used�

The most fundamental issue in the two-arm NI trials is the use of historical data because patients were not “randomized” into the two studies-the

current NI trial and the “super-study” of the historical studies (see Section 12�2)� For this reason, superiority testing using other designs (e�g�, add-on design) is often recommended�

Although discounting may be used if the constancy assumption is violated, such a discounting is subjective� Section 5�7 makes it very clear that preservation and discounting are two different concepts, although they are indistinguishable mathematically, which is also recognized by many authors (e�g�, Ng 2001, Snapinn and Jiang 2008a; Peterson et al� 2010a)�

The following are highlights of the issues and challenges in the design and analysis of NI trials discussed throughout this book:

• Which metric-the difference, ratio or odds ratio-should be used in formulating the NI hypotheses with a binary endpoint (see Chapter 4)?