Introduction | 3 | Noninferiority Testing in Clinical Trials

ABSTRACT

There are two major types of equivalence in clinical research: therapeutic equivalence and bioequivalence� Therapeutic equivalence is sometimes referred to as clinical equivalence, and often arises in active-control equivalence studies in which an experimental or test treatment is compared to a standard therapy or active control based on clinical endpoints� The objective is to show that the experimental treatment produces the same benefit as the active control� The term “active-control equivalence studies” (ACES) is attributed to Makuch and Johnson (1990)�

Bioequivalence arises from studies in which a test product is compared to a reference product with respect to pharmacokinetic parameters, such as the area under the concentration-time curve (AUC), the maximum concentration (Cmax), etc� The objective of the bioequivalence studies is to show that the pharmacologic activity of one product is similar to that of another� These studies are often conducted with normal, healthy volunteers using the standard 2 × 2 crossover design� Chow and Liu (2009, p� 1) state, “When two formulations of the same drug or two drug products are claimed to be bioequivalent, it is believed that they will provide the same therapeutic effect or that they are therapeutically equivalent and they can be used interchangeably�” This assumption is the basis for the approval of generic drugs by the U�S� Food and Drug Administration (FDA) and by nearly all regulatory agencies in the world� Temple (1982) raised a fundamental issue with regard to ACES� He questioned whether the positive control in an ACES would have beaten a placebo group had one been present� This is the assay sensitivity of the trial as discussed in the International Conference on Harmonization (ICH) E10 (2001)� ICH E10 (2001, p� 7) states, “Assay sensitivity is a property of a clinical trial defined as the ability to distinguish an effective treatment from a less effective or ineffective treatment.” Assay sensitivity cannot be validated in an ACES with no concurrent placebo� Without assay sensitivity, the results obtained from an ACES are uninterpretable. Therefore, a concurrent placebo is often recommended if placebo use is ethical in the setting of the study� Assay sensitivity is the fundamental issue in an ACES and will be discussed further in Sections 2�6 and 2�7�

Even though a placebo does not play any role in bioequivalence studies, these studies have issues similar to assay sensitivity� These issues, however, have received much less attention than those with ACES� On the other hand, in bioequivalence studies for biological products, such as immunoglobulin, the endogenous level, for example, needs to be taken into account by subtracting it from the concentration before computing the AUC (Ng 2001; EMEA/CPMP 2010), and the endogenous level, in some sense, plays the role of placebo�

“For drugs, bioequivalence studies are the basis for evaluating generic products� For biologics, these studies are conducted to show comparability of production lots when the sponsor makes significant manufacturing changes, such as scaling up pilot plant production or building new facilities that do not require efficacy studies or extensive safety data” (Ng 2001, p� 1518)� The generic versions of biologic products are usually referred to as (1) biosimilars by the European Medicines Agency (EMA) of the European Union (EU) and (2) follow-on biologics by the U�S� FDA� Due to the complexity of biosimilar drug products, ”the design and analysis to evaluate the equivalence between the biosimilar drug product and innovator products are substantially different from those of chemical generic products” (Chow 2011, p� 6)�

Equivalence testing can arise in many other situations� One such situation is in the evaluation of product quality, such as red blood cells, plasma, and platelets� Other examples include lot-to-lot consistency and bridging studies, as well as safety studies (see also Wellek 2010)� There is an extensive literature on bioequivalence� See, for example, Chow and Liu (2009), Hauschke, Steinijans and Pigeot (2007), and the references therein�

This book will focus on therapeutic equivalence� Unless otherwise noted, the discussion in this book is in the context of therapeutic equivalence, although the statistical methodology, such as the two one-sided tests procedure (Schuirmann 1987), had originally been proposed for bioequivalence�

Equivalence testing with two treatment groups can be one sided or two sided� One-sided equivalence studies are also known as NI studies� Therapeutic equivalence is often one sided� That is, we wish to know if the experimental treatment is not worse than the active control� Bioequivalence with biologic products can also be one sided because we are not concerned if the test product is more bioavailable than the reference product; however, that does not preclude a twosided bioequivalence testing� On the other hand, bioequivalence with drugs is two sided, since greater bioavailability may post a safety concern (e�g�, adverse events)� Therapeutic equivalence can also be two sided� For example, when comparing a twice-a-day to a once-a-day regimen, a difference in either direction is worthy of note� In the remainder of this section, the hypotheses are formulated, without loss of generality, using a continuous outcome as an example�

For two-sided equivalence testing, it is not possible to show absolute equality of the two means (Ng 2001; Piaggio et al� 2006; Senn 2007, 240; Gülmezoglu et al� 2009; Suda et al� 2011) (see also Section 1�6)� Therefore, hypotheses are formulated for showing δ-equivalence, a terminology introduced by Ng (1993a, 1995); that is, the absolute mean difference being less than a prespecified δ > 0� More specifically, the null hypothesis

≥ δH :| – |0(2) t s

is tested against the alternative hypothesis

< δH :| – |1(2) t s where μt and μs denote the mean response for the test (or experimental) treatment and the standard therapy (or active control), respectively� The superscript (2) indicates testing for two-sided equivalence� The two one-sided tests procedure (Schuirmann 1987; Ng 1993b; Hauschke 2001) and the confidence interval approach are often used for testing this null hypothesis (Ng 2001)� Briefly, the previous hypotheses may be rewritten as

≤ δ ≥ δH : – – or –0 (2)

versus

> δ < δH : – – and –1 (2)

which can be decomposed into two one-sided hypotheses as

≤ δH : – –01 t s

versus

> δH : – –11 t s

and

≥ δH : –02 t s

versus

< δH : –12 t s

If both H01 and H02 are rejected at a significance level of α, then H0(2) will be rejected at the same significance level� This amounts to rejecting H0(2) if the 100(1 – 2α)% [not 100(1 – α)%] confidence interval for the mean difference completely falls inside the interval (–δ, δ)�

For average bioequivalence (bioequivalence in average bioavailability in terms of pharmacokinetic parameters such as AUC and Cmax; Chow and Liu 2000), the hypotheses are formulated as the mean ratio with an equivalence range of 0�8 to 1�25 rather than as the mean difference� The FDA has adopted this equivalence range for a broad range of drugs based on a clinical judgment that a test product with bioavailability measures (e�g�, AUC and Cmax) outside this range should be denied market access (U�S� FDA 2001a)� The FDA also recommends that the analyses of bioavailability measures be performed on a log scale; that is, the data is log transformed� Analyzing the natural log-transformed data results in testing the hypotheses based on the mean difference with an equivalence range from –0�223 to 0�223, as loge(0�80) = –0�223 and loge(1�25) = 0�223� Significance levels of 0�05 and 0�025 are typically used for the average bioequivalence and therapeutic equivalence, respectively�

For one-sided equivalence testing, assuming a larger response corresponds to a better outcome, μt – μs ≥ 0 means that the test treatment is at least as good as the standard therapy, or equivalently, the test treatment is not inferior to the standard therapy� However, it is not possible to show NI literally in the sense that μt – μs ≥ 0, as shown in the following�

To show NI literally, we test the null hypothesis

<H : – 00 t s (1�1)

against the alternative hypothesis

≥H : – 01 t s

so that NI can be concluded when the null hypothesis is rejected� To show superiority, we test the null hypothesis

≤H : – 00 t s (1�2)

against the alternative hypothesis

>H : – 01 t s

The only difference between these two sets of hypotheses is the boundary 0� More specifically, the boundary 0 is included in the null hypothesis for testing for superiority, but not for NI� Since taking the supremum of the test statistic over the parameter space under the null hypothesis for testing NI in Equation 1�1 is equal to that over the parameter space under the null hypothesis for testing superiority in Equation 1�2, testing for NI would be the same as testing for superiority� Therefore, instead of

showing NI literally (i�e�, μt – μs ≥ 0), hypotheses are formulated to show that the experimental treatment is δ-no-worse than the active control, a terminology introduced by Ng (1993a, 1995); that is, the experiment treatment is not worse than the active control by a prespecified δ (> 0) or more� Therefore, for NI, we test the null hypothesis

≤ δH : – –0 (1)

against the alternative hypothesis

> δH : – –1 (1)

where the superscript (1) indicates testing for the one-sided equivalence (Ng 2001)�

What is δ? δ is the usual term for the equivalence or NI margin� The following is a list of definitions of δ found in the literature� It is by no means a complete list� Some authors used other notations, such as Θ 0 and Δ instead of δ� This list is expanded from an original 13-item list that was prepared by the author of this book for an invited talk at the Drug Information Association workshop held in Vienna, Austria, in 2001 (Ng 2001)�

1� “An equivalence margin should be specified in the protocol; this margin is the largest difference which can be judged as being clinically acceptable and …” (ICH E9 1998)�

2� “This margin is the degree of inferiority of the test treatments to the control that the trial will attempt to exclude statistically” (ICH E10 2001)�

3� “Choice of a meaningful value for Θ 0 is crucial, since it defines levels of similarity sufficient to justify use of the experimental treatment” (Blackwelder 1998)�

4� “…that a test treatment is not inferior to an active treatment by more than a specified, clinically irrelevant amount (Noninferiority trials)…” (Hauschke, Schall, and Luus 2000)�

5� “…but is determined from the practical aspects of the problem in such a way that the treatments can be considered for all practical purposes to be equivalent if their true difference is unlikely to exceed the specified Δ” (Dunnett and Gent 1977)�

6� “In a study designed to show equivalence of the therapies, the quantity δ is sufficiently small that the therapies are considered

equivalent for practical purposes if the difference is smaller than δ” (Blackwelder 1982)�

7� “An objective of ACES is the selection of the new treatment when it is not worse than the active control by more than some difference judged to be acceptable by the clinical investigator” (Makuch and Johnson 1990)�

8� “Hence, if a new therapy and an accepted standard therapy are not more than irrelevantly different concerning a chosen outcome measure, both therapies are called therapeutically equivalent” (Windeler and Trampisch 1996)�

9� “The δ is a positive number that is a measure of how much worse B could be than A and still be acceptable” (Hauck and Anderson 1999)�

10� “For regulatory submissions, the goal is to pick the allowance, δ, so that there is assurance of effectiveness of the new drug when the new drug is shown to be clinically equivalent to an old drug used as an active control� For trials of conservative therapies, the δ represents the maximum effect with respect to the primary clinical outcome that one is willing to give up in return for the other benefits of the new therapy” (Hauck and Anderson 1999)�

11� “…where δ represents the smallest difference of medical importance� …These approaches depend on the specification of a minimal difference δ in efficacy that one is willing to tolerate” (Simon 1999)�

12� “The noninferiority/equivalence margin, δ, is the degree of acceptable inferiority between the test and active control drugs that a trial needs to predefine at the trial design stage” (Hwang and Morikawa, 1999)�

13� “In general, the difference δ should represent the largest difference that a patient is willing to give up in efficacy of the standard treatment C for the secondary benefits of the experimental treatment E” (Simon 2001)�

14� “A margin of clinical equivalence (Δ) is chosen by defining the largest difference that is clinically acceptable, so that a difference bigger than this would matter in practice” (EMEA/CPMP 2000)�

15� “To determine whether the two treatments are equivalent, it is necessary first to identify what is the smallest difference in 30-day mortality rates that is clinically important” (Fleming 2000)�

16� “The inherent issue in noninferiority and equivalence studies is the definition of what constitutes a clinically irrelevant difference in effectiveness” (Hauschke 2001)�

17� “The smallest value that would represent a clinically meaningful difference, or the largest value that would represent a clinically meaningless difference” (Wiens 2002)�

18� “Here, M is the non-inferiority margin, that is, how much C can exceed T with T still being considered noninferior to C (M > 0)” (D’Agostino, Massaro, and Sullivan 2003)�

19� “…one sets out to arbitrarily choose this minimum clinically relevant difference, commonly called delta…” (Pocock 2003)�

20� “The selection of an appropriate non-inferiority margin delta (Δ), i�e�, the quantitative specification of an ‘irrelevant difference’ between the test and the standard treatment, poses a further difficulty in such trials” (Lange and Freitag 2005)�

21� “Noninferiority margin is a response parameter threshold that defines an acceptable difference in the value of that response parameter between the experimental treatment and the positive control treatment as the selected comparator� This margin is completely dictated by the study objective” (Hung, Wang, and O’Neill 2005)�

22� “If δ represents the degree of difference we wish to rule out, then we test H0A: τ ≤ –δ against H1A: τ > –δ and H0B: τ ≥ δ against H1B: τ < δ” (Senn 2007, 238)�

23� “Because proof of exact equality is impossible, a prestated margin of noninferiority (Δ) for the treatment effect in a primary patient outcome is defined” (Piaggio et al� 2006)�

24� “Because proof of exact equality is impossible, a prestated margin of noninferiority (Δ) for the difference in effectiveness has to be defined” (Gülmezoglu et al� 2009)�

25� “Since it is not possible to determine that the drugs being compared are exactly equal, a margin of noninferiority is determined a priori and is used to demonstrate the relative effect of the study intervention” (Suda et al� 2011)�

Most definitions relate δ to a clinical judgment; others relate δ to other benefits� The ICH E10 document (2001) refers to δ as the degree of inferiority of the test treatments to the control that the trial will attempt to exclude statistically� It says exactly what the statistical inference does� The document then states that “if the confidence interval for the difference between the test and control treatments excludes a degree of inferiority of the test treatment that is as large as, or larger than, the margin, the test treatment can be declared noninferior�” There is no problem with this statement if δ is small, but it could be misleading if δ is too large (Ng 2001)�

In the 1990s, most ACES were not recognized as one-sided versions of equivalence� For example, ICH E10 (2001, p� 7) states the following:

Clinical trials designed to demonstrate efficacy of a new drug by showing that it is similar in efficacy to a standard agent have been called equivalence trials� Most of these are actually noninferiority trials, attempting

to show that the new drug is not less effective than the control by more than a defined amount, generally called the margin�

ICH E9 (1998) distinguishes the two-sided equivalence as equivalence and one-sided equivalence as NI by stating that “…This type of trial is divided into two major categories according to its objective; one is an equivalence trial (see Glossary) and the other is a noninferiority trial (see Glossary)�” Even so, the NI margin is not used in ICH E9 (1998), and the lower equivalence margin is used instead� This is in contrast to two more recent regulatory guidances (EMEA/CPMP, 2005; U�S� FDA, 2010), where NI is the primary focus� However, these documents do not define explicitly what δ is, as shown by the following statements made by the EMEA/CPMP (2005, p� 3) and U�S� FDA (2010, p� 7), respectively; neither do many authors (e�g�, Ng 2001; Piaggio et al� 2006; Senn 2007, 238; Ng 2008; Gülmezoglu et al� 2009; Suda et al� 2011):

In fact a noninferiority trial aims to demonstrate that the test product is not worse than the comparator by more than a pre-specified, small amount� This amount is known as the noninferiority margin, or delta (Δ)�

…the NI study seeks to show that the difference in response between the active control (C) and the test drug (T), (C-T)…�is less than some pre-specified, fixed noninferiority margin (M)�

How do you choose δ? The following is a list of suggestions in the literature� It is by no means a complete list� Again, this list is expanded from an original 9-item list that was prepared by the author of this book for an invited talk at the Drug Information Association workshop held in Vienna, Austria, in 2001 (Ng 2001)�

1� “…should be smaller than differences observed in superiority trials of the active comparator” (ICH E9 1998)�

2� “The margin chosen for a noninferiority trial cannot be greater than the smallest effect size that the active drug would be reliably expected to have compared with placebo in the setting of the planned trial� …In practice, the noninferiority margin chosen usually will be smaller than that suggested by the smallest expected effect size of the active control because of interest in ensuring that some clinically acceptable effect size (or fraction of the control drug effect) was maintained” (ICH E10 2001)�

3� “Θ0 must be considered reasonable by clinicians and must be less than the corresponding value for placebo compared to standard

treatment, if that is known� …The choice of Θ0 depends on the seriousness of the primary clinical outcome, as well as the relative advantages of the treatments in considerations extraneous to the primary outcome” (Blackwelder 1998)�

4� “In general, the equivalence limits depend upon the response of the reference drug” (Liu 2000a)�

5� “On the other hand, for one-sided therapeutic equivalence, the lower limit L may be determined from previous experience about estimated relative efficacy with respect to placebo and from the maximum allowance which clinicians consider to be therapeutically acceptable� …Therefore, the prespecified equivalence limit for therapeutic equivalence evaluated in a noninferiority trial should always be selected as a quantity smaller than the difference between the standard and placebo that a superior trial is designed to detect” (Liu 2000b)�

6� “The extent of accepted difference (inferiority) may depend on the size of the difference between standard therapy and placebo” (Windeler and Trampisch 1996)�

7� “A basis for choosing the δ for assurance of effectiveness is prior placebo-controlled trials of the active control in the same population to be studied in the new trial” (Hauck and Anderson 1999)�

8� “This margin chosen for a noninferiority trial should be smaller (usually a fraction) than the effect size, Δ, that the active control would be reliably expected to have compared with placebo in the setting of the given trial” (Hwang and Morikawa 1999)�

9� “The difference δ must be no greater than the efficacy of C relative to P and will in general be a fraction of this quantity delta δc” (Simon 2001)�

10� “Under other circumstances it may be more acceptable to use a delta of one half or one third of the established superiority of the comparator to placebo, especially if the new agent has safety or compliance advantages” (EMEA/CPMP 1999)�

11� “Choosing the noninferiority margin M … we need to state the noninferiority margin M, that is, how close the new treatment T must be to the active control treatment C on the efficacy variable in order for the new treatment to be considered noninferior to the active control” (D’Agostino, Massaro, and Sullivan 2003)�

12� “The size of the acceptable margin depends on the smallest clinically significant difference (preferably established by independent expert consensus), expected event rates, the established efficacy advantage of the control over placebo, and regulatory requirements” (Gomberg-Maitland, Frison, and Halperin 2003)�

13� “A prestated margin of noninferiority is often chosen as the smallest value that would be a clinically important effect� If relevant, Δ should be smaller than the ‘clinically relevant’ effect chosen to investigate superiority of reference treatment against placebo” (Piaggio et al� 2006)�

14� “The choice of the noninferiority margin can be made using clinical assessment, which is to a certain extent arbitrary, and needs consensus among different stakeholders� … A reasonable criterion is to preserve 80% of the benefit of the full AMTSL package (considered as 100%) over expectant management (considered as 0%)� …Preserving a higher percentage (say 90%) will push the sample size calculations very high while a smaller percentage (say 50%) may not be considered acceptable” (Gülmezoglu et al� 2009)�

15� “The margin of noninferiority is typically selected as the smallest value that would be clinically significant …” (Suda et al� 2011)�

16� “The determination of the noninferiority margin should incorporate both statistical reasoning and clinical judgment” (Suda et al� 2011)�

Ng (1993b, 2001, 2008) proposed that the equivalence margin δ should be a small fraction (e�g�, 0�2) of the therapeutic effect of the active control as compared to placebo (or effect size)� This proposal is in line with the view of many authors that δ should depend on the effect size of the active control, but more specifically, recommends that δ should be a small fraction of the effect size with the objective of showing NI� Such a proposal and its motivation will be elaborated in Chapter 2�

ICH E10 (2001) and EMEA/CPMP (2005) suggested that the determination of δ be based on both statistical reasoning and clinical judgment, which was supported by many authors (e�g�, Kaul and Diamond 2007; Suda et al� 2011)� The “statistical reasoning” is due to the dependence of δ on the effect size, as Suda et al� (2011) stated that “statistical reasoning takes into account previous placebo-controlled trials to identify an estimate of the active control effect�” There is a subtle difference in wording with regard to “clinical judgment�” The following are two lists of such wordings:

1� Clinically acceptable difference, clinically irrelevant amount, clinically irrelevant difference, clinically meaningless difference, irrelevantly different, degree of acceptable inferiority

2� Clinically important, clinically meaningful difference, clinically relevant difference, clinically significant

The key word in the first and second list is “acceptable” and “important,” respectively� There are two approaches to set δ: (1) bottom-up and (2) topdown� The bottom-up approach starts from the bottom with an extremely small difference that is “acceptable” and works up, while the top-down

approach starts from the top with a large difference that is “important” and works down� Hopefully, these two approaches stop at the same place, which becomes the δ� Treadwell et al� (2012, p� B-5) stated the following:

The two journal publications (Gomberg-Maitland, Frison, and Halperin 2003; Piaggio et al� 2006) described the threshold in terms of the smallest value that would be clinically important� Three regulatory documents (ICH E9 1998; EMEA/CPMP 2000; US FDA 2010) described it as the largest difference that is clinically acceptable�

The first sentence corresponds to the top-down approach, while the second sentence corresponds to the bottom-up approach� See Section 2�8 of Chapter 2 for further discussion of these approaches when δ is expressed in terms of a fraction of effect size of the standard therapy as compared to placebo�

1.5.1 Gold Standard for Assessing Treatment Efficacy

A randomized, double-blind, placebo-controlled trial is the gold standard in assessing the efficacy of the test treatment� In such a trial, subjects are randomly assigned to either the test treatment or the placebo� “The purpose of randomization is to avoid selection bias and to generate groups which are comparable to each other” (Newell 1992, p� 837)� Without randomization, the investigators can preferentially (intentionally or unintentionally) enroll subjects between the two groups� In addition, unobservable covariates that affect the outcome are most likely to be equally distributed between the two groups; thus, it minimizes allocation bias� Double-blind means both the investigator and the participant are unaware of the treatment (test treatment or placebo) the participant is receiving� Without double-blinding, the results may be subjected to potential bias, especially if the outcome variable is subjective� It is critical that randomization be properly executed, and blinding is adequate because lack of proper randomization and/or inadequate blinding may render the results uninterpretable due to various biases, which are difficult, if not impossible, to assess and account for� With a proper randomization and adequate blinding, any observed difference beyond the random chance may then be attributed to the test treatment�

The analysis of a placebo-controlled trial is relatively simple and straightforward as compared to an active-controlled trial� This will be elaborated in Section 1�5�3� However, when effective treatment is available, placebocontrolled trials are under attack for ethical reasons (Lasagna 1979)� The Declaration of Helsinki calls for using the best currently proven intervention as the control (see Section 1�5�2)� This makes a lot of sense from an ethical

point of view� However, such a recommendation was pushed back notably by the regulatory agency (see Section 1�5�2) due to inherent difficulties in the interpretation of an ACES (see Section 1�5�3; Temple 1997)�

1.5.2 Declaration of Helsinki and U.S. Regulations

A brief overview of the Declaration of Helsinki is given by Wikipedia Contributors (2014) in the following:

The Declaration of Helsinki is a set of ethical principles regarding human experimentation developed for the medical community by the World Medical Association (WMA)� The declaration was originally adopted in June 1964 in Helsinki, Finland, and has since undergone seven revisions (the most recent at the General Assembly in October 2013) and two clarifications, growing considerably in length from 11 paragraphs in 1964 to 37 in the 2013 version�

Article II�3 in the 3rd Revision of the Declaration (WMA 1989) stated:

In any medical study, every patient-including those of a control group, if any-should be assured of the best proven diagnostic and therapeutic method�

The following was added to the end of that statement in the 4th Revision of the Declaration (WMA 1996):

This does not exclude the use of inert placebo in studies where no proven diagnostic or therapeutic method exists�

Article 32 in the 6th Revision (WMA 2008) of the Declaration stated:

The benefits, risks, burdens and effectiveness of a new intervention must be tested against those of the best current proven intervention, except in the following circumstances:

• The use of placebo, or no treatment, is acceptable in studies where no current proven intervention exists; or

• Where for compelling and scientifically sound methodological reasons the use of placebo is necessary to determine the efficacy or safety of an intervention and the patients who receive placebo or no treatment will not be subject to any risk of serious or irreversible harm� Extreme care must be taken to avoid abuse of this option�

A minor revision of this article was made in the current 7th Revision (WMA 2013) of the Declaration as Article 33 under “Use of Placebo�”

In 1975, the U�S� FDA incorporated the 1964 Helsinki Declaration into its regulation governing investigational drug trials conducted in non-U�S�

countries (U�S� FDA 2001b)� The agency also issued a similar regulation applicable to devices in 1986, when the 1983 version of the declaration (2nd Revision) was in effect (21 CFR 814�15)� Subsequently, the agency amended the regulation in 1981 to replace the 1964 declaration with the 1975 version (1st Revision), and again in 1991 (21 CFR 312�120) to replace the 1975 declaration with the 1989 version (3rd Revision)� The regulations (21 CFR 312�120 and 21 CFR 814�15) have not been amended to incorporate the 2000 version of the declaration (5th Revision) (U�S� FDA 2001b), and it is silent with regard to the 1996 version of the declaration (4th Revision)� On April 28, 2008, the regulations were amended, again abandoning the Declaration of Helsinki� Instead, it is required to follow the ICH E6 (1996) guidance on good clinical practice (GCP) such as review and approval by an independent ethics committee (IEC) and informed consent from subjects� This requirement took effect on October 27, 2008, and was codified in 21 CFR 312�120 (U�S� FDA 2012)�

1.5.3 Placebo Control versus Active Control and Sample Size Determination

For a placebo-controlled trial, the null hypothesis is that there is no difference between test treatment and placebo� This null hypothesis is usually tested at a two-sided 0�05 or one-sided 0�025 significance level� Such hypothesis testing has been referred to as the conventional approach (Ng 1995)� See Section 1�6 for further discussion of using this approach in an ACES� Aside from randomization and blinding, there is an incentive to conduct high-quality studies, as poorly conducted studies (e�g�, mixing up treatment assignment, poor compliance, etc�) may not detect a true difference, since there is an increase in variability and bias toward equality� Despite the many scientific merits of placebo-controlled trials, such trials have been controversial from an ethical standpoint as increasing number of effective treatments are brought into the market� For example, the Declaration of Helsinki (1989) (see Section 1�5�2) essentially called for an active-controlled trial rather than a placebo-controlled trial when effective treatments are available�

For an active-controlled trial, if the objective is to show superiority of test treatment over the active control, the statistical principle in hypothesis testing is the same as that in the placebo-controlled trial, and there will be no issues� However, many issues arise if the objective is to show that the test treatment is similar to or not inferior to the active control, the so-called equivalence and noninferiority trial� These issues have been well recognized in the literature since the early 1980s� See, for example, Temple (1982), Temple (1997), Senn (2007), U�S� FDA (2010), and Rothmann, Wiens and Chan (2012)� These issues will be discussed in Sections 2�5 through 2�7 of Chapter 2�

There is a consensus that use of a placebo is unethical and should be prohibited when an intervention shown to improve survival or decrease serious morbidity is available� See, for example, Temple and Ellenberg (2000),

Emanuel and Miller (2001), and Lyons (2001)� Recognizing the inherent difficulties in assessing δ-equivalence/δ-no-worse-than, and hence the efficacy of the test treatment in an ACES, these authors also elaborate different scenarios where use of a placebo is ethical when there is an effective treatment, even though the Declaration of Helsinki (4th Revision 1996) is essentially excluding the use of placebo as a control in all clinical trials (Section 1�5�2)� These difficulties arise in an ACES, but not in placebo-controlled trials (or superiority trials) as they relate to the assay sensitivity (ICH E10, 2001)� See Sections 2�6 and 2�7 of Chapter 2 for further discussion� Note that the current version of the declaration (7th Revision; WMA 2013) allows use of placebo (or no treatment) when (1) there is no effective treatment or (2) under certain scenarios even there is an effective treatment (Section 1�5�2)�

Sample size determination in the conventional null hypothesis testing in a placebo-controlled trial involves specifications of Type I (α) and Type II (β) error rates and δ0� In practice, specification of δ0 may be arbitrary-so are α and β to a lesser extent, although α = 0�05 is typically used; as Feinstein (1975) wrote:

What often happens is that the statistician and investigator decide on the size of δ 0� The magnitude of the sample is then chosen to fit the two requirements (1) that the selected number of patients can actually be obtained for the trial and (2) that their recruitment and investigation can be funded�

To specify δ0, the paper continued: “In the absence of established standards, the clinical investigator picks what seems like reasonable value… If the sample size that emerges is unfeasible, δ0 gets adjusted accordingly, and so do α and β, until n comes out right�” Spiegelhalter and Freedman (1986) summarized precisely the practice in the specification of δ0 in the following:

There is very little explicit guidance as to how to set δ 0, and in practice it seems likely that δ 0 is juggled until it is set at a value that is reasonably plausible, and yet detectable given the available patients�

In the regulatory environment, however, an α level of one-sided 0�025 or two-sided 0�05 is the standard, and the study is typically powered at 80% or 90%� Actually, the estimate of the variability of the continuous endpoint or the background rate for the binary endpoint also plays a role in sample size calculation� Once the sample size is determined and the study is completed, δ 0 does not play any role in the statistical analyses or inferences� The null hypothesis of equality is either rejected or not� No inference may be made with regard to δ 0 when the null hypothesis is rejected, although point estimate and confidence interval provide information regarding the effect size�

It should be noted that too small δ 0 would result in a very large sample size, which is a waste of valuable resources� Furthermore, this may lead to an undesirable outcome where the treatment difference is statistically significant but is too small to be clinically meaningful� Very often δ0 is set equal to twice (or more) of the minimum difference of clinical interest (Jones et al� 1996, p� 37)�

On the other hand, an ACES is to show that the new treatment is sufficiently similar to (or not too much worse than) the standard therapy to be clinically indistinguishable (or noninferior)� Therefore, the margin δ should be smaller than δ0 (Jones et al� 1996; Kaul and Diamond 2007; ICH E9 1998; Liu 2000b; Suda et al� 2011)� Jones et al (1996) suggests δ be no larger than half of δ0, leading to sample sizes roughly four times as large as those in similar placebo-controlled trials�

In the 1970s, there was widespread recognition among statisticians that it is a flaw to accept the null hypothesis of no difference between two treatments (referred to as the conventional null hypothesis) when the null hypothesis is not rejected (e�g�, Dunnett and Gent 1977; Makuch and Simon 1978)� Consequently, many authors (e�g�, Anderson and Hauck 1983; Blackwelder 1982) criticized the use of significance testing of the conventional null hypothesis (referred to here as “the conventional approach”) in situations in which the experimenter wishes to establish the equivalence of two treatments� The main criticisms are that (1) two different treatments (or regimens) are not expected to have exactly the same treatment effect and (2) two treatments cannot be shown to be literally equivalent� Although the criticisms are legitimate, it is not practicable because (1) no other statistical methods can be used to establish the strict sense of equivalence (strict equality) and (2) the confidence interval approach and other forms of hypothesis testing (e�g�, Anderson and Hauck 1983; Blackwelder 1982) (referred to here as “the role-reversal approach,” e�g�, two one-sided test procedure) were proposed to establish only δ-equivalence but not strictsense equivalence (Ng 1995)�

Can we use the conventional approach to establish that two treatments are δ-equivalent or that one treatment is δ-no-worse than the other treatment? The remainder of this section will address these questions�

When the conventional null hypothesis is not rejected, what can we conclude? Although accepting the conventional null hypothesis when it is not rejected is a flaw, failing to reject the conventional null hypothesis would lead one to believe that the treatment difference is not very “far” from zero� When the conventional null hypothesis is not rejected, one would believe

that the larger the sample size, the closer to zero the treatment difference is� If the sample size is such that the Type II error rate (β error) at some δ is sufficiently small (e�g�, < 0�025), then we can conclude that the two treatments are δ-equivalent� With this interpretation, Ng (1993a) showed that under the assumptions of normality with common known variance, the conventional approach to establish δ-equivalence and δ-no-worse-than coincides with the role-reversal approach if the Type II error rate (β error) in the conventional approach at δ (or –δ) is equal to the Type I error rate (α error) in the rolereversal approach�

The argument for the equivalence of the two approaches for establishing δ-no-worse-than is briefly described as follows� It is obvious that the result holds if we are testing a simple null hypothesis against a simple alternative hypothesis, as opposed to a simple null against a composite alternative hypothesis� We then extend the simple alternative hypotheses in both approaches to the appropriate composite alternative hypotheses� Next, we perform the tests as if we are testing simple versus simple alternative hypotheses, except that in the conventional approach, we would conclude δ-no-worse-than instead of accepting the conventional null hypothesis if it is not rejected� The result then follows� The argument for the equivalence of the two approaches for establishing δ-equivalence is similar and is omitted here� Note that the argument for the equivalence of the two approaches assumes known variance rather than observed power calculated after the study is completed� Such a power calculation is not recommended (Hoenig and Heisey 2001)�

Therefore, under the normality assumption with a common known variance, the conventional approach may be used to establish that two treatments are δ-equivalent or that one treatment is δ-no-worse than the other treatment� In practice, however, the conventional approach often cannot control the β error because the variance has to be estimated and the sample size might be smaller than planned due to dropouts, low recruitment, etc� On the other hand, the role-reversal approach does not have the difficulty in controlling the α error� That does not mean that the conventional approach is inappropriate in sample size calculation in the design of active-control equivalence studies� In fact, with proper choices of the α and β errors, the sample size calculation using either approach will give the same answer, where (1) the β error in the conventional approach is calculated at an alternative that the mean difference is δ and (2) the β error in the role-reversal approach is calculated at an alternative that the mean difference is zero (Ng 1995; Ng 1996)�

The following discussions focus on δ-equivalence, although they are applicable to the δ-no-worse-than as well� It is incorrect to accept the conventional null hypothesis, regardless of the sample size� In fact, we can never establish the conventional null hypothesis in the strict sense no matter how large the sample size is, unless we “exhaust” the whole population (and hence, the true means are known)� However, for fixed variance and fixed δ and with the same nonsignificant p-value, the larger the sample size, the stronger the

evidence in supporting δ-equivalence� Furthermore, for fixed variance with the same nonsignificant p-value, the larger the sample size, the smaller the δ for claiming δ-equivalence (Ng 1995)�

One should realize that establishing δ-equivalence has little or no meaning at all if δ is too large and that any two treatments are δ-equivalent if δ is large enough� For example, in antihypertensive studies for which the reduction in supine diastolic blood pressure is the primary efficacy variable, if δ = 8 mm Hg and the therapeutic effect of the standard therapy as compared to placebo is only 6 mm Hg, then we don’t really gain anything by concluding that the test drug and the standard therapy are δ-equivalent because the placebo is also δ-equivalent to the standard therapy (Ng 1995)�

Throughout the rest of this book, T denotes the test or experimental treatment, S denotes the standard therapy or the active control, P denotes the placebo, and δ denotes the noninferiority (NI) margin� Furthermore, assume that there is no concurrent placebo control due to ethical reasons in life-threatening situations, for example� Note that T, S, and P could be the true mean responses for continuous outcomes (see Chapters 2 and 3) or the true success rates (or proportions of successes) for binary outcomes (see Chapter 4)� Unless noted otherwise, we assume that a larger value corresponds to a better outcome�

The NI hypotheses (H0(1) and H1(1)) in Section 1�2 may be restated using the notations introduced in this section as follows:

≤ δH : T – S –0

versus

> δH : T – S –1

or equivalently

≤ δH : T S –0 (1�3a)

versus

> δH : T S –1 (1�3b)

Anderson S, and Hauck WW (1983)� A New Procedure for Testing Equivalence in Comparative Bioavailability and Other Clinical Trials� Communications in Statistics: Theory and Methods, 12:2663-2692�

Blackwelder CW (1982)� “Proving the Null Hypothesis” in Clinical Trials� Controlled Clinical Trials, 3:345-353�

Blackwelder CW (1998)� Equivalence Trials� In: Armitage P and Colton T eds� Encyclopedia of Biostatistics� New York: John Wiley, 1367-1372�

Chow SC (2011)� Quantitative Evaluation of Bioequivalence/Biosimilarity� J Bioequiv Availab S1:002 doi: https://omicsonline.org/0975-0851/JBB-S1-002.php.