ABSTRACT

Switching between superiority and noninferiority (NI) is attractive in active equivalence control studies� It reduces the simultaneous testing of both hypotheses using a one-sided confidence interval� There was considerable interest in this topic in the late 1990s by the regulatory authority in Europe (e�g�, EMEA/CPMP 2000), as well as among pharmaceutical statisticians (e�g�, Phillips et al� 2000)� Dunnett and Gent (1996) and Morikawa and Yoshida (1995) showed that multiplicity adjustment is not necessary by using the intersection-union (IU) and the closed-testing (CT) principles, respectively�

Although there is no inflation of the Type I error rate, Ng (2003) cautioned against such simultaneous testing for NI and superiority in a confirmatory evaluation� Simultaneous testing is exploratory in nature because it may be considered as testing only one null hypothesis that depends on the outcome of the trial, such as the lower confidence limit� Therefore, a finding of superiority in simultaneous testing is less credible than a finding of superiority in a superiority trial where no switching from superiority to NI is allowed� Depending upon how the U�S� Food and Drug Administration’s (FDA’s) general requirement of at least two adequate and well-controlled clinical trials is interpreted (see Section 6�2�5), a finding of superiority in simultaneous testing may or may not be used as one of the two trials required to claim superiority (see Section 6�3�2)�

In addition, there are other concerns� Simultaneous testing of both hypotheses allows a test treatment that is expected to have the same effect as an active control to claim superiority by chance alone, without losing the chance of showing NI� This would lead to a higher number of erroneous claims of superiority, compared with the situation where only one null hypothesis is to be tested, because of the following� If only one null hypothesis is to be tested, if researchers expect the test treatment to have the same effect as an active control, they will likely choose to test NI rather than superiority� However, with simultaneous testing, superiority will be tested, regardless of

the expectation� Therefore, more test treatments that are expected to have the same effect as an active control would be tested for superiority with simultaneous testing than would be if only one null hypothesis were to be tested� Consequently, simultaneous testing will lead to (1) more erroneous claims of superiority, although the Type I error rate remains the same; and (2) a higher false discovery rate because the prior probability is increased (Ng 2007), which is one of the reasons why most published research findings are false (Ioannidis 2005)� The details will be discussed in Section 6�3�3�

Section 6�2 presents background information, including (1) the Committee for Proprietary Medicinal Products (CPMP) points-to-consider document on switching between superiority and NI (see Section 6�2�1) and (2) simultaneous tests for NI and superiority (see Section 6�2�2)� Section 6�3 presents the statistical issues with simultaneous testing, including increases in the false discovery rate� Decision-theoretic views are presented in Section 6�4, followed by controlling the Type I error rate of superiority claims conditioned on establishing the NI in Section 6�5� Discussions and concluding remarks are given in Section 6�6�

It should be noted that in this chapter, we (1) use the fixed-margin method discussed in Section 5�3 in Chapter 5, and (2) assume that the same analysis set is used in the NI testing and the superiority testing for simplicity, although it is most likely not the case in practice, as discussed in Chapter 10�

6.2.1 Switching between Superiority and Noninferiority

In 2000, the European Agency for the Evaluation of Medicinal Products, Committee for Proprietary Medicinal Products (EMEA/CPMP 2000) issued a document titled “Points to Consider on Switching between Superiority and Noninferiority�”

Basically, in an NI trial, if the null hypothesis is rejected, we can proceed to test it for superiority� There is no multiplicity issue because the test procedure is closed� In a superiority trial, if we fail to reject the null hypothesis, we can proceed to test it for NI�

The document asserts that there is no multiplicity issue� However, it does point out the issue of post hoc specification of δ, if it is not already specified�

6.2.2 Simultaneous Tests for Noninferiority and Superiority

Switching the objective between superiority and NI means simultaneously testing using a one-sided (1 – α) 100% lower confidence interval (CI), as shown in Figure 6�1� The axis is the mean difference of the test treatment minus that of the standard therapy� If the lower limit of the CI is greater

than 0, then superiority is shown� If it is between –δ and 0, then NI is shown� Otherwise, neither NI nor superiority is shown�

Such simultaneous testing is discussed by Dunnett and Gent (1996) and Morikawa and Yoshida (1995)� Both papers argue that a multiplicity adjustment is not necessary� The first paper uses the IU principle; that is, we conclude superiority if both hypotheses for NI and superiority are rejected� The second paper uses the CT principle, in which we test the null hypothesis for superiority when the intersection of hypotheses for NI and superiority is tested and rejected�

In October 1998, the Statisticians in the Pharmaceutical Industry (PSI) (Phillips et al� 2000) organized a discussion forum in London� One of the questions posed was “Is simultaneous testing of equivalence and superiority acceptable?” There was no consensus in the discussion� Some participants felt that in a superiority trial, if we fail to reject the null hypothesis of equality, no claim of equivalence/NI could be made� Others felt that it is permissible to proceed to test for NI, and they cited the second paper (i�e�, Morikawa and Yoshida 1995)�

6.2.3 Assumptions and Notation

We assume that the response variable follows a normal distribution with a common variance σ 2 and that a larger response variable corresponds to a better treatment� For a given d, define the null hypothesis H0(d) as the test treatment being worse than the standard therapy by d or more; that is,

H (d): T S – d0 ʺ

and the alternative hypothesis H1(d) is the complement of the null hypothesis; that is,

H (d): T S – d1 >

To test for NI, set d = δ; to test for superiority, set d = 0� These hypotheses are shown graphically in Figures 6�2a and 6�2b, respectively, where the axis

represents the mean response� Finally, we assume that all null hypotheses are tested at the same α level�

6.2.4 Hypothesis Testing versus Interval Estimation

It is well known that we can test the null hypothesis H0(d) at significance level α by constructing a one-sided (1 – α)100% lower CI for T – S, and rejecting the null hypothesis if and only if the CI excludes –d, as shown in Figure 6�3� In this figure, the axis represents the mean difference of the test treatment minus the standard therapy�

If the lower limit of the CI is L, then H0(d) will be rejected, for all –d < L� However, we should be cautious not to fall into the trap of post hoc specification of the null hypothesis in the sense of specifying “d” as the lower limit of the 95% CI after seeing the data� For example, in a premarket notification 510(k) submission to the U�S� FDA in December 2000, it was stated that when using a one-sided 95% CI, the true mean of this population is greater than 83�6%� Presumably, 83�6% is the one-sided 95% lower confidence limit� Since 83�6% is not prespecified, the conclusion that the true mean of this population is greater than 83�6% cannot be treated as confirmatory in the sense of hypothesis testing; otherwise, it would be just like testing the null hypothesis that can just be rejected (Pennello, Maurer, and Ng 2003)� Such post hoc

specification of the null hypothesis is exploratory and is unacceptable for confirmatory testing for the following reason� If the study is repeated with the same study design, sample size, etc�, then the unconditional probability of rejecting the same null hypothesis again (in the sense of considering the threshold value being random) is only 50%�

As we know, in hypothesis testing, when the null hypothesis H0(d) is rejected, we conclude that T – S > –d� We note that it is most likely that T – S is considerably larger than –d (but we never know how much larger); otherwise, the study would not have had enough power to reject the null hypothesis�

On the other hand, if the 100(1 – α)% lower confidence limit is L, we want to refrain from concluding that T – S > L because doing so is similar to post hoc specification of the null hypothesis�

6.2.5 Two Adequate and Well-Controlled Clinical Trials

The U�S� FDA’s general requirement of at least two adequate and wellcontrolled clinical trials may be interpreted in two different ways:

1� Two independent confirmatory trials are conducted more or less in parallel�

2� Two trials are conducted sequentially� The results of the first trial may be used to design the second trial�

These two different interpretations are discussed by Maurer (Pennello, Maurer, and Ng 2003), respectively, as (1) two “identical” trials run in parallel and (2) two trials are conducted sequentially�

6.3.1 Type I Error Control and Logical Flaws

Although simultaneous testing for NI and superiority is accepted by the EMEA/CPMP (2000) and the U�S� FDA (2010), this can lead to the acceptance of testing several nested null hypotheses simultaneously, which is not desirable� Two such arguments against simultaneous testing are presented in this subsection�

Instead of simultaneous testing for NI and superiority, simultaneous testing of two null hypotheses, H0(d1) and H0(d2), for any d1 > d2 may be performed as shown in Figure 6�4a� If the lower limit exceeds –d2, we reject H0(d2)� If the lower limit is between –d1 and –d2, we reject H0(d1), but not H0(d2)� Otherwise, we reject neither of the null hypotheses� According

to Dunnett and Gent (1996) and Morikawa and Yoshida (1995), with this approach, no multiplicity adjustment is necessary� If we can simultaneously test two hypotheses without adjustment, there is no reason why we can’t simultaneously test three nested null hypotheses� In fact, we can simultaneously test any number, say k, of nested hypotheses, H0(d1), …, H0(dk), without adjustment, as shown in Figure 6�4b for d1 > d2 > … > dk� Operationally, such simultaneous testing of k nested hypotheses is like specifying one null hypothesis to be tested after seeing the data� For example, if the one-sided 100(1 – α)% lower confidence limit falls between –dj and –dj+1 for some j, then we test H0(dj)� Since we can simultaneously test as many nested hypotheses as we like without adjustment, we can choose k large enough and d1 > d2 > … > dk so that (1) the one-sided 100(1 – α)% lower confidence limit exceeds –d1 almost with certainty, and (2) the difference between the two adjacent d’s is as small as we like� Therefore, simultaneous testing of these k nested hypotheses is similar to post hoc specification of the null hypothesis that can just be rejected�

To reiterate the problem, if we accept simultaneous testing for NI and superiority without some kind of adjustment (but not a multiplicity adjustment for the Type I error rate), then we have no reason not to accept simultaneous testing of three, four, or any number of k nested null hypotheses, which contradicts the fact that post hoc specification of the null hypothesis (in the sense discussed in Section 6�2�4) is unacceptable� Therefore, accepting simultaneous testing of NI and superiority on the basis of no inflation of the Type I error rate is, logically, a flaw�

There is nothing wrong with the IU and CT principles� In fact, the probability of rejecting at least one true null hypothesis is controlled at α when many nested null hypotheses are tested simultaneously� In other words, there is no inflation of the Type I error rate� This can be shown easily as follows (see also Hsu and Berger 1999)� If T – S is in (–dj, –dj+1) for some j, then H0(di) is false for i ≤ j and true for all i > j� It follows that

Pr[Rejecting at least one true null hypothesis]

Pr[Rejecting H (d )|T – S –d ]0 j 1 j 1≤ = = α+ +

where “Pr” stands for probability� If we accept simultaneous testing of NI and superiority because the Type I error rate is controlled, why don’t we accept simultaneous testing of many nested null hypotheses as well? Here is the problem with simultaneous testing of many nested null hypotheses� If we simultaneously test many nested null hypotheses, we will have a low probability of confirming the findings of such testing� For example, in the first trial, if H0(dj) is rejected but H0(dj+1) is not and the same trial is repeated, then the unconditional probability (in the sense of considering H0(dj) being random) that H0(dj) will be rejected in the second trial could be as low as 50% as the number of nested hypotheses approaches infinity (see Section 6�2�4)� Therefore, accepting simultaneous testing of NI and superiority on the basis of the Type I error rate being controlled is, logically, a flaw�

6.3.2 An Assessment of the Problems

How would we assess the problems when two nested null hypotheses are tested simultaneously, in particular, when NI and superiority are tested simultaneously? One way is to assess the probability of confirming the finding from the first trial in the presumed second independent trial relative to that of testing for NI� To do so, we assume that the variance is known and let

T – Sθ =

For a fixed d, let fd(θ) be the power function for testing the null hypothesis H0(d); that is, fd(θ) = Pr[Rejecting H0(d)|θ]�

These power functions are shown graphically in Figure 6�5a, for d = δ (= 2) and 0, where α = 0�025� The sample size is such that fδ(0) is 0�8; that is, the study has 80% power to conclude NI at θ = 0�

Suppose we test one null hypothesis H0(δ) and we reject it� If the same trial is repeated independently, then the probability of rejecting the null hypothesis again in the second trial is given by the solid line in Figure 6�5a, which is denoted by fδ(θ)�

Suppose we test H0(δ) and H0(0) simultaneously and H0(δ) or H0(0) is rejected� If the same trial is repeated independently, then the probability (as a function of θ, but given that H0(δ) is rejected) of rejecting the same hypothesis again in the second trial is given by

δ δ × δ

δ ×

= θ θ θ θ θ θ θ = θ θ θ θ

Pr[H ( ) is rejected but not H (0)|H ( ) is rejected] Pr[H ( ) is rejected in

the second trial] plus Pr[H (0) is rejected|H ( ) is rejected] Pr[H (0)

is rejected in the second trial]

{[f ( ) – f ( )]/ f ( )}· f ( ) +[f ( )/ f ( )]· f ( ) [1 – w( )] · f ( ) + w( )·f ( )

where w(θ) = f0(θ)/fδ(θ)� Note that this power function is a weighted average of the two power functions and is shown graphically by the solid line in Figure 6�5b� Taking the ratio of this power function over the power function of the second trial when only H0(δ) is tested-that is, fδ(θ)—we have

1 – w( ) w ( ) 2θ + θ

This ratio is shown graphically in Figure 6�5c� It can be shown that this ratio may be as low as 0�75� In other words, there may be a 25% reduction in power in confirming the finding of simultaneous testing for NI and superiority�

It should be noted that the main purpose of the second trial is to determine if the conclusion drawn from the first trial, where nested hypotheses are tested simultaneously, is credible, rather than to serve as one of the two adequate and well-controlled clinical trials� In other words, we are making an assessment on whether the first trial may be used as one of the confirmatory trials in the first interpretation (see Section 6�2�5)� For example, one should not believe the conclusion based on testing the null hypothesis can just be rejected because there is only a 50% chance (unconditional on the outcome of the first trial) of the null hypothesis being rejected again if the trial

is repeated with the same sample size� Simultaneous testing for NI and superiority is exploratory in nature, in the sense of testing one null hypothesis that is data dependent [i�e�, testing H0(δ) if the lower confidence limit < 0, and H0(0) otherwise]� Therefore, the study should not be used as one of the two independent confirmatory trials for the first interpretation to claim superiority� On the other hand, it may be used as one of the two well-controlled clinical trials required by the U�S� FDA to claim superiority in the second interpretation because it is not used as a confirmatory trial�

6.3.3 Simultaneous Testing of Noninferiority and Superiority Increases the False Discovery Rate

This subsection shows that simultaneous testing of NI and superiority (see Section 6�2�2) would increase the false discovery rate as shown by Ng (2007)� Rejection of the null hypothesis in a one-sided test is called discovery by Soric (1989), who is concerned about the proportion of false discoveries in the set of declared discovery� Such a proportion corresponds to the conditional probability of a false rejection for a specific null hypothesis given that this null hypothesis is rejected� This conditional probability may be considered from a Bayesian perspective as the posterior distribution� This posterior distribution depends on the Type I error rate (α), the statistical power (1 – β), and prior distribution (i�e�, the probability that the null hypothesis is true)� More specifically,

α

β= α +Pr[H is true|H is rejected] Pr[H is true]

Pr[H is true] Pr[H is true] (1 – )0 0 0

Benjamini and Hochberg (1995) define the term false discovery rate (FDR) as the expected proportion of errors among the rejected hypotheses�

To envision the concern of simultaneous testing for NI and superiority, let us consider what would happen under the following two situations:

Situation 1: Testing only one null hypothesis Situation 2: Simultaneous testing for NI and superiority

Suppose that there are 2000 products of which 1000 have the same efficacy as the active control (call this category A), while the other 1000 products are better than the active control (call this category B)� To conduct a confirmatory trial, the sponsors make a preliminary assessment of their products� Based on the preliminary assessment, the products may be grouped into category A* (same efficacy) and category B* (better)� Assume that the sponsor conducts an NI trial for a product in category A* and a superiority trial for a product in category B* in situation 1 and simultaneously tests for NI and superiority, regardless of the preliminary assessment in situation 2�

Suppose that the preliminary assessment has a 20% error rate for both categories� Then, in situation 1, 200 products in category A will be tested for superiority, and 800 products in category B will be tested for superiority (see Table 6�1)� On the other hand, in situation 2, all 2000 products will be tested for superiority�

In situation 1, we expect to falsely claim superiority for 5 products from category A, as compared to 25 products in situation 2, because 1000 products will be tested for superiority in situation 2, whereas only 200 products will be tested for superiority in situation 1� Although the error rate is the same in both situations, more products in category A would be tested in situation 1 than in situation 2, resulting in more erroneous claims of superiority�

Note that 1000 products will be tested for superiority in situation 1 compared with 2000 products in situation 2� Furthermore, for those products tested for superiority, the proportion of products in category A (i�e�, true null hypothesis) in situation 1 is 0�2 compared with 0�5 in situation 2� Therefore, testing two hypotheses would result in an increase in the proportion of true null hypotheses� Since the proportion of true null hypotheses is the prior distribution in the Bayesian paradigm, the FDR increases from 1/145 to 1/37, as shown in Table 6�1�

Koyama and Westfall (2005) (1) summarized the issues discussed in Section 6�3�2 and (2) compared the two strategies discussed in Section 6�3�3 using a

TABLE 6.1

A Comparison between the Two Situations: Testing for Superiority

Bayesian decision-theoretic standpoint� These issues and the two strategies were originally discussed by Ng (2003)� The two strategies are (1) testing one hypothesis based on the preliminary assessment (referred to as Ng), and (2) simultaneous testing of both hypotheses, regardless of the preliminary assessment, or “test both” (referred to as TB)�

There are three components in their studies:

1� A “normal with an equivalence spike” model is used as the prior distribution of the effect size, which is indexed by three parameters: the proportion of spike equivalence (p) and the mean (m) and standard (s) deviations of the normal distribution�

2� The probability of selecting the NI test (ps) as a function of the effect size�

3� The loss matrix indexed by one parameter (c)�

Five values are used for each of the five parameters (i�e�, p, m, s, ps, c), resulting in 55 = 3125 combinations of these parameter values for the Ng method and 54 = 625 for the TB method, because ps is irrelevant in the TB method� For a given set of parameters, an optimal critical value may be computed with the associated minimum expected loss� For each parameter, comparisons between the two methods are based on the loss relative to this minimum expected loss� Furthermore, the comparisons are based on the median of loss for other parameter values not in the comparisons� For example, the median of 625 values for a given ps value for the Ng method is compared with the median of 625 values for the TB method�

The authors recommended always testing NI and superiority simultaneously in light of very high “winning” percentages of the TB method� However, such a recommendation is not warranted in a confirmatory trial because (1) changing the parameter values and/or the loss matrix could turn it around; (2) the error rates of 80%, 60%, and 50% (corresponding to the parameter values of 0�2, 0�4, and 0�5 for ps, respectively) in the preliminary assessment are unrealistic; and (3) the comparisons are based on the median of loss for other parameter values (Ng 2007)�

To alleviate the concerns raised by Ng (2003, 2007) as discussed in Section 6�3 by switching the objective from NI to superiority, Yuan, Tong, and Ng (2011) proposed to control the conditional Type I error rate of the second-step superiority test at the nominal significance level of α� This leads to testing superiority at a significance level lower than α, thus decreasing the incidence of erroneous claims of superiority�

For simplicity, to derive the conditional Type I error rate, assuming σ t2 and σc2 are known, let Φ(•) and zα be the cumulative distribution function (CDF) and the upper α-quantile of a standard normal distribution, respectively� Let σ 2 = σ t2/nt + σ c2/nc, where nt and nc denote the sample sizes for the test and control arms, respectively� Let W = ∑Xi/nt – ∑Yj/nc, where Xi and Yj are the individual values of the primary endpoint for the test treatment and control, respectively, for i = 1,…, nt and j = 1,…, nc� Let the first-step NI hypothesis be tested at a significance level of α 1 so that the null hypothesis for NI is rejected when

1 W z> −δ + σα � In addition, let the second-step superiority

hypothesis be tested at a significance level of α 2 so that the null hypothesis for superiority is rejected when

2 W z> σα � If α 2 ≤ α 1, then the conditional

Type I error rate Ψ for the second-step superiority test T2 is given by

sup ( | ) / ( ) 0

22 1 1P W z W z z t c

Ψ = > σ > −δ + σ = α Φ − + δ σ − ≤

See Yuan, Tong, and Ng (2011) for the derivation of Ψ� To control the conditional Type I error rate of T2 at the nominal significance level of α, set Ψ = α and solve for α 2� We have ( )2 1zα = αΦ − + δ σα � Therefore, the testing procedure is as follows: (1) perform the first step NI test as usual with Type I error rate α1 = α, and (2) perform the conditional superiority test with Type I error rate α2 = αΦ(–zα+δ/σ) instead of the nominal level of α� Noting that α2 < α, in some sense, you pay a price for switching from NI to superiority�

In practice, the variance is unknown and has to be estimated� Although it is “natural” to use the formula given by Equation 6�1, with σ being replaced by its estimate when it is not known, it is not appropriate to do so because this formula is derived based on known variances� With equal and unknown variances, the formula α = α ν δ σ αF tv( )2 , , was derived by Yuan, Tong, and Ng (2011), where ⋅ = − ⋅ν θ ν θF F( ) 1 ( ), , and ⋅ν θF ( ), denote the CDF of the noncentral t-distribution with v degrees of freedom and noncentrality parameter θ� To be conservative, α 2 = α 2 may be used, where σ approaches infinity�

From a Bayesian perspective, Pr[H0 is true | null is rejected] in situation 1 (see Section 6�3�3), with a 20% error rate in the preliminary assessment, is equal to 0�2, which increases to 0�5 in situation 2, so that the FDR in situation 1 is less than the FDR in situation 2� Therefore, testing two hypotheses would result in an increase in FDR as compared to testing one hypothesis� In general, it is straightforward to show that Pr[H0 is true | null is rejected] in situation 1 is less than Pr[H0 is true | null is rejected] in situation 2, so that

the FDR in situation 1 is less than the FDR in situation 2, provided the sum of the error rates in the preliminary assessment is less than 1�

Ioannidis (2005) argued that most published research findings are false and discussed many factors that influence this problem� One of the factors is the ratio of the number of “true relationships” to “no relationships” among those tested in the field� This corresponds to the ratio of the number of products in category B to the number of products in category A (see Section 6�3�3) that could be “translated” into the proportion of true null hypotheses� Frequentist controls the Type I error rate, while Bayesian evaluates the FDR� Therefore, the Type I error rate should not be the sole criterion in multiple testing problems, and FDR should be taken into consideration (Ng 2007)�

NI trials are conducted with the belief that the experimental treatment has the same effect as the active control� If NI and superiority are tested simultaneously, without any adjustment, there is a concern of erroneously concluding that the experimental treatment is superior over the active control� It is true that there is always a 2�5% chance of erroneously concluding the experimental treatment is superior over the active control in a superiority trial (assuming the null hypothesis is tested at the 2�5% level)� However, for an experimental treatment that is expected to have the same effect as the active control, superiority testing will not be performed if no simultaneous testing is allowed (Ng 2003)�

One way to alleviate (but not eliminate) the concern is to decrease the size of the test for superiority� In extreme situations (e�g�, decreasing the size of the test for superiority to, say, 0�0001), the concern may be eliminated to a minimal level, but it essentially reduces to testing for NI (Ng 2003)�

The downside of situation 1 is that we may fail to claim superiority for more products from category B than we would miss in situation 2� For example, assuming 90% power for detecting superiority for products from category B and 20% error rate of preliminary assessment, in situation 1, we expect to fail to claim superiority for 280 products, compared with 100 products in situation 2 (see Table 6�1)� Therefore, there is a tradeoff between situations 1 and 2� However, if the preliminary assessment is fairly accurate (which it should be if we are designing a confirmatory trial), then the tradeoff will not be as large (Ng 2003)�

Although erroneously claiming superiority does not result in letting ineffective products in the market, one has to be “fair” to the “active control�” It is well known that post hoc specification of the null hypothesis in the context of multiple endpoints (that is, looking for the extreme result among multiple endpoints and then testing that one) would inflate the Type I error rate� On the other hand, simultaneous testing of many nested null hypotheses can be considered post hoc specification of the null hypothesis in the sense of choosing the “right” one to test� However, there is no inflation of the Type I error rate in such testing, as shown in Section 6�3�1, because the parameter space is one-dimensional as opposed to multidimensional, which is the case with multiple endpoints (Ng 2003)�

Although there is no inflation of the Type I error rate, simultaneous testing of many nested null hypotheses is problematic in a confirmatory trial, because the probability of confirming the findings in a second trial would approach 0�5 as the number of nested null hypotheses approaches infinity� There is a concern with erroneous conclusions of superiority in simultaneous testing for NI and superiority� Such a concern would diminish if only one null hypothesis is tested, because NI trials rather than superiority trials would be conducted for experimental treatments that are expected to have the same effect as the active control� This is a good example of how there might be problems other than the inflation of the Type I error rate in multiple testing (Ng 2003)�

In a confirmatory trial, we usually test one and only one prespecified primary null hypothesis and post hoc specification of the null hypothesis in the sense of specifying “d” after seeing the data is exploratory, and therefore unacceptable� Simultaneous testing of many nested null hypotheses is problematic, although there is no inflation of the Type I error rate� Simultaneous testing for NI and superiority may be viewed as an initial step toward exploratory analysis and thus, may be best used cautiously in confirmatory evaluation (Ng 2003)�

Benjamini Y and Hochberg Y (1995)� Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing� Journal of the Royal Statistical Society, Series B: Methodological, 57:289-300�

Dunnett CW and Gent M (1996)� An Alternative to the Use of Two-Sided Tests in Clinical Trials� Statistics in Medicine, 15:1729-1738�

European Agency for the Evaluation of Medicinal Products, Committee for Proprietary Medicinal Products (2000)� Points to Consider on Switching Between Superiority and Noninferiority (https://www�ema�europa�eu/docs/en_GB/document_ library/Scientific_guideline/2009/09/WC500003658�pdf) (Accessed: August 25, 2013)�

Hsu J and Berger R (1999)� Stepwise Confidence Intervals without Multiplicity Adjustment for Dose-Response and Toxicity Studies� Journal of the American Statistical Association, 94:468-482�

Ioannidis JPA (2005)� Why Most Published Research Findings Are False� PLoS Medicine 2(8):e124, pp� 0690-0701�

Koyama T and Westfall PH (2005)� Decision-Theoretic Views on Simultaneous Testing of Superiority and Noninferiority� Journal of Biopharmaceutical Statistics, 15:943-955�

Morikawa T and Yoshida M (1995)� A Useful Testing Strategy in Phase III Trials: Combined Test of Superiority and Test of Equivalence� Journal of Biopharmaceutical Statistics, 5:297-306�

Ng T-H (2003)� Issues of Simultaneous Tests for Non-inferiority and Superiority� Journal of Biopharmaceutical Statistics, 13:629-639�

Ng T-H (2007)� Simultaneous Testing of Noninferiority and Superiority Increases the False Discovery Rate� Journal of Biopharmaceutical Statistics, 17:259-264�

Pennello G, Maurer W, and Ng T-H (2003)� Comments and Rejoinder on “Issues of Simultaneous Tests for Non-inferiority and Superiority�” Journal of Biopharmaceutical Statistics, 13:641-662�

Phillips A, Ebbutt A, France L, and Morgan D (2000)� The International Conference on Harmonization Guideline “Statistical Principles for Clinical Trials”: Issues in Applying the Guideline in Practice� Drug Information Journal, 34:337-348�

Soric B (1989)� Statistical Discoveries and Effect Size Estimation� Journal of the American Statistical Association, 84:608-610�

U�S� Food and Drug Administration (2010)� Draft Guidance for Industry: Non-inferiority Clinical Trials (https://www�fda�gov/downloads/Drugs /GuidanceComplianceRegulatoryInformation/Guidances/UCM202140�pdf) (Accessed: August 25, 2013)�

Yuan J, Tong T, and Ng T-H (2011)� Conditional Type I Error Rate for Superiority Test Conditional on Establishment of Noninferiority in Clinical Trials� Drug Information Journal, 45:331-336�