方式是先计算其置信区间,然后检查其是否包含0值。当该置信区间不包括0值时,说明两方法的均值差有显著差异。但是,统计学上的显著差异对于实验室并不一定有多么重要的实际意义,因为差异的增大可能来自于高精密度的数据或者大样本量的数据。另一方面,当置信区间包括0值时,也会出现虽然结果显示两者无统计显著性差异,但也并不能排除存在具有重要实际意义的差异。比如,当数据具有较大变异性或者样本量太小时,这种情况常会发生。所以,不论t检验的结果是否显示有显著性差异,都不能充分证明是否存在有实际重要意义的差异。
Determination of Sample Size
样本量计算
Sample size determination is based on the comparison of the accuracy and precision of the two procedures3 and is similar to that for testing hypotheses about average differences in the former case and variance ratios in the latter case, but the meaning of some of the input is different. The first component to be specified is δ, the largest acceptable difference between the two procedures that, if achieved, still leads to the conclusion of equivalence. That is, if the two procedures differ by no more than δ, on the average, they are considered acceptably similar. The comparison can be two-sided as just expressed, considering a difference of δ in either direction, as would be used when comparing means. Alternatively, it can be one-sided as in the case of comparing variances where a decrease in variability is acceptable and equivalency is concluded if the ratio of the variances (new/current, as a proportion) is not more than 1.0 + δ. A researcher will need to state δ based on knowledge of the current procedure and/or its use, or it may be calculated. One consideration, when there are specifications to satisfy, is that the new procedure should not differ by so much from the current procedure as to risk generating out-of-specification results. One then chooses δ to have a low likelihood of this happening by, for example, comparing the distribution of data for the current procedure to the specification limits. This could be done graphically or by using a tolerance interval, an example of which is given in Appendix E. In general, the choice for δ must depend on the scientific requirements of the laboratory.
根据两种方法进行准确性和精密度比较的需要来确定样本量3,在准确性比较时样本量类似于均值差异检验假设所需,在精密度比较时样本量类似于方差差异检验假设所需,但是计算样本量时所需的一些输入参量的意义是不同的。第一个所需参量是δ,它代表两种方法最大可接受的差异,如果满足条件就可以给出等效性结论。如果两种方法的差异小于δ,一般认为两者等效。考虑到在两个方向上δ的差异,方法的比较可以选择均值比较时所使用的双侧检验。或者,如果可以接受变异性降低,在比较方差时也可以选择单侧比较,并且如果方差比值(新方法方差/现行方法方差的比值)不大于1.0 +δ,新方法和现行方法就被认为是等效的。研究人员需要根据现行方法和/或其应用等的相关知识来规定δ值,或者计算δ值。当合规性检测时,其中的一项考虑就是新方法不应与现行方法出现较大差异,以导致出现超标结果(OOS)的风险。这时人们应该通过选择δ值来降低发
3
In general, the sample size required to compare the precision of two procedures will be greater than that required to compare the accuracy of the procedures.
通常用来两种方法精密度比较所需的样本量应该大于准确度比较所需。
生这种情况的可能性,比如,可以通过比较质量标准限度中现行方法的分布数据来确定。这需要使用图形法或通过使用容忍区间来完成,附录E给出了一个相应的使用实例。总之,δ值的选择要根据实验室的科学需求。 The next two components relate to the probability of error. The data could lead to a conclusion of similarity when the procedures are unacceptably different (as defined by δ). This is called a false positive or Type I error. The error could also be in the other direction; that is, the procedures could be similar, but the data do not permit that conclusion. This is a false negative or Type II error. With statistical methods, it is not possible to completely eliminate the possibility of either error. However, by choosing the sample size appropriately, the probability of each of these errors can be made acceptably small. The acceptable maximum probability of a Type I error is commonly denoted as α and is commonly taken as 5%, but may be chosen differently. The desired maximum probability of a Type II error is commonly denoted byβ. Often, βis specified indirectly by choosing a desired level of 1 ? β , which is called the ―power‖ of the test. In the context of equivalency testing, power is the probability of correctly concluding that two procedures are equivalent. Power is commonly taken to be 80% or 90% (corresponding to aβof 20% or 10%), though other values may be chosen. The protocol for the experiment should specify δ, α, and power. The sample size will depend on all of these components. An example is given in Appendix E. Although Appendix E determines only a single value, it is often useful to determine a table of sample sizes corresponding to different choices of δ, a, and power. Such a table often allows for a more informed choice of sample size to better balance the competing priorities of resources and risks (false negative and false positive conclusions).
计算样本量的另外两个成分与误差概率相关。当两种方法存在不可接受的差异(如δ所定义)时,而数据却给出了具有相似性的结论。这被称为假阳性结果或I类错误。错误也可以来自另一个方向,方法是相似的但数据却不能支持该结论。这是假阴性结果或II类错误。使用统计方法,两种错误都是无法完全避免的。然而,通过选择合适的样本量,可以将发生这些错误的可能性有效地减小到一个可以接受的很小程度。I类错误的最大可接受概率一般用α表示,并且取值通常为5%(也可以取其他值)。II类错误的最大期望概率一般用β表示。一般用一个期望水平1-β,即称为检测“效能”来间接确定β。在等效性检测情形中,效能是判定两种方法等效的正确概率。尽管也可选择其他值,通常效能取值为80%或90%(及相应的β值为20%或10%)。实验方案中应规定δ,α和效能。样本量将与所有的这三个因素直接相关。附录E给出了相应的计算实例。尽管附录E仅确定了一个值,但通常根据不同的δ,α和效能来确定一个样本量表是很有用的,这样的样本量表提供了更多的选择,以便允许更好地平衡资源与风险(假阴性和假阳性)的问题。
APPENDIX A: CONTROL CHARTS
附录A:控制图
Figure 1 illustrates a control chart for individual values. There are several different methods for calculating the upper control limit (UCL) and lower control limit (LCL). One method involves the moving range, which is defined as the absolute difference between two consecutive measurements (xi–xi-1). These moving ranges are averaged (MR) and used in the following formulas:
图1显示了一个各单独数值的控制图。有几种不同的方法计算控制上限(UCL)和下限(LCL)。一个方法涉及到移动区间,这被定义为连续两次测量差值的绝对值(xi–xi-1)。这些移动区间被求平均值(MR)并被用在下列公式中:
where x is the sample mean, and d2 is a constant commonly used for this type of chart and is based on the number of observations associated with the moving range calculation. Where n = 2 (two consecutive measurements), as here, d2 = 1.128. For the example in Figure 1, the MR was 1.7:
其中x是样本的平均值,d2是通常用于这类图表中的一个常数,它也是基于与移动区间计算相关的观测。在这里n = 2 (两次连续测量),d2 = 1.128。图1中所示的例子中,MR值为1.7。
Other methods exist that are better able to detect small shifts in the process mean, such as the cumulative sum (also known as ―CUSUM‖) and exponentially weighted moving average (―EWMA‖).
有一些其他的方法更适于检测方法均值的微小波动,例如累加和(也被称为“CUSUM”)和指数加权的移动均值(“EWMA”)。
Figure 1. Individual X or individual measurements control chart for control samples.
In this particular example, the mean for all the samples (x) is 102.0, the UCL is 106.5, and the LCL is 97.5.
图1:控制样本的单一X值或单一测量控制图。
在本例中,所有样本(x)的均值为102.0,UCL为106.5,LCL为97.5。
APPENDIX B: PRECISION STUDY
附录B:精密度研究
Table 1 displays data collected from a precision study. This study consisted of five independent runs and, within each run, results from three replicates were collected.
表1显示了一个精密度研究中采集的数据。这项研究包括了5个独立的组,每组实验来自3次重复的结果。
Table 1. Data from a Precision Study
表1. 精密度研究数据
Replicate Number 1 2 3 Mean Standard deviation %RSDa 1 100.70 101.05 101.15 100.97 0.236 0.234% 2 99.46 99.37 99.59 99.47 0.111 0.111% Run Number 3 99.96 100.17 101.01 100.38 0.556 0.554% 4 101.80 102.16 102.44 102.13 0.321 0.314% 5 101.91 102.00 101.67 101.86 0.171 0.167% a
%RSD (percent relative standard deviation) = 100% × (standard deviation/mean)
a
%RSD (百分相对标准偏差) = 100% × (标准偏差/均值)
Table 1A. Analysis of Variance Table for Data Presented in Table 1
表1A 表1中数据的方差分析表
aSource of Variation Degrees of Freedom (df) Sum of Squares (SS) Mean Squares(MS) F = MSB/MSW Between runs 4 14.200 3.550 34.886 Within runs 10 1.018 0.102 Total 14 15.217 a
The Mean Squares Between (MSB) = SSBetween/dfBetween and the Mean Squares Within (MSW) = SSWithin/dfWithin a
组间均方 (MSB) = SSBetween/dfBetween及组内均方 (MSW) = SSWithin/dfWithin
Performing an analysis of variance (ANOVA) on the data in Table 1 leads to the ANOVA table (Table 1A). Because there were an equal number of replicates per run in the precision study, values for VarianceRun and VarianceRep can be derived from the ANOVA table in a straightforward manner. The equations below calculate the variability
associated with both the runs and the replicates where the MSwithin represents the ―error‖ or ―within-run‖ mean square, and MSbetween represents the ―between-run‖ mean square.
对表1中的数据进行方差分析(ANOVA)可以得到方差分析表(表1A)。因为在精密度研究中每组重复同样次数,可以用一种直接的方式从方差分析表中得出VarianceRun值和VarianceRep值。下列公式可以计算与实验组(runs)相关的变异性和与重复性(replicates)相关的变异性,其中MSwithin表示“错误”或“组内”均方差,MSbetween表示“组间”均方差。
VarianceRep = MSwithin = 0.102
[NOTE—It is common practice to use a value of 0 for VarianceRun when the calculated value is negative.] Estimates
can still be obtained with unequal replication, but the formulas are more complex. Many statistical software packages can easily handle unequal replication. Studying the relative magnitude of the two variance components is important when designing and interpreting a precision study. The insight gained can be used to focus any ongoing procedure improvement effort and, more important, it can be used to ensure that procedures are capable of supporting their intended uses. By carefully defining what constitutes a result (i.e., reportable value), one harnesses the power of averaging to achieve virtually any desired precision. That is, by basing the reportable value on an average across replicates and/or runs, rather than on any single result, one can reduce the %RSD, and reduce it in a predictable fashion. [注意-通常当计算值为负值时,将VarianceRun值实际设为0]。当重复次数不等时也可以得出估计值,但是公式会比较复杂。许多统计软件包可以很容易解决这种情况。在设计和解释精密度研究时,研究两个方差成分相对大小是很重要的。研究所获得的洞察力可以用于关注任何正在进行的优化方法的努力,更重要的是,也可以用于确认方法是可以满足其预期用途的。通过仔细定义结果的组成(比如,报告值),利用平均值的力量就可以事实上获得任何预期的精密度。这就是说,如果基于多次测量及/或多组测量的平均值生成报告值,而不是基于单次测量生成报告值,一个人可以降低%RSD,并且是以可预见的方式降低。
Table 2 shows the computed variance and %RSD of the mean (i.e., of the reportable value) for different combinations of number of runs and number of replicates per run using the following formulas: 对于实验组数和每组重复次数不同的组合时,表2使用下列公式计算出的方差及均值的%RSD:
For example, the Variance of the mean, Standard deviation of the mean, and %RSD of a test involving two runs and three replicates per each run are 0.592, 0.769, and 0.76% respectively, as shown below.
比如,当一项研究包括2个实验组,每组3次重复实验时,如下所示均值的方差,标准偏差及%RSD分别为0.592, 0.769和0.76%。
RSD = (0.769/100.96) × 100% = 0.76%
where 100.96 is the mean for all the data points in Table 1. As illustrated in Table 2, increasing the number of runs from one to two provides a more dramatic reduction in the variability of the reportable value than does increasing the number of replicates per run.
其中100.96是表1中所有数据的平均值。如表2所示,与增加每组实验中重复次数相比,将实验组数量从1增加到2可以将报告值的方差显著减少。