Department of Statistics


Some Notes on Power Calculation

The objective of this note is to aid CSU researchers in determining the appropriate number of animal subjects, field plots, or other experimental units required for an experiment. It includes a review of some approximate techniques that are available in many standard texts, a description of some SAS program code that can be used to compute power for experiments with normally distributed responses, and some information about a program named STPLAN, written at the University of Texas M.D. Anderson Cancer Center. There are several other commercial and public domain programs available that will also compute power in a variety of situations. We plan to add some information about these programs in the near future.

The usual mechanism for statistical determination of sample size is a computation of statistical "power" (although sometimes sample size is selected to achieve a given standard error of a mean or a given confidence interval width). Power is defined as the probability that a hypothesis test will reject, given that the hypothesis is false. Since the objective of many projects is to establish a difference between treatments by rejection of the hypothesis of equal treatment means, power often represents the probability of success of the experiment.

Power calculations are an important aid in determining the number of subjects that will be sufficient to meet the objective, but will not be wasteful of animals or resources. Power calculations are also important when the objective of research is to establish the lack of an effect. If a study fails to detect a difference between two treatment means, before concluding that that no difference exists, one must determine that the experiment had enough power to have detected a difference.

Researchers generally consider power calculations to be difficult. With the available software and a little bit of training, the power computation itself is not difficult for most basic analyses. The difficulty lies in the fact that the true power depends on quantities that are unknown. The probability of detecting a significant difference depends on the size of the true difference relative to the amount of variability in the individual observations. In practice, the researcher often must proceed using poor estimates, or wild guesses, of these quantities. Because the results are only as good as the inputs, we often do several calculations with hypothetical values to get an idea how power estimates vary over a reasonable range of input values.

In this note we will review some ideas of power and sample size and describe some programs that can aid in sample size and power calculations.

The SAS programs are stored on the STSS Lab server in C141 Clark in the directory: g:\data\power . If you have any comments or corrections send them to pchapman@lamar.colostate.edu. I think the programs are correct, but some of them are first drafts, so look for updates or corrections. The SAS files are also available on the StatLab home page on the World Wide Web: http://www.stat.colostate.edu/Statlab/examples.html .
The STPLAN files are located in the C141 Clark lab in the directory: x:\stplan on the server. To run them on your own machine, copy the entire directory to a directory named STPLAN on your own machine. Type STPLAN to start the program. A complete manual (about 150 pages) is in the STPLAN.DOC file. The WWW home page of the program authors is http://utmdacc.mdacc.tmc.edu.

Sometimes the objective of an experiment is estimation of a parameter, such as a mean or a proportion, rather than testing a hypothesis about the equality of means or proportions. When that is the case the issue of sample size sometimes is most simply handled by selecting a sample size that achieves an desired level of accuracy in the estimator, rather than a power calculation.

  1. Sample size to achieve a given standard error of the mean. (large sample estimate). If the objective of an experiment is to estimate a mean response, the estimator is usually the sample mean. The accuracy of the estimate is often described by the standard error of the mean: . In general, the true value will fall within one standard error of the mean about 67% of the time. If a researcher has an advance estimate of the standard deviation, and the desire that the standard error of the mean be approximately equal to some value, say E, then the equation can be solved for n: to yield the appropriate sample size.
  2. Sample size to achieve a given confidence interval width. (large sample estimate). If the objective of an experiment is to estimate a mean response, the accuracy of the estimate is often reported by a 95% confidence interval. For the case of a single sample of normal observations the confidence interval formula is:

Thus the width of the confidence interval is . With a prior estimate of s, and the approximate value (good for sample sizes greater than 20), the sample size required to achieve the confidence interval width W is: . The above assumes two things: (1) the sample size is large enough that the approximation is reasonable, and (2) the actual value of will be fairly close to the estimate (See the "POWERCI" SAS program for a method that does not have such restrictive assumptions.)

For the case of a confidence interval for the difference between means from two independent samples, the appropriate formula is:

.

The sample size to achieve confidence interval width W is approximately: (which is double the value for the single sample case).

If the objective is to estimate a proportion, similar methods can be used. To achieve a confidence interval of width W for a proportion, an approximate n required is:

Since p is unknown, you can substitute and educated guess, or the worst-case value p =0.5. Again, this assumes fairly large samples, and that the true value of p is not too close to zero or one. For a confidence interval for the difference between two proportions, the above value of n is doubled.

Some SAS programs for determining sample size for normally distributed data are given below. All of these programs require that the user enter an estimate of error variability, and give a set of means that identify the "true" values of the means. Sometimes the estimate of variability can be taken from the sample variances or mean squared error (MSE) of previous experiments of a similar nature. Sometimes it is just a guess. The "true" values of the means should reflect the minimum difference between means that the researcher wants to be assured of detecting.

When the power calculation is being done after the experiment (to see if the observed lack of significance indicates a real lack of difference between treatments), the estimate of variability is taken from the experiment, but the means from the experiment are not used. (You already know that those means were not significant.) Rather, a several sets of hypothetical means are used to see if the experiment had sufficient power to detect those differences. Although such post hoc power calculations have their place, I have found that an easier road to the same objective is usually to just to interpret the confidence intervals for the differences between the means. If a confidence interval is so wide that it includes differences that would have been of interest to detect, then the power was insufficient. If a confidence interval is narrow enough to include only trivial differences that would not have been of interest, then power was sufficient.

  1. Sample size to achieve a given confidence interval width (small sample estimate using "powerci.sas"). This program does a calculation like the one above, but adapted to smaller sample sizes. It uses the exact value for , rather than the estimate 2, and it figures in an additional "fudge factor" to allow for the chance that the s value calculated from the data will be an overestimate of the expected value.
  2. Power in the single sample t-test (powert.sas). Used for computing the power of a hypothesis test about the mean of a single group. It assumes a two-sided test, but can be easily adapted to be a one-side test.
  3. Power in the two-sample t-test (powert2.sas). Used for computing the power of a hypothesis test comparing the means of two treatment groups. Again, this is a two-sided test.
  4. Power in the one-way completely randomized design (powercrd.sas). Used for computing the power of a test the overall hypothesis that the means of t treatment groups are simultaneously equal.
  5. Power of a contrast in a one-way completely randomized design (powercon.sas). Used for comparing the average of one group of treatments to the average of another group of treatments in a one-way completely randomized design. This is of interest when the experiment involves groups of treatments that are similar. Some efficiency can be gained averaging similar treatments and using a larger proportion of the subjects in each comparison.
  6. Power in a one-factor randomized complete block design (powerrcb.sas). Used for computing the power of the test of the hypothesis that all treatment means are equal in a design in which subjects are "blocked" to be more comparable. (Often a litter is a block because siblings are genetically similar. Sometimes subjects receiving inoculum from the same batch will be in the same block, because inoculum potency may vary from batch to batch.)
  7. Power in a two-factor completely randomized design. (powertwo.sas). A two-factor design has subjects classified by two types of treatments. For example subjects may be divided into eight groups that are given two different drugs and sacrificed at four different times (2 x 4 = 8 groups). There are three common hypotheses of interest: (1) Does the difference between the drugs depend on time? (The "interaction hypothesis"), (2) Averaging over time, is there a difference between the drugs? (The "main effect" of drug), and (3) Averaging over drugs, is there a difference between times? (The "main effect" of time). This program computes power for testing these three hypotheses, as well as the overall hypothesis that all eight treatments have the same mean.
  8. Power for test of a regression slope in a simple linear regression (powerreg.sas). Used to compute the power of a (one-sided) test of the hypothesis that the slope of a regression line is zero versus the alternative that it is greater than zero.
  9. Power for a repeated measures (or split plot) design (powrmeas.sas). This program calculates power for the F-tests in a two-factor experiment in which the first factor (rows) is completely randomized, and the second factor (columns) is a split plot or repeated measure. It is appropriate in the situation in which subjects are randomly assigned to r treatments, and then repeatedly measured at c time points. It is also appropriate for a split plot design where the whole plot factor has been completely randomized.

Using STPLAN for power and sample size calculations. STPLAN is a sample-size and power calculation program written at the University of Texas. It was written with support from the National Cancer Institute and is available for non-commercial distribution. It can be copied to your machine and used. A MacIntosh version is available, as well as the Fortran source code.

STPLAN does a wide variety of single sample and multiple sample calculations. I have been using it for calculating power in Chi-square tests, and exact tests, and comparisons of Poisson rates. I have found its t-test calculations to be very approximate for low power situations, but accurate for high power situations. One of its quirks is that in two-sided tests, t-tests, it only includes in the power rejections of the null hypothesis in the direction of the specified alternative. This is a sensible choice, although it leads to some odd results. When the alternative is very close to the null value, the power computed is less than the value of a , which you would normally think impossible.

Department of Statistics, Colorado State University, Fort Collins, CO 80523-1877
Phone: (970)-491-5269      Fax:(970) 491-7895    
Dept Email: stats@lamar.colostate.edu
Email: webmaster@stat.colostate.edu                  Last Modified: April 08, 2002