Power and Sample Size Determination

Lisa Sullivan, PhD

Professor of Biosatistics

Boston Univeristy School of Public Health

Title logo - a jumble of words related to sample size and statistical power

Introduction

A critically important aspect of any study is determining the appropriate sample size to answer the research question. This module will focus on formulas that can be used to estimate the sample size needed to produce a confidence interval estimate with a specified margin of error (precision) or to ensure that a test of hypothesis has a high probability of detecting a meaningful difference in the parameter.

Studies should be designed to include a sufficient number of participants to adequately address the research question. Studies that have either an inadequate number of participants or an excessively large number of participants are both wasteful in terms of participant and investigator time, resources to conduct the assessments, analytic efforts and so on. These situations can also be viewed as unethical as participants may have been put at risk as part of a study that was unable to answer an important question. Studies that are much larger than they need to be to answer the research questions are also wasteful.

The formulas presented here generate estimates of the necessary sample size(s) required based on statistical criteria. However, in many studies, the sample size is determined by financial or logistical constraints. For example, suppose a study is proposed to evaluate a new screening test for Down Syndrome.  Suppose that the screening test is based on analysis of a blood sample taken from women early in pregnancy. In order to evaluate the properties of the screening test (e.g., the sensitivity and specificity), each pregnant woman will be asked to provide a blood sample and in addition to undergo an amniocentesis. The amniocentesis is included as the gold standard and the plan is to compare the results of the screening test to the results of the amniocentesis. Suppose that the collection and processing of the blood sample costs $250 per participant and that the amniocentesis costs $900 per participant. These financial constraints alone might substantially limit the number of women that can be enrolled. Just as it is important to consider both statistical and clinical significance when interpreting results of a statistical analysis, it is also important to weigh both statistical and logistical issues in determining the sample size for a study.

Learning Objectives

After completing this module, the student will be able to:

  • Provide examples demonstrating how the margin of error, effect size and variability of the outcome affect sample size computations.
  • Compute the sample size required to estimate population parameters with precision.
  • Interpret statistical power in tests of hypothesis.
  • Compute the sample size required to ensure high power when hypothesis testing.

Issues in Estimating Sample Size for Confidence Intervals Estimates

The module on confidence intervals provided methods for estimating confidence intervals for various parameters (e.g., μ , p, ( μ 1 - μ 2 ),   μ d , (p 1 -p 2 )). Confidence intervals for every parameter take the following general form:

Point Estimate + Margin of Error

In the module on confidence intervals we derived the formula for the confidence interval for μ as

In practice we use the sample standard deviation to estimate the population standard deviation. Note that there is an alternative formula for estimating the mean of a continuous outcome in a single population, and it is used when the sample size is small (n<30). It involves a value from the t distribution, as opposed to one from the standard normal distribution, to reflect the desired level of confidence. When performing sample size computations, we use the large sample formula shown here. [Note: The resultant sample size might be small, and in the analysis stage, the appropriate confidence interval formula must be used.]

The point estimate for the population mean is the sample mean and the margin of error is

In planning studies, we want to determine the sample size needed to ensure that the margin of error is sufficiently small to be informative. For example, suppose we want to estimate the mean weight of female college students. We conduct a study and generate a 95% confidence interval as follows 125 + 40 pounds, or 85 to 165 pounds. The margin of error is so wide that the confidence interval is uninformative. To be informative, an investigator might want the margin of error to be no more than 5 or 10 pounds (meaning that the 95% confidence interval would have a width (lower limit to upper limit) of 10 or 20 pounds). In order to determine the sample size needed, the investigator must specify the desired margin of error . It is important to note that this is not a statistical issue, but a clinical or a practical one. For example, suppose we want to estimate the mean birth weight of infants born to mothers who smoke cigarettes during pregnancy. Birth weights in infants clearly have a much more restricted range than weights of female college students. Therefore, we would probably want to generate a confidence interval for the mean birth weight that has a margin of error not exceeding 1 or 2 pounds.

The margin of error in the one sample confidence interval for μ can be written as follows:

Our goal is to determine the sample size, n, that ensures that the margin of error, " E ," does not exceed a specified value. We can take the formula above and, with some algebra, solve for n :

First, multipy both sides of the equation by the square root of n . Then cancel out the square root of n from the numerator and denominator on the right side of the equation (since any number divided by itself is equal to 1). This leaves:

Now divide both sides by "E" and cancel out "E" from the numerator and denominator on the left side. This leaves:

Finally, square both sides of the equation to get:

This formula generates the sample size, n , required to ensure that the margin of error, E , does not exceed a specified value. To solve for n , we must input " Z ," " σ ," and " E ."  

  • Z is the value from the table of probabilities of the standard normal distribution for the desired confidence level (e.g., Z = 1.96 for 95% confidence)
  • E is the margin of error that the investigator specifies as important from a clinical or practical standpoint.
  • σ is the standard deviation of the outcome of interest.

Sometimes it is difficult to estimate σ . When we use the sample size formula above (or one of the other formulas that we will present in the sections that follow), we are planning a study to estimate the unknown mean of a particular outcome variable in a population. It is unlikely that we would know the standard deviation of that variable. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study done in a different, but comparable, population. The sample size computation is not an application of statistical inference and therefore it is reasonable to use an appropriate estimate for the standard deviation. The estimate can be derived from a different study that was reported in the literature; some investigators perform a small pilot study to estimate the standard deviation. A pilot study usually involves a small number of participants (e.g., n=10) who are selected by convenience, as opposed to by random sampling. Data from the participants in the pilot study can be used to compute a sample standard deviation, which serves as a good estimate for σ in the sample size formula. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size is not too small.

Sample Size for One Sample, Continuous Outcome

In studies where the plan is to estimate the mean of a continuous outcome variable in a single population, the formula for determining sample size is given below:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), σ is the standard deviation of the outcome variable and E is the desired margin of error. The formula above generates the minimum number of subjects required to ensure that the margin of error in the confidence interval for μ does not exceed E .  

An investigator wants to estimate the mean systolic blood pressure in children with congenital heart disease who are between the ages of 3 and 5. How many children should be enrolled in the study? The investigator plans on using a 95% confidence interval (so Z=1.96) and wants a margin of error of 5 units. The standard deviation of systolic blood pressure is unknown, but the investigators conduct a literature search and find that the standard deviation of systolic blood pressures in children with other cardiac defects is between 15 and 20. To estimate the sample size, we consider the larger standard deviation in order to obtain the most conservative (largest) sample size. 

In order to ensure that the 95% confidence interval estimate of the mean systolic blood pressure in children between the ages of 3 and 5 with congenital heart disease is within 5 units of the true mean, a sample of size 62 is needed. [ Note : We always round up; the sample size formulas always generate the minimum number of subjects needed to ensure the specified precision.] Had we assumed a standard deviation of 15, the sample size would have been n=35. Because the estimates of the standard deviation were derived from studies of children with other cardiac defects, it would be advisable to use the larger standard deviation and plan for a study with 62 children. Selecting the smaller sample size could potentially produce a confidence interval estimate with a larger margin of error. 

An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams? Try to work through the calculation before you look at the answer.

Sample Size for One Sample, Dichotomous Outcome 

In studies where the plan is to estimate the proportion of successes in a dichotomous outcome variable (yes/no) in a single population, the formula for determining sample size is:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%) and E is the desired margin of error. p is the proportion of successes in the population. Here we are planning a study to generate a 95% confidence interval for the unknown population proportion, p . The equation to determine the sample size for determining p seems to require knowledge of p, but this is obviously this is a circular argument, because if we knew the proportion of successes in the population, then a study would not be necessary! What we really need is an approximate value of p or an anticipated value. The range of p is 0 to 1, and therefore the range of p(1-p) is 0 to 1. The value of p that maximizes p(1-p) is p=0.5. Consequently, if there is no information available to approximate p, then p=0.5 can be used to generate the most conservative, or largest, sample size.

Example 2:  

An investigator wants to estimate the proportion of freshmen at his University who currently smoke cigarettes (i.e., the prevalence of smoking). How many freshmen should be involved in the study to ensure that a 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion?

Because we have no information on the proportion of freshmen who smoke, we use 0.5 to estimate the sample size as follows:

In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 385 is needed.

Suppose that a similar study was conducted 2 years ago and found that the prevalence of smoking was 27% among freshmen. If the investigator believes that this is a reasonable estimate of prevalence 2 years later, it can be used to plan the next study. Using this estimate of p, what sample size is needed (assuming that again a 95% confidence interval will be used and we want the same level of precision)?

An investigator wants to estimate the prevalence of breast cancer among women who are between 40 and 45 years of age living in Boston. How many women must be involved in the study to ensure that the estimate is precise? National data suggest that 1 in 235 women are diagnosed with breast cancer by age 40. This translates to a proportion of 0.0043 (0.43%) or a prevalence of 43 per 10,000 women. Suppose the investigator wants the estimate to be within 10 per 10,000 women with 95% confidence. The sample size is computed as follows:

A sample of size n=16,448 will ensure that a 95% confidence interval estimate of the prevalence of breast cancer is within 0.10 (or to within 10 women per 10,000) of its true value. This is a situation where investigators might decide that a sample of this size is not feasible. Suppose that the investigators thought a sample of size 5,000 would be reasonable from a practical point of view. How precisely can we estimate the prevalence with a sample of size n=5,000? Recall that the confidence interval formula to estimate prevalence is:

Assuming that the prevalence of breast cancer in the sample will be close to that based on national data, we would expect the margin of error to be approximately equal to the following:

Thus, with n=5,000 women, a 95% confidence interval would be expected to have a margin of error of 0.0018 (or 18 per 10,000). The investigators must decide if this would be sufficiently precise to answer the research question. Note that the above is based on the assumption that the prevalence of breast cancer in Boston is similar to that reported nationally. This may or may not be a reasonable assumption. In fact, it is the objective of the current study to estimate the prevalence in Boston. The research team, with input from clinical investigators and biostatisticians, must carefully evaluate the implications of selecting a sample of size n = 5,000, n = 16,448 or any size in between.

Sample Sizes for Two Independent Samples, Continuous Outcome

In studies where the plan is to estimate the difference in means between two independent populations, the formula for determining the sample sizes required in each comparison group is given below:

where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used and E is the desired margin of error. σ again reflects the standard deviation of the outcome variable. Recall from the module on confidence intervals that, when we generated a confidence interval estimate for the difference in means, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome (based on pooling the data), where Sp is computed as follows:

If data are available on variability of the outcome in each comparison group, then Sp can be computed and used in the sample size formula. However, it is more often the case that data on the variability of the outcome are available from only one group, often the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated below.  

Note that the formula for the sample size generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used.  

An investigator wants to plan a clinical trial to evaluate the efficacy of a new drug designed to increase HDL cholesterol (the "good" cholesterol). The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. HDL cholesterol will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study over 12 weeks. A 95% confidence interval will be estimated to quantify the difference in mean HDL levels between patients taking the new drug as compared to placebo. The investigator would like the margin of error to be no more than 3 units. How many patients should be recruited into the study?  

The sample sizes are computed as follows:

A major issue is determining the variability in the outcome of interest (σ), here the standard deviation of HDL cholesterol. To plan this study, we can use data from the Framingham Heart Study. In participants who attended the seventh examination of the Offspring Study and were not on treatment for high cholesterol, the standard deviation of HDL cholesterol is 17.1. We will use this value and the other inputs to compute the sample sizes as follows:

Samples of size n 1 =250 and n 2 =250 will ensure that the 95% confidence interval for the difference in mean HDL levels will have a margin of error of no more than 3 units. Again, these sample sizes refer to the numbers of participants with complete data. The investigators hypothesized a 10% attrition (or drop-out) rate (in both groups). In order to ensure that the total sample size of 500 is available at 12 weeks, the investigator needs to recruit more participants to allow for attrition.  

N (number to enroll) * (% retained) = desired sample size

Therefore N (number to enroll) = desired sample size/(% retained)

N = 500/0.90 = 556

If they anticipate a 10% attrition rate, the investigators should enroll 556 participants. This will ensure N=500 with complete data at the end of the trial.

An investigator wants to compare two diet programs in children who are obese. One diet is a low fat diet, and the other is a low carbohydrate diet. The plan is to enroll children and weigh them at the start of the study. Each child will then be randomly assigned to either the low fat or the low carbohydrate diet. Each child will follow the assigned diet for 8 weeks, at which time they will again be weighed. The number of pounds lost will be computed for each child. Based on data reported from diet trials in adults, the investigator expects that 20% of all children will not complete the study. A 95% confidence interval will be estimated to quantify the difference in weight lost between the two diets and the investigator would like the margin of error to be no more than 3 pounds. How many children should be recruited into the study?  

Again the issue is determining the variability in the outcome of interest (σ), here the standard deviation in pounds lost over 8 weeks. To plan this study, investigators use data from a published study in adults. Suppose one such study compared the same diets in adults and involved 100 participants in each diet group. The study reported a standard deviation in weight lost over 8 weeks on a low fat diet of 8.4 pounds and a standard deviation in weight lost over 8 weeks on a low carbohydrate diet of 7.7 pounds. These data can be used to estimate the common standard deviation in weight lost as follows:

We now use this value and the other inputs to compute the sample sizes:

Samples of size n 1 =56 and n 2 =56 will ensure that the 95% confidence interval for the difference in weight lost between diets will have a margin of error of no more than 3 pounds. Again, these sample sizes refer to the numbers of children with complete data. The investigators anticipate a 20% attrition rate. In order to ensure that the total sample size of 112 is available at 8 weeks, the investigator needs to recruit more participants to allow for attrition.  

N = 112/0.80 = 140

Sample Size for Matched Samples, Continuous Outcome

In studies where the plan is to estimate the mean difference of a continuous outcome based on matched data, the formula for determining sample size is given below:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), E is the desired margin of error, and σ d is the standard deviation of the difference scores. It is extremely important that the standard deviation of the difference scores (e.g., the difference based on measurements over time or the difference between matched pairs) is used here to appropriately estimate the sample size.    

Sample Sizes for Two Independent Samples, Dichotomous Outcome

In studies where the plan is to estimate the difference in proportions between two independent populations (i.e., to estimate the risk difference), the formula for determining the sample sizes required in each comparison group is:

where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), and E is the desired margin of error. p 1 and p 2 are the proportions of successes in each comparison group. Again, here we are planning a study to generate a 95% confidence interval for the difference in unknown proportions, and the formula to estimate the sample sizes needed requires p 1 and p 2 . In order to estimate the sample size, we need approximate values of p 1 and p 2 . The values of p 1 and p 2 that maximize the sample size are p 1 =p 2 =0.5. Thus, if there is no information available to approximate p 1 and p 2 , then 0.5 can be used to generate the most conservative, or largest, sample sizes.    

Similar to the situation for two independent samples and a continuous outcome at the top of this page, it may be the case that data are available on the proportion of successes in one group, usually the untreated (e.g., placebo control) or unexposed group. If so, the known proportion can be used for both p 1 and p 2 in the formula shown above. The formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used. Interested readers can see Fleiss for more details. 4

An investigator wants to estimate the impact of smoking during pregnancy on premature delivery. Normal pregnancies last approximately 40 weeks and premature deliveries are those that occur before 37 weeks. The 2005 National Vital Statistics report indicates that approximately 12% of infants are born prematurely in the United States. 5 The investigator plans to collect data through medical record review and to generate a 95% confidence interval for the difference in proportions of infants born prematurely to women who smoked during pregnancy as compared to those who did not. How many women should be enrolled in the study to ensure that the 95% confidence interval for the difference in proportions has a margin of error of no more than 4%?

The sample sizes (i.e., numbers of women who smoked and did not smoke during pregnancy) can be computed using the formula shown above. National data suggest that 12% of infants are born prematurely. We will use that estimate for both groups in the sample size computation.

Samples of size n 1 =508 women who smoked during pregnancy and n 2 =508 women who did not smoke during pregnancy will ensure that the 95% confidence interval for the difference in proportions who deliver prematurely will have a margin of error of no more than 4%.

Is attrition an issue here? 

Issues in Estimating Sample Size for Hypothesis Testing

In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true.   In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).  

Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.  

The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.  

Suppose we want to test the following hypotheses at aα=0.05:  H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of

If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.  

Normal distribution of X when the mean of X is 90. A bell-shaped curve with a value of X-90 at the center.

Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05

Standard normal distribution showing a mean of 90. The rejection areas are in the two tails at the extremes above and below the mean. If the alpha level is 0.05, then each tail accounts for an arean of 0.025.

The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing.  

Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.  

Two overlapping normal distributions, one depicting the null hypothesis with a mean of 90 and the other showing the alternative hypothesis with a mean of 94. A more complete explanation of the figure is provided in the text below the figure.

If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).  

Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).

β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.

Overlapping bell-shaped distributions - one with a mean of 90 and the other with a mean of 98

Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.

Ensuring That a Test Has High Power

In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.  

The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.

In studies where the plan is to perform a test of hypothesis comparing the mean of a continuous outcome variable in a single population to a known mean, the hypotheses of interest are:

H 0 : μ = μ 0 and H 1 : μ ≠ μ 0 where μ 0 is the known mean (e.g., a historical control). The formula for determining sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. For example, if α=0.05, then 1- α/2 = 0.975 and Z=1.960. 1- β is the selected power, and Z 1-β is the value from the standard normal distribution holding 1- β below it. Sample size estimates for hypothesis testing are often based on achieving 80% or 90% power. The Z 1-β values for these popular scenarios are given below:

  • For 80% power Z 0.80 = 0.84
  • For 90% power Z 0.90 =1.282

ES is the effect size , defined as follows:

where μ 0 is the mean under H 0 , μ 1 is the mean under H 1 and σ is the standard deviation of the outcome of interest. The numerator of the effect size, the absolute value of the difference in means | μ 1 - μ 0 |, represents what is considered a clinically meaningful or practically important difference in means. Similar to the issue we faced when planning studies to estimate confidence intervals, it can sometimes be difficult to estimate the standard deviation. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study performed in a different but comparable population. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size will not be too small.

Example 7:  

An investigator hypothesizes that in people free of diabetes, fasting blood glucose, a risk factor for coronary heart disease, is higher in those who drink at least 2 cups of coffee per day. A cross-sectional study is planned to assess the mean fasting blood glucose levels in people who drink at least two cups of coffee per day. The mean fasting blood glucose level in people free of diabetes is reported as 95.0 mg/dL with a standard deviation of 9.8 mg/dL. 7 If the mean blood glucose level in people who drink at least 2 cups of coffee per day is 100 mg/dL, this would be important clinically. How many patients should be enrolled in the study to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.  

The effect size is computed as:

The effect size represents the meaningful difference in the population mean - here 95 versus 100, or 0.51 standard deviation units different. We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

Therefore, a sample of size n=31 will ensure that a two-sided test with α =0.05 has 80% power to detect a 5 mg/dL difference in mean fasting blood glucose levels.

In the planned study, participants will be asked to fast overnight and to provide a blood sample for analysis of glucose levels. Based on prior experience, the investigators hypothesize that 10% of the participants will fail to fast or will refuse to follow the study protocol. Therefore, a total of 35 participants will be enrolled in the study to ensure that 31 are available for analysis (see below).

N (number to enroll) * (% following protocol) = desired sample size

N = 31/0.90 = 35.

Sample Size for One Sample, Dichotomous Outcome

In studies where the plan is to perform a test of hypothesis comparing the proportion of successes in a dichotomous outcome variable in a single population to a known proportion, the hypotheses of interest are:

where p 0 is the known proportion (e.g., a historical control). The formula for determining the sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. 1- β is the selected power and   Z 1-β is the value from the standard normal distribution holding 1- β below it , and ES is the effect size, defined as follows:

where p 0 is the proportion under H 0 and p 1 is the proportion under H 1 . The numerator of the effect size, the absolute value of the difference in proportions |p 1 -p 0 |, again represents what is considered a clinically meaningful or practically important difference in proportions.  

Example 8:  

A recent report from the Framingham Heart Study indicated that 26% of people free of cardiovascular disease had elevated LDL cholesterol levels, defined as LDL > 159 mg/dL. 9 An investigator hypothesizes that a higher proportion of patients with a history of cardiovascular disease will have elevated LDL cholesterol. How many patients should be studied to ensure that the power of the test is 90% to detect a 5% difference in the proportion with elevated LDL cholesterol? A two sided test will be used with a 5% level of significance.  

We first compute the effect size: 

We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

A sample of size n=869 will ensure that a two-sided test with α =0.05 has 90% power to detect a 5% difference in the proportion of patients with a history of cardiovascular disease who have an elevated LDL cholesterol level.

A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance. (Do the computation yourself, before looking at the answer.)

In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are:

where μ 1 and μ 2 are the means in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is:

where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α /2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as:

where | μ 1 - μ 2 | is the absolute value of the difference in means between the two groups expected under the alternative hypothesis, H 1 . σ is the standard deviation of the outcome of interest. Recall from the module on Hypothesis Testing that, when we performed tests of hypothesis comparing the means of two independent groups, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome.

Sp is computed as follows:

If data are available on variability of the outcome in each comparison group, then Sp can be computed and used to generate the sample sizes. However, it is more often the case that data on the variability of the outcome are available from only one group, usually the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that may have involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated.  

 Note also that the formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used (see Howell 3 for more details).

An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to reduce systolic blood pressure. The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. Systolic blood pressures will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this would represent a clinically meaningful reduction. How many patients should be enrolled in the trial to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.  

In order to compute the effect size, an estimate of the variability in systolic blood pressures is needed. Analysis of data from the Framingham Heart Study showed that the standard deviation of systolic blood pressure was 19.0. This value can be used to plan the trial.  

The effect size is:

Samples of size n 1 =232 and n 2 = 232 will ensure that the test of hypothesis will have 80% power to detect a 5 unit difference in mean systolic blood pressures in patients receiving the new drug as compared to patients receiving the placebo. However, the investigators hypothesized a 10% attrition rate (in both groups), and to ensure a total sample size of 232 they need to allow for attrition.  

N = 232/0.90 = 258.

The investigator must enroll 258 participants to be randomly assigned to receive either the new drug or placebo.

An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.  

Answer  

In studies where the plan is to perform a test of hypothesis on the mean difference in a continuous outcome variable based on matched data, the hypotheses of interest are:

where μ d is the mean difference in the population. The formula for determining the sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it and ES is the effect size, defined as follows:

where μ d is the mean difference expected under the alternative hypothesis, H 1 , and σ d is the standard deviation of the difference in the outcome (e.g., the difference based on measurements over time or the difference between matched pairs).    

   

Example 10:

An investigator wants to evaluate the efficacy of an acupuncture treatment for reducing pain in patients with chronic migraine headaches. The plan is to enroll patients who suffer from migraine headaches. Each will be asked to rate the severity of the pain they experience with their next migraine before any treatment is administered. Pain will be recorded on a scale of 1-100 with higher scores indicative of more severe pain. Each patient will then undergo the acupuncture treatment. On their next migraine (post-treatment), each patient will again be asked to rate the severity of the pain. The difference in pain will be computed for each patient. A two sided test of hypothesis will be conducted, at α =0.05, to assess whether there is a statistically significant difference in pain scores before and after treatment. How many patients should be involved in the study to ensure that the test has 80% power to detect a difference of 10 units on the pain scale? Assume that the standard deviation in the difference scores is approximately 20 units.    

First compute the effect size:

Then substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

A sample of size n=32 patients with migraine will ensure that a two-sided test with α =0.05 has 80% power to detect a mean difference of 10 points in pain before and after treatment, assuming that all 32 patients complete the treatment.

Sample Sizes for Two Independent Samples, Dichotomous Outcomes

In studies where the plan is to perform a test of hypothesis comparing the proportions of successes in two independent populations, the hypotheses of interest are:

H 0 : p 1 = p 2 versus H 1 : p 1 ≠ p 2

where p 1 and p 2 are the proportions in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is given below:

where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as follows: 

where |p 1 - p 2 | is the absolute value of the difference in proportions between the two groups expected under the alternative hypothesis, H 1 , and p is the overall proportion, based on pooling the data from the two comparison groups (p can be computed by taking the mean of the proportions in the two comparison groups, assuming that the groups will be of approximately equal size).  

Example 11: 

An investigator hypothesizes that there is a higher incidence of flu among students who use their athletic facility regularly than their counterparts who do not. The study will be conducted in the spring. Each student will be asked if they used the athletic facility regularly over the past 6 months and whether or not they had the flu. A test of hypothesis will be conducted to compare the proportion of students who used the athletic facility regularly and got flu with the proportion of students who did not and got flu. During a typical year, approximately 35% of the students experience flu. The investigators feel that a 30% increase in flu among those who used the athletic facility regularly would be clinically meaningful. How many students should be enrolled in the study to ensure that the power of the test is 80% to detect this difference in the proportions? A two sided test will be used with a 5% level of significance.  

We first compute the effect size by substituting the proportions of students in each group who are expected to develop flu, p 1 =0.46 (i.e., 0.35*1.30=0.46) and p 2 =0.35 and the overall proportion, p=0.41 (i.e., (0.46+0.35)/2):

We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.  

Samples of size n 1 =324 and n 2 =324 will ensure that the test of hypothesis will have 80% power to detect a 30% difference in the proportions of students who develop flu between those who do and do not use the athletic facilities regularly.

Donor Feces? Really? Clostridium difficile (also referred to as "C. difficile" or "C. diff.") is a bacterial species that can be found in the colon of humans, although its numbers are kept in check by other normal flora in the colon. Antibiotic therapy sometimes diminishes the normal flora in the colon to the point that C. difficile flourishes and causes infection with symptoms ranging from diarrhea to life-threatening inflammation of the colon. Illness from C. difficile most commonly affects older adults in hospitals or in long term care facilities and typically occurs after use of antibiotic medications. In recent years, C. difficile infections have become more frequent, more severe and more difficult to treat. Ironically, C. difficile is first treated by discontinuing antibiotics, if they are still being prescribed. If that is unsuccessful, the infection has been treated by switching to another antibiotic. However, treatment with another antibiotic frequently does not cure the C. difficile infection. There have been sporadic reports of successful treatment by infusing feces from healthy donors into the duodenum of patients suffering from C. difficile. (Yuk!) This re-establishes the normal microbiota in the colon, and counteracts the overgrowth of C. diff. The efficacy of this approach was tested in a randomized clinical trial reported in the New England Journal of Medicine (Jan. 2013). The investigators planned to randomly assign patients with recurrent C. difficile infection to either antibiotic therapy or to duodenal infusion of donor feces. In order to estimate the sample size that would be needed, the investigators assumed that the feces infusion would be successful 90% of the time, and antibiotic therapy would be successful in 60% of cases. How many subjects will be needed in each group to ensure that the power of the study is 80% with a level of significance α = 0.05?

Determining the appropriate design of a study is more important than the statistical analysis; a poorly designed study can never be salvaged, whereas a poorly analyzed study can be re-analyzed. A critical component in study design is the determination of the appropriate sample size. The sample size must be large enough to adequately answer the research question, yet not too large so as to involve too many patients when fewer would have sufficed. The determination of the appropriate sample size involves statistical criteria as well as clinical or practical considerations. Sample size determination involves teamwork; biostatisticians must work closely with clinical investigators to determine the sample size that will address the research question of interest with adequate precision or power to produce results that are clinically meaningful.

The following table summarizes the sample size formulas for each scenario described here. The formulas are organized by the proposed analysis, a confidence interval estimate or a test of hypothesis.

Situation

Continuous Outcome,

One Sample:

CI for μ, H : μ = μ

Continuous Outcome,

Two Independent Samples:

CI for ( μ -μ ), H : μ = μ

Continuous Outcome,

Two Matched Samples:

CI for μ , H : μ = 0

Dichotomous Outcome,

One Sample:

CI for p , H : p = p

Dichotomous Outcome,

Two Independent Samples:

CI for (p -p ) , H : p = p

  • Buschman NA, Foster G, Vickers P. Adolescent girls and their babies: achieving optimal birth weight. Gestational weight gain and pregnancy outcome in terms of gestation at delivery and infant birth weight: a comparison between adolescents under 16 and adult women. Child: Care, Health and Development. 2001; 27(2):163-171.
  • Feuer EJ, Wun LM. DEVCAN: Probability of Developing or Dying of Cancer. Version 4.0 .Bethesda, MD: National Cancer Institute, 1999.
  • Howell DC. Statistical Methods for Psychology. Boston, MA: Duxbury Press, 1982.
  • Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: John Wiley and Sons, Inc.,1981.
  • National Center for Health Statistics. Health, United States, 2005 with Chartbook on Trends in the Health of Americans. Hyattsville, MD : US Government Printing Office; 2005.  
  • Plaskon LA, Penson DF, Vaughan TL, Stanford JL. Cigarette smoking and risk of prostate cancer in middle-aged men. Cancer Epidemiology Biomarkers & Prevention. 2003; 12: 604-609.
  • Rutter MK, Meigs JB, Sullivan LM, D'Agostino RB, Wilson PW. C-reactive protein, the metabolic syndrome and prediction of cardiovascular events in the Framingham Offspring Study. Circulation. 2004;110: 380-385.
  • Ramachandran V, Sullivan LM, Wilson PW, Sempos CT, Sundstrom J, Kannel WB, Levy D, D'Agostino RB. Relative importance of borderline and elevated levels of coronary heart disease risk factors. Annals of Internal Medicine. 2005; 142: 393-402.
  • Wechsler H, Lee JE, Kuo M, Lee H. College Binge Drinking in the 1990s:A Continuing Problem Results of the Harvard School of Public Health 1999 College Health, 2000; 48: 199-210.

Answers to Selected Problems

Answer to birth weight question - page 3.

An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams?

In order to ensure that the 95% confidence interval estimate of the mean birthweight is within 100 grams of the true mean, a sample of size 57 is needed. In planning the study, the investigator must consider the fact that some women may deliver prematurely. If women are enrolled into the study during pregnancy, then more than 57 women will need to be enrolled so that after excluding those who deliver prematurely, 57 with outcome information will be available for analysis. For example, if 5% of the women are expected to delivery prematurely (i.e., 95% will deliver full term), then 60 women must be enrolled to ensure that 57 deliver full term. The number of women that must be enrolled, N, is computed as follows:

                                                        N (number to enroll) * (% retained) = desired sample size

                                                        N (0.95) = 57

                                                        N = 57/0.95 = 60.

 Answer Freshmen Smoking - Page 4

In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 303 is needed. Notice that this sample size is substantially smaller than the one estimated above. Having some information on the magnitude of the proportion in the population will always produce a sample size that is less than or equal to the one based on a population proportion of 0.5. However, the estimate must be realistic.

Answer to Medical Device Problem - Page 7

A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance.

Then substitute the effect size and the appropriate z values for the selected alpha and power to comute the sample size.

A sample size of 364 stents will ensure that a two-sided test with α=0.05 has 90% power to detect a 0.05, or 5%, difference in jthe proportion of defective stents produced.

Answer to Alcohol and GPA - Page 8

An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.

First compute the effect size.

Now substitute the effect size and the appropriate z values for alpha and power to compute the sample size.

Sample sizes of n i =44 heavy drinkers and 44 who drink few fewer than five drinks per typical drinking day will ensure that the test of hypothesis has 80% power to detect a 0.25 unit difference in mean grade point averages.

Answer to Donor Feces - Page 8

We first compute the effect size by substituting the proportions of patients expected to be cured with each treatment, p 1 =0.6 and p 2 =0.9, and the overall proportion, p=0.75:

We now substitute the effect size and the appropriate Z values for the selected a and power to compute the sample size.

Samples of size n 1 =33 and n 2 =33 will ensure that the test of hypothesis will have 80% power to detect this difference in the proportions of patients who are cured of C. diff. by feces infusion versus antibiotic therapy.

In fact, the investigators enrolled 38 into each group to allow for attrition. Nevertheless, the study was stopped after an interim analysis. Of 16 patients in the infusion group, 13 (81%) had resolution of C. difficile–associated diarrhea after the first infusion. The 3 remaining patients received a second infusion with feces from a different donor, with resolution in 2 patients. Resolution of C. difficile infection occurred in only 4 of 13 patients (31%) receiving the antibiotic vancomycin.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How to Calculate Sample Size Needed for Power

By Jim Frost 73 Comments

Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.

Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study! Learn more about Statistical Power .

When you perform hypothesis testing, there is a lot of preplanning you must do before collecting any data. This planning includes identifying the data you will gather, how you will collect it, and how you will measure it among many other details. A crucial part of the planning is determining how much data you need to collect. I’ll show you how to estimate the sample size for your study.

Before we get to estimating sample size requirements, let’s review the factors that influence statistical significance. This process will help you see the value of formally going through a power and sample size analysis rather than guessing.

Related post : 5 Steps for Conducting Scientific Studies with Statistical Analyses

Factors Involved in Statistical Significance

Look at the chart below and identify which study found a real treatment effect and which one didn’t. Within each study, the difference between the treatment group and the control group is the sample estimate of the effect size.

A bar chart that displays the treatment and control group for two studies. Study A has a larger effect size than study B.

Did either study obtain significant results? The estimated effects in both studies can represent either a real effect or random sample error. You don’t have enough information to make that determination. Hypothesis tests incorporate these considerations to determine whether the results are statistically significant.

  • Effect size : The larger the effect size, the less likely it is to be random error. It’s clear that Study A exhibits a more substantial effect in the sample—but that’s insufficient by itself.
  • Sample size : Larger sample sizes allow hypothesis tests to detect smaller effects. If Study B’s sample size is large enough, its more modest effect can be statistically significant.
  • Variability : When your sample data have greater variability, random sampling error is more likely to produce considerable differences between the experimental groups even when there is no real effect. If the sample data in Study A have sufficient variability, random error might be responsible for the large difference.

Hypothesis testing takes all of this information and uses it to calculate the p-value —which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data. Consequently, you cannot determine a good sample size in a vacuum because the three factors are intertwined.

Related post : How Hypothesis Tests Work

Statistical Power of a Hypothesis Test

Because we’re talking about determining the sample size for a study that has not been performed yet, you need to learn about a fourth consideration—statistical power. Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error . Power = 1 – β. The power of the test depends on the other three factors.

For example, if your study has 80% power, it has an 80% chance of detecting an effect that exists. Let this point be a reminder that when you work with samples, nothing is guaranteed! When an effect actually exists in the population, your study might not detect it because you are working with a sample. Samples contain sample error, which can occasionally cause a random sample to misrepresent the population.

Related post : Types of Errors in Hypothesis Testing

Goals of a Power and Sample Size Analysis

Power analysis involves taking these three considerations, adding subject-area knowledge, and managing tradeoffs to settle on a sample size. During this process, you must rely heavily on your expertise to provide reasonable estimates of the input values.

Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense.

You don’t want to collect a large and expensive sample only to be able to detect an effect that is too small to be useful! Nor do you want an underpowered study that has a low probability of detecting an important effect. Your goal is to collect a large enough sample to have sufficient power to detect a meaningful effect—but not too large to be wasteful.

As you’ll see in the upcoming examples, the analyst provides numeric values that correspond to “a good chance” and “meaningful effect.” These values allow you to tailor the analysis to your needs.

All of these details might sound complicated, but a statistical power analysis helps you manage them. In fact, going through this procedure forces you to focus on the relevant information. Typically, you specify three of the four factors discussed above and your statistical software calculates the remaining value. For instance, if you specify the smallest effect size that is practically significant, variability, and power, the software calculates the required sample size.

Let’s work through some examples in different scenarios to bring this to life.

2-Sample t-Test Power Analysis for Sample Size

Suppose we’re conducting a 2-sample t-test to determine which of two materials is stronger. If one type of material is significantly stronger than the other, we’ll use that material in our process. Furthermore, we’ve tested these materials in a pilot study, which provides background knowledge for the estimates.

In a power and sample size analysis, statistical software presents you with a dialog box something like the following:

Power and sample size analysis dialog box for 2-sample t-test.

We’ll go through these fields one-by-one. First off, we will leave Sample sizes blank because we want the software to calculate this value.

Differences

Differences is often a confusing value to enter. Do not enter your guess for the difference between the two types of material. Instead, use your expertise to identify the smallest difference that is still meaningful for your application. In other words, you consider smaller differences to be inconsequential. It would not be worthwhile to expend resources to detect them.

By choosing this value carefully, you tailor the experiment so that it has a reasonable chance of detecting useful differences while allowing smaller, non-useful differences to remain potentially undetected. This value helps prevent us from collecting an unnecessarily large sample.

For our example, we’ll enter 5 because smaller differences are unimportant for our process.

Power values

Power values is where we specify the probability that the statistical hypothesis test detects the difference in the sample if that difference exists in the population. This field is where you define the “reasonable chance” that I mentioned earlier. If you hold the other input values constant and increase the test’s power, the required sample size also increases. The proper value to enter in this field depends on norms in your study area or industry. Common power values are 0.8 and 0.9.

We’ll enter a power of 0.9 so that the 2-sample t-test has a 90% chance of detecting a difference of 5.

Standard deviation

Standard deviation is the field where we enter the data variability. We need to enter an estimate for the standard deviation of material strength. Analysts frequently base these estimates on pilot studies and historical research data. Inputting better variability estimates will produce more reliable power analysis results. Consequently, you should strive to improve these estimates over time as you perform additional studies and testing. Providing good estimates of the standard deviation is often the most difficult part of a power and sample size analysis.

For our example, we’ll assume that the two types of material have a standard deviation of 4 units of strength. After we click OK, we see the results.

Related post : Measures of Variability

Interpreting the Statistical Power Analysis and Sample Size Results

Statistical power and sample size analysis provides both numeric and graphical results, as shown below.

Statistical output for the power and sample size analysis for the 2-sample t-test.

The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units.

The dot on the Power Curve corresponds to the information in the text output. However, by studying the entire graph, we can learn additional information about how statistical power varies by the difference. If we start at the dot and move down the curve to a difference of 2.5, we learn that the test has a power of approximately 0.4 (40%). This power is too low. However, we indicated that differences less than 5 were not practically significant to our process. Consequently, having low power to detect a difference of 2.5 is not problematic.

Conversely, follow the curve up from the dot and notice how power quickly increases to nearly 100% before we reach a difference of 6. This design satisfies the process requirements while using a manageable sample size of 15 per group.

Other Power Analysis Options

Now, let’s explore a few more options that are available for power analysis. This time we’ll use a one-tailed test and have the software calculate a value other than sample size.

Suppose we are again comparing the strengths of two types of material. However, in this scenario, we are currently using one kind of material and are considering switching to another. We will change to the new material only if it is stronger than our current material. Again, the smallest difference in strength that is meaningful to our process is 5 units. The standard deviation in this study is now 7. Further, let’s assume that our company uses a standard sample size of 20, and we need approval to increase it to 40. Because the standard deviation (7) is larger than the smallest meaningful difference (5), we might need a larger sample.

In this scenario, the test needs to determine only whether the new material is stronger than the current material. Consequently, we can use a one-tailed test. This type of test provides greater statistical power to determine whether the new material is stronger than the old material, but no power to determine if the current material is stronger than the new—which is acceptable given the dictates of the new scenario.

In this analysis, we’ll enter the two potential values for Sample sizes and leave Power values blank. The software will estimate the power of the test for detecting a difference of 5 for designs with both 20 and 40 samples per group.

We fill in the dialog box as follows:

Power and sample size analysis dialog box for a one-side 2-sample t-test.

And, in Options , we choose the following one-tailed test:

Options for the power and sample size analysis dialog box.

Interpreting the Power and Sample Size Results

Statistical output for the power and sample size analysis for the one-sided 2-sample t-test.

The statistical output indicates that a design with 20 samples per group (a total of 40) has a ~72% chance of detecting a difference of 5. Generally, this power is considered to be too low. However, a design with 40 samples per group (80 total) achieves a power of ~94%, which is almost always acceptable. Hopefully, the power analysis convinces management to approve the larger sample size.

Assess the Power Curve graph to see how the power varies by the difference. For example, the curve for the sample size of 20 indicates that the smaller design does not achieve 90% power until the difference is approximately 6.5. If increasing the sample size is genuinely cost prohibitive, perhaps accepting 90% power for a difference of 6.5, rather than 5, is acceptable. Use your process knowledge to make this type of determination.

Use Power Analysis for Sample Size Estimation For All Studies

Throughout this post, we’ve been looking at continuous data, and using the 2-sample t-test specifically. For continuous data, you can also use power analysis to assess sample sizes for ANOVA and DOE designs. Additionally, there are hypothesis tests for other types of data , such as proportions tests ( binomial data ) and rates of occurrence (Poisson data). These tests have their own corresponding power and sample analyses.

In general, when you move away from continuous data to these other types of data , your sample size requirements increase. And, there are unique intricacies in each. For instance, in a proportions test, you need a relatively larger sample size to detect a difference when your proportion is closer 0 or 1 than if it is in the middle (0.5). Many factors can affect the optimal sample size. Power analysis helps you navigate these concerns.

After reading this post, I hope you see how power analysis combines statistical analyses, subject-area knowledge, and your requirements to help you derive the optimal sample size for your specific needs. If you don’t perform this analysis, you risk performing a study that is either likely to miss an important effect or have an exorbitantly large sample size. I’ve written a post about a Mythbusters experiment that had no chance of detecting an effect because they guessed a sample size instead of performing a power analysis.

In this post, I’ve focused on how power affects your test’s ability to detect a real effect. However, low power tests also exaggerate effect sizes !

Finally, experimentation is an iterative process. As you conduct more studies in an area, you’ll develop better estimates to input into power and sample size analyses and gain a clearer picture of how to proceed.

Share this:

sample size null hypothesis

Reader Interactions

' src=

July 10, 2024 at 4:22 am

Thank you for this wonderful article, your articles are always very informative & helpful!

I have 2 questions regarding sample size, I’m running an experiment and: 1) I was under the impression that you don’t need to have determined what model you will eventually be using, but rather you determine the sample size beforehand & all is well. However, I do need to assume some things… I have G*Power and am not sure what ‘statistical test’ to choose! All I know is to choose the ‘F-test’. I know I have to determine the effect size, mention the number of groups, but am not sure where to add variance (or what it means exactly), nor what stat test to choose. I am assuming it is important to mention that my dependent variable is an ordinal scale (arguably can be run using OLS); not sure whether I am ‘comparing means between more than 2 groups’ or if there is something else I should assume/consider. Also, I’m not sure the ‘linear’ regression option would work.

2) I already ran a large pilot study, how can I use that to determine the sample size (I’m afraid my pilot study is at/close to the required size anyway…)

Navigating the net for help on power analysis has strangely been quite difficult, so I appreciate your help!

' src=

May 11, 2024 at 2:18 am

Thank you Mr. Jim for such a brief explanation on power analysis for same size. i read several explanations on this topic but none got to my head and understanding. but your explanation made it understandable.

Regards, Roopini

' src=

April 15, 2024 at 6:56 pm

Jim, Are you able to share what statistical software was used for your examples? Are there equations that can be typed into Excel to determine sample size and power? Is there free, reliable statistical analysis software you can recommend for calculating sample size and power?

Thank you! Suzann

' src=

April 16, 2024 at 3:37 pm

I used Minitab statistical software for the examples. Unfortunately, I don’t believe Excel has this feature built into it. However, there is a free Power and Sample Size analysis software that I highly recommend. It’s called G*Power . Click the link to get it for free!

' src=

May 24, 2024 at 2:13 pm

Jim, The post is really informative but i want to know how to use power analysis to find a corelation between 2 variables

May 25, 2024 at 5:27 pm

You can use power analysis to determine the sample size you’d need to detect a correlation of a particular strength with a specified power.

I recommend using the free power analysis tool called G*Power. Below I show an example of using it to find the sample size I’d need to detect a correlation of 0.7 with 95% power. The answer is a sample size of 20. See how I set up G*Power to get this answer below.

Power analysis for correlation.

October 24, 2022 at 8:29 pm

Hi again Jim, apologies if this was posted multiple times but I looked into the Bonferroni Correction and saw that this was the equation αnew = αoriginal / n

αoriginal: The original α level n: The total number of comparisons or tests being performed Seeing this would 6000 or 1000 be the n in my case? Would I also have to perform this once or more then once. Second question after finding this out when performing the power analysis that you mentioned do I have to do it multiple times to account for the different combinations with the states that I will match with each other.

October 24, 2022 at 10:23 pm

In this context, n is the number of comparisons between groups. If you want to compare all groups to each other (i.e., all pairwise comparisons), then with 6 groups you’ll have 15 comparisons. So, n = 15. However, you don’t necessarily need to compare all groups. It depends on your research question. If you can avoid all pairwise comparisons, it’s a good thing. Just decide on your comparisons and record it in your plans before proceeding with the project. If you wait until after analyzing the data, you might (even if subconsciously) be tempted to cherry pick the comparisons that give good results.

As an example of an alternative to all pairwise comparisons, you might compare five of the states to one reference state in your sample. That reduces the pairwise comparisons (n) from 15 to 5. That helps because you’re dividing alpha by the number of comparisons. A lower n won’t lower your Bonferroni corrected significance level as much:

0.05/15 = 0.003 0.05/5 = 0.01

You’ll need an extremely low p-value with 15 comparisons ( Using Post Hoc Tests with ANOVA . Of course, you’re not working with ANOVA. But if you need information about what and why you need to control the familywise error rate, it’ll be helpful. The same ideas will apply to the multiple comparisons you’re making with the 2 proportions test. In your case, if you go with 15 comparisons (all pairwise for the 6 states), your familywise error rate is 0.54. Over a 50% chance of a false positive!

October 21, 2022 at 8:59 pm

Hello again Jim, I looked on your other page about the margin of error and I had a few extra questions. The approach I would be taking is as you said taking with using 1000 people from each for a comparison with the surveys. I saw the formula that you had so would my confidence level for this instance be 95%? Also as your formula is listed would my bottom number be 1000 as well or would it be 6000, or would I have to complete this one instead Finding the Margin of Error for Other Percentages formula.

October 23, 2022 at 4:27 pm

Typically, surveys don’t delve so deep into statistical differences between groups in the responses. At least not that I’ve seen. Usually, they’ll calculate and report the margin of error. If the margins don’t overlap, you can assume the difference is statistically significant. However, as I point out in the margin of error post, that process is conservative because the difference can be statistically significant even with a little overlap.

What you need to do for your cases is perform a power analysis for a two-sample proportions test. That’s beyond what most public opinion surveys do but will get you the answers you need. In your case, the proportions you’re testing are the proportion of individual in state A who respond a particular way to a survey item and the other will be the proportion in state B who respond that way to the item.

I didn’t realize that you were performing hypothesis testing with your survey data, or I would’ve mentioned this from the start! Because you’re comparing six states, you’re also facing the problem of the multiple comparison increasing the familywise error rate for that set of comparison. You’ll need to use something like a Bonferroni correction to appropriately lower the significance level you use, which will affect the numbers you need for a particular power.

I hope that helps!

October 20, 2022 at 4:33 pm

Hello Jim, I am hoping you can have some guidance for me here. I am currently doing an assignment involving this subject here and my professor said this statement to me, There’s no rationale for the six thousand surveys. How did you arrive at your sample size? You need to report the power analysis (and numbers you used in that analysis) to arrive at your chosen sample size–like everything else in scientific writing the sample size needs justification. My study involves six states and getting specific individuals opinions from each state about there opinions on crime and how it has affected them. Surveys are my choice of use here, so my question is how would I come about to a sample size here. I had thought 6,000 was a starting point but am unsure if thats right?

October 21, 2022 at 4:11 pm

With surveys you typically calculate the sample size to produce a specific margin of error . Click the link to learn more about that and how to tell whether there are differences. It’s a little different process that power analysis in other contexts but it’s related. The big questions are how precise do you want your estimates to be? And if you have groups you want to compare, that can affect the calculates.

For instance, 6,000 would generally be considered a large sample size for survey research. However, if you’re comparing subgroups within your sample, that can affect how many you need. I don’t know if you plan to do this or not, but if you wanted to compare the differences between the six states, that means you’d have about 1,000 per state. That’s still fairly decent but you’ll have a larger margin of error. You’ll need to know whether your primary interest is estimates for the total sample or differences between subgroups. If it’s differences between subgroups, that always increases your required sample size.

That’s not to say that 1000 per state isn’t enough. I don’t know. But you’d do the margin of error calculations to see if it produces sufficient precision for your needs. The process involves a combination of doing the MoE calculations and knowing the required precision (or possibly standards in your subject area).

' src=

October 15, 2022 at 2:38 am

So can a “power analysis’ be done to get the sample size for a proposed survey instead of calculating for the sample size? In other words, is a “power analysis” the same as calculating for the sample size when doing a research study? Thank you.

October 16, 2022 at 2:01 am

Hi Ronrico,

There’s definitely a related concept. For surveys, you typically need to calculate the margin of error . Click the link to read my post about it!

' src=

August 16, 2022 at 5:59 am

Wonderful post!

I was wondering, how would I be able to determine if a sample size is large enough for a paper that I’m reading, assuming they do not give the power calculation? If they d give the power calculation, should the be 80% or over for stat sig results?

Thank you so much 🙂

August 21, 2022 at 12:28 am

Determining whether a study’s sample size and, hence, its statistical power, are sufficient isn’t quite as straightforward as it might appear. It’s tempting to take the study’s sample size, effect size, and variability and enter them into a power analysis. However, that’s problematic. What happens is that if the study has statistically significant findings the power analysis will always indicate sufficient sample size/power. However, if the study has non-significant results, the power analysis will always indicate that the sample size/power are insufficient.

That’s a problem because it’s possible to obtain significant results with low power studies and insignificant results with high power studies. It’s important recognize all these cases because significant low power studies will exaggerate the effects sizes and insignificant high power studies are more likely to indicate that the effect does not exist in the population.

What you need to do instead is enter the study’s sample size, use a literature review to obtain reasonable estimates of the variability (if possible), and then enter an effect size that represents either the literature’s collective best estimate of it or a minimum sample size that is still practically meaningful. Note that you are not using the study’s estimates for these calculations for the reasons I indicate earlier!

' src=

November 13, 2021 at 1:46 am

Hi Sir Jim!

I I’d like to know how I can utilize the GPOWER Calculator to figure out the sample size for my study. It essentially employed stratified random sampling. I’m hoping you’ll respond! best wishes!

November 13, 2021 at 11:57 pm

It depends on how you’ve conducted your stratified sampling and what you want to test. Are you comparing the strata within your sample? If so, you’d just select the type of test, such as a t-test, and then enter your values. G Power uses the default setting that your groups size are equal. That’s fine if you’re using a disproportionate stratified sampling design and set all your strata to the same size. However, if your strata sizes are unequal, you’ll need to adjust the allocation ratio.

' src=

June 16, 2021 at 7:32 am

Hello Jim. I want your help in calculating sample size for my study. I have three groups, first group is control (normal), second is a clinical population group undergoing treatment 1 and third colonics group (same disease as group2) undergoing treatment 2. So here I will compare some parameters between pre-post treatment for group 2 and 3 separately first. Then compare group 2 and 3 before treatment and after treatment and then compare baseline parameters and after treatment parameters across all three groups. I hope I have not confused you. I want to know the sample size for my three groups. My hypothesis is that the two treatments will improve the parameters in group 2 and 3, what I want to check is which treatment (1 or 2) is most effective.. I request you to kindly help me in this regard

' src=

April 19, 2021 at 10:49 pm

Dear Jim, I have question regarding calculating the sample size in this scenario: I’m doing a hospital based study (chart review study) where i will include all patients who have a specific disease (celiac disease) in the last 5 years. How would i know that the number which i will get is sufficient to answer my research questions considering that this disease is rare? suppose for example i ended up with 100 patients, how would i know that i can use this sample for further analysis ? Is there a way to calculate ahead the minimum number of patients needed to do my research?

' src=

March 8, 2021 at 10:45 pm

I am looking to determine the sample size necessary to detect differences in bird populations (composition and abundance) between forest treatment types. I assume I would use an ANOVA given that I have control units. My data will be bird occurrence data, so I imagine Poisson distribution. I have zero pilot data, though. Do you have any recommendations for reading up on ways to simulate or bootstrap data in this situation for use in making variability estimates?

Thank you!!

March 9, 2021 at 7:20 pm

Hi Lorelle,

Yes, I’d think you’d use something Poisson regression or negative binomial regression because of the count data. I write a little bit about them in my post about choosing the correct type of regression analysis . You can include categorical variables for forest types.

I don’t have good ideas for developing variability estimates. That can be the most difficult part of a power analysis. I’d recommend reading up on the literature as much as possible. Perhaps others have conduct similar research and you can use their estimates. Unfortunately, if you don’t have any data, you can’t bootstrap or simulate it.

I wish I had some better advice, but the best I can think of is to look through the literature for comparable studies. That’s always a good idea anyway, but here it’ll help you with the power analysis too.

' src=

February 17, 2021 at 7:04 am

I am confused in some parts as I am new to this, let’s assume I have difference in mean, standard error, power 80%, I have these information to get a sample size, (delta, sd, power). But question is how I would know this is correct sample size to get 80% power? which type I need to put paired or two.sample or one.sample? After power.t.test I get sample size 8.7 for two sample and 6 for one sample, I am not sure which would be correct one. How to determine that?

February 18, 2021 at 12:34 am

The correct test depends on the nature of the data you collect. Are you comparing the means of two groups? In that case, you need to use a 2-sample t-test. If you have one group and are comparing its mean to a test value, you need a 1-sample t-test.

You can read about the purposes and interpretations the various t-tests in my post about How to do t-tests in Excel . That should be helpful even if you’re not using Excel. Also, I write more about how t-tests work , which will be helpful in showing you what each test can do.

' src=

February 7, 2021 at 6:53 pm

Hey there! What sort of test would be best to determine sample size needed for a study determining a 10% difference between two groups at a power of say 80%? Thanks!

February 7, 2021 at 10:23 pm

Hi Kristin, you’d need to perform a power and sample size analysis for a 2-sample t-test. As I indicate in this post, you’ll need to supply an estimate of the population’s standard deviation, the difference you want to detect, and the power, and the procedure will tell you the sample size per group.

' src=

January 30, 2021 at 7:48 pm

I have an essay question if anyone can help me with:

Do a calculation: write down what you think the typical power of psychological study really is and what percentage of research hypotheses are “good” hypotheses. Assume that journals reserve 10% of their pages for publishing null results. Under these assumptions, what percentage of published psychological research is wrong? Do you agree that this analysis make sense or is this the wrong way to think about “right” and “wrong” research

January 30, 2021 at 8:57 pm

I can’t do your essay for you, but I’ve written two blog posts that should be extremely helpful for your assignment.

Reproducibility in Psychology Experiments Low power tests exaggerate effect sizes

Those two should give you some good food for thought!

' src=

January 26, 2021 at 1:17 pm

Dear Jim I have a question regrading sample size calculation for a laboratory study. The laboratory evaluation includes evaluation of marginal integrity of 2 dental material vs a control material? what type of test should I use ?

January 26, 2021 at 9:13 pm

Hi Eman, that largely depends on the type of data you’re collecting for your outcome. If marginal integrity is continuous data and you want to compare the means between the control and two treatment groups, one-way ANOVA is a great place to start.

' src=

November 22, 2020 at 10:30 am

Hi Jim, what if I want to run mixed model ANOVAS twice (on two different dependent variables) – would I have to then double the sample size that I calculated using g power? Thanks, Joanna

' src=

November 16, 2020 at 11:35 pm

Hi Jim. What about molecular data? For instance, I sequenced my 6 samples, 3 controls and 3 treatments, but each sample (tank replicate) consist of 500-800 individuals of biological replicates (larvae). Given the analysis after sequencing is that there are thousand of genes that may show mean differences between the control and treatment. My concern is, does power analysis still play a fair role here, given that increasing the “sample size” which is the number of tank replicate to a number of 5 or more suggested by power analysis to get >0.8 is nearly impossible in a physical setting?

' src=

November 5, 2020 at 8:09 pm

I have somewhat of a basic question. I am performing some animal studies and looking at the effect of preservation solution on ischemia repercussion injury following transplantation. I am comparing 5 different preservation solutions. What should be my sample size for each group? I want to know how exactly I can calculate that.

November 6, 2020 at 8:58 pm

You’ll need to have an estimate of the effect. Or, an estimate of the minimum effect size that is practically meaningful in a real-world sense. If you’re comparing means, you’ll also need an estimate of the variability. The nature of what and how to determine the sample size depends on the type of hypothesis test you’ll be using. That in turn depends on the nature of your outcome variable. Are you comparing means with continuous data or comparing proportions with binary data? But in all cases you’ll need that effect size estimate.

You’ll also need software to calculate that for you. I recommend a freeware program called G*Power . Although, most statistical applications can do these power calculations. I cover examples in this post that should be helpful for you.

If you have 5 solutions and you want to compare their means, you’ll need to perform power and sample size calculations for one-way ANOVA.

' src=

September 4, 2020 at 3:48 am

Hi Jim, I’ve calculate that I need 34 pairs for a paired t-test with an alpha=0.05 and beta=0.10 with standard deviation of 1.945 to detect a 1.0 increase in the difference. If after 5 pairs I run my hypothesis tests and I find that the difference is significant (i.e. I reject the null hypothesis) is there a need to complete the remaining 29 pairs? Thanks, Sam

' src=

August 20, 2020 at 12:13 pm

Thank you for the explanation. I am currently using G power to determine my sample size. But I am still confused about the effect size. Let say I use medium effect size for conducting a correlation, so sample size that have been suggested is 138 (example) but then when I use medium effect size for conducting a t test to find differences between two independent group, the sample size that have been suggested is 300 (example). So which sample size I should take? Does the same effect size need to be use for every statistical test? or actually each statistical test have different effect size?

' src=

August 15, 2020 at 1:45 pm

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

Thanks in Advance Vishal

August 15, 2020 at 3:11 pm

It sounds like you need to use a 2-sample proportions test. It’s one of the many hypothesis tests that I cover in my new Hypothesis Testing ebook . You’ll find the details about how and why to use it, assumptions, interpretations and examples for it.

As for using G*Power to estimate power and sample size, under the Test family drop-down list, choose Exact . Under the Statistical test drop-down, choose Proportions: Inequality, two independent groups (Fisher’s exact test) . That assumes that your two groups have different probes. From there, you’ll need to enter estimates for your study based on whatever background subject-area research/knowledge you have.

I hope this helps!

' src=

August 15, 2020 at 10:00 am

Hi Jim, Is that scientifically appropriate to use G*Power in sample size calculation of a clinical biomedical research?

August 15, 2020 at 3:24 pm

Hi, yes, G*Power should be appropriate to use for statistical analyses in any area. Did you have a specific concern about it?

August 11, 2020 at 12:21 am

Hi Everyone

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

' src=

July 12, 2020 at 1:56 am

Thank you, Jim, for the app reference. I am checking it out right now. #TeamNoSleep

July 12, 2020 at 5:40 pm

Hi Jamie, Ah, yes, #TeamNoSleep. I’ve unfortunately been on that team! 🙂

' src=

June 17, 2020 at 1:30 am

Hi Jim, What is the name of the software you use?

June 18, 2020 at 5:40 pm

I’m using Minitab statistical software. If you’d like free software to calculate power and sample sizes, I highly recommend G*Power .

' src=

June 10, 2020 at 4:05 pm

I would like to calculate power for a poisson regression (my DV consists of count data). Do you have any guidance on how to do so?

June 10, 2020 at 4:29 pm

Hi Veronica,

Unfortunately, I’m not familiar with an application will calculation power for Poisson regression. If your counts are large enough (lambda greater than 10), Poisson approximates a normal distribution. You might then be able to use power analysis for linear multiple regression, which I have seen in the free application G*Power . That might give you an idea at least. I’m not sure about power analysis specifically for Poisson regression.

' src=

June 3, 2020 at 6:24 am

Dear Jim, your post looks very nice. I have just one comment: how I could calculate the sample size and power for an “Equal variances” test comparing more than 2 samples ? Is it mandatory as in t-tests ? Which is the test statistic used in that test ? Thanks in advance for your tip

June 3, 2020 at 8:13 pm

Hi Ciro, to be honest, I’ve never seen a power analysis for an equal variances test with more than two samples!

The test statistic depends upon which of several methods you use, F-test, Levene’s test statistic, and Bartlett’s test statistic.

While it would be nice to estimate power for this type of test, I don’t think it’s a common practice and I haven’t seen it available in the software I have checked.

' src=

April 24, 2020 at 12:10 am

Why are the sample sizes here all so small?

April 25, 2020 at 1:37 am

For sample sizes, large and small are relative. Given the parameters entered, which include the effect size you want to detect, the properties of the data, and the desired powered, the sample sizes are exactly the correct size! Of course, you’re always working estimates for these values and there’s a chance your estimates are off. But, the proper sample size depends on the nature of all those properties.

I’m curious, was there some reason why you were expecting larger sample sizes? Some times you’ll see big studies, such as medical trials. In some cases with lives on the line, you’ll want very large sample sizes that go beyond just issue of statistical power. But, for many scientific studies where the stakes aren’t so high, they use the approach described here.

' src=

December 1, 2019 at 6:20 pm

Does he formula n equals z times standard deviation decided by margin of error all squared already a power analysis? I’m looking for power analysis for just estimating a statistic (descriptive statistics) and not hypothesis testing as in many cases of inferential statistics. Does that formula suffice? Thanks in advanced 😊

December 2, 2019 at 2:43 pm

You might not realize it, but you’re asking me a trick question! The answer for how you calculate power for descriptive statistics is that you don’t calculate power for descriptive statistics.

Descriptive statistics simply describe the characteristics of a particular group. You’re not making inferences about a larger population. Consequently, there is no hypothesis testing. Power relates to the probability that a hypothesis test will detect a population effect that actually exists. Consequently, if there is no hypothesis test/inferences about a population, there’s no reason to calculate power.

Relatedly, descriptive statistics do not involve a margin of error based on random sampling. The mean of a group is a specific known value without error (excluding measurement error) because you’re measuring all members of that group.

For more information about this topic, read my post about the differences between descriptive and inferential statistics .

' src=

October 22, 2019 at 3:24 am

Just wanted to understand, if the confidence interval and power is same.

' src=

September 9, 2019 at 8:25 am

Thanks for your explanation, Jim.

August 21, 2019 at 7:46 am

I would like to design a test for the following problem (under the assumption that the Poisson distribution applies):

Samples from a population can be either defective or not (e.g. some technical component from a production)

Out of a random sample of N, there should be at most k defective occurrences, with a 95% probability (e.g. N = 100’000, k = 30).

I would like to design a test for this (testing this Hypothesis) with a sample size N1 (different from N). What should my limit on k1 (defective occurrences from the sample of N1) be? Such that I can say that with a 95% confidence, there will be at most k occurrences out of N samples.

E.g. N2 = 20’000. k1 = ???

Any hints how to tackle this problem?

Many thanks in advance Tom

August 21, 2019 at 11:46 pm

To me, it sounds like you need to use the binomial distribution rather than the Poisson distribution. You use the binomial distribution when you have binary data and you know the probability of an event and the number of trials. That’s sounds like you’re scenario!

In the graph below, I illustrate a binomial distribution where we assume the defect rate is 0.001 and the sample size is 100,000. I had the software shade the upper and lower ~2.5% of the tails. 95% of the outcomes should fall within the middle.

example of binomial distribution

If you have sample data, you can use the Proportions hypothesis test, which is based on the binomial distribution. If you have a single sample, use the Proportions test to determine whether your sample is significantly different from a target probability and to construct a confidence interval.

I hope this help!

' src=

March 17, 2019 at 6:37 pm

Thanks very much for putting together this very helpful and informative page. I just have a quick question about statistical power: it’s been surprisingly difficult for me to locate an answer to it in the literature.

I want to calculate the sample size required in order to reach a certain level of a priori statistical power in my experiment. My question is about what ‘sample size’ means in this type of calculation. Does it mean the number of participants or the number of data points? If there is one data point per participant, then these numbers will obviously be the same. However, I’m using a mixed-effects logistic regression model in which there are multiple data points nested within each participant. (Each participant produces multiple ‘yes/no’ responses.)

It would seem odd if the calculation of a priori statistical power did not differentiate between whether each participant produces one response or multiple responses.

' src=

April 8, 2018 at 4:46 am

Thank you so much sir for the lucid explanation. Really appreciate your kind help. Many Thanks!

April 1, 2018 at 4:36 am

Dear sir, When i search online for sample size determination, i predominantly see mention of margin of error formula for its calculation.

At other places, like your website, i see use of effect size and desired power etc. for the same calcation.

I’m struggling to reconcile between these 2 approaches. Is there a link between the two?

I wish to determine sample size for testing a hypothesis with sufficient power, say 80% or 90%. Please guide me.

April 2, 2018 at 11:37 am

Hi Khalid, a margin of error (MOE) quantifies the amount of random sampling error in the estimation of a parameter, such as the mean or proportion. MOEs represent the uncertainty about how well the sample estimates from a study represent the true population value and are related to confidence intervals. In a confidence interval, the margin of error is the distance between the sample estimate and each endpoint of the CI.

Margins of error are commonly used for surveys. For example, if a survey result is that 75% of the respondents like the product with a MOE of 3 percent. This result indicates that we can be 95% confident that 75% +/- 3% (or 72-78%) of the population like the product.

If you conduct a study, you can estimate the sample size that you need to achieve a specific margin of error. The narrower the MOE, the more precise the estimate. If you have requirements about the precision of the estimates, then you might need to estimate the margin of error based on different sample sizes. This is simply one form of power and sample size analysis where the focus is on how sample sizes relate to the margin of error.

However, if you need to calculate power to detect an effect, use the methods I describe in this post.

In summary, determine what your requirements are and use the corresponding analysis. Do you need to estimate a sample size that produces a level of precision that you specify for the estimates? Or, do you need to estimate a sample size that produces an amount of power to detect a specific size effect? Of course, these are related questions and it comes down to what you want to input in as your criteria.

' src=

March 24, 2018 at 8:39 pm

' src=

March 20, 2018 at 10:42 am

Thank you so much for this very intuitive article on sample size.

Thank you, Ashwini

March 20, 2018 at 10:53 am

Hi Ashwini, you’re very welcome! I’m glad it was helpful!

' src=

March 19, 2018 at 1:22 pm

Thank you.This was very helpful

March 19, 2018 at 1:25 pm

You’re very welcome, Hellen! I’m glad you found it to be helpful!

' src=

March 13, 2018 at 4:27 am

Thanks for your answer Jim. I was indeed aware of this tool, which is great for demonstration. I think I’ll stick to it.

' src=

March 12, 2018 at 7:53 am

Awaiting your book!

March 12, 2018 at 2:06 pm

Thanks! If all goes well, the first one should be out in September 2018!

March 12, 2018 at 4:18 am

Once again, a nice demonstration. Thanks Jim. I was wondering which software you used in your examples. Is it, perhaps, R or G*Power? And, would you have any suggestions on an (online/offline) tool that can be used in class?

March 12, 2018 at 2:03 pm

Hi George, thank you very much! I’m glad it was helpful! I used Minitab for the examples, but I would imagine that most statistical software have similar features.

I found this interactive tool for displaying how power, alpha, effect size, etc. are related. Perhaps this is what you’re looking for?

' src=

March 12, 2018 at 1:02 am

Thanks for information, please explain for case- control study, sample size calculation if different study says different prevalence for different parameter.

' src=

March 12, 2018 at 12:26 am

Thnks sir …. Wana to salute uh. ……bt to far Sir send me sme articles on distributions of probability. ..

MOST KNDNSS

Comments and Questions Cancel reply

Teach yourself statistics

Power of a Hypothesis Test

The probability of not committing a Type II error is called the power of a hypothesis test.

Effect Size

To compute the power of the test, one offers an alternative view about the "true" value of the population parameter, assuming that the null hypothesis is false. The effect size is the difference between the true value and the value specified in the null hypothesis.

Effect size = True value - Hypothesized value

For example, suppose the null hypothesis states that a population mean is equal to 100. A researcher might ask: What is the probability of rejecting the null hypothesis if the true population mean is equal to 90? In this example, the effect size would be 90 - 100, which equals -10.

Factors That Affect Power

The power of a hypothesis test is affected by three factors.

  • Sample size ( n ). Other things being equal, the greater the sample size, the greater the power of the test.
  • Significance level (α). The lower the significance level, the lower the power of the test. If you reduce the significance level (e.g., from 0.05 to 0.01), the region of acceptance gets bigger. As a result, you are less likely to reject the null hypothesis. This means you are less likely to reject the null hypothesis when it is false, so you are more likely to make a Type II error. In short, the power of the test is reduced when you reduce the significance level; and vice versa.
  • The "true" value of the parameter being tested. The greater the difference between the "true" value of a parameter and the value specified in the null hypothesis, the greater the power of the test. That is, the greater the effect size, the greater the power of the test.

Test Your Understanding

Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size. II. Changing the significance level from 0.01 to 0.05. III. Increasing beta, the probability of a Type II error.

(A) I only (B) II only (C) III only (D) All of the above (E) None of the above

The correct answer is (C). Increasing sample size makes the hypothesis test more sensitive - more likely to reject the null hypothesis when it is, in fact, false. Changing the significance level from 0.01 to 0.05 makes the region of acceptance smaller, which makes the hypothesis test more likely to reject the null hypothesis, thus increasing the power of the test. Since, by definition, power is equal to one minus beta, the power of a test will get smaller as beta gets bigger.

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test. II. The effect size of the hypothesis test. III. The probability of making a Type II error.

The correct answer is (A). Increasing sample size makes the hypothesis test more sensitive - more likely to reject the null hypothesis when it is, in fact, false. Thus, it increases the power of the test. The effect size is not affected by sample size. And the probability of making a Type II error gets smaller, not bigger, as sample size increases.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.5 - power.

The probability of rejecting the null hypothesis, given that the null hypothesis is false, is known as power. In other words, power is the probability of correctly rejecting \(H_0\).

The power of a test can be increased in a number of ways, for example increasing the sample size, decreasing the standard error, increasing the difference between the sample statistic and the hypothesized parameter, or increasing the alpha level. Using a directional test (i.e., left- or right-tailed) as opposed to a two-tailed test would also increase power. 

When we increase the sample size, decrease the standard error, or increase the difference between the sample statistic and hypothesized parameter, the p value decreases, thus making it more likely that we reject the null hypothesis. When we increase the alpha level, there is a larger range of p values for which we would reject the null hypothesis. Going from a two-tailed to a one-tailed test cuts the p value in half. In all of these cases, we say that statistically power is increased. 

There is a relationship between \(\alpha\) and \(\beta\). If the sample size is fixed, then decreasing \(\alpha\) will increase \(\beta\). If we want both \(\alpha\) and \(\beta\) to decrease (i.e., decreasing the likelihood of both Type I and Type II errors), then we should increase the sample size.

Try it! Section  

The probability of committing a Type II error is known as \(\beta\).

\(Power+\beta=1\)

\(Power=1-\beta\)

If power increases then \(\beta\) must decrease. So, if the power of a statistical test is increased, for example by increasing the sample size, the probability of committing a Type II error decreases.

No. When we perform a hypothesis test, we only set the Type I error rate (i.e., alpha level) and guard against it. Thus, we can only present the strength of evidence against the null hypothesis. We can sidestep the concern about Type II error if the conclusion never mentions that the null hypothesis is accepted. When the null hypothesis cannot be rejected, there are two possible cases:

1) The null hypothesis is really true.

2) The sample size is not large enough to reject the null hypothesis (i.e., statistical power is too low).

The result of the study was to fail to reject the null hypothesis. In reality, the null hypothesis was false. This is a Type II error.

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called sampling error . (Note that the term error here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s r value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H 0 and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the alternative hypothesis (often symbolized as H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the alternative hypothesis. If it would not be extremely unlikely, then retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of d = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value . A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the p value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to conclude that it is true. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood p Value

The p value is one of the most misunderstood quantities in psychological research (Cohen, 1994). Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the p value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the p value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The p value is really the probability of a result at least as extreme as the sample result if the null hypothesis were true. So a p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the p value is not the probability that any particular hypothesis is true or false. Instead, it is the probability of obtaining the sample result if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the p value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the p value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s d is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s d is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word Yes , then this combination would be statistically significant for both Cohen’s d and Pearson’s r . If it contains the word No , then it would not be statistically significant for either. There is one cell where the decision for d and r would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Table 13.1 How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant

Relationship strength
Sample Size Weak Medium Strong
Small ( = 20) No No

= Maybe

= Yes

Medium ( = 50) No Yes Yes
Large ( = 100)

= Yes

= No

Yes Yes
Extra large ( = 500) Yes Yes Yes

Although Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007). The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word significant can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result. Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.

Practice: Use Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” to decide whether each of the following results is statistically significant.

  • The correlation between two variables is r = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD = 5) and the mean score for men is 24 ( SD = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of r = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.

Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003.

Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science , 16 , 259–263.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Statology

How to Write a Null Hypothesis (5 Examples)

A hypothesis test uses sample data to determine whether or not some claim about a population parameter is true.

Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms:

H 0 (Null Hypothesis): Population parameter =,  ≤, ≥ some value

H A  (Alternative Hypothesis): Population parameter <, >, ≠ some value

Note that the null hypothesis always contains the equal sign .

We interpret the hypotheses as follows:

Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

Alternative hypothesis: The sample data  does provide sufficient evidence to support the claim being made by an individual.

For example, suppose it’s assumed that the average height of a certain species of plant is 20 inches tall. However, one botanist claims the true average height is greater than 20 inches.

To test this claim, she may go out and collect a random sample of plants. She can then use this sample data to perform a hypothesis test using the following two hypotheses:

H 0 : μ ≤ 20 (the true mean height of plants is equal to or even less than 20 inches)

H A : μ > 20 (the true mean height of plants is greater than 20 inches)

If the sample data gathered by the botanist shows that the mean height of this species of plants is significantly greater than 20 inches, she can reject the null hypothesis and conclude that the mean height is greater than 20 inches.

Read through the following examples to gain a better understanding of how to write a null hypothesis in different situations.

Example 1: Weight of Turtles

A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles.

Here is how to write the null and alternative hypotheses for this scenario:

H 0 : μ = 300 (the true mean weight is equal to 300 pounds)

H A : μ ≠ 300 (the true mean weight is not equal to 300 pounds)

Example 2: Height of Males

It’s assumed that the mean height of males in a certain city is 68 inches. However, an independent researcher believes the true mean height is greater than 68 inches. To test this, he goes out and collects the height of 50 males in the city.

H 0 : μ ≤ 68 (the true mean height is equal to or even less than 68 inches)

H A : μ > 68 (the true mean height is greater than 68 inches)

Example 3: Graduation Rates

A university states that 80% of all students graduate on time. However, an independent researcher believes that less than 80% of all students graduate on time. To test this, she collects data on the proportion of students who graduated on time last year at the university.

H 0 : p ≥ 0.80 (the true proportion of students who graduate on time is 80% or higher)

H A : μ < 0.80 (the true proportion of students who graduate on time is less than 80%)

Example 4: Burger Weights

A food researcher wants to test whether or not the true mean weight of a burger at a certain restaurant is 7 ounces. To test this, he goes out and measures the weight of a random sample of 20 burgers from this restaurant.

H 0 : μ = 7 (the true mean weight is equal to 7 ounces)

H A : μ ≠ 7 (the true mean weight is not equal to 7 ounces)

Example 5: Citizen Support

A politician claims that less than 30% of citizens in a certain town support a certain law. To test this, he goes out and surveys 200 citizens on whether or not they support the law.

H 0 : p ≥ .30 (the true proportion of citizens who support the law is greater than or equal to 30%)

H A : μ < 0.30 (the true proportion of citizens who support the law is less than 30%)

Additional Resources

Introduction to Hypothesis Testing Introduction to Confidence Intervals An Explanation of P-Values and Statistical Significance

Featured Posts

sample size null hypothesis

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

2 Replies to “How to Write a Null Hypothesis (5 Examples)”

you are amazing, thank you so much

Say I am a botanist hypothesizing the average height of daisies is 20 inches, or not? Does T = (ave – 20 inches) / √ variance / (80 / 4)? … This assumes 40 real measures + 40 fake = 80 n, but that seems questionable. Please advise.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Psychol Med
  • v.42(1); Jan-Feb 2020

Sample Size and its Importance in Research

Chittaranjan andrade.

Clinical Psychopharmacology Unit, Department of Clinical Psychopharmacology and Neurotoxicology, National Institute of Mental Health and Neurosciences, Bengaluru, Karnataka, India

The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary sample size is set for a pilot study. This article discusses sample size and how it relates to matters such as ethics, statistical power, the primary and secondary hypotheses in a study, and findings from larger vs. smaller samples.

Studies are conducted on samples because it is usually impossible to study the entire population. Conclusions drawn from samples are intended to be generalized to the population, and sometimes to the future as well. The sample must therefore be representative of the population. This is best ensured by the use of proper methods of sampling. The sample must also be adequate in size – in fact, no more and no less.

SAMPLE SIZE AND ETHICS

A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not worth the effort and expense involved in recruiting the extra patients. Furthermore, an overly large sample would inconvenience more patients than might be necessary for the study objectives; this is unethical. In contrast, a sample that is smaller than necessary would have insufficient statistical power to answer the primary research question, and a statistically nonsignificant result could merely be because of inadequate sample size (Type 2 or false negative error). Thus, a small sample could result in the patients in the study being inconvenienced with no benefit to future patients or to science. This is also unethical.

In this regard, inconvenience to patients refers to the time that they spend in clinical assessments and to the psychological and physical discomfort that they experience in assessments such as interviews, blood sampling, and other procedures.

ESTIMATING SAMPLE SIZE

So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80%, and some set the threshold for significance at 0.01 rather than 0.05. Both choices are uncommon because the necessary sample size becomes large, and the study becomes more expensive and more difficult to conduct. Many investigators increase the sample size by 10%, or by whatever proportion they can justify, to compensate for expected dropout, incomplete records, biological specimens that do not meet laboratory requirements for testing, and other study-related problems.

Sample size calculations require assumptions about expected means and standard deviations, or event risks, in different groups; or, upon expected effect sizes. For example, a study may be powered to detect an effect size of 0.5; or a response rate of 60% with drug vs. 40% with placebo.[ 1 ] When no guesstimates or expectations are possible, pilot studies are conducted on a sample that is arbitrary in size but what might be considered reasonable for the field.

The sample size may need to be larger in multicenter studies because of statistical noise (due to variations in patient characteristics, nonspecific treatment characteristics, rating practices, environments, etc. between study centers).[ 2 ] Sample size calculations can be performed manually or using statistical software; online calculators that provide free service can easily be identified by search engines. G*Power is an example of a free, downloadable program for sample size estimation. The manual and tutorial for G*Power can also be downloaded.

PRIMARY AND SECONDARY ANALYSES

The sample size is calculated for the primary hypothesis of the study. What is the difference between the primary hypothesis, primary outcome and primary outcome measure? As an example, the primary outcome may be a reduction in the severity of depression, the primary outcome measure may be the Montgomery-Asberg Depression Rating Scale (MADRS) and the primary hypothesis may be that reduction in MADRS scores is greater with the drug than with placebo. The primary hypothesis is tested in the primary analysis.

Studies almost always have many hypotheses; for example, that the study drug will outperform placebo on measures of depression, suicidality, anxiety, disability and quality of life. The sample size necessary for adequate statistical power to test each of these hypotheses will be different. Because a study can have only one sample size, it can be powered for only one outcome, the primary outcome. Therefore, the study would be either overpowered or underpowered for the other outcomes. These outcomes are therefore called secondary outcomes, and are associated with secondary hypotheses, and are tested in secondary analyses. Secondary analyses are generally considered exploratory because when many hypotheses in a study are each tested at a P < 0.05 level for significance, some may emerge statistically significant by chance (Type 1 or false positive errors).[ 3 ]

INTERPRETING RESULTS

Here is an interesting question. A test of the primary hypothesis yielded a P value of 0.07. Might we conclude that our sample was underpowered for the study and that, had our sample been larger, we would have identified a significant result? No! The reason is that larger samples will more accurately represent the population value, whereas smaller samples could be off the mark in either direction – towards or away from the population value. In this context, readers should also note that no matter how small the P value for an estimate is, the population value of that estimate remains the same.[ 4 ]

On a parting note, it is unlikely that population values will be null. That is, for example, that the response rate to the drug will be exactly the same as that to placebo, or that the correlation between height and age at onset of schizophrenia will be zero. If the sample size is large enough, even such small differences between groups, or trivial correlations, would be detected as being statistically significant. This does not mean that the findings are clinically significant.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Logo for Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Inferential Statistics

Learning Objectives

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

 The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive summary data (e.g., means, correlation coefficients) for those variables. These descriptive data for the sample are called statistics .  In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the   null hypothesis  (often symbolized  H 0 and read as “H-zero”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the probability of the sample result or a more extreme result if the null hypothesis were true (Lakens, 2017). [1] This probability is called the p value . A low  p value means that the sample or more extreme result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that the sample or more extreme result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the p value criterion be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is a 5% chance or less of a result at least as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [2] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

Null Hypothesis. Image description available.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Sample Size Weak Medium Strong
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [3] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Conditional Risk. Image description available.

Image Description

“Null Hypothesis” long description:  A comic depicting a man and a woman talking in the foreground. In the background is a child working at a desk. The man says to the woman, “I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it  years  ago.”  [Return to “Null Hypothesis”]

“Conditional Risk” long description:  A comic depicting two hikers beside a tree during a thunderstorm. A bolt of lightning goes “crack” in the dark sky as thunder booms. One of the hikers says, “Whoa! We should get inside!” The other hiker says, “It’s okay! Lightning only kills about 45 Americans a year, so the chances of dying are only one in 7,000,000. Let’s go on!” The comic’s caption says, “The annual death rate among people who know that statistic is one in six.”  [Return to “Conditional Risk”]

Media Attributions

  • Null Hypothesis  by XKCD  CC BY-NC (Attribution NonCommercial)
  • Conditional Risk  by XKCD  CC BY-NC (Attribution NonCommercial)
  • Lakens, D. (2017, December 25). About p -values: Understanding common misconceptions. [Blog post] Retrieved from https://correlaid.org/en/blog/understand-p-values/ ↵
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Descriptive data that involves measuring one or more variables in a sample and computing descriptive summary data (e.g., means, correlation coefficients) for those variables.

Corresponding values in the population.

The random variability in a statistic from sample to sample.

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

The idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error (often symbolized H0 and read as “H-zero”).

An alternative to the null hypothesis (often symbolized as H1), this hypothesis proposes that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

A decision made by researchers using null hypothesis testing which occurs when the sample relationship would be extremely unlikely.

A decision made by researchers in null hypothesis testing which occurs when the sample relationship would not be extremely unlikely.

The probability of obtaining the sample result or a more extreme result if the null hypothesis were true.

The criterion that shows how low a p-value should be before the sample result is considered unlikely enough to reject the null hypothesis (Usually set to .05).

An effect that is unlikely due to random chance and therefore likely represents a real effect in the population.

Refers to the importance or usefulness of the result in some real-world context.

Research Methods in Psychology Copyright © 2019 by Rajiv S. Jhangiani, I-Chant A. Chiang, Carrie Cuttler, & Dana C. Leighton is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

  The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the  null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p  value that is not low means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is a 5% chance or less of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

image

“Null Hypothesis” retrieved from http://imgs.xkcd.com/comics/null_hypothesis.png (CC-BY-NC 2.5)

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Sample Size Weak Medium Strong
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

image

“Conditional Risk” retrieved from http://imgs.xkcd.com/comics/conditional_risk.png (CC-BY-NC 2.5)

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Creative Commons License

Share This Book

  • Increase Font Size

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 13: Inferential Statistics

Understanding Null Hypothesis Testing

Learning Objectives

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the   null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favour of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favour of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high  p  value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to conclude that it is true. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Table 13.1 How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant
Sample Size Weak relationship Medium-strength relationship Strong relationship
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favour of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.

Long Descriptions

“Null Hypothesis” long description: A comic depicting a man and a woman talking in the foreground. In the background is a child working at a desk. The man says to the woman, “I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it years ago.” [Return to “Null Hypothesis”]

“Conditional Risk” long description: A comic depicting two hikers beside a tree during a thunderstorm. A bolt of lightning goes “crack” in the dark sky as thunder booms. One of the hikers says, “Whoa! We should get inside!” The other hiker says, “It’s okay! Lightning only kills about 45 Americans a year, so the chances of dying are only one in 7,000,000. Let’s go on!” The comic’s caption says, “The annual death rate among people who know that statistic is one in six.” [Return to “Conditional Risk”]

Media Attributions

  • Null Hypothesis by XKCD  CC BY-NC (Attribution NonCommercial)
  • Conditional Risk by XKCD  CC BY-NC (Attribution NonCommercial)
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Values in a population that correspond to variables measured in a study.

The random variability in a statistic from sample to sample.

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

The idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error.

The idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

When the relationship found in the sample would be extremely unlikely, the idea that the relationship occurred “by chance” is rejected.

When the relationship found in the sample is likely to have occurred by chance, the null hypothesis is not rejected.

The probability that, if the null hypothesis were true, the result found in the sample would occur.

How low the p value must be before the sample result is considered unlikely in null hypothesis testing.

When there is less than a 5% chance of a result as extreme as the sample result occurring and the null hypothesis is rejected.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

sample size null hypothesis

Test at the 0.05 level of significance whether the mean of a random sample of size n=16 is "significantlyless than 10" if the distribution from which the sample was taken is normal, xbar=8.4, and sigma=3.2.What are the null and altenative hypothesis for this test.

To test the given situation, you would use a one-sample z-test. For this test, the null and alternative hypotheses are as follows: Null Hypothesis (H₀): The population mean (µ) is equal to 10.

Mathematically, it can be written as: H₀: µ = 10, Alternative Hypothesis (H₁): The population mean (µ) is significantly less than 10. Mathematically , it can be written as: H₁: µ < 10 You are given the sample size (n=16), the sample mean (X=8.4), and the population standard deviation (σ=3.2). To test the hypotheses at a 0.05 level of significance, you would calculate the z-score using the formula: z = (X - µ) / (σ / √n) Once you find the z-score, compare it to the critical value from the standard normal distribution table. If the z-score is less than the critical value, reject the null hypothesis, indicating that the population mean is significantly less than 10.

To know more about value click here

brainly.com/question/30760879

Related Questions

Graph three points for the equation x+(2y)-1=9 and determine if it is linear or nonlinear. List the points you used.

To graph the equation x + 2y - 1 = 9, we can first rearrange it to solve for y:

x + 2y - 1 = 9

2y = 10 - x

y = (10 - x) / 2

Now we can pick three different values of x and find the corresponding values of y using this equation. Let's choose x = 0, x = 2, and x = 4:

When x = 0:

y = (10 - 0) / 2 = 5

So one point on the graph is (0, 5).

When x = 2:

y = (10 - 2) / 2 = 4

So another point on the graph is (2, 4).

When x = 4:

y = (10 - 4) / 2 = 3

So the third point on the graph is (4, 3).

To check if this equation is linear or nonlinear, we can see if it satisfies the property of linearity, which is that if we draw a line between any two points on the graph, all other points on the graph should lie on that same line.

Let's check if this holds true for the three points we've chosen:

If we plot these three points, we get:

   |     * (4, 3)

 5 |        

   |   * (2, 4)

   | * (0, 5)

 3 +------------------

   0   2   4   6   8  10

Visually inspecting the plot, it looks like all three points do indeed lie on a straight line. Therefore, we can conclude that this equation is linear.

The degrees of freedom for the sample variance A.are equal to the sample size B.are equal to the sample size C.can vary between - [infinity] and + [infinity] D.both B and C

The degrees of freedom for the sample variance can vary between - [infinity] and + [infinity]. This means that the number of degrees of freedom is not dependent on the sample size, but rather on the amount of variance in the data.

The degrees of freedom for sample variance A. is equal to the sample size minus 1. This means that the correct answer is not provided in your given options. To clarify, let's define the terms: 1. Degrees of freedom: The number of independent values in a statistical calculation that are free to vary. 2. Variance: A measure of dispersion that represents the average squared difference between the values in a dataset and the mean of the dataset. 3. Sample size: The number of observations in a sample.

As the variance increases, the degrees of freedom decrease, which can impact the accuracy of the results. However, it is important to note that a larger sample size can often lead to a more accurate estimate of the population variance, even if the degrees of freedom are not directly related to the sample size.

When calculating the sample variance, the degrees of freedom is equal to the sample size (n) minus 1, often denoted as (n-1). This is because we lose one degree of freedom when estimating the population means using the sample mean.

Learn more about Degrees of Freedom :

brainly.com/question/31424137

0.5 miles = 2,640 feet O A. True OB. False

Step-by-step explanation:

Yes, the statement "0.5 miles = 2,640 feet" is true. One mile is equal to 5,280 feet, so half a mile (0.5 miles) is equal to 2,640 feet.

1 mile = 5280

1/2 = 0.5 / 5280 = 5280 / 2 =2640

change from rectangular to cylindrical coordinates. (let r ≥ 0 and 0 ≤ θ ≤ 2π.) (a) (5, −5, 3)(b) (−4, −4sqrt(3), 1)

a. The cylindrical coordinates for point (5, -5, 3) are (√(50), 7π/4, 3).

b. The cylindrical coordinates for point (-4, -4√(3), 1) are (8, 5π/3, 1).

To convert from rectangular coordinates (x, y, z) to cylindrical coordinates (r, θ, z), you can use the following formulas: r = √(x² + y²) θ = atan2(y, x) z = z (a) For the point (5, -5, 3): r = √(5² + (-5)²) = √(25 + 25) = √(50) θ = atan2(-5, 5) = -π/4 (since 0 ≤ θ ≤ 2π, add 2π to get θ) = 7π/4 z = 3 So, the cylindrical coordinates for point (5, -5, 3) are (√(50), 7π/4, 3). (b) For the point (-4, -4√(3), 1): r = √((-4)² + (-4√(3))²) = √(16 + 48) = √(64) = 8 θ = atan2(-4√(3), -4) = 5π/3 z = 1 So, the cylindrical coordinates for point (-4, -4√(3), 1) are (8, 5π/3, 1).

To learn more about cylindrical coordinates here:

brainly.com/question/31046653#

Verify distributive property of multiplication. a = 4. b = (-2) c = 1​

Given values satisfy the Distributive property of multiplication by -4=-4.

The Distributive Property of multiplication says that the multiplication of a group of numbers that will be added or subtracted is always equal to the subtraction or addition of individual multiplication.

To verify the given Distributive property of multiplication,

Given a = 4, b = (-2) and c = 1

The expression for the Distributive Property of multiplication is A(B+C) = AXB + AXC. So by substituting those values in the equation we get,

4((-2)+1) = 4x(-2) + 4x1

4(-1) = -8 + 4

So, by the above verification, we conclude that the given values satisfy the Distributive Property of Multiplication.

To know more about the Distributive Property of Multiplication ,

https://brainly.com/question/28490348

A 9-pound bag of sugar is being split into containers that hold 3/4 of a pound. How many containers of sugar will the 9-pound bag fill

true or false: as a general rule, one can use the normal distribution to approximate a binomial distribution whenever the sample size is at least 30.

The statement is generally true .

The binomial distribution is a probability distribution that describes the number of successes in a fixed number of independent trials, where each trial can either result in a success or a failure

According to the given information:

The statement is generally true. According to the Central Limit Theorem, as the sample size increases, the distribution of sample means approaches a normal distribution . In the case of a binomial distribution , the sample mean represents the proportion of successes in the sample. Therefore, when the sample size is large enough (typically n ≥ 30), the distribution of sample proportions closely approximates a normal distribution , and the mean and standard deviation of the sample proportion can be used to approximate the mean and standard deviation of a normal distribution. However, there may be cases where the normal approximation is not appropriate due to skewness or other factors, so it is always important to consider the context and assumptions of the problem.

To know more about distribution visit :

https://brainly.com/question/31197941

For each of the following problems, imagine that you are on a strange and unusual island, the natives of which are either Knights or Knaves. Knights may only tell the truth, whereas Knaves may only tell falsehoods. (Consequently, no one can be both a knight and a knave.) Each native wears medieval armor, and upon the breastplate of their armor, they have a single letter emblazoned (e.g., A, B, C, ....). Thus, the natives can be identified by the letter emblazoned on their breastplate. You can earn partial credit by explaining your reasoning even if you do not arrive at the correct answer. Part 1 (10 points total). You encounter two natives of this strange and unusual island – A and B. A says to you, "At least one of us is a knave." Is A a knight or a knave? How about B? Part 2 (10 points total). Now, you encounter three natives – C, D, E – and they initiate the following dialogue: C: All of us are knaves. D: Exactly one of us is a knight. What is C? What is D? What is E? Part 3 (10 points total). After C, D, and E leave, F, G, and H arrive. F: All of us are knaves. G: Exactly one of us is a knave. What is F? What is G? What is H? Part 4 (10 points total). Tiring of talking to these strange inhabitants, and needing some funds to finance your expedition, you begin to look for gold. You encounter J, and ask, "Is there gold on this island?" J responds "There is gold on this island if and only if I am a knight." Is there gold on the island?

On a strange and unusual island, the natives of which are either Knights or Knaves. The natives are neither A nor B are knaves in first scenario.    The natives are either all of them of C, D, and E are knights or two of them are knaves in other scenario. The natives are F is a knave, G is a knave, and H's truth value cannot be determined in third scenario. There is no gold on the island.

If A is a knight, then what A said must be true, which means both A and B are knaves, which is a contradiction. Therefore, A must be a knave, which means what A said must be false. Thus, neither A nor B are knaves.

If C is a knight , then what C said must be true, which means all of them are knaves, which is a contradiction. Therefore, C must be a knave, which means what C said must be false.

Thus, at least one of them is not a knave. If D is a knight, then what D said must be true, which means D is a knight, and exactly one of them is a knight, which is a contradiction since C is a knave. Therefore, D must be a knave, which means what D said must be false. Thus, either all of them are knights or two of them are knaves.

We encounter three natives named F, G, and H. F says that all of them are knaves, which means that either F, G, or H must be a knight. G says that exactly one of them is a knave, which means that G cannot be a knight because if G were a knight, then both F and H would have to be knaves, which contradicts what F said.

So, G must be a knave. Now, we know that at least one of F or H is a knight, since either of them being a knight would satisfy G's statement. We can't determine which one is a knight, so we can't determine the truth value of H's statement.

Therefore, we cannot determine whether H is a knight or a knave. So, the answer is F is a knave, G is a knave, and H's truth value cannot be determined.

Suppose J is a knight. Then, what J said must be true, which means there is gold on the island . But this contradicts what J said since J is not a knave. Therefore, J must be a knave, which means what J said must be false. Thus, there is no gold on the island.

To know more about natives:

https://brainly.com/question/31460562

Guys I need help how can i find an area of a square

Area =  width X height = 7 cm X  7cm = 49 cm^2

Using the rule that cos3θ = 4(cosθ)^3 − 3 cosθ, show that cos 2π/9 is a root of the equation 8x^3 − 6x + 1 = 0

Below in bold.

Let x = cosθ, then

8(cosθ)^3 − 6cosθ + 1 = 0

---> 2(4(cosθ)^3 − 3 cosθ) + 1 = 0    

---> 2(cos3θ) + 1 = 0

---> cos3θ = -1/2

---> θ = 2π/9

Therefore cos  θ  = = cos(2π/9) = x, and

cos(2π/9) is a root of the given eqation.

emma is currently the same age as claire was when emma was born. how old is emma now if claire is currently 42 years old

Emma is currently 21 years old.

Let's denote Emma's age as E and Claire's age as C. We are given the following information: The age of a person can be counted differently in different cultures. This calculator is based on the most common age system. In this system, age increases on a person's birthday. For example, the age of a person who has lived for 3 years and 11 months is 3, and their age will increase to 4 on their next birthday one month later. Most western countries use this age system.

1. When Emma was born, Claire was E years old. 2. Currently, Claire is 42 years old (C = 42). Since Emma is currently the same age as Claire was when Emma was born , we can say E = C - E. Now let's solve for E: E = 42 - E 2E = 42 E = 21 So, Emma is currently 21 years old.

Visit here to learn more about   age:

brainly.com/question/29963980

find the absolute maximum and absolute minimum values of f on the given interval. f(t) = t − 3√ t , [−1, 4]

The absolute maximum value of f(t) is approximately -0.1213 at t = 9/4, and the absolute minimum value is -2 at t = 4.

To find the absolute maximum and absolute minimum values of f on the given interval [−1, 4], we first need to find the critical points of the function f(t) = t − 3√t. Taking the derivative of f(t) with respect to t, we get: f'(t) = 1 - (3/2)t^(-1/2) Setting f'(t) = 0 to find critical points , we get: 0 = 1 - (3/2)t^(-1/2) (3/2)t^(-1/2) = 1 t^(-1/2) = 2/3 t = (2/3)^(-2) = 2.25 So the only critical point of f(t) on the interval [−1, 4] is t = 2.25. Now we need to evaluate f(t) at the endpoints of the interval and at the critical point to determine the absolute maximum and minimum values of f on the interval: f(-1) = -1 - 3√(-1) = -1 - 3i f(4) = 4 - 3√4 = 4 - 6 = -2 f(2.25) = 2.25 - 3√2.25 = 2.25 - 3(1.5) = -2.25 Therefore, the absolute maximum value of f on the interval [−1, 4] is f(-1) = -1 - 3i, and the absolute minimum value of f on the interval is f(4) = -2. To find the absolute maximum and minimum values of f(t) = t - 3√t on the interval [-1, 4], we need to evaluate the function at its critical points and endpoints. First, we find the critical points by taking the derivative of the function and setting it to zero: f'(t) = 1 - (3/2)t^(-1/2) To solve for critical points, set f'(t) = 0: 0 = 1 - (3/2)t^(-1/2) (3/2)t^(-1/2) = 1 t^(-1/2) = 2/3 t = (2/3)^(-2) = 9/4 Since 9/4 is within the interval [-1, 4], it is a valid critical point. Now, evaluate the function at the critical point and the endpoints: f(-1) = -1 - 3√(-1)

(Note: This value is complex, and we're looking for absolute max/min in the real domain , so we'll ignore this endpoint) f(9/4) = (9/4) - 3√(9/4) ≈ -0.1213 f(4) = 4 - 3√4 = -2 So, the absolute maximum value of f(t) is approximately -0.1213 at t = 9/4, and the absolute minimum value is -2 at t = 4.

To learn more about absolute value , click here:

brainly.com/question/1301718

Find the area of the region that is bounded by the given curve and lies in the specified sector. r=Sqrt(sin(theta)) 0 <= theta <= pi

The area of the region bounded by the curve and lying in the sector[tex]0 < = \theta < = \pi[/tex] is: 1 square unit.

The given curve is [tex]r = \sqrt{(sin(\theta)[/tex], where [tex]0 < = \theta < = \pi.[/tex]

To find the area of the region bounded by this curve and lying in the specified sector , we can use the formula for the area of a polar region:

A = (1/2)∫[a,b] [tex](f(\theta)^2[/tex] dθ

where f(θ) is the polar equation of the curve, and [a,b] is the interval of theta values that correspond to the desired sector.

In this case, we have:

f(θ) = [tex]\sqrt[/tex](sin(θ))

[a,b] = [0, [tex]\pi[/tex]]

Therefore, the area of the region bounded by the curve and lying in the sector [tex]0 < = \theta < = \pi[/tex] is:

A = (1/2)∫[0,[tex]\pi[/tex]] [tex](\sqrt(sin(\theta))^2[/tex] dθ

= (1/2)∫[0,[tex]\pi[/tex]] sin(θ) dθ

= (1/2) [-cos(θ)]|[0,[tex]\pi[/tex]]

= (1/2) (-cos([tex]\pi[/tex]) + cos(0))

= (1/2) (2)

Therefore, the area of the region is 1 square unit.

For more such questions on Area.

https://brainly.com/question/31473969#

Help AGAIN! Which one cheaper and by how much? View attachment below

Answer: Website A is cheaper, by an amount of, £0.29.

Step-by-step explanation: Here, the problem is simply about, initially adding , and then finding difference between the added results.

For Website A ,

Net Cost = £49.95 + £4.39

For Website B ,

Net Cost = £47.68 + £6.95

Therefore, we can clearly see,

Website A is cheaper by,

£(54.63 - 54.34) = £0.29

Read more about addition and subtraction:

https://brainly.com/question/778086

-10.4166666667 as a fraction

lets take n = -10.4166666

multiply this by 100 so we get the recurring part as the decimals

100n = -1041.66666

now we multiply our original n value by 10 for simplicity while calulating

10n = -104.16666

then we subtract 10n from 100n

90n = -1041.666 - (- 104.16666)

the recurring part will cancel out infinitely

90n = 937.5

then we solve for n

n = 937.5/90

simplifying will get us n= 125/12

Let W be the region bounded by the cylinders z= 1-y^2 and y=x^2, and the planes z=0 and y=1 . Calculate the volume of W as a triple integral in the three orders dzdydx, dxdzdy, and dydzdx. Im having trouble figuring out my parameters for which i am integrating. I do understand however that i should get the same volume for all three orders since the orders don't matter.

The volume of W as a triple integral in the three orders dzdydx, dxdzdy, and dydzdx are [tex]\int_{-1}^{1} \int_{x^2}^{1}\int_{0}^{1-y^2} 1 dz dy dx[/tex], [tex]\int_{0}^{1}\int_{0}^{1-y^2} \int_{-\sqrt{y}}^ {\sqrt{y}} 1 dx dz dy[/tex], and [tex]\int_{-1}^{ 1} \int_{0}^{1-y^2} \int_{x^2}^{1} 1 dy dz dx[/tex] respectively.

To calculate the volume of region W bounded by the cylinders z=1-y² and y=x², and the planes z=0 and y=1, we will set up the triple integral in three different orders : dzdydx, dxdzdy, and dydzdx.

You are correct that the volume should be the same for all three orders. 1. dzdydx: First, we find the limits of integration for z, y, and x.

The limits for z are from 0 to 1-y².

The limits for y are from x² to 1.

The limits for x are from -1 to 1, as y=x² intersects the y-axis at -1 and 1. The triple integral in dzdydx order will be: [tex]\int_{-1}^{1} \int_{x^2}^{1}\int_{0}^{1-y^2} 1 dz dy dx[/tex] 2. dxdzdy: To find the limits of integration for x, we solve y=x² for x and obtain x=±√y.

The limits for z are the same as before, from 0 to 1-y².

The limits for y are from 0 to 1. The triple integral in dxdzdy order will be: [tex]\int_{0}^{1}\int_{0}^{1-y^2} \int_{-\sqrt{y}}^ {\sqrt{y}} 1 dx dz dy[/tex] 3. dydzdx: We find the limits of integration for y by solving the equation y=x² for y, obtaining y=x².

The limits for z and x are the same as in the previous order. The triple integral in dydzdx order will be: [tex]\int_{-1}^{ 1} \int_{0}^{1-y^2} \int_{x^2}^{1} 1 dy dz dx[/tex] Evaluate each of these triple integrals to find the volume of region W.

Since the order of integration does not affect the result, you should get the same volume for all three orders.

Learn more about volume :

https://brainly.com/question/463363

for what values of a and c will the graph of f(x)=ax^2+c have one x intercept?

+) Case 1: a = 0

=> f(x) = 0×x²+c = c

=> for all values of x, f(x) always = c (does not satisfy the requirement)

+) Case 2: a≠0

=> f(x) = ax²+c

=> for every non-zero a, f(x) has only one solution x

Ans: a≠0, c ∈ R

P/s: c can be any value (in case you don't know the symbols above)

Ok done. Thank to me >:333

. find the solutions of each of the following systems of linear congruences. a) 2x 3y = 5 (mod 7) b) 4x y = 5 (mod 7) x 5y = 6 (mod 7) x 2y = 4 (mod 7)

The  following parts can be answered by the concept of linear congruences .

a. The solutions of the system of linear congruences 2x + 3y ≡ 5 (mod 7) are (x, y) = (0, 6) and (1, 3).

b. The solutions of the system of linear congruences 4x + y ≡ 5 (mod 7), x + 5y ≡ 6 (mod 7), x + 2y ≡

The given question asks to find the solutions of three systems of linear congruences. In system a), the congruence is 2x + 3y ≡ 5 (mod 7). In system b), the congruences are 4x + y ≡ 5 (mod 7), x + 5y ≡ 6 (mod 7), and x + 2y ≡ 4 (mod 7).

a) System of linear congruences: 2x + 3y ≡ 5 (mod 7)

To solve this system of linear congruences, we can use the Chinese Remainder Theorem (CRT). First, we write the congruences in the form ax ≡ b (mod m), where a, b, and m are integers.

2x ≡ -3y + 5 (mod 7)

Now we can try different values of x and y to find the solutions that satisfy the congruence . By substituting x = 0, we get:

0 ≡ -3y + 5 (mod 7)

Solving for y, we get y ≡ 6 (mod 7). So, one solution is x = 0 and y = 6.

Now, let's try x = 1:

2 ≡ -3y + 5 (mod 7)

Solving for y, we get y ≡ 3 (mod 7). So, another solution is x = 1 and y = 3.

Therefore, the solutions of the system of linear congruences 2x + 3y ≡ 5 (mod 7) are (x, y) = (0, 6) and (1, 3).

b) System of linear congruences: 4x + y ≡ 5 (mod 7), x + 5y ≡ 6 (mod 7), x + 2y ≡ 4 (mod 7)

To solve this system of linear congruences, we can again use the Chinese Remainder Theorem (CRT). First, we write the congruences in the form ax ≡ b (mod m), where a, b, and m are integers.

4x ≡ -y + 5 (mod 7) (1)

x ≡ -5y + 6 (mod 7) (2)

x ≡ -2y + 4 (mod 7) (3)

Now, we can try different values of x and y to find the solutions that satisfy all three congruences.

By substituting x = 0 into congruences (1) and (3), we get:

0 ≡ -y + 5 (mod 7)

0 ≡ -2y + 4 (mod 7)

Solving for y, we get y ≡ 5 (mod 7). So, one solution is x = 0 and y = 5.

4 ≡ -y + 5 (mod 7)

1 ≡ -5y + 6 (mod 7)

1 ≡ -2y + 4 (mod 7)

Therefore, the solutions of the system of linear congruences 4x + y ≡ 5 (mod 7), x + 5y ≡ 6 (mod 7), x + 2y ≡

To learn more about linear congruences here:

brainly.com/question/29597631#

let p(n) be the predicate "whenever 2n 1 players stand at distinct pairwise-distances and play arena dodgeball, there is always at least one survivor." prove this by induction 1

Since p(1) is true, by induction we conclude that p(n) is true for all positive integers n.

To prove the predicate p(n) by induction, we need to show that it is true for the base case n = 1, and that if it is true for some positive integer k, then it is also true for k+1.

When n = 1, we have 2n - 1 = 1 player. In this case, there is no pairwise-distance, so the predicate p(1) is vacuously true.

Inductive step:

Assume that p(k) is true for some positive integer k. That is, whenever 2k - 1 players stand at distinct pairwise-distances and play arena dodgeball, there is always at least one survivor.

We will show that p(k+1) is also true, that is, whenever 2(k+1) - 1 = 2k + 1 players stand at distinct pairwise-distances and play arena dodgeball, there is always at least one survivor.

Consider the 2k+1 players. We can group them into two sets: the first set contains k players, and the second set contains the remaining player. By the pigeonhole principle, at least one player in the first set is at a distance of d or greater from the player in the second set, where d is the smallest pairwise-distance among the k players.

Now, remove the player in the second set, and consider the remaining 2k - 1 players in the first set. Since p(k) is true, there is always at least one survivor among these players. This survivor is also a survivor among the original 2k+1 players, since the removed player is farther away from all of them than the surviving player.

Therefore, we have shown that if p(k) is true, then p(k+1) is also true. Since p(1) is true, by induction we conclude that p(n) is true for all positive integers n.

Learn more about induction

https://brainly.com/question/18575018

Given that A is the matrix 2 4 -7 -4 7 3 -1 -5 -1 The cofactor expansion of the determinant of A along column 1 is: det(A) = a1 · A1| + a2 · |A2|+ a3 · |A3), where a1 = __ a2 = ___ a3 = __ A1 =

A2 is the matrix -4 3 -1 -1, a2 is -4. A3 is the matrix 7 -4 -1 -5, a3 is -1. Therefore, the answer is: a1 = 2, a2 = -4, a3 = -1, A1 = 7 3 -5 -1.

Given that A is the matrix: | 2  4 -7 | | -4  7  3 | | -1 -5 -1 | The cofactor expansion of the determinant of A along column 1 is: det(A) = a1 · |A1| + a2 · |A2|+ a3 · |A3| Here, a1, a2, and a3 are the elements of the first column of the matrix A: a1 = 2 a2 = -4 a3 = -1 To find the matrices A1, A2, and A3, we need to remove the corresponding row and column of each element: A1 is obtained by removing the first row and first column: | 7  3 | |-5 -1 | A2 is obtained by removing the second row and first column: | 4 -7 | |-5 -1 | A3 is obtained by removing the third row and first column: | 4 -7 | | 7  3 | So, the cofactor expansion of the determinant of A along column 1 is: det(A) = 2 · |A1| - 4 · |A2| - 1 · |A3|

Learn more about matrix here: brainly.com/question/29132693

McMullen and Mulligan, CPAs, were conducting the audit of Cusick Machine Tool Company for the year ended December 31. Jim Sigmund, senior-in-charge of the audit, plans to use MUS to audit Cusick's inventory account. The balance at December 31 was $9,000,000. Required: a. Based on the following information, compute the required MUS sample size and sampling interval using Table 8-5: (Use the tables, Enot IDEA, to solve for these problems. Round your interval answer to the nearest whole number.) Tolerable misstatement = $360,000 Expected misstatement = $90,000 Risk of incorrect acceptance = 5%

Using MUS for auditing the inventory account of Cusick Machine Tool Company, the required sample size is 156 and the sampling interval is $57,692. The auditor can use this information to select the sample and test the account accurately.

In this scenario, Jim Sigmund, a senior in charge of the audit, plans to use MUS ( Monetary Unit Sampling ) to audit Cusick Machine Tool Company's inventory account. The MUS is a statistical sampling method that uses monetary units as a basis for selecting a sample.

The sample size is determined by considering the tolerable misstatement, expected misstatement, and the risk of incorrect acceptance. To determine the required MUS sample size and sampling interval, we need to use Table 8-5. The tolerable misstatement is the maximum amount of error that the auditor is willing to accept without modifying the opinion.

In this case, it is $360,000. The expected misstatement is the auditor's estimate of the amount of misstatement in the account. Here, it is $90,000. The risk of incorrect acceptance is the auditor's assessment of the risk that the sample will not identify a misstatement that exceeds the tolerable misstatement. It is 5%.

Using Table 8-5, we can find the sample size and sampling interval based on these factors. The sample size is 156, and the sampling interval is $57,692. This means that every $57,692 in the inventory account is a sampling interval, and the auditor needs to select 156 monetary units for testing.

The calculation for the sampling interval is determined by dividing the recorded balance of the account by the sample size. In this case, the recorded balance is $9,000,000, and the sample size is 156. Thus, the sampling interval is $57,692 ($9,000,000 / 156).

In conclusion, using MUS for auditing the inventory account of Cusick Machine Tool Company, the required sample size is 156 and the sampling interval is $57,692. The auditor can use this information to select the sample and test the account accurately.

To know more about sample size refer here:

https://brainly.com/question/30885988#

Complete Question:

McMullen and Mulligan, CPAs, were conducting the audit of Cusick Machine Tool Company for the year ended December 31. Jim Sigmund, senior-in-charge of the audit, plans to use MUS to audit Cusick's inventory account. The balance at December 31 was $9,000,000.

Based on the following information, compute the required MUS sample size and sampling interval using Table 8-5: (Use the tables, Enot IDEA, to solve for these problems. Round your interval answer to the nearest whole number.)

Tolerable misstatement = $360,000

Expected misstatement = $90,000

Risk of incorrect acceptance = 5%

Sample Size- 156

Sampling Interval - $57,692

Which answer describes the transformation of f(x)=x^2−1 tog(x)=(x+4)^2−1 ? A. a vertical stretch by a factor of 4 B. a horizontal translation 4 units to the left C. a vertical translation 4 units down D. a horizontal translation 4 units to the right

The transformation of the function [tex]f(x)=x^2[/tex] [tex]g( x)=(x+4)^2[/tex]−1 involves a horizontal translation 4 units to the left.

Therefore, the answer is B. a horizontal translation 4 units to the left.

We can see this by comparing the two functions. The function g(x) is the same as f(x) except that the argument of the squared term has been replaced by (x+4). This means that the graph of g(x) is the same as the graph of f(x), but shifted horizontally 4 units to the left.

A function is a mathematical relationship between two variables, typically denoted as f(x). A function takes an input value x and produces an output value y, according to a specific rule or equation.

The input value x is called the independent variable , while the output value y is called the dependent variable. The rule or equation that determines how the input value is transformed into the output value is called the function's formula or expression

To know more about radius here

https://brainly.com/question/27696929

Find the critical points of the function f(x)=10x23+x53fx=10x23+x53. Enter your answers in increasing order. If the number of critical points is less than the number of response areas, enter NA in the remaining response areas. x=x=

The answer is x = (-1/4)^(1/3), 0. This can be answered by the concept of Critical points .

To find the critical points of the function f(x)=10x²+3x⁵, we need to find where the derivative of the function is equal to zero or undefined. Taking the derivative of the function, we get: f'(x) = 20x + 15x⁴ Setting f'(x) equal to zero and solving for x, we get: 20x + 15x⁴ = 0 5x(4x³ + 1) = 0 x = 0 or x = (-1/4)^(1/3) So the critical points are x=0 and x=(-1/4)^(1/3). Entering them in increasing order , we get: x = (-1/4)^(1/3), 0 Therefore, the answer is x = (-1/4)^(1/3), 0.

To learn more about Critical points here:

brainly.com/question/29144288#

Find a unit normal vector for the following function at the point P(-3,-1,27) f(x,y)=x^3 comp wants answer says z component should be negative

The final answer for the unit normal vector at point P(-3,-1,27) for the function f(x,y)=x^3 is N = <-1, 0, 0>.

For more such question on vector

https://brainly.com/question/30394406

working together, evan and ellie can do the garden chores in 6 hours. it takes evan twice as long as ellie to do the work alone. how many hours does it take evan working alone?

Working together, Evan and Ellie can do the garden chores in 6 hours. it takes Evan twice as long as Ellie to do the work alone. Thus it takes Evan 18 hours to do the work alone.

Let x be the number of hours Ellie takes to do the garden chores alone. Then, Evan takes 2x hours to do the same work alone. We can express their work rates as follows: - Ellie's work rate : 1/x (jobs per hour) - Evan's work rate: 1/(2x) (jobs per hour)

Now, we know that if they work together, they can do the garden chores in 6 hours. This means that their combined work rate is 1/6 of the job per hour.

When they work together, their work rates add up: 1/x + 1/(2x) = 1/6 (since they complete the work together in 6 hours ) Now, let's solve for x: 1/x + 1/(2x) = 1/6 To clear the fractions, multiply both sides by 6x: We can solve for "x", which is Ellie's time to do the work alone: 1/6 = 3/2x 2x = 18 x = 9 So, Ellie takes 9 hours to complete the garden chores alone. Since Evan takes twice as long as Ellie, he takes 2 * 9 = 18 hours to complete the garden chores alone.

Learn more about hours :

brainly.com/question/13533620

For two programs at a university, the type of student for two majors is as follows. Find the probability a student is a science major, given they are a graduate student.

To find the probability that a student is a science major given that they are a graduate student, we need to use Bayes' theorem:

P(Science | Graduate) = P(Graduate | Science) * P(Science) / P(Graduate)

We know that P(Science) = 0.45 and P(Liberal Arts) = 0.55, and that P(Graduate | Science) = 0.35 and P(Graduate | Liberal Arts) = 0.25. We also know that the total probability of being a graduate student is:

P(Graduate) = P(Graduate | Science) * P(Science) + P(Graduate | Liberal Arts) * P(Liberal Arts)

Plugging in the values, we get:

P(Graduate) = 0.35 * 0.45 + 0.25 * 0.55 = 0.305

Now we can calculate the probability of being a science major given that the student is a graduate student:

P(Science | Graduate) = 0.35 * 0.45 / 0.305 = 0.515

Therefore, the probability that a student is a science major, given they are a graduate student, is approximately 0.515.

find p(2 < x1 2x2 < 5). find p(x1 6 > 2x2).

The required answer is p(x1 > 6 > 2x2) = 15/2.

To find p(2 < x1 < 2x2 < 5), we need to first determine the range of possible values for x1 and x2 that satisfy the inequality . We can do this by setting up a system of inequalities : 2 < x1 2x1 < 2x2 2x2 < 5 Simplifying the second inequality , we get: x1 < x2 Combining all the inequalities , we have: 2 < x1 < x2 < 5/2 This means that x1 can take on values between 2 and 5/2, while x2 can take on values between x1 and 5/2. To find p(2 < x1 < 2x2 < 5), we need to calculate the probability of this event occurring, given that x1 and x2 are both uniformly distributed between 0 and 1. This can be done using a double integral: p(2 < x1 < 2x2 < 5) = ∫∫(2 < x1 < x2 < 5/2) dx1 dx2 = ∫2^(1/2) 2x2 (5/2 - x2) dx2 = 15/8 - 2^(1/2)/2 To find p(x1 > 6 > 2x2), we need to determine the range of possible values for x1 and x2 that satisfy the inequality . We can do this by setting up the following inequalities : x1 > 6 2x2 < x1 Combining these inequalities , we have: 2x2 < x1 > 6 This means that x1 can take on values greater than 6, while x2 can take on values between 0 and x1/2. To find p(x1 > 6 > 2x2), we need to calculate the probability of this event occurring, given that x1 and x2 are both uniformly distributed between 0 and 1. This can be done using a double integral : The probability of an event is a number that indicates how likely the event is to occur. It is expressed as a number in the range from 0 and 1, or, using percentage notation, in the range from 0% to 100%. The more likely it is that the event will occur, the higher its probability.

p(x1 > 6 > 2x2) = ∫∫(x1 > 6, 0 < x2 < x1/2) dx1 dx2 = ∫6^1 2x2 dx2 = 15/2 Therefore, p(x1 > 6 > 2x2) = 15/2.

find p for the given inequalities .

step - by-step.

First inequality : 2 < x1 < 2x2 < 5 1. Rearrange the inequality to isolate x1: 2 < x1 < 2x2 2. Rearrange the inequality to isolate x2: x1/2 < x2 < 5/2

The probability of this event occurring, given that x1 and x2 are both uniformly distributed between 0 and 1. This can be done using a double integral . The probability of an event is a number that indicates how likely the event is to occur. It is expressed as a number in the range from 0 and 1, or, using percentage notation, in the range from 0% to 100%. The more likely it is that the event will occur, the higher its probability.

The probability of getting an outcome of "head-head" is 1 out of 4 outcomes, or, in numerical terms, 1/4, 0.25 or 25%. However, when it comes to practical application, there are two major competing categories of probability interpretations, whose adherents hold different views about the fundamental nature of probability.

Second inequality: x1 + 6 > 2x2 1. Rearrange the inequality to isolate x1: x1 > 2x2 - 6 2. Rearrange the i nequality to isolate x2: x2 < (x1 + 6) / 2 Now we have the following inequalities : 1. 2 < x1 < 2x2 2. x1/2 < x2 < 5/2 3. x1 > 2x2 - 6 4. x2 < (x1 + 6) / 2 To find p, we need to find the range of x1 and x2 that satisfy all the given inequalities . Unfortunately, without more information or constraints on x1 and x2, we cannot find a unique solution for p.

To know more about probability . Click on the link.

https://brainly.com/question/14210034

{xyz | x, z ∈ σ ∗ and y ∈ σ ∗ 1σ ∗ , where |x| = |z| ≥ |y|}

The expression {xyz | x, z ∈ σ ∗ and y ∈ σ ∗ 1σ ∗ , where |x| = |z| ≥ |y|} represents a set of strings that can be formed by concatenating three substrings : x, y, and z.

The strings in the set must satisfy the following conditions:

Intuitively, this set represents all the strings that can be formed by taking a " core " string of length |y| and adding some arbitrary strings before and after it to create a longer string of the same length. The single symbol at the end of y is meant to separate y from the rest of the string and ensure that y is not empty. For example, if σ = {0, 1}, then one possible string in the set is "0011100", where x = "00", y = "111", and z = "00". This string satisfies the conditions because |x| = |z| = 2, |y| = 3, and y ends in the symbol "1" from σ. Other strings in the set could be "0000110", "1010101", or "1111000", depending on the choice of x, y, and z.

To learn more about Sets , visit:

https://brainly.com/question/25005086

Kira's backyard has a patio and a garden. Find the area of the garden. (Sides meet at right angles.)

  18 square yards

You want the area of a garden that fills a back yard that is 4 yd by 6 yd except for a patio that is 3 yd by 2 yd .

The area of the backyard is ...

  A = LW = (6 yd)(4 yd) = 24 yd²

The area of the patio is ...

  A = LW = (3 yd)(2 yd) = 6 yd²

Garden area

The garden area is the area of the backyard that is not taken up by the patio:

  24 yd² -6 yd² = 18 yd²

The garden covers 18 square yards .

Additional comment

You can compute this many ways. You can divide the garden area into rectangles or trapezoids, or you can recognize that the garden is 3/4 of the area of the back yard.

(You get two trapezoids by cutting the garden along a line between the upper left corner of the yard and the upper left corner of the patio.)

develop a model for trend and seasonality. please clearly define your variables. how many independent variables do you have in your regression?

The recommended model for trend and seasonality is the Seasonal-Trend Decomposition using Loess (STL) regression model.

The variables in the model are time (t), trend (Tt), seasonality (St), and residual (Rt).

The number of independent variables depends on the frequency of data and degree of seasonality, and can be determined by the formula 2q x m.

To develop a model for trend and seasonality , we can use a regression model known as the Seasonal-Trend Decomposition using Loess (STL).

The variables in the model are:

The number of independent variables in the regression depends on the frequency of the data and the degree of seasonality. If the data has a daily frequency and exhibits daily seasonality, the regression model will have 365 independent variables (one for each day of the year). If the data has a monthly frequency and exhibits monthly seasonality, the regression model will have 12 independent variables (one for each month of the year).

The number of independent variables can be determined by the formula 2q × m, where q is the number of harmonics (usually set to 1 or 2) and m is the number of observations per season (e.g., 12 for monthly data).

Lear more about model for trend and seasonality

brainly.com/question/29344800

COMMENTS

  1. 25.3

    Before we learn how to calculate the sample size that is necessary to achieve a hypothesis test with a certain power, it might behoove us to understand the effect that sample size has on power. ... average yield from 40 bushels per acre. Therefore, he is interested in testing, at the \(\alpha=0.05\) level, the null hypothesis \(H_0:\mu=40 ...

  2. Issues in Estimating Sample Size for Hypothesis Testing

    Suppose we want to test the following hypotheses at aα=0.05: H 0: μ = 90 versus H 1: μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the ...

  3. Power and Sample Size Determination

    If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0: μ = 90. ... Sample size estimates for hypothesis testing are often based on achieving 80% or 90% power.

  4. Sample size, power and effect size revisited: simplified and practical

    In order to understand and interpret the sample size, power analysis, effect size, and P value, it is necessary to know how the hypothesis of the study was formed. It is best to evaluate a study for Type I and Type II errors ( Figure 1 ) through consideration of the study results in the context of its hypotheses ( 14 - 16 ).

  5. Null Hypothesis: Definition, Rejecting & Examples

    It is one of two mutually exclusive hypotheses about a population in a hypothesis test. When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant. Statisticians often denote the null hypothesis as H 0 or H A. Null Hypothesis H0: No effect exists in the population.

  6. 11.8: Effect Size, Sample Size and Power

    The answer, shown in Figure 11.5, is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if θ=0.7 the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if θ=0.55. In short, while θ=.55 and θ=.70 are both part of the alternative ...

  7. How to Calculate Sample Size Needed for Power

    Statistical power and sample size analysis provides both numeric and graphical results, as shown below. The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units. The dot on the Power Curve corresponds to the information in the text output.

  8. Power of Hypothesis Test

    A researcher might ask: What is the probability of rejecting the null hypothesis if the true population mean is equal to 90? In this example, the effect size would be 90 - 100, which equals -10. Factors That Affect Power. The power of a hypothesis test is affected by three factors. Sample size (n). Other things being equal, the greater the ...

  9. 6.5

    When we increase the sample size, decrease the standard error, or increase the difference between the sample statistic and hypothesized parameter, the p value decreases, thus making it more likely that we reject the null hypothesis. When we increase the alpha level, there is a larger range of p values for which we would reject the null hypothesis.

  10. 13.1 Understanding Null Hypothesis Testing

    If it would not be unlikely, then the null hypothesis is retained. The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly ...

  11. 13.1: Understanding Null Hypothesis Testing

    Practice: Use Table 13.1.1 13.1. 1 to decide whether each of the following results is statistically significant. The correlation between two variables is r = −.78 based on a sample size of 137. The mean score on a psychological characteristic for women is 25 ( SD = 5) and the mean score for men is 24 ( SD = 5).

  12. Null hypothesis

    In scientific research, the null hypothesis (often denoted H 0) ... For such a hypothesis the sampling distribution of any statistic is a function of the sample size alone. Composite hypothesis Any hypothesis which does not specify the population distribution completely.

  13. 9.1: Null and Alternative Hypotheses

    Review. In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim.If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis, typically denoted with \(H_{0}\).The null is not rejected unless the hypothesis test shows otherwise.

  14. Interpreting Results from Statistical Hypothesis Testing: Understanding

    There are two types of power analyses, the a priori method and the post hoc procedure. The a priori method determines the power and ES before the study and determines the sample size required for the null hypothesis significance test in advance. Because the power is fixed at 0.8, the sample size can be obtained if the ES is known in advance.

  15. How to Write a Null Hypothesis (5 Examples)

    Example 1: Weight of Turtles. A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles. Here is how to write the null and alternative hypotheses for this scenario: H0: μ = 300 (the true mean weight is equal to ...

  16. Sample Size and its Importance in Research

    ESTIMATING SAMPLE SIZE. So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80 ...

  17. Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H0 and read as "H-zero").

  18. 13.1 Understanding Null Hypothesis Testing

    The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.

  19. Sample Size and Hypothesis Testing

    Right-sizing experiments involve trade-offs involving the probabilities of different kinds of false claims, precision of estimates, and operational and ethical constraints on sample size. Power is the probability of obtaining a true positive (correctly rejecting the null hypothesis when the alternate hypothesis is true).

  20. Understanding Null Hypothesis Testing

    So a p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time. ... Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship ...

  21. 8.6: Hypothesis Test of a Single Population Mean with Examples

    Interpretation of the p-value: If the null hypothesis is true, then there is a 0.0396 probability (3.96%) that the sample mean is 65 or more. Figure \(\PageIndex{11}\) ... If the power is too low, statisticians typically increase the sample size while keeping ...

  22. Test At The 0.05 Level Of Significance Whether The Mean Of A Random

    To test the given situation, you would use a one-sample z-test. For this test, the null and alternative hypotheses are as follows: Null Hypothesis (H₀): The population mean (µ) is equal to 10.. Mathematically, it can be written as: H₀: µ = 10, Alternative Hypothesis (H₁): The population mean (µ) is significantly less than 10.