Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

8.4: Small Sample Tests for a Population Mean

  • Last updated
  • Save as PDF
  • Page ID 522

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Learning Objectives

  • To learn how to apply the five-step test procedure for test of hypotheses concerning a population mean when the sample size is small.

In the previous section hypotheses testing for population means was described in the case of large samples. The statistical validity of the tests was insured by the Central Limit Theorem, with essentially no assumptions on the distribution of the population. When sample sizes are small, as is often the case in practice, the Central Limit Theorem does not apply. One must then impose stricter assumptions on the population to give statistical validity to the test procedure. One common assumption is that the population from which the sample is taken has a normal probability distribution to begin with. Under such circumstances, if the population standard deviation is known, then the test statistic

\[\frac{(\bar{x}-\mu _0)}{\sigma /\sqrt{n}} \nonumber \]

still has the standard normal distribution, as in the previous two sections. If \(\sigma\) is unknown and is approximated by the sample standard deviation \(s\), then the resulting test statistic

\[\dfrac{(\bar{x}-\mu _0)}{s/\sqrt{n}} \nonumber \]

follows Student’s \(t\)-distribution with \(n-1\) degrees of freedom.

Standardized Test Statistics for Small Sample Hypothesis Tests Concerning a Single Population Mean

 If \(\sigma\) is known: \[Z=\frac{\bar{x}-\mu _0}{\sigma /\sqrt{n}} \nonumber \]

If \(\sigma\) is unknown: \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \]

  • The first test statistic (\(\sigma\) known) has the standard normal distribution.
  • The second test statistic (\(\sigma\) unknown) has Student’s \(t\)-distribution with \(n-1\) degrees of freedom.
  • The population must be normally distributed.

The distribution of the second standardized test statistic (the one containing \(s\)) and the corresponding rejection region for each form of the alternative hypothesis (left-tailed, right-tailed, or two-tailed), is shown in Figure \(\PageIndex{1}\). This is just like Figure 8.2.1 except that now the critical values are from the \(t\)-distribution. Figure 8.2.1 still applies to the first standardized test statistic (the one containing (\(\sigma\)) since it follows the standard normal distribution.

ecf5f771ca148089665859c88d8679df.jpg

The \(p\)-value of a test of hypotheses for which the test statistic has Student’s \(t\)-distribution can be computed using statistical software, but it is impractical to do so using tables, since that would require \(30\) tables analogous to Figure 7.1.5, one for each degree of freedom from \(1\) to \(30\). Figure 7.1.6 can be used to approximate the \(p\)-value of such a test, and this is typically adequate for making a decision using the \(p\)-value approach to hypothesis testing, although not always. For this reason the tests in the two examples in this section will be made following the critical value approach to hypothesis testing summarized at the end of Section 8.1, but after each one we will show how the \(p\)-value approach could have been used.

Example \(\PageIndex{1}\)

The price of a popular tennis racket at a national chain store is \(\$179\). Portia bought five of the same racket at an online auction site for the following prices:

\[155\; 179\; 175\; 175\; 161 \nonumber \]

Assuming that the auction prices of rackets are normally distributed, determine whether there is sufficient evidence in the sample, at the \(5\%\) level of significance, to conclude that the average price of the racket is less than \(\$179\) if purchased at an online auction.

  • Step 1 . The assertion for which evidence must be provided is that the average online price \(\mu\) is less than the average price in retail stores, so the hypothesis test is \[H_0: \mu =179\\ \text{vs}\\ H_a: \mu <179\; @\; \alpha =0.05 \nonumber \]
  • Step 2 . The sample is small and the population standard deviation is unknown. Thus the test statistic is \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \] and has the Student \(t\)-distribution with \(n-1=5-1=4\) degrees of freedom.
  • Step 3 . From the data we compute \(\bar{x}=169\) and \(s=10.39\). Inserting these values into the formula for the test statistic gives \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}}=\frac{169-179}{10.39/\sqrt{5}}=-2.152 \nonumber \]
  • Step 4 . Since the symbol in \(H_a\) is “\(<\)” this is a left-tailed test, so there is a single critical value, \(-t_\alpha =-t_{0.05}[df=4]\). Reading from the row labeled \(df=4\) in Figure 7.1.6 its value is \(-2.132\). The rejection region is \((-\infty ,-2.132]\).
  • Step 5 . As shown in Figure \(\PageIndex{2}\) the test statistic falls in the rejection region. The decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the average price of such rackets purchased at online auctions is less than \(\$179\).

Rejection Region and Test Statistic

To perform the test in Example \(\PageIndex{1}\) using the \(p\)-value approach, look in the row in Figure 7.1.6 with the heading \(df=4\) and search for the two \(t\)-values that bracket the unsigned value \(2.152\) of the test statistic. They are \(2.132\) and \(2.776\), in the columns with headings \(t_{0.050}\) and \(t_{0.025}\). They cut off right tails of area \(0.050\) and \(0.025\), so because \(2.152\) is between them it must cut off a tail of area between \(0.050\) and \(0.025\). By symmetry \(-2.152\) cuts off a left tail of area between \(0.050\) and \(0.025\), hence the \(p\)-value corresponding to \(t=-2.152\) is between \(0.025\) and \(0.05\). Although its precise value is unknown, it must be less than \(\alpha =0.05\), so the decision is to reject \(H_0\).

Example \(\PageIndex{2}\)

A small component in an electronic device has two small holes where another tiny part is fitted. In the manufacturing process the average distance between the two holes must be tightly controlled at \(0.02\) mm, else many units would be defective and wasted. Many times throughout the day quality control engineers take a small sample of the components from the production line, measure the distance between the two holes, and make adjustments if needed. Suppose at one time four units are taken and the distances are measured as

Determine, at the \(1\%\) level of significance, if there is sufficient evidence in the sample to conclude that an adjustment is needed. Assume the distances of interest are normally distributed.

  • Step 1 . The assumption is that the process is under control unless there is strong evidence to the contrary. Since a deviation of the average distance to either side is undesirable, the relevant test is \[H_0: \mu =0.02\\ \text{vs}\\ H_a: \mu \neq 0.02\; @\; \alpha =0.01 \nonumber \] where \(\mu\) denotes the mean distance between the holes.
  • Step 2 . The sample is small and the population standard deviation is unknown. Thus the test statistic is \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \] and has the Student \(t\)-distribution with \(n-1=4-1=3\) degrees of freedom.
  • Step 3 . From the data we compute \(\bar{x}=0.02075\) and \(s=0.00171\). Inserting these values into the formula for the test statistic gives \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}}=\frac{0.02075-0.02}{0.00171\sqrt{4}}=0.877 \nonumber \]
  • Step 4 . Since the symbol in \(H_a\) is “\(\neq\)” this is a two-tailed test, so there are two critical values, \(\pm t_{\alpha/2} =-t_{0.005}[df=3]\). Reading from the row in Figure 7.1.6 labeled \(df=3\) their values are \(\pm 5.841\). The rejection region is \((-\infty ,-5.841]\cup [5.841,\infty )\).
  • Step 5 . As shown in Figure \(\PageIndex{3}\) the test statistic does not fall in the rejection region. The decision is not to reject \(H_0\). In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the mean distance between the holes in the component differs from \(0.02\) mm.

Rejection Region and Test Statistic

To perform the test in "Example \(\PageIndex{2}\)" using the \(p\)-value approach, look in the row in Figure 7.1.6 with the heading \(df=3\) and search for the two \(t\)-values that bracket the value \(0.877\) of the test statistic. Actually \(0.877\) is smaller than the smallest number in the row, which is \(0.978\), in the column with heading \(t_{0.200}\). The value \(0.978\) cuts off a right tail of area \(0.200\), so because \(0.877\) is to its left it must cut off a tail of area greater than \(0.200\). Thus the \(p\)-value, which is the double of the area cut off (since the test is two-tailed), is greater than \(0.400\). Although its precise value is unknown, it must be greater than \(\alpha =0.01\), so the decision is not to reject \(H_0\).

Key Takeaway

  • There are two formulas for the test statistic in testing hypotheses about a population mean with small samples. One test statistic follows the standard normal distribution, the other Student’s \(t\)-distribution.
  • The population standard deviation is used if it is known, otherwise the sample standard deviation is used.
  • Either five-step procedure, critical value or \(p\)-value approach, is used with either test statistic.

Teach yourself statistics

Hypothesis Test of a Proportion (Small Sample)

This lesson explains how to test a hypothesis about a proportion when a simple random sample has fewer than 10 successes or 10 failures - a situation that often occurs with small samples. (In a previous lesson , we showed how to conduct a hypothesis test for a proportion when a simple random sample includes at least 10 successes and 10 failures.)

The approach described in this lesson is appropriate, as long as the sample includes at least one success and one failure. The key steps are:

  • Formulate the hypotheses to be tested. This means stating the null hypothesis and the alternative hypothesis .
  • Determine the sampling distribution of the proportion. If the sample proportion is the outcome of a binomial experiment , the sampling distribution will be binomial. If it is the outcome of a hypergeometric experiment , the sampling distribution will be hypergeometric.
  • Specify the significance level . (Researchers often set the significance level equal to 0.05 or 0.01, although other values may be used.)
  • Based on the hypotheses, the sampling distribution, and the significance level, define the region of acceptance .
  • Test the null hypothesis. If the sample proportion falls within the region of acceptance, do not reject the null hypothesis; otherwise, reject the null hypothesis.

The following examples illustrate how to test hypotheses with small samples. The first example involves a binomial experiment; and the second example, a hypergeometric experiment.

Example 1: Sampling With Replacement

Suppose an urn contains 30 marbles. Some marbles are red, and the rest are green. A researcher hypothesizes that the urn contains 15 or more red marbles. The researcher randomly samples five marbles, with replacement , from the urn. Two of the selected marbles are red, and three are green. Based on the sample results, should the researcher reject the null hypothesis? Use a significance level of 0.20.

Solution: There are five steps in conducting a hypothesis test, as described in the previous section. We work through each of the five steps below:

Null hypothesis: P >= 0.50

Alternative hypothesis: P < 0.50

Given those inputs (a binomial distribution where the true population proportion is equal to 0.50), the sampling distribution of the proportion can be determined. It appears in the table below, which shows individual probabilities for single events and cumulative probabilities for multiple events. (Elsewhere on this website, we showed how to compute binomial probabilities that form the body of the table.)

Number of red marbles in sample Sample prop Prob Cumulative probability
0 0.0 0.03125 0.03125
1 0.2 0.15625 0.1875
2 0.4 0.3125 0.5
3 0.6 0.3125 0.8125
4 0.8 0.15625 0.96875
5 1.0 0.03125 1.00
  • Specify significance level . The significance level was set at 0.20. (This means that the probability of making a Type I error is 0.20, assuming that the null hypothesis is true.)

However, we can define a region of acceptance for which the significance level would be no more than 0.20. From the table, we see that if the true population proportion is equal to 0.50, we would be very unlikely to pick 0 or 1 red marble in our sample of 5 marbles. The probability of selecting 1 or 0 red marbles would be 0.1875. Therefore, if we let the significance level equal 0.1875, we can define the region of rejection as any sampled outcome that includes only 0 or 1 red marble (i.e., a sampled proportion equal to 0 or 0.20). We can define the region of acceptance as any sampled outcome that includes at least 2 red marbles. This is equivalent to a sampled proportion that is greater than or equal to 0.40.

  • Test the null hypothesis . Since the sample proportion (0.40) is within the region of acceptance, we cannot reject the null hypothesis.

Example 2: Sampling Without Replacement

The Acme Advertising company has 25 clients. Account executives at Acme claim that 80 percent of these clients are very satisfied with the service they receive. To test that claim, Acme's CEO commissions a survey of 10 clients. Survey participants are randomly sampled, without replacement , from the client population. Six of the ten sampled customers (i.e., 60 percent) say that they are very satisfied. Based on the sample results, should the CEO accept or reject the hypothesis that 80 percent of Acme's clients are very satisfied. Use a significance level of 0.10.

Null hypothesis: P >= 0.80

Alternative hypothesis: P < 0.80

Given those inputs (a hypergeometric distribution where 20 of 25 clients are very satisfied), the sampling distribution of the proportion can be determined. It appears in the table below, which shows individual probabilities for single events and cumulative probabilities for multiple events. (Elsewhere on this website, we showed how to compute hypergeometric probabilities that form the body of the table.)

Number of satisfied clients in sample Sample prop Prob Cumulative probability
4 or less 0.4 or less 0.00 0.00
5 0.5 0.00474 0.00474
6 0.6 0.05929 0.06403
7 0.7 0.23715 0.30119
8 0.8 0.38538 0.68656
9 0.9 0.25692 0.94348
10 1.0 0.05652 1.00
  • Specify significance level . The significance level was set at 0.10. (This means that the probability of making a Type I error is 0.10, assuming that the null hypothesis is true.)

However, we can define a region of acceptance for which the significance level would be no more than 0.10. From the table, we see that if the true proportion of very satisfied clients is equal to 0.80, we would be very unlikely to have fewer than 7 very satisfied clients in our sample. The probability of having 6 or fewer very satisfied clients in the sample would be 0.064. Therefore, if we let the significance level equal 0.064, we can define the region of rejection as any sampled outcome that includes 6 or fewer very satisfied customers. We can define the region of acceptance as any sampled outcome that includes 7 or more very satisfied customers. This is equivalent to a sample proportion that is greater than or equal to 0.70.

  • Test the null hypothesis . Since the sample proportion (0.60) is outside the region of acceptance, we cannot accept the null hypothesis at the 0.064 level of significance.

hypothesis testing small sample size

8.4 Small Sample Tests for a Population Mean

Learning objective.

  • To learn how to apply the five-step test procedure for test of hypotheses concerning a population mean when the sample size is small.

In the previous section hypotheses testing for population means was described in the case of large samples. The statistical validity of the tests was insured by the Central Limit Theorem, with essentially no assumptions on the distribution of the population. When sample sizes are small, as is often the case in practice, the Central Limit Theorem does not apply. One must then impose stricter assumptions on the population to give statistical validity to the test procedure. One common assumption is that the population from which the sample is taken has a normal probability distribution to begin with. Under such circumstances, if the population standard deviation is known, then the test statistic ( x - − μ 0 ) ∕ ( σ ∕ n ) still has the standard normal distribution, as in the previous two sections. If σ is unknown and is approximated by the sample standard deviation s , then the resulting test statistic ( x - − μ 0 ) ∕ ( s ∕ n ) follows Student’s t -distribution with n − 1 degrees of freedom.

Standardized Test Statistics for Small Sample Hypothesis Tests Concerning a Single Population Mean

The first test statistic ( σ known) has the standard normal distribution.

The second test statistic ( σ unknown) has Student’s t -distribution with n − 1 degrees of freedom.

The population must be normally distributed.

The distribution of the second standardized test statistic (the one containing s ) and the corresponding rejection region for each form of the alternative hypothesis (left-tailed, right-tailed, or two-tailed), is shown in Figure 8.11 "Distribution of the Standardized Test Statistic and the Rejection Region" . This is just like Figure 8.4 "Distribution of the Standardized Test Statistic and the Rejection Region" , except that now the critical values are from the t -distribution. Figure 8.4 "Distribution of the Standardized Test Statistic and the Rejection Region" still applies to the first standardized test statistic (the one containing σ ) since it follows the standard normal distribution.

Figure 8.11 Distribution of the Standardized Test Statistic and the Rejection Region

hypothesis testing small sample size

The p -value of a test of hypotheses for which the test statistic has Student’s t -distribution can be computed using statistical software, but it is impractical to do so using tables, since that would require 30 tables analogous to Figure 12.2 "Cumulative Normal Probability" , one for each degree of freedom from 1 to 30. Figure 12.3 "Critical Values of " can be used to approximate the p -value of such a test, and this is typically adequate for making a decision using the p -value approach to hypothesis testing, although not always. For this reason the tests in the two examples in this section will be made following the critical value approach to hypothesis testing summarized at the end of Section 8.1 "The Elements of Hypothesis Testing" , but after each one we will show how the p -value approach could have been used.

The price of a popular tennis racket at a national chain store is $179. Portia bought five of the same racket at an online auction site for the following prices:

Assuming that the auction prices of rackets are normally distributed, determine whether there is sufficient evidence in the sample, at the 5% level of significance, to conclude that the average price of the racket is less than $179 if purchased at an online auction.

Step 1. The assertion for which evidence must be provided is that the average online price μ is less than the average price in retail stores, so the hypothesis test is

Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is

and has the Student t -distribution with n − 1 = 5 − 1 = 4 degrees of freedom.

Step 3. From the data we compute x - = 169 and s = 10.39. Inserting these values into the formula for the test statistic gives

  • Step 4. Since the symbol in H a is “<” this is a left-tailed test, so there is a single critical value, − t α = − t 0.05 [ d f = 4 ] . Reading from the row labeled d f = 4 in Figure 12.3 "Critical Values of " its value is −2.132. The rejection region is ( − ∞ , − 2.132 ] .

Step 5. As shown in Figure 8.12 "Rejection Region and Test Statistic for " the test statistic falls in the rejection region. The decision is to reject H 0 . In the context of the problem our conclusion is:

The data provide sufficient evidence, at the 5% level of significance, to conclude that the average price of such rackets purchased at online auctions is less than $179.

Figure 8.12 Rejection Region and Test Statistic for Note 8.42 "Example 10"

hypothesis testing small sample size

To perform the test in Note 8.42 "Example 10" using the p -value approach, look in the row in Figure 12.3 "Critical Values of " with the heading d f = 4 and search for the two t -values that bracket the unsigned value 2.152 of the test statistic. They are 2.132 and 2.776, in the columns with headings t 0.050 and t 0.025 . They cut off right tails of area 0.050 and 0.025, so because 2.152 is between them it must cut off a tail of area between 0.050 and 0.025. By symmetry −2.152 cuts off a left tail of area between 0.050 and 0.025, hence the p -value corresponding to t = − 2.152 is between 0.025 and 0.05. Although its precise value is unknown, it must be less than α = 0.05 , so the decision is to reject H 0 .

A small component in an electronic device has two small holes where another tiny part is fitted. In the manufacturing process the average distance between the two holes must be tightly controlled at 0.02 mm, else many units would be defective and wasted. Many times throughout the day quality control engineers take a small sample of the components from the production line, measure the distance between the two holes, and make adjustments if needed. Suppose at one time four units are taken and the distances are measured as

Determine, at the 1% level of significance, if there is sufficient evidence in the sample to conclude that an adjustment is needed. Assume the distances of interest are normally distributed.

Step 1. The assumption is that the process is under control unless there is strong evidence to the contrary. Since a deviation of the average distance to either side is undesirable, the relevant test is

where μ denotes the mean distance between the holes.

and has the Student t -distribution with n − 1 = 4 − 1 = 3 degrees of freedom.

Step 3. From the data we compute x - = 0.02075 and s = 0.00171. Inserting these values into the formula for the test statistic gives

  • Step 4. Since the symbol in H a is “≠” this is a two-tailed test, so there are two critical values, ± t α ∕ 2 = − t 0.005 [ d f = 3 ] . Reading from the row in Figure 12.3 "Critical Values of " labeled d f = 3 their values are ± 5.841 . The rejection region is ( − ∞ , − 5.841 ] ∪ [ 5.841 , ∞ ) .

Step 5. As shown in Figure 8.13 "Rejection Region and Test Statistic for " the test statistic does not fall in the rejection region. The decision is not to reject H 0 . In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the 1% level of significance, to conclude that the mean distance between the holes in the component differs from 0.02 mm.

Figure 8.13 Rejection Region and Test Statistic for Note 8.43 "Example 11"

hypothesis testing small sample size

To perform the test in Note 8.43 "Example 11" using the p -value approach, look in the row in Figure 12.3 "Critical Values of " with the heading d f = 3 and search for the two t -values that bracket the value 0.877 of the test statistic. Actually 0.877 is smaller than the smallest number in the row, which is 0.978, in the column with heading t 0.200 . The value 0.978 cuts off a right tail of area 0.200, so because 0.877 is to its left it must cut off a tail of area greater than 0.200. Thus the p -value, which is the double of the area cut off (since the test is two-tailed), is greater than 0.400. Although its precise value is unknown, it must be greater than α = 0.01 , so the decision is not to reject H 0 .

Key Takeaways

  • There are two formulas for the test statistic in testing hypotheses about a population mean with small samples. One test statistic follows the standard normal distribution, the other Student’s t -distribution.
  • The population standard deviation is used if it is known, otherwise the sample standard deviation is used.
  • Either five-step procedure, critical value or p -value approach, is used with either test statistic.

Find the rejection region (for the standardized test statistic) for each hypothesis test based on the information given. The population is normally distributed.

  • H 0 : μ = 27 vs. H a : μ < 27 @ α = 0.05 , n = 12, σ = 2.2.
  • H 0 : μ = 52 vs. H a : μ ≠ 52 @ α = 0.05 , n = 6, σ unknown.
  • H 0 : μ = − 105 vs. H a : μ > − 105 @ α = 0.10 , n = 24, σ unknown.
  • H 0 : μ = 78.8 vs. H a : μ ≠ 78.8 @ α = 0.10 , n = 8, σ = 1.7.
  • H 0 : μ = 17 vs. H a : μ < 17 @ α = 0.01 , n = 26, σ = 0.94.
  • H 0 : μ = 880 vs. H a : μ ≠ 880 @ α = 0.01 , n = 4, σ unknown.
  • H 0 : μ = − 12 vs. H a : μ > − 12 @ α = 0.05 , n = 18, σ = 1.1.
  • H 0 : μ = 21.1 vs. H a : μ ≠ 21.1 @ α = 0.05 , n = 23, σ unknown.

Find the rejection region (for the standardized test statistic) for each hypothesis test based on the information given. The population is normally distributed. Identify the test as left-tailed, right-tailed, or two-tailed.

  • H 0 : μ = 141 vs. H a : μ < 141 @ α = 0.20 , n = 29, σ unknown.
  • H 0 : μ = − 54 vs. H a : μ < − 54 @ α = 0.05 , n = 15, σ = 1.9.
  • H 0 : μ = 98.6 vs. H a : μ ≠ 98.6 @ α = 0.05 , n = 12, σ unknown.
  • H 0 : μ = 3.8 vs. H a : μ > 3.8 @ α = 0.001 , n = 27, σ unknown.
  • H 0 : μ = − 62 vs. H a : μ ≠ − 62 @ α = 0.005 , n = 8, σ unknown.
  • H 0 : μ = 73 vs. H a : μ > 73 @ α = 0.001 , n = 22, σ unknown.
  • H 0 : μ = 1124 vs. H a : μ < 1124 @ α = 0.001 , n = 21, σ unknown.
  • H 0 : μ = 0.12 vs. H a : μ ≠ 0.12 @ α = 0.001 , n = 14, σ = 0.026.

A random sample of size 20 drawn from a normal population yielded the following results: x - = 49.2 , s = 1.33.

  • Test H 0 : μ = 50 vs. H a : μ ≠ 50 @ α = 0.01 .
  • Estimate the observed significance of the test in part (a) and state a decision based on the p -value approach to hypothesis testing.

A random sample of size 16 drawn from a normal population yielded the following results: x - = − 0.96 , s = 1.07.

  • Test H 0 : μ = 0 vs. H a : μ < 0 @ α = 0.001 .

A random sample of size 8 drawn from a normal population yielded the following results: x - = 289 , s = 46.

  • Test H 0 : μ = 250 vs. H a : μ > 250 @ α = 0.05 .

A random sample of size 12 drawn from a normal population yielded the following results: x - = 86.2 , s = 0.63.

  • Test H 0 : μ = 85.5 vs. H a : μ ≠ 85.5 @ α = 0.01 .

Applications

Researchers wish to test the efficacy of a program intended to reduce the length of labor in childbirth. The accepted mean labor time in the birth of a first child is 15.3 hours. The mean length of the labors of 13 first-time mothers in a pilot program was 8.8 hours with standard deviation 3.1 hours. Assuming a normal distribution of times of labor, test at the 10% level of significance test whether the mean labor time for all women following this program is less than 15.3 hours.

A dairy farm uses the somatic cell count (SCC) report on the milk it provides to a processor as one way to monitor the health of its herd. The mean SCC from five samples of raw milk was 250,000 cells per milliliter with standard deviation 37,500 cell/ml. Test whether these data provide sufficient evidence, at the 10% level of significance, to conclude that the mean SCC of all milk produced at the dairy exceeds that in the previous report, 210,250 cell/ml. Assume a normal distribution of SCC.

Six coins of the same type are discovered at an archaeological site. If their weights on average are significantly different from 5.25 grams then it can be assumed that their provenance is not the site itself. The coins are weighed and have mean 4.73 g with sample standard deviation 0.18 g. Perform the relevant test at the 0.1% (1/10th of 1%) level of significance, assuming a normal distribution of weights of all such coins.

An economist wishes to determine whether people are driving less than in the past. In one region of the country the number of miles driven per household per year in the past was 18.59 thousand miles. A sample of 15 households produced a sample mean of 16.23 thousand miles for the last year, with sample standard deviation 4.06 thousand miles. Assuming a normal distribution of household driving distances per year, perform the relevant test at the 5% level of significance.

The recommended daily allowance of iron for females aged 19–50 is 18 mg/day. A careful measurement of the daily iron intake of 15 women yielded a mean daily intake of 16.2 mg with sample standard deviation 4.7 mg.

  • Assuming that daily iron intake in women is normally distributed, perform the test that the actual mean daily intake for all women is different from 18 mg/day, at the 10% level of significance.
  • The sample mean is less than 18, suggesting that the actual population mean is less than 18 mg/day. Perform this test, also at the 10% level of significance. (The computation of the test statistic done in part (a) still applies here.)

The target temperature for a hot beverage the moment it is dispensed from a vending machine is 170°F. A sample of ten randomly selected servings from a new machine undergoing a pre-shipment inspection gave mean temperature 173°F with sample standard deviation 6.3°F.

  • Assuming that temperature is normally distributed, perform the test that the mean temperature of dispensed beverages is different from 170°F, at the 10% level of significance.
  • The sample mean is greater than 170, suggesting that the actual population mean is greater than 170°F. Perform this test, also at the 10% level of significance. (The computation of the test statistic done in part (a) still applies here.)

The average number of days to complete recovery from a particular type of knee operation is 123.7 days. From his experience a physician suspects that use of a topical pain medication might be lengthening the recovery time. He randomly selects the records of seven knee surgery patients who used the topical medication. The times to total recovery were:

  • Assuming a normal distribution of recovery times, perform the relevant test of hypotheses at the 10% level of significance.
  • Would the decision be the same at the 5% level of significance? Answer either by constructing a new rejection region (critical value approach) or by estimating the p -value of the test in part (a) and comparing it to α .

A 24-hour advance prediction of a day’s high temperature is “unbiased” if the long-term average of the error in prediction (true high temperature minus predicted high temperature) is zero. The errors in predictions made by one meteorological station for 20 randomly selected days were:

  • Assuming a normal distribution of errors, test the null hypothesis that the predictions are unbiased (the mean of the population of all errors is 0) versus the alternative that it is biased (the population mean is not 0), at the 1% level of significance.
  • Would the decision be the same at the 5% level of significance? The 10% level of significance? Answer either by constructing new rejection regions (critical value approach) or by estimating the p -value of the test in part (a) and comparing it to α .

Pasteurized milk may not have a standardized plate count (SPC) above 20,000 colony-forming bacteria per milliliter (cfu/ml). The mean SPC for five samples was 21,500 cfu/ml with sample standard deviation 750 cfu/ml. Test the null hypothesis that the mean SPC for this milk is 20,000 versus the alternative that it is greater than 20,000, at the 10% level of significance. Assume that the SPC follows a normal distribution.

One water quality standard for water that is discharged into a particular type of stream or pond is that the average daily water temperature be at most 18°C. Six samples taken throughout the day gave the data:

The sample mean x - = 18.15 exceeds 18, but perhaps this is only sampling error. Determine whether the data provide sufficient evidence, at the 10% level of significance, to conclude that the mean temperature for the entire day exceeds 18°C.

Additional Exercises

A calculator has a built-in algorithm for generating a random number according to the standard normal distribution. Twenty-five numbers thus generated have mean 0.15 and sample standard deviation 0.94. Test the null hypothesis that the mean of all numbers so generated is 0 versus the alternative that it is different from 0, at the 20% level of significance. Assume that the numbers do follow a normal distribution.

At every setting a high-speed packing machine delivers a product in amounts that vary from container to container with a normal distribution of standard deviation 0.12 ounce. To compare the amount delivered at the current setting to the desired amount 64.1 ounce, a quality inspector randomly selects five containers and measures the contents of each, obtaining sample mean 63.9 ounces and sample standard deviation 0.10 ounce. Test whether the data provide sufficient evidence, at the 5% level of significance, to conclude that the mean of all containers at the current setting is less than 64.1 ounces.

A manufacturing company receives a shipment of 1,000 bolts of nominal shear strength 4,350 lb. A quality control inspector selects five bolts at random and measures the shear strength of each. The data are:

  • Assuming a normal distribution of shear strengths, test the null hypothesis that the mean shear strength of all bolts in the shipment is 4,350 lb versus the alternative that it is less than 4,350 lb, at the 10% level of significance.
  • Estimate the p -value (observed significance) of the test of part (a).
  • Compare the p -value found in part (b) to α = 0.10 and make a decision based on the p -value approach. Explain fully.

A literary historian examines a newly discovered document possibly written by Oberon Theseus. The mean average sentence length of the surviving undisputed works of Oberon Theseus is 48.72 words. The historian counts words in sentences between five successive 101 periods in the document in question to obtain a mean average sentence length of 39.46 words with standard deviation 7.45 words. (Thus the sample size is five.)

  • Determine if these data provide sufficient evidence, at the 1% level of significance, to conclude that the mean average sentence length in the document is less than 48.72.
  • Estimate the p -value of the test.
  • Based on the answers to parts (a) and (b), state whether or not it is likely that the document was written by Oberon Theseus.
  • Z ≤ − 1.645
  • T ≤ − 2.571 or T ≥ 2.571
  • Z ≤ − 1645 or Z ≥ 1.645
  • T ≤ − 0.855
  • T ≤ − 2.201 or T ≥ 2.201
  • T = − 2.690 , d f = 19 , − t 0.005 = − 2.861 , do not reject H 0 .
  • 0.01 < p  -value < 0.02 , α = 0.01 , do not reject H 0 .
  • T = 2.398, d f = 7 , t 0.05 = 1.895 , reject H 0 .
  • 0.01 < p  -value < 0.025 , α = 0.05 , reject H 0 .

T = − 7.560 , d f = 12 , − t 0.10 = − 1.356 , reject H 0 .

T = − 7.076 , d f = 5 , − t 0.0005 = − 6.869 , reject H 0 .

  • T = − 1.483 , d f = 14 , − t 0.05 = − 1.761 , do not reject H 0 ;
  • T = − 1.483 , d f = 14 , − t 0.10 = − 1.345 , reject H 0 ;
  • T = 2.069, d f = 6 , t 0.10 = 1.44 , reject H 0 ;
  • T = 2.069, d f = 6 , t 0.05 = 1.943 , reject H 0 .

T = 4.472, d f = 4 , t 0.10 = 1.533 , reject H 0 .

T = 0.798, d f = 24 , t 0.10 = 1.318 , do not reject H 0 .

  • T = − 1.773 , d f = 4 , − t 0.05 = − 2.132 , do not reject H 0 .
  • 0.05 < p  -value < 0.10
  • α = 0.05 , do not reject H 0

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

25.3 - calculating sample size.

Before we learn how to calculate the sample size that is necessary to achieve a hypothesis test with a certain power, it might behoove us to understand the effect that sample size has on power. Let's investigate by returning to our IQ example.

Example 25-3 Section  

Let \(X\) denote the IQ of a randomly selected adult American. Assume, a bit unrealistically again, that \(X\) is normally distributed with unknown mean \(\mu\) and (a strangely known) standard deviation of 16. This time, instead of taking a random sample of \(n=16\) students, let's increase the sample size to \(n=64\). And, while setting the probability of committing a Type I error to \(\alpha=0.05\), test the null hypothesis \(H_0:\mu=100\) against the alternative hypothesis that \(H_A:\mu>100\).

What is the power of the hypothesis test when \(\mu=108\), \(\mu=112\), and \(\mu=116\)?

Setting \(\alpha\), the probability of committing a Type I error, to 0.05, implies that we should reject the null hypothesis when the test statistic \(Z\ge 1.645\), or equivalently, when the observed sample mean is 103.29 or greater:

\( \bar{x} = \mu + z \left(\dfrac{\sigma}{\sqrt{n}} \right) = 100 +1.645\left(\dfrac{16}{\sqrt{64}} \right) = 103.29\)

Therefore, the power function \K(\mu)\), when \(\mu>100\) is the true value, is:

\( K(\mu) = P(\bar{X} \ge 103.29 | \mu) = P \left(Z \ge \dfrac{103.29 - \mu}{16 / \sqrt{64}} \right) = 1 - \Phi \left(\dfrac{103.29 - \mu}{2} \right)\)

Therefore, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=108\) is 0.9907, as calculated here:

\(K(108) = 1 - \Phi \left( \dfrac{103.29-108}{2} \right) = 1- \Phi(-2.355) = 0.9907 \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=112\) is greater than 0.9999, as calculated here:

\( K(112) = 1 - \Phi \left( \dfrac{103.29-112}{2} \right) = 1- \Phi(-4.355) = 0.9999\ldots \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=116\) is greater than 0.999999, as calculated here:

\( K(116) = 1 - \Phi \left( \dfrac{103.29-116}{2} \right) = 1- \Phi(-6.355) = 0.999999... \)

In summary, in the various examples throughout this lesson, we have calculated the power of testing \(H_0:\mu=100\) against \(H_A:\mu>100\) for two sample sizes ( \(n=16\) and \(n=64\)) and for three possible values of the mean ( \(\mu=108\), \(\mu=112\), and \(\mu=116\)). Here's a summary of our power calculations:

As you can see, our work suggests that for a given value of the mean \(\mu\) under the alternative hypothesis, the larger the sample size \(n\), the greater the power \(K(\mu)\) . Perhaps there is no better way to see this than graphically by plotting the two power functions simultaneously, one when \(n=16\) and the other when \(n=64\):

As this plot suggests, if we are interested in increasing our chance of rejecting the null hypothesis when the alternative hypothesis is true, we can do so by increasing our sample size \(n\). This benefit is perhaps even greatest for values of the mean that are close to the value of the mean assumed under the null hypothesis. Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power.

Example 25-4 Section  

corn field

Let \(X\) denote the crop yield of corn measured in the number of bushels per acre. Assume (unrealistically) that \(X\) is normally distributed with unknown mean \(\mu\) and standard deviation \(\sigma=6\). An agricultural researcher is working to increase the current average yield from 40 bushels per acre. Therefore, he is interested in testing, at the \(\alpha=0.05\) level, the null hypothesis \(H_0:\mu=40\) against the alternative hypothesis that \(H_A:\mu>40\). Find the sample size \(n\) that is necessary to achieve 0.90 power at the alternative \(\mu=45\).

As is always the case, we need to start by finding a threshold value \(c\), such that if the sample mean is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.05\) level, the following statement must hold (using our typical \(Z\) transformation):

\(c = 40 + 1.645 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

But, that's not the only condition that \(c\) must meet, because \(c\) also needs to be defined to ensure that our power is 0.90 or, alternatively, that the probability of a Type II error is 0.10. That would happen if there was a 10% chance that our test statistic fell short of \(c\) when \(\mu=45\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.90 power, the following statement must hold (using our usual \(Z\) transformation):

\(c = 45 - 1.28 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

Aha! We have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(40+1.645\left(\frac{6}{\sqrt{n}}\right)=45-1.28\left(\frac{6}{\sqrt{n}}\right)\) \(\Rightarrow 5=(1.645+1.28)\left(\frac{6}{\sqrt{n}}\right), \qquad \Rightarrow 5=\frac{17.55}{\sqrt{n}}, \qquad n=(3.51)^2=12.3201\approx 13\)

Now that we know we will set \(n=13\), we can solve for our threshold value c :

\( c = 40 + 1.645 \left( \dfrac{6}{\sqrt{13}} \right)=42.737 \)

So, in summary, if the agricultural researcher collects data on \(n=13\) corn plots, and rejects his null hypothesis \(H_0:\mu=40\) if the average crop yield of the 13 plots is greater than 42.737 bushels per acre, he will have a 5% chance of committing a Type I error and a 10% chance of committing a Type II error if the population mean \(\mu\) were actually 45 bushels per acre.

Example 25-5 Section  

politician

Consider \(p\), the true proportion of voters who favor a particular political candidate. A pollster is interested in testing at the \(\alpha=0.01\) level, the null hypothesis \(H_0:9=0.5\) against the alternative hypothesis that \(H_A:p>0.5\). Find the sample size \(n\) that is necessary to achieve 0.80 power at the alternative \(p=0.55\).

In this case, because we are interested in performing a hypothesis test about a population proportion \(p\), we use the \(Z\)-statistic:

\(Z = \dfrac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)

Again, we start by finding a threshold value \(c\), such that if the observed sample proportion is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.01\) level, the following statement must hold:

\(c = 0.5 + 2.326 \sqrt{ \dfrac{(0.5)(0.5)}{n}} \) (**)

But, again, that's not the only condition that c must meet, because \(c\) also needs to be defined to ensure that our power is 0.80 or, alternatively, that the probability of a Type II error is 0.20. That would happen if there was a 20% chance that our test statistic fell short of \(c\) when \(p=0.55\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.80 power, the following statement must hold:

\(c = 0.55 - 0.842 \sqrt{ \dfrac{(0.55)(0.45)}{n}} \) (**)

Again, we have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(0.5+2.326\sqrt{\dfrac{0.5(0.5)}{n}}=0.55-0.842\sqrt{\dfrac{0.55(0.45)}{n}} \\ 2.326\dfrac{\sqrt{0.25}}{\sqrt{n}}+0.842\dfrac{\sqrt{0.2475}}{\sqrt{n}}=0.55-0.5 \\ \dfrac{1}{\sqrt{n}}(1.5818897)=0.05 \qquad \Rightarrow n\approx \left(\dfrac{1.5818897}{0.05}\right)^2 = 1000.95 \approx 1001 \)

Now that we know we will set \(n=1001\), we can solve for our threshold value \(c\):

\(c = 0.5 + 2.326 \sqrt{\dfrac{(0.5)(0.5)}{1001}}= 0.5367 \)

So, in summary, if the pollster collects data on \(n=1001\) voters, and rejects his null hypothesis \(H_0:p=0.5\) if the proportion of sampled voters who favor the political candidate is greater than 0.5367, he will have a 1% chance of committing a Type I error and a 20% chance of committing a Type II error if the population proportion \(p\) were actually 0.55.

Incidentally, we can always check our work! Conducting the survey and subsequent hypothesis test as described above, the probability of committing a Type I error is:

\(\alpha= P(\hat{p} >0.5367 \text { if } p = 0.50) = P(Z > 2.3257) = 0.01 \)

and the probability of committing a Type II error is:

\(\beta = P(\hat{p} <0.5367 \text { if } p = 0.55) = P(Z < -0.846) = 0.199 \)

just as the pollster had desired.

We've illustrated several sample size calculations. Now, let's summarize the information that goes into a sample size calculation. In order to determine a sample size for a given hypothesis test, you need to specify:

The desired \(\alpha\) level, that is, your willingness to commit a Type I error.

The desired power or, equivalently, the desired \(\beta\) level, that is, your willingness to commit a Type II error.

A meaningful difference from the value of the parameter that is specified in the null hypothesis.

The standard deviation of the sample statistic or, at least, an estimate of the standard deviation (the "standard error") of the sample statistic.

hypothesis testing small sample size

Power and Sample Size Determination

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  
  • |   11  

On This Page sidebar

Issues in Estimating Sample Size for Hypothesis Testing

Ensuring that a test has high power.

Learn More sidebar

All Modules

In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true.   In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).  

Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.  

The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.  

Suppose we want to test the following hypotheses at aα=0.05:  H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability ), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of

If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.  

Normal distribution of X when the mean of X is 90. A bell-shaped curve with a value of X-90 at the center.

Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05

Standard normal distribution showing a mean of 90. The rejection areas are in the two tails at the extremes above and below the mean. If the alpha level is 0.05, then each tail accounts for an arean of 0.025.

The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing .  

Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.  

Two overlapping normal distributions, one depicting the null hypothesis with a mean of 90 and the other showing the alternative hypothesis with a mean of 94. A more complete explanation of the figure is provided in the text below the figure.

If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).  

Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).

β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.

Overlapping bell-shaped distributions - one with a mean of 90 and the other with a mean of 98

Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.

In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.  

The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.

return to top | previous page | next page

Content ©2020. All Rights Reserved. Date last modified: March 13, 2020. Wayne W. LaMorte, MD, PhD, MPH

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Psychol Med
  • v.42(1); Jan-Feb 2020

Sample Size and its Importance in Research

Chittaranjan andrade.

Clinical Psychopharmacology Unit, Department of Clinical Psychopharmacology and Neurotoxicology, National Institute of Mental Health and Neurosciences, Bengaluru, Karnataka, India

The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary sample size is set for a pilot study. This article discusses sample size and how it relates to matters such as ethics, statistical power, the primary and secondary hypotheses in a study, and findings from larger vs. smaller samples.

Studies are conducted on samples because it is usually impossible to study the entire population. Conclusions drawn from samples are intended to be generalized to the population, and sometimes to the future as well. The sample must therefore be representative of the population. This is best ensured by the use of proper methods of sampling. The sample must also be adequate in size – in fact, no more and no less.

SAMPLE SIZE AND ETHICS

A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not worth the effort and expense involved in recruiting the extra patients. Furthermore, an overly large sample would inconvenience more patients than might be necessary for the study objectives; this is unethical. In contrast, a sample that is smaller than necessary would have insufficient statistical power to answer the primary research question, and a statistically nonsignificant result could merely be because of inadequate sample size (Type 2 or false negative error). Thus, a small sample could result in the patients in the study being inconvenienced with no benefit to future patients or to science. This is also unethical.

In this regard, inconvenience to patients refers to the time that they spend in clinical assessments and to the psychological and physical discomfort that they experience in assessments such as interviews, blood sampling, and other procedures.

ESTIMATING SAMPLE SIZE

So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80%, and some set the threshold for significance at 0.01 rather than 0.05. Both choices are uncommon because the necessary sample size becomes large, and the study becomes more expensive and more difficult to conduct. Many investigators increase the sample size by 10%, or by whatever proportion they can justify, to compensate for expected dropout, incomplete records, biological specimens that do not meet laboratory requirements for testing, and other study-related problems.

Sample size calculations require assumptions about expected means and standard deviations, or event risks, in different groups; or, upon expected effect sizes. For example, a study may be powered to detect an effect size of 0.5; or a response rate of 60% with drug vs. 40% with placebo.[ 1 ] When no guesstimates or expectations are possible, pilot studies are conducted on a sample that is arbitrary in size but what might be considered reasonable for the field.

The sample size may need to be larger in multicenter studies because of statistical noise (due to variations in patient characteristics, nonspecific treatment characteristics, rating practices, environments, etc. between study centers).[ 2 ] Sample size calculations can be performed manually or using statistical software; online calculators that provide free service can easily be identified by search engines. G*Power is an example of a free, downloadable program for sample size estimation. The manual and tutorial for G*Power can also be downloaded.

PRIMARY AND SECONDARY ANALYSES

The sample size is calculated for the primary hypothesis of the study. What is the difference between the primary hypothesis, primary outcome and primary outcome measure? As an example, the primary outcome may be a reduction in the severity of depression, the primary outcome measure may be the Montgomery-Asberg Depression Rating Scale (MADRS) and the primary hypothesis may be that reduction in MADRS scores is greater with the drug than with placebo. The primary hypothesis is tested in the primary analysis.

Studies almost always have many hypotheses; for example, that the study drug will outperform placebo on measures of depression, suicidality, anxiety, disability and quality of life. The sample size necessary for adequate statistical power to test each of these hypotheses will be different. Because a study can have only one sample size, it can be powered for only one outcome, the primary outcome. Therefore, the study would be either overpowered or underpowered for the other outcomes. These outcomes are therefore called secondary outcomes, and are associated with secondary hypotheses, and are tested in secondary analyses. Secondary analyses are generally considered exploratory because when many hypotheses in a study are each tested at a P < 0.05 level for significance, some may emerge statistically significant by chance (Type 1 or false positive errors).[ 3 ]

INTERPRETING RESULTS

Here is an interesting question. A test of the primary hypothesis yielded a P value of 0.07. Might we conclude that our sample was underpowered for the study and that, had our sample been larger, we would have identified a significant result? No! The reason is that larger samples will more accurately represent the population value, whereas smaller samples could be off the mark in either direction – towards or away from the population value. In this context, readers should also note that no matter how small the P value for an estimate is, the population value of that estimate remains the same.[ 4 ]

On a parting note, it is unlikely that population values will be null. That is, for example, that the response rate to the drug will be exactly the same as that to placebo, or that the correlation between height and age at onset of schizophrenia will be zero. If the sample size is large enough, even such small differences between groups, or trivial correlations, would be detected as being statistically significant. This does not mean that the findings are clinically significant.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Hypothesis Testing and Small Sample Sizes

Cite this chapter.

hypothesis testing small sample size

  • Rand R. Wilcox 2  

505 Accesses

O ne of the biggest breakthroughs in the past forty years is the derivation of inferential methods that perform well when sample sizes are small Indeed, some practical problems that seemed insurmountable only a few years ago have been solved. But to appreciate this remarkable achievement, we must first describe the shortcomings of conventional techniques developed during the first half of the twentieth century—methods that are routinely used today. At one time it was generally thought that these standard methods are insensitive to violations of assumptions, but a more accurate statement is that they seem to perform reasonably well (in terms of Type I errors) when groups have identical probability curves or when performing regression with variables that are independent. If, for example, we compare groups that happen to have different probability curves, extremely serious problems can arise. Perhaps the most striking problem is described in Chapter 7, but the problems described here are also very serious and are certainly relevant to applied work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Author information

Authors and affiliations.

Department of Psychology, University of Southern California, 90089-1061, Los Angeles, CA, USA

Rand R. Wilcox

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media New York

About this chapter

Wilcox, R.R. (2001). Hypothesis Testing and Small Sample Sizes. In: Fundamentals of Modern Statistical Methods. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-3522-2_5

Download citation

DOI : https://doi.org/10.1007/978-1-4757-3522-2_5

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-2891-7

Online ISBN : 978-1-4757-3522-2

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How to Calculate Sample Size Needed for Power

By Jim Frost 72 Comments

Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.

Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study! Learn more about Statistical Power .

When you perform hypothesis testing, there is a lot of preplanning you must do before collecting any data. This planning includes identifying the data you will gather, how you will collect it, and how you will measure it among many other details. A crucial part of the planning is determining how much data you need to collect. I’ll show you how to estimate the sample size for your study.

Before we get to estimating sample size requirements, let’s review the factors that influence statistical significance. This process will help you see the value of formally going through a power and sample size analysis rather than guessing.

Related post : 5 Steps for Conducting Scientific Studies with Statistical Analyses

Factors Involved in Statistical Significance

Look at the chart below and identify which study found a real treatment effect and which one didn’t. Within each study, the difference between the treatment group and the control group is the sample estimate of the effect size.

A bar chart that displays the treatment and control group for two studies. Study A has a larger effect size than study B.

Did either study obtain significant results? The estimated effects in both studies can represent either a real effect or random sample error. You don’t have enough information to make that determination. Hypothesis tests incorporate these considerations to determine whether the results are statistically significant.

  • Effect size : The larger the effect size, the less likely it is to be random error. It’s clear that Study A exhibits a more substantial effect in the sample—but that’s insufficient by itself.
  • Sample size : Larger sample sizes allow hypothesis tests to detect smaller effects. If Study B’s sample size is large enough, its more modest effect can be statistically significant.
  • Variability : When your sample data have greater variability, random sampling error is more likely to produce considerable differences between the experimental groups even when there is no real effect. If the sample data in Study A have sufficient variability, random error might be responsible for the large difference.

Hypothesis testing takes all of this information and uses it to calculate the p-value —which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data. Consequently, you cannot determine a good sample size in a vacuum because the three factors are intertwined.

Related post : How Hypothesis Tests Work

Statistical Power of a Hypothesis Test

Because we’re talking about determining the sample size for a study that has not been performed yet, you need to learn about a fourth consideration—statistical power. Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error . Power = 1 – β. The power of the test depends on the other three factors.

For example, if your study has 80% power, it has an 80% chance of detecting an effect that exists. Let this point be a reminder that when you work with samples, nothing is guaranteed! When an effect actually exists in the population, your study might not detect it because you are working with a sample. Samples contain sample error, which can occasionally cause a random sample to misrepresent the population.

Related post : Types of Errors in Hypothesis Testing

Goals of a Power and Sample Size Analysis

Power analysis involves taking these three considerations, adding subject-area knowledge, and managing tradeoffs to settle on a sample size. During this process, you must rely heavily on your expertise to provide reasonable estimates of the input values.

Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense.

You don’t want to collect a large and expensive sample only to be able to detect an effect that is too small to be useful! Nor do you want an underpowered study that has a low probability of detecting an important effect. Your goal is to collect a large enough sample to have sufficient power to detect a meaningful effect—but not too large to be wasteful.

As you’ll see in the upcoming examples, the analyst provides numeric values that correspond to “a good chance” and “meaningful effect.” These values allow you to tailor the analysis to your needs.

All of these details might sound complicated, but a statistical power analysis helps you manage them. In fact, going through this procedure forces you to focus on the relevant information. Typically, you specify three of the four factors discussed above and your statistical software calculates the remaining value. For instance, if you specify the smallest effect size that is practically significant, variability, and power, the software calculates the required sample size.

Let’s work through some examples in different scenarios to bring this to life.

2-Sample t-Test Power Analysis for Sample Size

Suppose we’re conducting a 2-sample t-test to determine which of two materials is stronger. If one type of material is significantly stronger than the other, we’ll use that material in our process. Furthermore, we’ve tested these materials in a pilot study, which provides background knowledge for the estimates.

In a power and sample size analysis, statistical software presents you with a dialog box something like the following:

Power and sample size analysis dialog box for 2-sample t-test.

We’ll go through these fields one-by-one. First off, we will leave Sample sizes blank because we want the software to calculate this value.

Differences

Differences is often a confusing value to enter. Do not enter your guess for the difference between the two types of material. Instead, use your expertise to identify the smallest difference that is still meaningful for your application. In other words, you consider smaller differences to be inconsequential. It would not be worthwhile to expend resources to detect them.

By choosing this value carefully, you tailor the experiment so that it has a reasonable chance of detecting useful differences while allowing smaller, non-useful differences to remain potentially undetected. This value helps prevent us from collecting an unnecessarily large sample.

For our example, we’ll enter 5 because smaller differences are unimportant for our process.

Power values

Power values is where we specify the probability that the statistical hypothesis test detects the difference in the sample if that difference exists in the population. This field is where you define the “reasonable chance” that I mentioned earlier. If you hold the other input values constant and increase the test’s power, the required sample size also increases. The proper value to enter in this field depends on norms in your study area or industry. Common power values are 0.8 and 0.9.

We’ll enter a power of 0.9 so that the 2-sample t-test has a 90% chance of detecting a difference of 5.

Standard deviation

Standard deviation is the field where we enter the data variability. We need to enter an estimate for the standard deviation of material strength. Analysts frequently base these estimates on pilot studies and historical research data. Inputting better variability estimates will produce more reliable power analysis results. Consequently, you should strive to improve these estimates over time as you perform additional studies and testing. Providing good estimates of the standard deviation is often the most difficult part of a power and sample size analysis.

For our example, we’ll assume that the two types of material have a standard deviation of 4 units of strength. After we click OK, we see the results.

Related post : Measures of Variability

Interpreting the Statistical Power Analysis and Sample Size Results

Statistical power and sample size analysis provides both numeric and graphical results, as shown below.

Statistical output for the power and sample size analysis for the 2-sample t-test.

The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units.

The dot on the Power Curve corresponds to the information in the text output. However, by studying the entire graph, we can learn additional information about how statistical power varies by the difference. If we start at the dot and move down the curve to a difference of 2.5, we learn that the test has a power of approximately 0.4 (40%). This power is too low. However, we indicated that differences less than 5 were not practically significant to our process. Consequently, having low power to detect a difference of 2.5 is not problematic.

Conversely, follow the curve up from the dot and notice how power quickly increases to nearly 100% before we reach a difference of 6. This design satisfies the process requirements while using a manageable sample size of 15 per group.

Other Power Analysis Options

Now, let’s explore a few more options that are available for power analysis. This time we’ll use a one-tailed test and have the software calculate a value other than sample size.

Suppose we are again comparing the strengths of two types of material. However, in this scenario, we are currently using one kind of material and are considering switching to another. We will change to the new material only if it is stronger than our current material. Again, the smallest difference in strength that is meaningful to our process is 5 units. The standard deviation in this study is now 7. Further, let’s assume that our company uses a standard sample size of 20, and we need approval to increase it to 40. Because the standard deviation (7) is larger than the smallest meaningful difference (5), we might need a larger sample.

In this scenario, the test needs to determine only whether the new material is stronger than the current material. Consequently, we can use a one-tailed test. This type of test provides greater statistical power to determine whether the new material is stronger than the old material, but no power to determine if the current material is stronger than the new—which is acceptable given the dictates of the new scenario.

In this analysis, we’ll enter the two potential values for Sample sizes and leave Power values blank. The software will estimate the power of the test for detecting a difference of 5 for designs with both 20 and 40 samples per group.

We fill in the dialog box as follows:

Power and sample size analysis dialog box for a one-side 2-sample t-test.

And, in Options , we choose the following one-tailed test:

Options for the power and sample size analysis dialog box.

Interpreting the Power and Sample Size Results

Statistical output for the power and sample size analysis for the one-sided 2-sample t-test.

The statistical output indicates that a design with 20 samples per group (a total of 40) has a ~72% chance of detecting a difference of 5. Generally, this power is considered to be too low. However, a design with 40 samples per group (80 total) achieves a power of ~94%, which is almost always acceptable. Hopefully, the power analysis convinces management to approve the larger sample size.

Assess the Power Curve graph to see how the power varies by the difference. For example, the curve for the sample size of 20 indicates that the smaller design does not achieve 90% power until the difference is approximately 6.5. If increasing the sample size is genuinely cost prohibitive, perhaps accepting 90% power for a difference of 6.5, rather than 5, is acceptable. Use your process knowledge to make this type of determination.

Use Power Analysis for Sample Size Estimation For All Studies

Throughout this post, we’ve been looking at continuous data, and using the 2-sample t-test specifically. For continuous data, you can also use power analysis to assess sample sizes for ANOVA and DOE designs. Additionally, there are hypothesis tests for other types of data , such as proportions tests ( binomial data ) and rates of occurrence (Poisson data). These tests have their own corresponding power and sample analyses.

In general, when you move away from continuous data to these other types of data , your sample size requirements increase. And, there are unique intricacies in each. For instance, in a proportions test, you need a relatively larger sample size to detect a difference when your proportion is closer 0 or 1 than if it is in the middle (0.5). Many factors can affect the optimal sample size. Power analysis helps you navigate these concerns.

After reading this post, I hope you see how power analysis combines statistical analyses, subject-area knowledge, and your requirements to help you derive the optimal sample size for your specific needs. If you don’t perform this analysis, you risk performing a study that is either likely to miss an important effect or have an exorbitantly large sample size. I’ve written a post about a Mythbusters experiment that had no chance of detecting an effect because they guessed a sample size instead of performing a power analysis.

In this post, I’ve focused on how power affects your test’s ability to detect a real effect. However, low power tests also exaggerate effect sizes !

Finally, experimentation is an iterative process. As you conduct more studies in an area, you’ll develop better estimates to input into power and sample size analyses and gain a clearer picture of how to proceed.

Share this:

hypothesis testing small sample size

Reader Interactions

' src=

May 11, 2024 at 2:18 am

Thank you Mr. Jim for such a brief explanation on power analysis for same size. i read several explanations on this topic but none got to my head and understanding. but your explanation made it understandable.

Regards, Roopini

' src=

April 15, 2024 at 6:56 pm

Jim, Are you able to share what statistical software was used for your examples? Are there equations that can be typed into Excel to determine sample size and power? Is there free, reliable statistical analysis software you can recommend for calculating sample size and power?

Thank you! Suzann

' src=

April 16, 2024 at 3:37 pm

I used Minitab statistical software for the examples. Unfortunately, I don’t believe Excel has this feature built into it. However, there is a free Power and Sample Size analysis software that I highly recommend. It’s called G*Power . Click the link to get it for free!

' src=

May 24, 2024 at 2:13 pm

Jim, The post is really informative but i want to know how to use power analysis to find a corelation between 2 variables

May 25, 2024 at 5:27 pm

You can use power analysis to determine the sample size you’d need to detect a correlation of a particular strength with a specified power.

I recommend using the free power analysis tool called G*Power. Below I show an example of using it to find the sample size I’d need to detect a correlation of 0.7 with 95% power. The answer is a sample size of 20. See how I set up G*Power to get this answer below.

Power analysis for correlation.

October 24, 2022 at 8:29 pm

Hi again Jim, apologies if this was posted multiple times but I looked into the Bonferroni Correction and saw that this was the equation αnew = αoriginal / n

αoriginal: The original α level n: The total number of comparisons or tests being performed Seeing this would 6000 or 1000 be the n in my case? Would I also have to perform this once or more then once. Second question after finding this out when performing the power analysis that you mentioned do I have to do it multiple times to account for the different combinations with the states that I will match with each other.

October 24, 2022 at 10:23 pm

In this context, n is the number of comparisons between groups. If you want to compare all groups to each other (i.e., all pairwise comparisons), then with 6 groups you’ll have 15 comparisons. So, n = 15. However, you don’t necessarily need to compare all groups. It depends on your research question. If you can avoid all pairwise comparisons, it’s a good thing. Just decide on your comparisons and record it in your plans before proceeding with the project. If you wait until after analyzing the data, you might (even if subconsciously) be tempted to cherry pick the comparisons that give good results.

As an example of an alternative to all pairwise comparisons, you might compare five of the states to one reference state in your sample. That reduces the pairwise comparisons (n) from 15 to 5. That helps because you’re dividing alpha by the number of comparisons. A lower n won’t lower your Bonferroni corrected significance level as much:

0.05/15 = 0.003 0.05/5 = 0.01

You’ll need an extremely low p-value with 15 comparisons ( Using Post Hoc Tests with ANOVA . Of course, you’re not working with ANOVA. But if you need information about what and why you need to control the familywise error rate, it’ll be helpful. The same ideas will apply to the multiple comparisons you’re making with the 2 proportions test. In your case, if you go with 15 comparisons (all pairwise for the 6 states), your familywise error rate is 0.54. Over a 50% chance of a false positive!

October 21, 2022 at 8:59 pm

Hello again Jim, I looked on your other page about the margin of error and I had a few extra questions. The approach I would be taking is as you said taking with using 1000 people from each for a comparison with the surveys. I saw the formula that you had so would my confidence level for this instance be 95%? Also as your formula is listed would my bottom number be 1000 as well or would it be 6000, or would I have to complete this one instead Finding the Margin of Error for Other Percentages formula.

October 23, 2022 at 4:27 pm

Typically, surveys don’t delve so deep into statistical differences between groups in the responses. At least not that I’ve seen. Usually, they’ll calculate and report the margin of error. If the margins don’t overlap, you can assume the difference is statistically significant. However, as I point out in the margin of error post, that process is conservative because the difference can be statistically significant even with a little overlap.

What you need to do for your cases is perform a power analysis for a two-sample proportions test. That’s beyond what most public opinion surveys do but will get you the answers you need. In your case, the proportions you’re testing are the proportion of individual in state A who respond a particular way to a survey item and the other will be the proportion in state B who respond that way to the item.

I didn’t realize that you were performing hypothesis testing with your survey data, or I would’ve mentioned this from the start! Because you’re comparing six states, you’re also facing the problem of the multiple comparison increasing the familywise error rate for that set of comparison. You’ll need to use something like a Bonferroni correction to appropriately lower the significance level you use, which will affect the numbers you need for a particular power.

I hope that helps!

October 20, 2022 at 4:33 pm

Hello Jim, I am hoping you can have some guidance for me here. I am currently doing an assignment involving this subject here and my professor said this statement to me, There’s no rationale for the six thousand surveys. How did you arrive at your sample size? You need to report the power analysis (and numbers you used in that analysis) to arrive at your chosen sample size–like everything else in scientific writing the sample size needs justification. My study involves six states and getting specific individuals opinions from each state about there opinions on crime and how it has affected them. Surveys are my choice of use here, so my question is how would I come about to a sample size here. I had thought 6,000 was a starting point but am unsure if thats right?

October 21, 2022 at 4:11 pm

With surveys you typically calculate the sample size to produce a specific margin of error . Click the link to learn more about that and how to tell whether there are differences. It’s a little different process that power analysis in other contexts but it’s related. The big questions are how precise do you want your estimates to be? And if you have groups you want to compare, that can affect the calculates.

For instance, 6,000 would generally be considered a large sample size for survey research. However, if you’re comparing subgroups within your sample, that can affect how many you need. I don’t know if you plan to do this or not, but if you wanted to compare the differences between the six states, that means you’d have about 1,000 per state. That’s still fairly decent but you’ll have a larger margin of error. You’ll need to know whether your primary interest is estimates for the total sample or differences between subgroups. If it’s differences between subgroups, that always increases your required sample size.

That’s not to say that 1000 per state isn’t enough. I don’t know. But you’d do the margin of error calculations to see if it produces sufficient precision for your needs. The process involves a combination of doing the MoE calculations and knowing the required precision (or possibly standards in your subject area).

' src=

October 15, 2022 at 2:38 am

So can a “power analysis’ be done to get the sample size for a proposed survey instead of calculating for the sample size? In other words, is a “power analysis” the same as calculating for the sample size when doing a research study? Thank you.

October 16, 2022 at 2:01 am

Hi Ronrico,

There’s definitely a related concept. For surveys, you typically need to calculate the margin of error . Click the link to read my post about it!

' src=

August 16, 2022 at 5:59 am

Wonderful post!

I was wondering, how would I be able to determine if a sample size is large enough for a paper that I’m reading, assuming they do not give the power calculation? If they d give the power calculation, should the be 80% or over for stat sig results?

Thank you so much 🙂

August 21, 2022 at 12:28 am

Determining whether a study’s sample size and, hence, its statistical power, are sufficient isn’t quite as straightforward as it might appear. It’s tempting to take the study’s sample size, effect size, and variability and enter them into a power analysis. However, that’s problematic. What happens is that if the study has statistically significant findings the power analysis will always indicate sufficient sample size/power. However, if the study has non-significant results, the power analysis will always indicate that the sample size/power are insufficient.

That’s a problem because it’s possible to obtain significant results with low power studies and insignificant results with high power studies. It’s important recognize all these cases because significant low power studies will exaggerate the effects sizes and insignificant high power studies are more likely to indicate that the effect does not exist in the population.

What you need to do instead is enter the study’s sample size, use a literature review to obtain reasonable estimates of the variability (if possible), and then enter an effect size that represents either the literature’s collective best estimate of it or a minimum sample size that is still practically meaningful. Note that you are not using the study’s estimates for these calculations for the reasons I indicate earlier!

' src=

November 13, 2021 at 1:46 am

Hi Sir Jim!

I I’d like to know how I can utilize the GPOWER Calculator to figure out the sample size for my study. It essentially employed stratified random sampling. I’m hoping you’ll respond! best wishes!

November 13, 2021 at 11:57 pm

It depends on how you’ve conducted your stratified sampling and what you want to test. Are you comparing the strata within your sample? If so, you’d just select the type of test, such as a t-test, and then enter your values. G Power uses the default setting that your groups size are equal. That’s fine if you’re using a disproportionate stratified sampling design and set all your strata to the same size. However, if your strata sizes are unequal, you’ll need to adjust the allocation ratio.

' src=

June 16, 2021 at 7:32 am

Hello Jim. I want your help in calculating sample size for my study. I have three groups, first group is control (normal), second is a clinical population group undergoing treatment 1 and third colonics group (same disease as group2) undergoing treatment 2. So here I will compare some parameters between pre-post treatment for group 2 and 3 separately first. Then compare group 2 and 3 before treatment and after treatment and then compare baseline parameters and after treatment parameters across all three groups. I hope I have not confused you. I want to know the sample size for my three groups. My hypothesis is that the two treatments will improve the parameters in group 2 and 3, what I want to check is which treatment (1 or 2) is most effective.. I request you to kindly help me in this regard

' src=

April 19, 2021 at 10:49 pm

Dear Jim, I have question regarding calculating the sample size in this scenario: I’m doing a hospital based study (chart review study) where i will include all patients who have a specific disease (celiac disease) in the last 5 years. How would i know that the number which i will get is sufficient to answer my research questions considering that this disease is rare? suppose for example i ended up with 100 patients, how would i know that i can use this sample for further analysis ? Is there a way to calculate ahead the minimum number of patients needed to do my research?

' src=

March 8, 2021 at 10:45 pm

I am looking to determine the sample size necessary to detect differences in bird populations (composition and abundance) between forest treatment types. I assume I would use an ANOVA given that I have control units. My data will be bird occurrence data, so I imagine Poisson distribution. I have zero pilot data, though. Do you have any recommendations for reading up on ways to simulate or bootstrap data in this situation for use in making variability estimates?

Thank you!!

March 9, 2021 at 7:20 pm

Hi Lorelle,

Yes, I’d think you’d use something Poisson regression or negative binomial regression because of the count data. I write a little bit about them in my post about choosing the correct type of regression analysis . You can include categorical variables for forest types.

I don’t have good ideas for developing variability estimates. That can be the most difficult part of a power analysis. I’d recommend reading up on the literature as much as possible. Perhaps others have conduct similar research and you can use their estimates. Unfortunately, if you don’t have any data, you can’t bootstrap or simulate it.

I wish I had some better advice, but the best I can think of is to look through the literature for comparable studies. That’s always a good idea anyway, but here it’ll help you with the power analysis too.

' src=

February 17, 2021 at 7:04 am

I am confused in some parts as I am new to this, let’s assume I have difference in mean, standard error, power 80%, I have these information to get a sample size, (delta, sd, power). But question is how I would know this is correct sample size to get 80% power? which type I need to put paired or two.sample or one.sample? After power.t.test I get sample size 8.7 for two sample and 6 for one sample, I am not sure which would be correct one. How to determine that?

February 18, 2021 at 12:34 am

The correct test depends on the nature of the data you collect. Are you comparing the means of two groups? In that case, you need to use a 2-sample t-test. If you have one group and are comparing its mean to a test value, you need a 1-sample t-test.

You can read about the purposes and interpretations the various t-tests in my post about How to do t-tests in Excel . That should be helpful even if you’re not using Excel. Also, I write more about how t-tests work , which will be helpful in showing you what each test can do.

' src=

February 7, 2021 at 6:53 pm

Hey there! What sort of test would be best to determine sample size needed for a study determining a 10% difference between two groups at a power of say 80%? Thanks!

February 7, 2021 at 10:23 pm

Hi Kristin, you’d need to perform a power and sample size analysis for a 2-sample t-test. As I indicate in this post, you’ll need to supply an estimate of the population’s standard deviation, the difference you want to detect, and the power, and the procedure will tell you the sample size per group.

' src=

January 30, 2021 at 7:48 pm

I have an essay question if anyone can help me with:

Do a calculation: write down what you think the typical power of psychological study really is and what percentage of research hypotheses are “good” hypotheses. Assume that journals reserve 10% of their pages for publishing null results. Under these assumptions, what percentage of published psychological research is wrong? Do you agree that this analysis make sense or is this the wrong way to think about “right” and “wrong” research

January 30, 2021 at 8:57 pm

I can’t do your essay for you, but I’ve written two blog posts that should be extremely helpful for your assignment.

Reproducibility in Psychology Experiments Low power tests exaggerate effect sizes

Those two should give you some good food for thought!

' src=

January 26, 2021 at 1:17 pm

Dear Jim I have a question regrading sample size calculation for a laboratory study. The laboratory evaluation includes evaluation of marginal integrity of 2 dental material vs a control material? what type of test should I use ?

January 26, 2021 at 9:13 pm

Hi Eman, that largely depends on the type of data you’re collecting for your outcome. If marginal integrity is continuous data and you want to compare the means between the control and two treatment groups, one-way ANOVA is a great place to start.

' src=

November 22, 2020 at 10:30 am

Hi Jim, what if I want to run mixed model ANOVAS twice (on two different dependent variables) – would I have to then double the sample size that I calculated using g power? Thanks, Joanna

' src=

November 16, 2020 at 11:35 pm

Hi Jim. What about molecular data? For instance, I sequenced my 6 samples, 3 controls and 3 treatments, but each sample (tank replicate) consist of 500-800 individuals of biological replicates (larvae). Given the analysis after sequencing is that there are thousand of genes that may show mean differences between the control and treatment. My concern is, does power analysis still play a fair role here, given that increasing the “sample size” which is the number of tank replicate to a number of 5 or more suggested by power analysis to get >0.8 is nearly impossible in a physical setting?

' src=

November 5, 2020 at 8:09 pm

I have somewhat of a basic question. I am performing some animal studies and looking at the effect of preservation solution on ischemia repercussion injury following transplantation. I am comparing 5 different preservation solutions. What should be my sample size for each group? I want to know how exactly I can calculate that.

November 6, 2020 at 8:58 pm

You’ll need to have an estimate of the effect. Or, an estimate of the minimum effect size that is practically meaningful in a real-world sense. If you’re comparing means, you’ll also need an estimate of the variability. The nature of what and how to determine the sample size depends on the type of hypothesis test you’ll be using. That in turn depends on the nature of your outcome variable. Are you comparing means with continuous data or comparing proportions with binary data? But in all cases you’ll need that effect size estimate.

You’ll also need software to calculate that for you. I recommend a freeware program called G*Power . Although, most statistical applications can do these power calculations. I cover examples in this post that should be helpful for you.

If you have 5 solutions and you want to compare their means, you’ll need to perform power and sample size calculations for one-way ANOVA.

' src=

September 4, 2020 at 3:48 am

Hi Jim, I’ve calculate that I need 34 pairs for a paired t-test with an alpha=0.05 and beta=0.10 with standard deviation of 1.945 to detect a 1.0 increase in the difference. If after 5 pairs I run my hypothesis tests and I find that the difference is significant (i.e. I reject the null hypothesis) is there a need to complete the remaining 29 pairs? Thanks, Sam

' src=

August 20, 2020 at 12:13 pm

Thank you for the explanation. I am currently using G power to determine my sample size. But I am still confused about the effect size. Let say I use medium effect size for conducting a correlation, so sample size that have been suggested is 138 (example) but then when I use medium effect size for conducting a t test to find differences between two independent group, the sample size that have been suggested is 300 (example). So which sample size I should take? Does the same effect size need to be use for every statistical test? or actually each statistical test have different effect size?

' src=

August 15, 2020 at 1:45 pm

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

Thanks in Advance Vishal

August 15, 2020 at 3:11 pm

It sounds like you need to use a 2-sample proportions test. It’s one of the many hypothesis tests that I cover in my new Hypothesis Testing ebook . You’ll find the details about how and why to use it, assumptions, interpretations and examples for it.

As for using G*Power to estimate power and sample size, under the Test family drop-down list, choose Exact . Under the Statistical test drop-down, choose Proportions: Inequality, two independent groups (Fisher’s exact test) . That assumes that your two groups have different probes. From there, you’ll need to enter estimates for your study based on whatever background subject-area research/knowledge you have.

I hope this helps!

' src=

August 15, 2020 at 10:00 am

Hi Jim, Is that scientifically appropriate to use G*Power in sample size calculation of a clinical biomedical research?

August 15, 2020 at 3:24 pm

Hi, yes, G*Power should be appropriate to use for statistical analyses in any area. Did you have a specific concern about it?

August 11, 2020 at 12:21 am

Hi Everyone

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

' src=

July 12, 2020 at 1:56 am

Thank you, Jim, for the app reference. I am checking it out right now. #TeamNoSleep

July 12, 2020 at 5:40 pm

Hi Jamie, Ah, yes, #TeamNoSleep. I’ve unfortunately been on that team! 🙂

' src=

June 17, 2020 at 1:30 am

Hi Jim, What is the name of the software you use?

June 18, 2020 at 5:40 pm

I’m using Minitab statistical software. If you’d like free software to calculate power and sample sizes, I highly recommend G*Power .

' src=

June 10, 2020 at 4:05 pm

I would like to calculate power for a poisson regression (my DV consists of count data). Do you have any guidance on how to do so?

June 10, 2020 at 4:29 pm

Hi Veronica,

Unfortunately, I’m not familiar with an application will calculation power for Poisson regression. If your counts are large enough (lambda greater than 10), Poisson approximates a normal distribution. You might then be able to use power analysis for linear multiple regression, which I have seen in the free application G*Power . That might give you an idea at least. I’m not sure about power analysis specifically for Poisson regression.

' src=

June 3, 2020 at 6:24 am

Dear Jim, your post looks very nice. I have just one comment: how I could calculate the sample size and power for an “Equal variances” test comparing more than 2 samples ? Is it mandatory as in t-tests ? Which is the test statistic used in that test ? Thanks in advance for your tip

June 3, 2020 at 8:13 pm

Hi Ciro, to be honest, I’ve never seen a power analysis for an equal variances test with more than two samples!

The test statistic depends upon which of several methods you use, F-test, Levene’s test statistic, and Bartlett’s test statistic.

While it would be nice to estimate power for this type of test, I don’t think it’s a common practice and I haven’t seen it available in the software I have checked.

' src=

April 24, 2020 at 12:10 am

Why are the sample sizes here all so small?

April 25, 2020 at 1:37 am

For sample sizes, large and small are relative. Given the parameters entered, which include the effect size you want to detect, the properties of the data, and the desired powered, the sample sizes are exactly the correct size! Of course, you’re always working estimates for these values and there’s a chance your estimates are off. But, the proper sample size depends on the nature of all those properties.

I’m curious, was there some reason why you were expecting larger sample sizes? Some times you’ll see big studies, such as medical trials. In some cases with lives on the line, you’ll want very large sample sizes that go beyond just issue of statistical power. But, for many scientific studies where the stakes aren’t so high, they use the approach described here.

' src=

December 1, 2019 at 6:20 pm

Does he formula n equals z times standard deviation decided by margin of error all squared already a power analysis? I’m looking for power analysis for just estimating a statistic (descriptive statistics) and not hypothesis testing as in many cases of inferential statistics. Does that formula suffice? Thanks in advanced 😊

December 2, 2019 at 2:43 pm

You might not realize it, but you’re asking me a trick question! The answer for how you calculate power for descriptive statistics is that you don’t calculate power for descriptive statistics.

Descriptive statistics simply describe the characteristics of a particular group. You’re not making inferences about a larger population. Consequently, there is no hypothesis testing. Power relates to the probability that a hypothesis test will detect a population effect that actually exists. Consequently, if there is no hypothesis test/inferences about a population, there’s no reason to calculate power.

Relatedly, descriptive statistics do not involve a margin of error based on random sampling. The mean of a group is a specific known value without error (excluding measurement error) because you’re measuring all members of that group.

For more information about this topic, read my post about the differences between descriptive and inferential statistics .

' src=

October 22, 2019 at 3:24 am

Just wanted to understand, if the confidence interval and power is same.

' src=

September 9, 2019 at 8:25 am

Thanks for your explanation, Jim.

August 21, 2019 at 7:46 am

I would like to design a test for the following problem (under the assumption that the Poisson distribution applies):

Samples from a population can be either defective or not (e.g. some technical component from a production)

Out of a random sample of N, there should be at most k defective occurrences, with a 95% probability (e.g. N = 100’000, k = 30).

I would like to design a test for this (testing this Hypothesis) with a sample size N1 (different from N). What should my limit on k1 (defective occurrences from the sample of N1) be? Such that I can say that with a 95% confidence, there will be at most k occurrences out of N samples.

E.g. N2 = 20’000. k1 = ???

Any hints how to tackle this problem?

Many thanks in advance Tom

August 21, 2019 at 11:46 pm

To me, it sounds like you need to use the binomial distribution rather than the Poisson distribution. You use the binomial distribution when you have binary data and you know the probability of an event and the number of trials. That’s sounds like you’re scenario!

In the graph below, I illustrate a binomial distribution where we assume the defect rate is 0.001 and the sample size is 100,000. I had the software shade the upper and lower ~2.5% of the tails. 95% of the outcomes should fall within the middle.

example of binomial distribution

If you have sample data, you can use the Proportions hypothesis test, which is based on the binomial distribution. If you have a single sample, use the Proportions test to determine whether your sample is significantly different from a target probability and to construct a confidence interval.

I hope this help!

' src=

March 17, 2019 at 6:37 pm

Thanks very much for putting together this very helpful and informative page. I just have a quick question about statistical power: it’s been surprisingly difficult for me to locate an answer to it in the literature.

I want to calculate the sample size required in order to reach a certain level of a priori statistical power in my experiment. My question is about what ‘sample size’ means in this type of calculation. Does it mean the number of participants or the number of data points? If there is one data point per participant, then these numbers will obviously be the same. However, I’m using a mixed-effects logistic regression model in which there are multiple data points nested within each participant. (Each participant produces multiple ‘yes/no’ responses.)

It would seem odd if the calculation of a priori statistical power did not differentiate between whether each participant produces one response or multiple responses.

' src=

April 8, 2018 at 4:46 am

Thank you so much sir for the lucid explanation. Really appreciate your kind help. Many Thanks!

April 1, 2018 at 4:36 am

Dear sir, When i search online for sample size determination, i predominantly see mention of margin of error formula for its calculation.

At other places, like your website, i see use of effect size and desired power etc. for the same calcation.

I’m struggling to reconcile between these 2 approaches. Is there a link between the two?

I wish to determine sample size for testing a hypothesis with sufficient power, say 80% or 90%. Please guide me.

April 2, 2018 at 11:37 am

Hi Khalid, a margin of error (MOE) quantifies the amount of random sampling error in the estimation of a parameter, such as the mean or proportion. MOEs represent the uncertainty about how well the sample estimates from a study represent the true population value and are related to confidence intervals. In a confidence interval, the margin of error is the distance between the sample estimate and each endpoint of the CI.

Margins of error are commonly used for surveys. For example, if a survey result is that 75% of the respondents like the product with a MOE of 3 percent. This result indicates that we can be 95% confident that 75% +/- 3% (or 72-78%) of the population like the product.

If you conduct a study, you can estimate the sample size that you need to achieve a specific margin of error. The narrower the MOE, the more precise the estimate. If you have requirements about the precision of the estimates, then you might need to estimate the margin of error based on different sample sizes. This is simply one form of power and sample size analysis where the focus is on how sample sizes relate to the margin of error.

However, if you need to calculate power to detect an effect, use the methods I describe in this post.

In summary, determine what your requirements are and use the corresponding analysis. Do you need to estimate a sample size that produces a level of precision that you specify for the estimates? Or, do you need to estimate a sample size that produces an amount of power to detect a specific size effect? Of course, these are related questions and it comes down to what you want to input in as your criteria.

' src=

March 24, 2018 at 8:39 pm

' src=

March 20, 2018 at 10:42 am

Thank you so much for this very intuitive article on sample size.

Thank you, Ashwini

March 20, 2018 at 10:53 am

Hi Ashwini, you’re very welcome! I’m glad it was helpful!

' src=

March 19, 2018 at 1:22 pm

Thank you.This was very helpful

March 19, 2018 at 1:25 pm

You’re very welcome, Hellen! I’m glad you found it to be helpful!

' src=

March 13, 2018 at 4:27 am

Thanks for your answer Jim. I was indeed aware of this tool, which is great for demonstration. I think I’ll stick to it.

' src=

March 12, 2018 at 7:53 am

Awaiting your book!

March 12, 2018 at 2:06 pm

Thanks! If all goes well, the first one should be out in September 2018!

March 12, 2018 at 4:18 am

Once again, a nice demonstration. Thanks Jim. I was wondering which software you used in your examples. Is it, perhaps, R or G*Power? And, would you have any suggestions on an (online/offline) tool that can be used in class?

March 12, 2018 at 2:03 pm

Hi George, thank you very much! I’m glad it was helpful! I used Minitab for the examples, but I would imagine that most statistical software have similar features.

I found this interactive tool for displaying how power, alpha, effect size, etc. are related. Perhaps this is what you’re looking for?

' src=

March 12, 2018 at 1:02 am

Thanks for information, please explain for case- control study, sample size calculation if different study says different prevalence for different parameter.

' src=

March 12, 2018 at 12:26 am

Thnks sir …. Wana to salute uh. ……bt to far Sir send me sme articles on distributions of probability. ..

MOST KNDNSS

Comments and Questions Cancel reply

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Hypothesis testing: Use t-test when sample size is small or variance is unknown, but not both

The t-test is often used in hypothesis testing when the sample size is small (less than 30) because its parameterization by degrees of freedom allows the greater uncertainty to be accounted for. Many online information sources, however, including answers in Cross Validated, say t-tests and z-tests require approximate normality in the underlying population or random variable. This Wikipedia section says that for a one-sample t-test, the underlying population or random variable does not need to be normal if the sample is large enough that sample mean is normally distributed due to the Central Limit Theorem (CLT).

Would it be fair to say that the t-test can be used: (i) when sample sizes are small or (ii) when the underlying population or random variable is not normal, but not both (i) and (ii) at the same time? That is, the above "or" should be an "exclusive or"?

P.S. I did not mention that the t-test is used when we do not know the standard deviation of the underlying population or random variable. But the above still applies, i.e., the t-test is inappropriate if conditions (i) and (ii) both apply.

  • hypothesis-testing

user2153235's user avatar

  • 1 $\begingroup$ As Wikipedia says, the t-test requires the distribution of the sample means to be normal. But that is the case for sure if the random variable itself is normal. If it is not, the distribution of the means is often approximately normal for larger sample sizes as the CLT states. Thus, it is an exclusive or (presuming the conditions for the CLT hold). $\endgroup$ –  frank Aug 7, 2022 at 4:47
  • $\begingroup$ That's the way I interpret it. I just wasn't sure because so many examples online don't highlight this caveat. They just dive into the mechanics of hypothesis testing. Thanks for confirming $\endgroup$ –  user2153235 Aug 7, 2022 at 16:44

The CLT has to do with type I assertion probability $\alpha$ being approximately correct (and even then the sample size must be huge for it to work if you have high asymmetry) but offers no protection against high $\beta$ (low power). The $t$ -test can be used any time you have confidence that its normality/equal variance assumptions are true, but you'll be getting a lot of the power from the assumptions when $n$ is small. A Bayesian $t$ -test recognizes uncertainty with regard to both normality and the variance ratio. More here . Or use a nonparametric method to test a more general hypothesis.

Frank Harrell's user avatar

  • $\begingroup$ Thanks. I'm going to look things up to fully appreciate this... $\endgroup$ –  user2153235 Aug 7, 2022 at 16:47

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing or ask your own question .

Hot network questions.

  • Inductance after core saturation
  • Why don't professors seem to use learning strategies like spaced repetition and note-taking?
  • A phrase that means you are indifferent towards the things you are familiar with?
  • Prove that "max independent set is larger than max clique" is NP-Hard
  • Asymptotic behavior of a ratio involving Bessel K functions
  • Can I paraphrase an conference paper I wrote in my dissertation?
  • Connecting to very old Linux OS with ssh
  • Matter currents
  • Book recommendation introduction to model theory
  • I'm looking for a series where there was a civilization in the Mediterranean basin, which got destroyed by the Atlantic breaking in
  • Transformer with same size symbol meaning
  • Why does the proposed Lunar Crater Radio Telescope suggest an optimal latitude of 20 degrees North?
  • Death in the saddle
  • Looping counter extended
  • Effects if a human was shot by a femtosecond laser
  • Which program is used in this shot of the movie "The Wrong Woman"
  • Can I expect to find taxis at Kunming Changshui Airport at 2 am?
  • Are there any jobs that are forbidden by law to convicted felons?
  • Where do UBUNTU_CODENAME and / or VERSION_CODENAME come from?
  • What’s the history behind Rogue’s ability to touch others directly without harmful effects in the comics?
  • What is the U.N. list of shame and how does it affect Israel which was recently added?
  • My vehicle shut off in traffic and will not turn on
  • How do I get rid of the artifacts on this sphere?
  • What percentage of light gets scattered by a mirror?

hypothesis testing small sample size

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Statistics and probability

Course: statistics and probability   >   unit 12.

  • Hypothesis testing and p-values
  • One-tailed and two-tailed tests
  • Z-statistics vs. T-statistics

Small sample hypothesis test

  • Large sample proportion hypothesis testing

hypothesis testing small sample size

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Incredible Answer

Video transcript

Chi-Square (Χ²) Test & How To Calculate Formula Equation

Benjamin Frimodig

Science Expert

B.A., History and Science, Harvard University

Ben Frimodig is a 2021 graduate of Harvard College, where he studied the History of Science.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

On This Page:

Chi-square (χ2) is used to test hypotheses about the distribution of observations into categories with no inherent ranking.

What Is a Chi-Square Statistic?

The Chi-square test (pronounced Kai) looks at the pattern of observations and will tell us if certain combinations of the categories occur more frequently than we would expect by chance, given the total number of times each category occurred.

It looks for an association between the variables. We cannot use a correlation coefficient to look for the patterns in this data because the categories often do not form a continuum.

There are three main types of Chi-square tests, tests of goodness of fit, the test of independence, and the test for homogeneity. All three tests rely on the same formula to compute a test statistic.

These tests function by deciphering relationships between observed sets of data and theoretical or “expected” sets of data that align with the null hypothesis.

What is a Contingency Table?

Contingency tables (also known as two-way tables) are grids in which Chi-square data is organized and displayed. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

In contingency tables, one variable and each of its categories are listed vertically, and the other variable and each of its categories are listed horizontally.

Additionally, including column and row totals, also known as “marginal frequencies,” will help facilitate the Chi-square testing process.

In order for the Chi-square test to be considered trustworthy, each cell of your expected contingency table must have a value of at least five.

Each Chi-square test will have one contingency table representing observed counts (see Fig. 1) and one contingency table representing expected counts (see Fig. 2).

contingency table representing observed counts

Figure 1. Observed table (which contains the observed counts).

To obtain the expected frequencies for any cell in any cross-tabulation in which the two variables are assumed independent, multiply the row and column totals for that cell and divide the product by the total number of cases in the table.

contingency table representing observed counts

Figure 2. Expected table (what we expect the two-way table to look like if the two categorical variables are independent).

To decide if our calculated value for χ2 is significant, we also need to work out the degrees of freedom for our contingency table using the following formula: df= (rows – 1) x (columns – 1).

Formula Calculation

chi-squared-equation

Calculate the chi-square statistic (χ2) by completing the following steps:

  • Calculate the expected frequencies and the observed frequencies.
  • For each observed number in the table, subtract the corresponding expected number (O — E).
  • Square the difference (O —E)².
  • Divide the squares obtained for each cell in the table by the expected number for that cell (O – E)² / E.
  • Sum all the values for (O – E)² / E. This is the chi-square statistic.
  • Calculate the degrees of freedom for the contingency table using the following formula; df= (rows – 1) x (columns – 1).

Once we have calculated the degrees of freedom (df) and the chi-squared value (χ2), we can use the χ2 table (often at the back of a statistics book) to check if our value for χ2 is higher than the critical value given in the table. If it is, then our result is significant at the level given.

Interpretation

The chi-square statistic tells you how much difference exists between the observed count in each table cell to the counts you would expect if there were no relationship at all in the population.

Small Chi-Square Statistic: If the chi-square statistic is small and the p-value is large (usually greater than 0.05), this often indicates that the observed frequencies in the sample are close to what would be expected under the null hypothesis.

The null hypothesis usually states no association between the variables being studied or that the observed distribution fits the expected distribution.

In theory, if the observed and expected values were equal (no difference), then the chi-square statistic would be zero — but this is unlikely to happen in real life.

Large Chi-Square Statistic : If the chi-square statistic is large and the p-value is small (usually less than 0.05), then the conclusion is often that the data does not fit the model well, i.e., the observed and expected values are significantly different. This often leads to the rejection of the null hypothesis.

How to Report

To report a chi-square output in an APA-style results section, always rely on the following template:

χ2 ( degrees of freedom , N = sample size ) = chi-square statistic value , p = p value .

chi-squared-spss output

In the case of the above example, the results would be written as follows:

A chi-square test of independence showed that there was a significant association between gender and post-graduation education plans, χ2 (4, N = 101) = 54.50, p < .001.

APA Style Rules

  • Do not use a zero before a decimal when the statistic cannot be greater than 1 (proportion, correlation, level of statistical significance).
  • Report exact p values to two or three decimals (e.g., p = .006, p = .03).
  • However, report p values less than .001 as “ p < .001.”
  • Put a space before and after a mathematical operator (e.g., minus, plus, greater than, less than, equals sign).
  • Do not repeat statistics in both the text and a table or figure.

p -value Interpretation

You test whether a given χ2 is statistically significant by testing it against a table of chi-square distributions , according to the number of degrees of freedom for your sample, which is the number of categories minus 1. The chi-square assumes that you have at least 5 observations per category.

If you are using SPSS then you will have an expected p -value.

For a chi-square test, a p-value that is less than or equal to the .05 significance level indicates that the observed values are different to the expected values.

Thus, low p-values (p< .05) indicate a likely difference between the theoretical population and the collected sample. You can conclude that a relationship exists between the categorical variables.

Remember that p -values do not indicate the odds that the null hypothesis is true but rather provide the probability that one would obtain the sample distribution observed (or a more extreme distribution) if the null hypothesis was true.

A level of confidence necessary to accept the null hypothesis can never be reached. Therefore, conclusions must choose to either fail to reject the null or accept the alternative hypothesis, depending on the calculated p-value.

The four steps below show you how to analyze your data using a chi-square goodness-of-fit test in SPSS (when you have hypothesized that you have equal expected proportions).

Step 1 : Analyze > Nonparametric Tests > Legacy Dialogs > Chi-square… on the top menu as shown below:

Step 2 : Move the variable indicating categories into the “Test Variable List:” box.

Step 3 : If you want to test the hypothesis that all categories are equally likely, click “OK.”

Step 4 : Specify the expected count for each category by first clicking the “Values” button under “Expected Values.”

Step 5 : Then, in the box to the right of “Values,” enter the expected count for category one and click the “Add” button. Now enter the expected count for category two and click “Add.” Continue in this way until all expected counts have been entered.

Step 6 : Then click “OK.”

The four steps below show you how to analyze your data using a chi-square test of independence in SPSS Statistics.

Step 1 : Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).

Step 2 : Select the variables you want to compare using the chi-square test. Click one variable in the left window and then click the arrow at the top to move the variable. Select the row variable and the column variable.

Step 3 : Click Statistics (a new pop-up window will appear). Check Chi-square, then click Continue.

Step 4 : (Optional) Check the box for Display clustered bar charts.

Step 5 : Click OK.

Goodness-of-Fit Test

The Chi-square goodness of fit test is used to compare a randomly collected sample containing a single, categorical variable to a larger population.

This test is most commonly used to compare a random sample to the population from which it was potentially collected.

The test begins with the creation of a null and alternative hypothesis. In this case, the hypotheses are as follows:

Null Hypothesis (Ho) : The null hypothesis (Ho) is that the observed frequencies are the same (except for chance variation) as the expected frequencies. The collected data is consistent with the population distribution.

Alternative Hypothesis (Ha) : The collected data is not consistent with the population distribution.

The next step is to create a contingency table that represents how the data would be distributed if the null hypothesis were exactly correct.

The sample’s overall deviation from this theoretical/expected data will allow us to draw a conclusion, with a more severe deviation resulting in smaller p-values.

Test for Independence

The Chi-square test for independence looks for an association between two categorical variables within the same population.

Unlike the goodness of fit test, the test for independence does not compare a single observed variable to a theoretical population but rather two variables within a sample set to one another.

The hypotheses for a Chi-square test of independence are as follows:

Null Hypothesis (Ho) : There is no association between the two categorical variables in the population of interest.

Alternative Hypothesis (Ha) : There is no association between the two categorical variables in the population of interest.

The next step is to create a contingency table of expected values that reflects how a data set that perfectly aligns the null hypothesis would appear.

The simplest way to do this is to calculate the marginal frequencies of each row and column; the expected frequency of each cell is equal to the marginal frequency of the row and column that corresponds to a given cell in the observed contingency table divided by the total sample size.

Test for Homogeneity

The Chi-square test for homogeneity is organized and executed exactly the same as the test for independence.

The main difference to remember between the two is that the test for independence looks for an association between two categorical variables within the same population, while the test for homogeneity determines if the distribution of a variable is the same in each of several populations (thus allocating population itself as the second categorical variable).

Null Hypothesis (Ho) : There is no difference in the distribution of a categorical variable for several populations or treatments.

Alternative Hypothesis (Ha) : There is a difference in the distribution of a categorical variable for several populations or treatments.

The difference between these two tests can be a bit tricky to determine, especially in the practical applications of a Chi-square test. A reliable rule of thumb is to determine how the data was collected.

If the data consists of only one random sample with the observations classified according to two categorical variables, it is a test for independence. If the data consists of more than one independent random sample, it is a test for homogeneity.

What is the chi-square test?

The Chi-square test is a non-parametric statistical test used to determine if there’s a significant association between two or more categorical variables in a sample.

It works by comparing the observed frequencies in each category of a cross-tabulation with the frequencies expected under the null hypothesis, which assumes there is no relationship between the variables.

This test is often used in fields like biology, marketing, sociology, and psychology for hypothesis testing.

What does chi-square tell you?

The Chi-square test informs whether there is a significant association between two categorical variables. Suppose the calculated Chi-square value is above the critical value from the Chi-square distribution.

In that case, it suggests a significant relationship between the variables, rejecting the null hypothesis of no association.

How to calculate chi-square?

To calculate the Chi-square statistic, follow these steps:

1. Create a contingency table of observed frequencies for each category.

2. Calculate expected frequencies for each category under the null hypothesis.

3. Compute the Chi-square statistic using the formula: Χ² = Σ [ (O_i – E_i)² / E_i ], where O_i is the observed frequency and E_i is the expected frequency.

4. Compare the calculated statistic with the critical value from the Chi-square distribution to draw a conclusion.

Print Friendly, PDF & Email

Related Articles

Exploratory Data Analysis

Exploratory Data Analysis

What Is Face Validity In Research? Importance & How To Measure

Research Methodology , Statistics

What Is Face Validity In Research? Importance & How To Measure

Criterion Validity: Definition & Examples

Criterion Validity: Definition & Examples

Convergent Validity: Definition and Examples

Convergent Validity: Definition and Examples

Content Validity in Research: Definition & Examples

Content Validity in Research: Definition & Examples

Construct Validity In Psychology Research

Construct Validity In Psychology Research

COMMENTS

  1. 8.4: Small Sample Tests for a Population Mean

    The sample is small and the population standard deviation is unknown. Thus the test statistic is T = ˉx − μ0 s / √n and has the Student t -distribution with n − 1 = 4 − 1 = 3 degrees of freedom. Step 3. From the data we compute ˉx = 0.02075 and s = 0.00171.

  2. Hypothesis Test for Proportion (Small Sample)

    The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: P >= 0.80. Alternative hypothesis: P < 0.80. Note that these hypotheses constitute a one-tailed test. The null hypothesis will be rejected only if the sample proportion is too small. Determine sampling distribution.

  3. Hypothesis Testing for Means & Proportions

    Select the appropriate test statistic. Because the sample size is small (n<30) the appropriate test statistic is. Step 3. Set up decision rule. This is a lower tailed test, using a t statistic and a 5% level of significance. In order to determine the critical value of t, we need degrees of freedom, df, defined as df=n-1. In this example df=15-1=14.

  4. Small Sample Tests for a Population Mean

    Step 1. The assertion for which evidence must be provided is that the average online price μ is less than the average price in retail stores, so the hypothesis test is. H 0: μ = 179 vs. H a: μ < 179 @ α = 0.05; Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is. T = x-− μ 0 s ∕ n

  5. Hypothesis Testing and Small Sample Sizes

    Hypothesis Testing and Small Sample Sizes. Chapter; First Online: 01 January 2010; pp 63-85; Cite this chapter; Download book PDF. Fundamentals of Modern Statistical Methods. Hypothesis Testing and Small Sample Sizes Download book PDF.

  6. PDF Lecture 14: Large and small sample inference for proportions

    When conducting a hypothesis test for a population proportion, we check if the expected number of successes and failures are at least 10. np 10 n(1 p) 10 In the above formula p comes from the null hypothesis. Statistics 101 (Mine C¸etinkaya-Rundel) L14: Large & small sample inference for props. March 13, 2012 6 / 31

  7. Significance tests (hypothesis testing)

    Small sample hypothesis test (Opens a modal) Large sample proportion hypothesis testing (Opens a modal) Up next for you: Unit test. Level up on all the skills in this unit and collect up to 1,500 Mastery points! Start Unit test. Our mission is to provide a free, world-class education to anyone, anywhere.

  8. 25.3

    Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power. Example 25-4 Section Let \(X\) denote the crop yield of corn measured in the number of bushels per acre.

  9. PDF Lecture 17: Small sample proportions

    Statistics 101 (Prof. Rundel) L17: Small sample proportions November 1, 2011 13 / 28 Small sample inference for a proportion Hypothesis test H0: p = 0:20 HA: p >0:20 Assuming that this is a random sample and since 48 <10% of all Duke students, whether or not one student in the sample is from the Northeast is independent of another. Sample size ...

  10. PDF Chapter 6: Tests of Significance for Small Samples Tests of

    For small samples (size < 30); tests proposed for large samples do not hold good as sampling distribution cannot be assumed to be normal for small samples. This led to ... used to test a hypothesis when the sample size is small and the . population variance is not known. As (size of the sample) increases; distribution tends

  11. Issues in Estimating Sample Size for Hypothesis Testing

    Suppose we want to test the following hypotheses at aα=0.05: H 0: μ = 90 versus H 1: μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the ...

  12. Hypothesis Testing for a Proportion and for Small Samples

    Hypothesis Testing for a Proportion and . for a Mean with Unknown Population Standard Deviation. Small Sample Hypothesis Tests For a Normal population. When we have a small sample from a normal population, we use the same method as a large sample except we use the t statistic instead of the z-statistic.Hence, we need to find the degrees of freedom (n - 1) and use the t-table in the back of the ...

  13. Type I & II Errors and Sample Size Calculation in Hypothesis Testing

    Photo by Scott Graham on Unsplash. In the world of statistics and data analysis, hypothesis testing is a fundamental concept that plays a vital role in making informed decisions. In this blog, we will delve deeper into hypothesis testing, specifically focusing on how to reduce type I and type II errors.We will discuss the factors that influence these errors, such as significance level, sample ...

  14. hypothesis testing

    4. Please refer to questions on testing for assumptions of your model. They are usually a pointless endeavour and particularly so in this case. Simple parametric tests, such as t-test and ANOVA, were designed for small samples sizes. So, there's not necessarily any good reason to be considering other tests.

  15. Sample Size and its Importance in Research

    ESTIMATING SAMPLE SIZE. So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80 ...

  16. Hypothesis Testing and Small Sample Sizes

    One of the biggest breakthroughs in the past forty years is the derivation of inferential methods that perform well when sample sizes are small Indeed, some practical problems that seemed insurmountable only a few years ago have been solved.But to appreciate this remarkable achievement, we must first describe the shortcomings of conventional techniques developed during the first half of the ...

  17. How to Calculate Sample Size Needed for Power

    Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense.

  18. Hypothesis testing: Use t-test when sample size is small or variance is

    The t-test is often used in hypothesis testing when the sample size is small (less than 30) because its parameterization by degrees of freedom allows the greater uncertainty to be accounted for. Many online information sources, however, including answers in Cross Validated, say t-tests and z-tests require approximate normality in the underlying ...

  19. Sample Size and Hypothesis Testing

    Hypothesis testing determines if there is sufficient evidence to support a claim (the statistical hypothesis) about a population parameter based on a sample of data. Right-sizing experiments involve trade-offs involving the probabilities of different kinds of false claims, precision of estimates, and operational and ethical constraints on ...

  20. Small sample hypothesis test (video)

    Small sample hypothesis test. Large sample proportion hypothesis testing. Math > Statistics and probability > Significance tests (hypothesis testing) > ... We have a small sample size right over here. So we're going to be dealing with a T-distribution and T-statistic. So with that said, so let's think of it this way.

  21. Understanding P-Values and Statistical Significance

    Hypothesis testing. ... Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals. With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. ...

  22. Calculating Sample Size for BI Hypothesis Tests

    Power analysis is a statistical method used to determine the sample size needed to detect an effect of a given size with a certain degree of confidence. It involves specifying the significance ...

  23. What Is Chi Square Test & How To Calculate Formula Equation

    χ2 (degrees of freedom, N = sample size) = chi-square statistic value, p = p value. In the case of the above example, the results would be written as follows: A chi-square test of independence showed that there was a significant association between gender and post-graduation education plans, χ2 (4, N = 101) = 54.50, p < .001. APA Style Rules

  24. What Is Sample Size?

    Sample size is the number of observations or individuals included in a study or experiment. It is the number of individuals, items, or data points selected from a larger population to represent it statistically. The sample size is a crucial consideration in research because it directly impacts the reliability and extent to which you can ...