Overview
The ChiSquared test can be used when you are dealing with a nominal measurement scale with a single sample (binary data) or two independent samples with discrete data. With binary data, you use Chisquared test of goodness of fit. It tests whether a significant difference exists between an observed number of outcomes falling in each category and an expected number based on the null hypothesis (goodness of fit). With two independent samples, you use the Chisquared test of independence. It tests whether the two groups differ concerning the relative frequency with which group members fall into several categories. For both these tests, R offers an inbuilt test called the chisq.test()
.
In this building block, we will describe how this code works for both these tests.
Understanding chisq.test ()
In R, chisq.test()
performs the chisquared test of goodness of fit and chisquared test of independence. The syntax of the test is as follows:
Where:
 x is a numeric vector or matrix that provides the observed frequencies.
 y describes another numeric vector with the same length as x. It is ignored for the test of goodness of fit or when x is a matrix.
 correct is a boolean that indicates whether to apply continuity correction when computing the test statistic for a 2x2 table. The default is set to be TRUE.
 p is a vector of probabilities of the same length of x.
 rescale.p is a boolean; if TRUE p is rescaled (if necessary) to sum to 1. It is set to FALSE by default.
Test of goodness of fit
The chisquared goodness of fit test is used to compare the observed distribution (Oi) to an expected distribution (Ei), in a situation where there are two or more categories in discrete data. The hypothesis is set as:
 H0: Oi = Ei
 H1: Oi â‰ Ei
As an example, suppose you gathered 40 wild tulips and observed that 20 were red, 15 were yellow and 5 were white, you can use the goodness of fit test to analyse if these three colours were equally common, i.e. the expected proportion of each colour was 1/3.
In this case, for the R code, you only need two arguments: x and p.
# Create a vector for the observed frequencies
tulip < c(20,15,5)
# Run the test
chisq.test(tulip, p = c(1/3, 1/3, 1/3))
The output of this code gives the value of the chisquare statistic which is equal to 8.75 and the degrees of freedom (df) which equals one less than the number of categories. Since we have three colours, the df is 2.
You need the df to look up the critical chisquare value for a certain significance level (alpha) using a distribution table or an online calculator. This is approximately equal to 6 if you take alpha equal to 0.05.
Since the critical value (6) is smaller than the chisquare statistic (8.75), you can reject the null hypothesis and conclude that there is a significant difference between the observed and expected frequencies at a 5% significance level. Therefore, the colours are significantly not commonly distributed.
The R output also gives the pvalue which you can compare with alpha to make the conclusion. If the pvalue is less than or equal to the chosen alpha, you reject the null hypothesis. In our case, the pvalue is 0.013 which is less than 0.05, verifying our previous analysis.
Test of independence
The chisquared test of independence is used to determine if two categorical variables have a significant correlation between them. A common example is checking the effectiveness of a new drug. Suppose you have 105 patients in a study out of which 50 were treated with the new drug and 55 were not. You record if these patients observed an improvement and obtain the following contingency table, let us call it ‘drug_effectiveness’:
Improvement  

Yes  No  
Treatment  No  26  29 
Yes  35  15 
You can read the dataset yourself by running the following code:
data_frame < read.csv("https://goo.gl/j6lRXD") #Reading CSV
drug_effectiveness < table(data_frame$treatment, data_frame$improvement)
The hypothesis will be set as:
 H0: The drug treatment and improvement are independent
 H1: Treatment and improvement are not independent; the drug is effective.
To run this test on R, you additionally need two other arguments: y and correct. x and y indicate the variables that you are comparing which in our case are ‘Treatment’ and ‘Improvement’ respectively, and correct is set FALSE
to turn off Yatesâ€™ continuity correction.
Another way of running this test is to assign the contingency table matrix, ‘drug_effectiveness’ to x, as follows:
The output of these codes gives the value of the chisquare statistic which is equal to 5.56 and the df equal to 1. The critical chisquare value at alpha equal to 0.05 is 0.02.
Since the pvalue is smaller than 0.05, you can reject the null hypothesis. Thus, there is sufficient evidence to conclude a significant dependence of the drug treatment on the improvement implying that the drug is effective at a 5% significance level.

The chisquared goodness of fit is apt for binary and nominal data.

The chisquared test of independence is used when you have two independent samples.

These tests can be performed on R using its inbuilt test
chisq.test()