Frequencies and the Chi-Square: Test of Dependence


[This module will NOT be covered in the Statistics Diagnostic Exam at the Nicholas School.]


As we know, data may be measured on a number of measurement scales: nominal, ordinal, interval and ratio.  We have conducted comparison of means tests on data that is either interval (interval between data points has meaning) or ratio (where the intervals have a meaning and there is a true zero).  Now we are going to shift gears to talk about how we can analyze nominal (categorical data with no order) and ordinal (where order is important, but the interval between each level is unknown or unequal).  An example of a nominal variable is ‘undergraduate major’–there is no order implied across the majors.  Alternatively, a response to a Likert scale survey question of Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree is an example of an ordinal variable.  In this tutorial, we will work with nominal variables, which are sometimes called categorical variables.


When describing nominal data, we often use frequency tables.  Frequency tables display how many times each of the levels of a variable occurs in a sample.  For example, we could take a count of visitors to different gates of Duke Forest.  We randomly selected five hours on the weekend and five hours on during the week and sent out four research assistants to each of four gates.  The total counts across these ten hours are recorded in the table below.





The table above is fairly straightforward.  It displays the counts of visitors at each of the four gates, for a total of 300 visitors.  The proportion column displays the proportion of visitors (in fraction and decimal form) at each of the gates.  Now let’s add in a second nominal variable, weekday vs. weekend, to the table.




So the question we want to ask is “Is Duke Forest visitation (counts) at the various gates dependent on whether it is a weekday or weekend?”  Perhaps some gates are much more popular on the weekends (because they have a shelter, for example), or some gates might have greater visitation during the week because they are closer to the population centers of Durham or Chapel Hill. This question is different from asking “Are there more visitors during the week vs. the weekend?”  Here we are asking if the allocation of visits across gates differ during the week than on the weekends.

To answer this question, we will use the chi-square distribution (X2 Distribution).  The chi-square statistic compares the OBSERVED frequencies to EXPECTED frequencies to determine if there is an association between the two variables.


But before we get to the details of the chi-square distribution, we need (as always) to establish our hypotheses:


Ho:  Duke Forest visitation across the gates does NOT depend on whether it is a weekday/weekend.


Ha:  Duke Forest visitation across the gates DEPENDS on whether it is a weekday/weekend.


We can not be more specific than this in establishing our hypotheses, as the chi-square statistic can not test a directionality to the association.  The chi-square statistic takes the following form:



where fo is the observed frequency (count) and fe is the expected frequency (count) across all of the categories (in this case, across all of the gates).  Similar to the t- and z-distributions, the chi-square distribution is a family of distributions, based on the degrees of freedom.  The degrees of freedom are calculated by the number of rows in the two-way table minus one multiplied by the number of columns minus one (r-1)(c-1).  In our case, the degrees of freedom equals (4-1)(2-1) = 3.  Unlike the t and z, however, the chi-square can only give us a one-sided p-values.  The p-value is the p(x2≥X2) with the df of (r-1)(c-1).


So now we need to calculate out the expected frequency (or cell count) under the null hypothesis (independence or no association).  The expected cell count can be calculated by multiplying the row total and column total for each of the cells and dividing by the total n.  So, for example, the expected frequency of Gate F on a weekday (if weekday and visitation are independent) is (50*50)/300 = 8.33.  We need to do that calculation for each of the cells.  After doing so, we get a table of expected frequencies that looks like this:



As you can see, the total row and column stay constant, but the expected frequency (or counts) shifted a bit for Gates F, 23 and 26.  We now can use the chi-square formula stated above to calculate the X2 and its associated p-value.  The table below walks you through the calculations.  The first column denotes the cells (gate and weekend/weekday), while Observed (fo) lists the observed counts of visitors and the Expected (fe) lists the expected count, under the null hypothesis (independence).  I then subtracted fo minus fe and then squared this value.  The final column takes this value and divides each row by the Expected (fe).  We sum that final column to get the chi-square value (3.277, df=3).  We can then look at a chi-square distribution table (or use a chi-square calculator) to get a p-value of approximately 0.35.  This provides no evidence against the null hypothesis which states that visitation across gates does NOT depend on whether it’s a weekend or weekday.



Chi-Square in STATA


To run a test of association/test of dependence on nominal data in STATA use the tabi command, followed by the aggregate data of the observed frequencies.


STATA CODE:  tabi 10 40\ 20 100 \ 5  50\ 15 60


We separate out each of the rows of data with the backward slash.  Here is the STATA output (and it matches are calculations in excel, AWESOME!).



Transportation Survey Chi-Square Test of Dependence

For fun, I decided to run a chi-square test of dependence on the data collected in a Transportation Survey that was conducted on this website in Fall 2013.  The variable ‘driving’ is whether the individual drove yesterday.  I set up the following hypotheses:


Ho:  Driving is NOT dependent on country location.


Ha:  Driving IS dependent on country location.


The table of tabulated of data:


observed frequencies

I then ran the chi-square test in STATA which calculated a chi-square value of 59.67, df=24 which gives a p-value of 0.0000.  We have very strong evidence against the null hypothesis and can conclude with a high level of confidence that the variable driving is dependent on country location.  Ideally we would want expected frequencies greater than five for each of the cells, but for now, we won’t worry about that.  To reach that goal, we could sum counts across continents and run a chi-square test across continents instead of countries.  I will leave that for you to do!





This page was developed by Elizabeth A. Albright, PhD of the Nicholas School of the Environment, Duke University.


Follow Elizabeth A.Albright, PhD on Twitter @enviro_prof. If you found these pages useful, please link or share via Facebook or Twitter. Thanks! Elizabeth A. Albright, PhD



Photo credit: Donna Sell, Nicholas School of the Environment


Comments are closed.