Statistical Analysis of Categorized Frequency Data
This article discuss the analysis of statistical frequency data when they can have more than two categories. This article also highlight the chi-square test logic and its limitations in inferencial statistical analysis.
In social science or in physical science data can be categorized in to two mutually excusive groups. For example in a study a population in a particular area can be categorized in to male and female. That is on the basis of gender. It is also common in medical study patients to be categorized in to different groups after a treatment say improved patients, no change patients and worsened patients. In such situations, where the categorized are more than two in one dimensional table or two dimensional distribution binomial probability distribution cannot be used to test the differences in observation and test whether the differences are due to mere chance given the accepted level of significance to reject the null- hypothesis. In any statistical testing method the selection of the test and the requisite sampling distribution must be compatible with the characteristics of the data set. In this situation, the statistical test called chi-squared test is the most appropriate test to be used considering its limitations and use it wisely not to accept null hypothesis in the condition where there is evidence of non –significance.
In this paper I will discuss in detail how the test is done and how to interpret the statistical inferences from an observed sample to infer to the population given the specific sampling chi-squared distribution and degrees of freedom or arbitrariness.
The Measure of Chi-square
The measure of chi-square means what is the non directional difference that is the square of difference between observed and the expected frequencies divide by the expected frequency. That is the chi-square measure how the expected measure differs from the observed given the specific chi-square sampling distribution and enables a researcher to determine whether the calculated chi-square measure is greater or less the critical chi-square given the specific sampling distribution at a specific degrees of freedom or arbitrariness.
The process and logic of the statistical chi-square test and statistical inference
The logic of chi-square test is to determine in a one dimensional or two dimensional frequency data set given sample of observed frequencies and their differences are due to mere chance or due to significant differences or due to the influence of the independent variable under investigation. The measure of chi-square is a test statistics, which measure the degree of differences relative to the expected frequency in a non directional difference because the difference in positive and negative difference squares is always positive. That is, the chi-square test is normally a non-directional test given the nature of the chi-square test statistics. Then using the specific probability distribution of the chi-square and given the degrees of freedom one can obtain the critical chi-square at a particular level of significance or error accepted by the study. Normally it is at 0.05 levels of significance.
It is important the researcher understands the logic of chi-square test and its limitations and the influence of proper sampling size to confidently reject the null-hypothesis because the sample size affects considerably the calculated chi-square and not to reject the null hypothesis if the chi-square is non-significance because the non existence of evidence cannot be in all situations to accept the null hypothesis because even if they are significance the chi-square in different scenarios may produce non significance and leads in the negative evidence to accept the null hypothesis. This may influence the researcher to wrongly conclude to accept null hypothesis if the test chows it is non-significance.
The chi-square test can be demonstrated by an example as follows:
Say a researcher has categorized the patients in to three categories such as improved no improvement and worsened due to a therapy compared to no therapy. In this study they have chosen randomly 300 patients and they have observed these categories under a therapy and recorded the frequencies which are independent to each category as follows:
The worse, no improvement and worsened under treatment are as follows:
Under treatment
Total patents = 120 worse = 48, no improvement = 22, improved = 50
No treatment
Total patients = 180 worse = 60, no improvement = 62, improved = 58
The first step is to calculate the expected frequency in the total proportion applied to the three categories as follows:
Treatment Expected frequencies as follows:
Worsen = 120* 108/100 = 43.2 no improvement = 120*84/300 = 33.6 improvement = 43.2
No Treatment expected frequencies as a percentage is as follows:
Worsen = 180*108/300 = 64.8 no improvement = 180*84/300 = 50.4 improvement = 180*108/300 = 64.8
The second step is to calculate the chi-square for the categorized frequency data set as follows:
Calculated chi-square = (48-43.2)2/43.2 + (22-33.6)2/33.6 + (50-43.2)2/43.2 + (60-64.8)2/64.8 + (62-50.4)2/50.4 + (58-64.8)2/64.8 = 9.35
The third step is to calculate the degrees of freedom for the categorized frequency data table. For this example the degrees of freedom = (3-1)* (2-1) = 2
The fourth step is to use the chi square specific sample distribution at 0.05 levels of significance using the chi square table for the level of significance in non directional statistical testing. In this instance the critical chi square at 0.05 level of significance using the chi squared for the non directional testing is 5.99. There fore the calculated chi square or test statistics is greater than the critical chi square at 0.05 levels of significance.
The fifth step is the crucial one because as mentioned above if the evidence from the test is positive that the null –hypothesis can be rejected if the chi square calculated is more than the critical chi-square at a particular levels of significance. However, if the difference is non significance one must be careful in accepting the null hypothesis because the chi square may be non significant for different frequencies even though they may be significant and accepting the hypothesis leads to wrong conclusions. In this example given the positive evidence that at the particular level of significance at 0.05 levels of significance the differences are significant. and there fore infer that the differences are not due to mere chance but due to the influence of specific therapy.
Limitations of Chi-square test
The chi-square test is not applicable if the categories not independent and mutually exclusive. As well, the sample size must be large enough to use the chi square to arrive at meaningful inferences as to the nature of differences between observed and expected frequencies. In addition, one must not accept the null hypothesis on the basis of non significance or to accept the null hypothesis which will lead to misleading conclusions as explained above.
Conclusion
As discussed above, the chi-square test is an important statistical inference method when analyzing the categorized frequency sample data set and to conclude whether the differences between the observed and the expected or deviation from the mean of different categories are due to chance at a particular levels of significance. This is mostly appropriate where the categorized frequency data set is having more than two categories as well as to analyze one dimensional or two dimensional categorized frequency sample data to infer for the population there exist a difference which is significant or not at a particular level of significance. However one must be careful to accept the null hypothesis when the analysis shows the differences are not significance due to the fact on the basis of non significance on the basis of chi-square test leads to wrong conclusions because it may be significance and to accept the null hypothesis on negative evidence will lead to wrong statistical inferences or may mislead the researcher. In addition the chi square is not applicable when the categories are not independent and mutually excusive as well as the observed number is quite small.
Liked it












One Response to “Statistical Analysis of Categorized Frequency Data”
On April 11, 2009 at 12:16 pm
nice and informative
Post Comment