BIOL 3110

Biostatistics

Phil Ganter

301 Harned Hall

963-5782

Endemic flower from the Serro do Cipo Mts. in Brazil

Categorical Data 1

Chapter 9 (4th ed.) or 5.2, 6.6, and 10.1 (3rd ed.)

Email me

Academic Page Tennessee State Home page
Bio 311 Page Ganter home page

Unit Organization:

Problems:

  • Problems for homework
    • 3rd edition - 5.1, 5.4 (modified), 5.5 (modified), 6.36, 6.40, 10.3, 10.6
    • 4th edition - 9.1.1, 9.1.4, 9.1.5, 9.2.1, 9.2.5, 9.4.3, 9.4.6,
  • Suggested Problems
    • 3rd edition - Try additional problems in the section where the required problems are found.
    • 4th edition - There aren't that many problems in this chapter, so all of the remaining problems are recommended.

Sampling Distribution from Categorical Data

We have discussed the sampling distribution of the mean in the previous chapters.   When we measure some attribute of an experimental or observational unit, we can use a mean to describe the central tendency of the measurements. 

  • Sample mean () is the statistic we used to infer something about the true population means (, the parameter)
  • Sample means are distributed normally if either the population from which they are drawn is distributed normally or if the samples are large (Central Limit Theorem)

However, not all data can be described by the mean.  Categorical data are described as proportions of the total sample, i.e. frequencies.

So, what then is the distribution of sample proportions? 

  • First, lets define the parameter and statistic for proportional data  The true population proportion is p in the book.  This is a bit of a departure and I wonder why they have not stuck to the older practice of using a Greek letter for the true proportion of the category in the population.
  • We will stick to the textbooks's usage and call the true population proportion p (the parameter) and the proportion of of a category in a sample from that population is
    • We normally estimate the proportion of successes (=, p-hat) as the number of successes divided by the number of trials, so if there were 3 successes out of 10 trials then:

 = 3/10  =  0.30

  • However, a correction, called the Wilson-adjustment, is necessary
    • To distinguish between the adjusted and non-adjusted calculation of the proportion of successes, we will designate the adjusted proportion as (=p-tilde) which is calculated from as:

      • Now (using the example above) it's = (3+2)/(10+4) = 5/14 = 0.357
    • Why use this adjustment?  It is an outcome of the fact that proportions are bounded (p can't be smaller than 0  nor larger than 1)
      • The probability distribution of "piles up" at the boundaries", i.e. the distribution of a particular is not symmetrical except for one special case
        • Consider three proportions, p = 0.05, 0.50, or 0.90. 
        • The asymmetry means that the chance that will be between 0 and 0.05 is greater than the chance that it will be between 0.05 and 0.10 - a point just as far from 0.05 as 0 is
        • At the upper end, the chance that  will be between 0.90 and 1.00 is greater than the chance of it being between 0.80 and 0.90, once again a point as far from 0.90 as 1.00 is
        • The special, symmetrical case is the third proportion mentioned above.  When p = 0.50, the distribution is symmetric
      • Thus, we need an adjustment that takes this asymmetry into account and the Wilson adjustment will always move  closer to 0.05 than  is.
    • Note that as n and x get larger, approaches the value of , which is reasonable as less of a correction is needed

So, what is the distribution of  (or ), given that the true proportion is p?  Once again, a metaexperiment is where to begin.

  • Each experiment involves a series of observations, each one with a set number of possible outcomes (categories) so the data is summarized as the frequency (how many) or the proportion of observations that belong in each category.
    • The results of an experiment are the frequency of each category and may be expressed as proportions (which, according to our frequency definition of probability, are also estimates of the probabilities of the outcomes)
    • Each experiment produces an estimate of the true probability (p), which is p-hat ()
  • Repeat the experiment time and time again, each time producing another
  • The distribution  of the s is graphed as a histogram with on the x-axis and the frequency of each as the y-axis. The question before us is this:  what shape will that histogram take?  Can we predict the distribution of the s?
    • We can.  It is the binomial, a distribution we already know.  To see why the binomial is appropriate, lets look at a simple example
      • We need an experiment with categorical data.  How about sampling a population and assessing the genotype of individuals at a single locus with only two possible alleles (let's keep it simple)?  Thus, we have three categories (HH, Hh, and hh) and a proportion of the total number of individuals in each category
      • Now, consider any of the genotypes.  How would we expect the sample proportions, the s, to be distributed?
      • If there are n individuals in the sample, might be any one of n + 1 discrete values (the binomial distribution is a discrete distribution):  0, 1/n, 2/n, 3/n, up to n/n.  Each one of these proportions is one of the possible outcomes of the sample (the experiment) and these are the only possible outcomes.
      • Before, the x-axis for a graph of the binomial distribution was J, the number of successes, and the range was from 0 successes to n successes
      • We can calculate the probability of each of the possible outcomes, the possible values for if we know the p (the actual proportion of that genotype in the population) because the experiment (our sample) gives us n and each possible outcome will give us a j (the j's will be the numerator from each possible outcome -- remember, to calculate the binomial, we need p, n and j).
    • Note that the binomial applies, even though there are more than two categories of outcomes.
      • When considering any one category, an individual that belongs in that category is a success and individuals belonging to all the other categories can be lumped together as failures, which makes the outcome either a failure or a success, hence the binomial distribution can be used
  • If is the estimate of , then how close can we expect to be to ? How good is the estimate?  This depends on the number of events in each experiment (n)
    • Note that only certain estimates of p are possible. 
      • If n = 4, then we can estimate p as 0, 0.25. 0.5, 0.75, and 1.0 only
      • If n = 5, then we can estimate p as 0, 0.2. 0.4, 0.6, 0.8, and 1.0 only
    • if = 0.3 and n = 4, then we can get a of either 0.25 or 0.5, but not 0.3.  The sampling distribution if will peak at 0.25 (because it is closest to ).
So, the most probable p-hat turns out to be p, the true probability of success

Confidence Interval for a population proportion

The confidence interval for p will need an estimate of the standard error of p and an estimate of p, both of which are presented below and an assumption about t.

  • Remember, our estimate of p is not simply the frequency of an event over the size of the sample (), but the Wilson-adjusted estimate ().
    • If n is large, the correction factor does not change the outcome much
  • The SEp is based on the standard error done before (see book about using the normal to estimate the binomial for a probability)

  • The last portion is what to do about the probability distribution of the proportion.  We will use the normal, not the t-distribution., so we will estimate the 95 % interval using 1.96 (from the normal) but you can substitute the appropriate z-value if you want to use a different level of confidence

Test for Goodness-of-Fit

A goodness-of-fit test tests whether or not the data conform to some prior expectation for the data.

  • The prior expectation can come from a model or from previous experience or can be a null expectation.
    • As an example of a model that we all know, Hardy-Weinberg predicts that one should find p2 of AA, 2pq of Aa, and q2 aa individuals in a population, if p and q are the frequencies of A and a, respectively.
      • One might ask if the actual frequency of genotypes from a population agree with Hardy-Weinberg expectations.
      • This is a good case of prior expectations coming from a model.
    • If you have taken Ecology, then you have used a model, the Poisson distribution, to predict the distribution of plots with a particular number of trees. This is another example of using a model to predict proportions of outcomes within each category.
    • From my own work, I found that some populations of a pillbug (Armadillidum vulgare) had 50 % females and other populations had 85% females. If I sample a new population and find 60% females, does this proportion agree with either the 50% or the 85% expectation?
      • This is an example of prior expectations coming from prior experience.
    • There are three morphs of a water flea (Daphnia). An experimenter puts equal numbers of each morph into a tank and then lets a predator (a fish) prey on them for a standard length of time. The null hypothesis of no difference in predation rate would lead to an unchanged proportion of each prey morph after being preyed upon by the fish.
      • This is an example of a null expectation.
  • The way to test for goodness-of-fit is to use a Chi-square () test.
    • The test is based on measuring the deviation of the data from the expected outcome.
      • First, you must calculate the expected outcome, which is dependent on the circumstances of the test.
      • Then you subtract the expected outcomes from the observed outcomes.
        • If you were to sum these deviations, they would sum to 0 (this is an outcome of the fact that both the observed and expected columns must sum to n, the number of observations).
        • To avoid this, we have to get rid of the - sign somehow.
        • We could take the absolute value, which we will not use, or we could square all of the deviations, so that negative values become positive.
      • Square all deviations.
        • There is another problem. The size of the sum of deviations will depend on the size of n. We partially correct this by standardizing the deviation.
      • Divide each deviation by the expected value used to calculate that deviation.
      • Sum all standardized deviations
    • This sum is the value.

  • Where o = number of observations in a category, e = expected number of observations in a category, i is the index, and c is the number of categories.

Evaluation of the value

  • First, select an -value. This is a necessary step for any statistical test.
  • The -values are not normally distributed.
  • The distribution begins at 0 (the outcome when all observed are the same as expected values).
    • The right side is skewed so that the right tail extends to infinity.
    • The exact shape depends on the number of categories and the assumption that the deviations between the expected and observed values are due to random error only.
  • Think for a moment about how the -value is calculated and the distribution will make some sense.
    • As the -value increases, it means that the gap between the observed and expected values is increasing.
      • If the gap is due to random error only, then really large values of must be uncommon (low probability).
    • The tail to the right represents the probability of getting a -value equal to or larger than the calculated -value.
      • Thus, the tail probability is a measure of how unlikely a -value as large as the calculated value is, a measure of how likely your -value is given that it due to random error alone.
      • If this probability is too low, then you are forced to reject the idea that it is due to random error alone and accept that something else is contributing to the gap between observed and expected values.
      • You are forced to conclude that your expected values are not the correct values and your model is incorrect.
  • The null and alternative hypotheses are:
    • H0 : The -value is due to random error and the expected values are an accurate prediction of the observed values
    • HA : The -value is too large to be due to random error alone and the expected values are not an accurate prediction of the observed values
  • The question you must answer is "How unlikely is my -value?" This means you must look up the probability of getting a -value as large or larger than the value for your data in the table in the back of the book.
    • d. f. = the number of categories - 1
    • NOTE THAT THIS IS NOT THE NUMBER OF OBSERVATIONS - 1

Nondirectional vs. Directional

  • Goodness-of-fit null hypotheses are called COMPOUND NULL HYPOTHESES because each of the expected values (except the last, which is fixed by the values of the others since the total number of observations is fixed) is an independent hypothesis.
    • This fact means that the test must be non-directional, as the deviation for each of the independent nulls can be either + or -, so no overall direction is necessarily true for the entire test. We just know the observed didn't fit the expected very well.
  • The exception to this is when the outcome categories are just two (a dichotomous variable). Here, since only one of the expected values is not fixed, then directionality is possible. For an example, see the 2 x 2 contingency table below.
    • Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the non-directional value divided in half.

Last updated November 12, 2012