BIOL 3110

Biostatistics

Phil

Ganter

301 Harned Hall

963-5782

Blackberry flower

 

Analysis of Categorical Data II

Chapter 10 (4th ed.) or Chapter 10 (all but 10.1, 3rd ed.)

Email me

Back to:

Course Page
TSU Home Page
Ganter Home Page

Unit Organization:

Problems:

Problems for homework

  • 3rd Edition - 10.13, 10.17, 10.21, 10.29, 10.31, 10.40, 10.48, 10.52, 10.55, 10.65, 10.68, 10.70, 10.74, 10.88
  • 4th edition - 10.2.1,10.2.4, 10.2.8, 10.3.4, 10.4.1, 10.5.2, 10.5.6, 10.6.2, 10.7.3, 10.8.3, 10.9.5

Suggested Problems

  • 3rd Edition - 40 (Fishers), 42 (Fishers)
  • 4th edition - 10.4.3

Go back to the first lecture and brush up on data types.

  • In this lecture, we will cover analysis of categorical data.
  • Categorical data is data that is sorted into different qualitative categories, not by a measured value.
    • Sex, color, genotype, are examples of categorical data. Each observation fits into one category (male or female; red, green or blue; AA, Aa, or aa).

2 x 2 Contingency Tables

Comparing associations among factors

  • Members of a population can often be categorized based on more than one factor (usually many factors).  Every TSU student can be categorized by matriculation status (matriculated full-time, matriculated not full-time, non-matriculated) or by age or by residency or by sex or.... etc.  One common type of question we ask is:
    • Is there an association between (or among) factors?  Note that factor here means the same as variable.
      • Are males more likely to smoke than females?
      • Does level of unsaturated fat in the diet correspond to the risk of heart attack?
  • To answer these questions, we have to be more exact when we ask the question.  We are really asking if one factor is independent of the other
    • Independence here is the independence we have already discussed - that one factor does not influence the other
    • If we couch the question in terms of probability, then we can develop methods to answer the question based on what we know about probability
      • Is the probability that males smoke greater than the probability that females smoke?
      • Is the probability that those who eat high-fat diets have heart attacks greater than the probability that those who eat low-fat diets have heart attacks?
  • CONTINGENCY TABLES are tables where one can contrast one factor/variable (the columns) versus a second factor/variable (the rows).
    • They are called contingency tables because we are investigating if one variable's outcome is contingent (= dependent) on the other variable.
  • A 2 X 2 CONTINGENCY TABLE (said 2-by-2) is the simplest contingency table where each variable has only two possible outcomes.
    • Each of the combinations of of the variables get a CELL of its own.
  • To test whether or not one variable is affecting the other, we need to have an idea of what to expect if there is no association between the variables. This means we need an expected that we can calculate.
    • The null hypothesis for a contingency table is that the proportion of variable A in each category of A should be independent of which category of variable B we are looking at.
    • In terms of the example above, the null hypothesis is:

    H0:  Pr{smoking|male} = Pr{smoking|female}

    • In English: the null hypothesis is that the probability of smoking given that someone is a male is equal to the probability of smoking given that someone is a female
  • How do we estimate these probabilities?
  • We do this by using the MARGINAL TOTALS, the totals at the margins of the table below:
  Variable A        Margin
Variable B Category A1 Category A2   
   
Category B1 21 4 25
Category B2 8 32 40
    Total #
Margin 29 36 65
    • The marginal totals for Variable B (25 and 40) divided by the grand total (65) give us our estimate of the frequency of categories B1 and B2 independent of Variable A.
    • The marginal totals for Variable A (29 and 36) divided by the grand total (65) give us our estimate of the frequency of categories A1 and A2 independent of Variable B.
    • So we can use these marginal frequencies to get the expected proportion of observations in each of the four cells:
      • (29/65) * (25/65) = 0.17
      • (29/65) * (40/65) = 0.27
      • (36/65) * (25/65) = 0.21
      • (36/65) * (40/65) = 0.34
  • These are our expected probabilities (remember the connection between frequency and probability) for all four outcomes GIVEN THAT VARIABLE A AND B ARE INDEPENDENT!
    • Why is this a given?  Because we calculated each probability as the product of two independent probabilities - this is the null assumption
  • To do the test, we need to go beyond calculating the expected probability of being in each cell.  We need to calculate expected number of observations in each cell, given the null assumption.  This is easy to do because we already have the expected probabilities
    • The expected outcome then are (simply multiply the proportions by the total, 65):
Expected Outcomes
  Variable A    
Variable B Category A1 Category A2
Category B1 11.15 13.85
Category B2 17.85 22.15 Total
    65.00
  • You now have a set of observations and a set of expectations from which to calculate a . In this case it is 25.5
      • The chi-square distribution tell us the probability of getting a  value as large as we actually did IF THE NULL HYPOTHESIS WERE TRUE

Evaluation of the value

  • First, select an -value. This is a necessary step for any statistical test.
  • The degrees of freedom are the number of rows -1 (r - 1) times the number of columns -1 (c - 1)

(r - 1) * (c - 1) = 1 * 1 = 1

  • The null and non-directional alternative hypotheses are:

H0 : Variables A and B are independent

HA : Variables A and B are dependent

    • or:

H0:  Pr{A|B1} = Pr{A|B2}

HA:  Pr{A|B1} not equal to Pr{A|B2}

  • Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the value.
  • The directional alternative hypotheses are:

HA1:  Pr{A|B1} > Pr{A|B2}

or

HA2:  Pr{A|B1} < Pr{A|B2}

  • First you have to check to see if the alternative of interest has actually occurred.
    • If we choose HA1 above, then we would proceed because 21 of 28 A1 outcomes were in B1 but only 4 out of 36 A2 outcomes were in B1 and this is as predicted by HA1
    • If we choose HA2 above, then we would not proceed because 21 of 28 A1 outcomes were in B1 but only 4 out of 36 A2 outcomes were in B1 and this is not as predicted by HA2
  • Reject the null if your is larger than the Table 9 entry for the appropriate d. f. and the *2 value.
    • Notice that we have look up a value twice the alpha-value, which makes the cut off -value smaller and, so, a smaller deviation will allow one to reject the null hypothesis.

What have we tested here?

  • In the non-directional case, we have asked if the two variables are independent of one another or if they are associated.
    • If we accept the null, we are saying that Variable A and B are INDEPENDENT of one another.
      • The outcome of A does not depend on B and vice-versa.
  • If we accept the non-directional alternative hypothesis, we are saying that variables A and B are ASSOCIATED.
    • Association means that the outcome of one corresponds to the outcome of the other.
      • In our example, if you get A1, then you also expect B1, but if you get A2, then you expect B2.
  • In the directional case, we have asked if the association between two variables goes in a particular direction
    • If we accept the directional alternative, we accept that the two variables have a particular association (as defined by our choice of alternative hypothesis)

FISHER'S EXACT TEST

This is a test that is an alternative to the for contingency tables.

  • Exact because it gives the exact probability of getting the cell values given the marginal totals.

H0: The probability of infection is independent of the genotype of the plant.

HA: (directional) the probability of infection is lower for aa than for other genotypes.

  • Suppose that there are two three genotypes but that the A allele is completely dominant. You think the aa genotype might be useful if it shows resistance masked by the dominant allele. So you set up an experiment to test this. Plots of plants are exposed to the fungal spores and the appearance of infected individuals is noted. Plots are monocultures of plant genotypes. The results:
  Genotype Frequency Margin  
Infected AA or Aa aa     Ways of getting 3 out of 16
            560
Yes 13 3 0.23 16 Ways of getting 10 out of 17
No 7 10 0.77 17   19448
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.019001
 
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 2 out of 16
            120
Yes 14 2 0.15 16 Ways of getting 11 out of 17
No 6 11 0.85 17   12376
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.002591
   
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 1 out of 16
            16
Yes 15 1 0.08 16 Ways of getting 12 out of 17
No 5 12 0.92 17   6188
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.000173
   
  Genotype   Margin    
Infected AA or Aa aa     Ways of getting 0 out of 16
            1
Yes 16 0 0.00 16 Ways of getting 13 out of 17
No 4 13 1.00 17   2380
     Total # Ways of getting 13 out of 33
Margin 20 13   33   573166440
  Probability 0.000004
  • Only the top part of the table represents the outcome of the experiment (the data in black).
    • The data in maroon represents hypothetical situations discussed below.
  • We need to know how likely is this table, assuming that the marginal totals are fixed.
    • Remember, that, since the marginal totals are unchanging, if we know the probability of the outcomes for one category of one variable, we know the outcomes in the other cells, so we need to find the probability of the outcome in a single cell.
    • The likelihood of the table depends on the number of ways to construct the table with the given cell entries divided by the total number of ways to get the marginal totals
      • These "number of ways" are combinatorials, just like we worked with when learning the binomial.
        • nCj = n!/((j!)*(n-j)!)
      • The numerator is the product of the number of ways of getting 3 successes out of 16 trials (= 16!/((3!)*(13!)) = 560) times the number of ways to get 10 successes out of 17 trials (= 17!/((10!)*(7!)) = 19,448), so the numerator is 560*19,448 = 10,890,880
      • The denominator is the number or ways to get 13 out of 33 trials = 33!/((20!)*(13!)) = 573,166,440
        • The probability  is 10,890,880 / 573,166,440 = 0.019
  • But this ignores that there are situations which give one more support for the rejecting the null than the total experiment.
    • These are the situations in maroon above.
    • Each one represents an outcome that supports the directional HA , so we have to add the probability of these outcomes to the probability of the actual outcome.
      • The new numerator is the product of the number of ways of getting 2 successes out of 16 trials (= 16!/((2!)*(14!)) = 120) times the number of ways to get 11 successes out of 17 trials (= 17!/((11!)*(6!)) = 12,376), so the numerator is 120*12,376 = 1,485,120
      • The denominator is the number or ways to get 13 out of 33 trials = 33!/((20!)*(13!)) = 573,166,440
        • The probability  is 1,485,120 / 573,166,440 = 0.00259
  • Once this is done for all "worse cases", we see that the probability of getting this outcome or one more in line with HA is:
    • Pr{this table) = 0.019 + 0.00259 + 0.000173 + 0.000004 = 0.0218
    • If we have an alpha-value of 0.05, then we reject the null and accept the alternative.
  • Notice that this is a directional alternative. We will stop here and do the non-directional only if we have time.

Confidence Intervals for Differences between Probabilities in 2 x 2 tables

  • If you see the 2 x 2 table as two samples with two levels of a variable observed in each, then you can ask if the probabilities differ between samples by constructing a confidence interval for the difference between the probability of some outcome within each sample.
  Sample 1 Sample 2
Category 1 x1 x2
Category 2 n1 - x1 n2 - x2
   
Totals n1 n2
  • If we define n1 and n2 as the marginal totals of the columns (each column is a different sample), we can define p1 and p2 as the probability of getting category 1 in the two samples.
    • We can ask if this probability is the same in each sample.
    • Define and .
      • Note that the addition of 1 to the numerator and 2 to the denominator is a correction for bias at small sample size.
  • What is needed is a standard error of the difference and a z value (which depends on the -level you want for the confidence interval - we will use the 95% confidence level recommended by the book, which is 1.96)

    • Then the confidence interval is:

    • If the CI includes 0, then it is possible (at that level of confidence) that there is no difference between the samples.  Concluding this is the same as accepting the null hypothesis of no difference.
      • Notice that this is a way of doing a parametric test on categorical data.

r x k Contingency Tables

  • There is no difference between this procedure and the 2 x 2 we have already done.
    • We just have more cells for which we need to calculate expected numbers of outcomes and we have to do the with more than 4 categories (r*c categories, to be exact).
  • An example will suffice:
    • We will expand the previous example to four rows by three columns.
      • Notice that the expected proportions still total to 1, and the observed and expected totals are still equal to one another.
      • Each expected proportion cell is still the product of the row and column marginal totals divided by the square of the total number of outcomes.
Actual Outcomes   
  Variable A Margin
Variable B Category A1 Category A2 Category A3  
Category B1 21 17 14 52
Category B2 16 11 8 35
Category B3 7 5 4 16
Category B4 3 0 1 4
     Total #
Margin 47 33 27 107
     
Expected Proportions   
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 0.21 0.15 0.12
Category B2 0.14 0.10 0.08
Category B3 0.07 0.05 0.04
Category B4 0.02 0.01 0.01 Total
     1.00
Expected Outcomes   
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 22.84 16.04 13.12
Category B2 15.37 10.79 8.83
Category B3 7.03 4.93 4.04
Category B4 1.76 1.23 1.01 Total
     107.00
Chi Square    
        0.1 0.1 0.1    
0.0 0.0 0.1
0.0 0.0 0.0
0.9 1.2 0.0 Total
       2.49
Pr{Greater chi-square) 0.87
df = (r-1)*(c-1) = 6
  • As you can see, there is no evidence that this table differs from the marginal expectations.
    • The is only 2.49 and the probability of such a large is 87%, far larger than the usual 0.5 -level (the d. f. = (r-1)*(c-1) = (4 - 1)*(3 - 1) = 6.
    • Notice that each Variable A category declines from A1 to A3, no matter which Variable B category you are looking at.
  • The Variable A trends are INDEPENDENT of Variable B and the difference between the observed and the expected is due to random error.
    • What if that were not true. What if one of the Variable B categories bucked the trend?
    • In the table below, Variable A had the opposite trend in Category B3, increasing from A1 to A3.
      • By the way, the expected proportions have been skipped as I have used the formula (RowMarginal*ColumnMarginal)/Total to go straight to the expected number of outcomes.
Actual Outcomes 
  Variable A Margin
Variable B Category A1 Category A2 Category A3  
Category B1 21 17 14 52
Category B2 16 11 8 35
Category B3 1 4 11 16
Category B4 3 0 1 4
     Total #
Margin 41 32 34 107
     
Expected Proportions
  Variable A      
Variable B Category A1 Category A2 Category A3
Category B1 19.93 15.55 16.52
Category B2 13.41 10.47 11.12
Category B3 6.13 4.79 5.08
Category B4 1.53 1.20 1.27 Total
     107.00
Chi Square    
        0.1 0.1 0.4    
0.5 0.0 0.9
4.3 0.1 6.9
1.4 1.2 0.1 Total
        15.95
Pr{Greater chi-square) 0.01
df = (r-1)*(c-1) = 6
  • Look at the difference this change has made in the value of (once again, the d. f. = (r-1)*(c-1) = (4 - 1)*(3 - 1) = 6. Now the Pr{greater -value} <0.01, below the 0.05 -level.
    • This means that the trend in Variable A DEPENDS on which category of Variable B.
    • Variables A and B are NOT INDEPENDENT.

Paired Data and 2 x 2 Tables

  • If your data is paired, it may be possible to use categorical analysis to understand the independence/dependence between outcomes for paired data. An example will illustrate.
    • A researcher wants to know about the probability of attack of a newly developed bean variety by a fungal pathogen. The data comes from plots of the beans planted by farmers throughout Tennessee. Either the plot is attacked by the fungus or it is not. Data is collected for two years from the same plots and is presented in the table below.
  • The pairing comes from the same plots being utilized each year so individual plots can affect two data points.
     Second Year Infected?
yes no
First year Infected? yes 67 165
no 210 31
   yes no
First year Infected? yes n11 n12
no n21 n22
                   
(n12 - n21)2 2025
n12 + n21 375
chi-square 5.4
d. f. 1
Probability 0.02
alpha -value 0.05
conclusion reject null

 

    • n11 and n22 represent CONCORDANT pairs, those that did not have an infection either year or had it both years.
    • n12 and n21 represent DISCORDANT pairs, those that either developed the infection in the first year and lost it the second or developed it only in the second year.
  • H0 for this analysis is that the year did not make any difference in the probability of the plot of beans being attacked by the fungus.
  • H0 : a discordant pair is just as likely to be yes- no as no-yes.

H0 : Pr{yes-no} = Pr{no-yes} = 0.5

McNEMAR'S TEST

  • This test uses the test for the expected 0.5 distribution and is calculated as

= (n12 - n21)2/(n12 + n21), with 1 d. f.

  • In the case above, it appears that the years were not independent. It was more likely that plots that were infected in only one year were more likely to be infected in the second year than in the first.

Relative Risk and the Odds Ratio

  • One often hears on the news some reporter saying something like:
    • "A study just published in the Journal of the American Medical Association reports that listening to pop music increases a persons risk of dermatitis three times."
    • You almost never hear scientists in areas other than clinical medicine report their findings in this same way, but clinical researchers often do.
  • How do they determine this?
    • They are reporting RELATIVE RISK, a ratio of probabilities.
    • In the example above, if 300 of 2000 participants in the study who listened to pop music suffered dermatitis during the study, then:
      • Pr{dermatitis | pop listening} = 300/2000 = 0.15
        • Note that the vertical line means "given" so that you read Pr{dermatitis | pop listening} as the probability of contracting dermatitis given that one listens to pop music.
    • Suppose that:
      • Pr{dermatitis | no pop listening} = 100/2000 = 0.05
    • If these are the probabilities, then we can calculate the relative risk of contracting dermatitis as the ratio of these probabilities, or
      • Relative Risk = 0.15 / 0.05 = 3
  • The ODDS of something happening is another ratio, the probability of something happening divided by the probability of it not happening.
    • What are the odds of contracting dermatitis for pop listeners?
      • Pr{dermatitis}/Pr{not getting dermatitis} = (300/2000)/(1700/2000) = 300/1700 = 3/17
    • For non-pop listeners?
      • Pr{dermatitis}/Pr{not getting dermatitis} = (100/2000)/(1900/2000) = 100/1900 = 1/19
  • The ODDS RATIO is the ratio of the two odds or:
    • (3/17)/(1/19) = (3*19)/(17*1) = 57/17 = 3.35
    • The book has more on the odds ratio, but we haven't the time to go further than this.

Last updated September 30, 2011