Rarefaction (From Lecture 16)
- Look at this expression and simplify it in your mind. Each term that you sum is 1 minus a fraction, so each term you sum is less than 1. You are summing up S (= number of species in the sample) terms, so the sum will have to be less than S (since each term is less than one and there are S terms). Therefore, the expected number of species will be less than the actual number of species
- This is because you would expect to capture fewer species in a smaller sample. The rarer species have less of a chance of being taken.
- The expressions within the inner most parentheses are not fractions, they are combinations (note that there is no horizontal bar and see a discussion of combinations in Lecture 13 from BIOL 3110, my biostats class). These combinations are defined as:
- and
- remember that the fraction on the left of the = sign is not a fraction (note: no horizontal bar), and the fraction on the right is one. The ! means that the expression is a factorial. The expression on the left is called a combination because it gives you the number of ways to take N objects n at a time. For instance, there are three ways (= 3!/2!1!) to group 3 objects 2 at a time (1&2, 1&3, 2&3). Factorials (indicated by !) are gotten by multiplying the number times one less times two less times ...
- What you are calculating in the combinatorial expression found in the numerator of the fraction within the summation in the top equation is the number of combinations one can make (of the same size as the smaller sample size, n) without any of the species of interest present (this is why we use N - Ni and not Ni here). The total number of combinations possible is calculated by the combinatorial expression in the denominator. This fraction is then the proportion of combinations (each one represents a possible sample) that contain none of the species of interest (species i). This can be seen as the probability of not getting that species in the sample and this fraction is subtracted from 1.
Without rarefaction, one can not compare samples that have different number of individuals in each sample
Calculating Rarefaction
Remember that
Logically, the sum of the Ni values must be equal to N.
# of Fish from three lakes |
||||
Species of fish |
North America |
Central America |
Argentina |
|
A |
12 |
|||
B |
5 |
|||
C |
4 |
33 |
||
D |
3 |
32 |
||
E |
1 |
34 |
||
F |
33 |
|||
G |
42 |
|||
H |
23 |
|||
I |
16 |
|||
J |
14 |
|||
K |
6 |
|||
L |
5 |
|||
total | 25 |
132 |
106 |
=N |
each cell is an Ni | ||||
# of species | 5 |
4 |
6 |
=S |
To compare all three lakes, we need to rarefy the samples from Central America and Argentina to the smallest sample, North America
The book does not say, but n must be THE SMALLEST SAMPLE SIZE
The criterion is that N > n, or you will not be able to do the combinatorials when N < n
Therefore, rarefaction always adjusts down, never up.
So, we can only ask "How may species would I have gotten in this sample if it had been as small as the smallest sample?
We will use n = 25 from the North American A sample and rarefy the North American B and Argentine samples
Central America |
||||||||
N |
n |
Ni |
N-Ni |
N-Ni n Factorial |
N n Factorial |
fraction |
1--fraction |
|
C |
132 |
25 |
33 |
99 |
1.82E+23 |
6E+26 |
0.0003 |
1 |
D |
132 |
25 |
32 |
100 |
2.43E+23 |
6E+26 |
0.0004 |
1 |
E |
132 |
25 |
34 |
98 |
1.36E+23 |
6E+26 |
0.0002 |
1 |
F |
132 |
25 |
33 |
99 |
1.82E+23 |
6E+26 |
0.0003 |
1 |
Total = |
4 species |
In the Central American lake sample, we do not get much of a correction (too small to show up). Why?
This situation is a bit different for the Argentine sample, where the sample is not so even, although the richness is greater.
Argentina |
||||||||
N |
n |
Ni |
N-Ni |
N-Ni n Factorial |
N n Factorial |
fraction |
1--fraction |
|
G |
106 |
25 |
42 |
64 |
4.01E+17 |
1E+24 |
3E-07 |
1 |
H |
106 |
25 |
23 |
83 |
1.08E+21 |
1E+24 |
0.0008 |
1 |
I |
106 |
25 |
16 |
90 |
1.16E+22 |
1E+24 |
0.0091 |
0.99 |
J |
106 |
25 |
14 |
92 |
2.2E+22 |
1E+24 |
0.0172 |
0.98 |
K |
106 |
25 |
6 |
100 |
2.43E+23 |
1E+24 |
0.1902 |
0.81 |
L |
106 |
25 |
5 |
101 |
3.22E+23 |
1E+24 |
0.2528 |
0.75 |
Total = |
5.53 species |
Here, there is a noticeable correction. Why?
The last example rearranges the Argentine data, but keeps the number of species (6) and total sample size the same (106). What it does is make species G more dominant at the expense of all other species (look at the Ni column here and compare with the previous Argentina table).
Argentina - with almost all fish from species G | ||||||||
N |
n |
Ni |
N-Ni |
N-Ni n Factorial |
N n Factorial |
fraction |
1 - fraction |
|
G |
106 |
25 |
80 |
26 |
26 |
1E+24 |
0.00 |
1.00 |
H |
106 |
25 |
9 |
97 |
1.01E+23 |
1E+24 |
0.08 |
0.92 |
I |
106 |
25 |
7 |
99 |
1.82E+23 |
1E+24 |
0.14 |
0.86 |
J |
106 |
25 |
5 |
101 |
3.22E+23 |
1E+24 |
0.25 |
0.75 |
K |
106 |
25 |
3 |
103 |
5.64E+23 |
1E+24 |
0.44 |
0.42 |
L |
106 |
25 |
2 |
104 |
7.42E+23 |
1E+24 |
0.58 |
0.24 |
Total = |
4.50 species |
I included this example for two reasons
The -1 from the ((N - Ni) - n) factorial (from 24 - 25) is the problem. You must set this combination to 0 to do the calculations in the rarifaction table because the combination (24 over 25) means you want to calculate the number of combinations of 25 objects one can make with only 24 objects to combine! Obviously, there are no combinations possible and the answer is 0. When doing these calculations in a spreadsheet, the program will return some sort of error code for any situation in which you have asked for the factorial of a negative number. Whenever this occurs, simply set the value of the calculation at 0 and proceed with the calculations.
So, we see that rarefaction is a bit more involved than the text makes out, but still a necessary exercise when making comparisons of species richness among samples that differ in size.
In addition, honesty makes me disclose that this is not the only way to rarefy a sample
For example, one might use a bootstrap approach (made possible by the speed of computers)
In this approach, one subsamples a larger sample repeatedly and then calculates the parameter of interest based on the subsamples
- For the above example comparing the Argentine lake with the North American lake, one would make a subsample by taking the 106 individuals in the larger sample and randomly choosing only 25 of the 106 to be in the subsample. One would then count the number of species in the subsample and that would be one bootstrap estimate for species richness.
Next, one would resample the original 106 individuals again, choosing another 25 at random (some of those in the second subsample could have been in the first) and recalculate the number of species.
Repeat the subsampling many times (this is why a computer is necessary, 10,000 is a good number of subsamples), each time getting an estimate of the species richness.
Finally, average the 10,000 richness values for the 10,000 subsamples and use that average as your rarefied estimate of richness [ = E(S) in the formula above].
The bootstrapping approach supplies an additional bit of information. One can do the bootstrap estimate for any subsample size and graph the expected number of species in the sample versus the sample size. This is a Rarefaction Curve and it usually has a steep portion before it plateaus as the subsample size approaches the larger sample size. If your smaller sample is in the plateau region, the two samples are reasonable compared. If not, your smaller sample most probably is deficient as a sample of the diversity (compared with the larger sample).
Last Updated on July 23, 2007