BIOL 3110

Biostatistics

Phil

Ganter

302 Harned Hall

963-5782

Russulaceae, probably in the genus Lactaria

Introduction & Descriptive Statistics

Email me

Back to:

Academic Page 
Tennessee State Home page 
Bio 311 Page 
Ganter home page 

Unit Organization:

This unit is composed of many definitions as we go through some basic components of statistics: the means of summarizing data (=descriptive statistics)

Below are bookmarks to the several sections

Variable Types

Frequency Tables and Distributions

Summation Notation and Statistical Inference

Central Tendency

Dispersion

Data Transformation

Problems:

Problems for homework - 3rd Edition

  • 2.4, 2.10, 2.15, 2.20, 2.25, 2.28, 2.32, 2.41, 2.47 (in addition, use the 2.47 data to calculate the mean, s. d. and C. V.)

Problems for homework - 4th Edition

  • 2.2.1, 2.2.7, 2.3.2, 2.3.7, 2.3.12, 2.3.15, 2.4.3, 2.6.2, 2.6.8 (in addition, use the 2.6.8 data to calculate the mean, s.d. and C.V.)

Suggested Problems

  • 2.66, 2.68, 2.69, 2.73, 2.77 (the top three distributions, labelled I, II, and III, belong to question 2.26, which is a good one if you want the challange, but ignore them for question 2.77)
  • 2.S.7, 2.S.9, 2.S.10, 2.S.13, 2.S.17 -- 4th edition

Link to temporary page with the homework problems

Variable Types

A variable is anything that can have different values or qualities

Categorical

Ordinal (ordered categories, like life history stages for insects - egg, larval instar #1, etc.)

Non-ordinal (unordered, like the sexes)

Quantitative

Discrete (things that are counted, like population size)

Continuous (things that are measured, like height)

Frequency Tables and Distributions

A means of compacting the data and an aid to understanding some of the statistical properties of a collection of data.

Curve shapes

Bimodal (or tri-, etc.), Unimodal

Normal (bell curve)

Leptokurtic vs. Platykurtic - sharp vs. flat, platykurtic might mean a two modes close by (perhaps two sets of data have been combined)

Skewed vs. Symmetric

 

Exponential (decay) -- negative monotonic

Histograms -- utility in data presentation, visual impact

Categorical data -- columns are not touching (columns should touch for continuous data)

Grouping of Continuous Data

Determination of group sizes

Area under curve = proportional contribution of category to total

Proportional Histograms (Relative Frequency in book) -- making histograms comparable

Stem and Leaf plots

Summation Notation and Statistical Inference

Statistical Inference

a measurement or an observation is a value for a variable taken from an observed individual

a sample is the set of measurements or observations taken from the total Population

The Population is the larger group about which you wish to draw some sort of conclusion

You use statistics done from the sample to draw an inference about the population (it is inferred because you are guessing from the particular to the general)

sample descriptive statistics are denoted by the usual English letters, but the population statistic by Greek letters (mean = , standard deviatio= )

sample mean = , but population mean is

estimated values are sometimes symbolized by placing a caret rather than a bar over them

Summation Notation

is read as "the sum of the observations from the first one to the nth one"

i is the INDEX, which labels all of the sample observations from 1 to the total sample size (so that each observation has its own index number)

X is the variable.

n is the total number of values (= the sample size)

Central Tendency

Mean

because the mean can be biased when there are unusually large or small observations in a sample, it is possible to decide to Trim the mean (drop the bottom and top 5 or 10 % of the observations)

this must be done before the data is inspected on the basis that one can not throw out data just because it is "too big"

Median

Mode (not in book, but an acceptable measure of central tendency in some cases)

Dispersion

Range

most conservative, makes no assumptions, not estimated but not useful for statistical interpretation

Quartiles

divide the observations into quarters, (the median would also be the division between the second and third quartiles)

Interquartile Range

distance between beginning of second quartile and end of third (so 50% of all data points lie between)

Standard Deviation

n = sample size, n-1 is a correction for the degree to which the sample varied (called the degrees of freedom)

The book points out that when there is only one observation, dividing by n would give you a value of 0 for s, which is misleading because you have no information to base this on (given that you have only one measurement), but if you divide by n-1, the value is undefined (division by zero), which is consistent with the situation

Normal Curve percentages -

68% are ± 1 s. d., 95% are ± 2 s. d., 99% are ± 3 s. d.

Chebyshev's Rule - for any distribution - 75% are ± 1 s. d., 89% are ± 2 s. d.

Computational Formula

Variance

the square of the standard deviation (written )

notice that the measurement units are also squared, so that you never report mean ± variance (as it makes no sense)

uses of variance -- it is used in some statistical tests

Coefficient of Variation

C. V. is just the ratio of the s. d. to the mean, usually expressed as a percentage

Data Transformation

X' is the symbol for transformed data

Linear

usually to change a scale (multiplicative), or to add or subtract a common factor

all are "natural", so that you can perform the dame procedure on the mean to change it to the transformed value

all change the mean, but additive or subtractive do not change s. d. (s. d. multiplicative change is also natural)

Nonlinear

must recalculate means and s. d. for nonlinearly transformed data

Taking the logarithm (natural or base 10)

Taking the square root of the value

Last updated January 22, 2013