Introduction to Statistics

Descriptive Statistics

Types of data

A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation to another. For example, the units might be headache sufferers and the variate might be the time between taking an aspirin and the headache ceasing.

An observation or response is the value taken by a variate for some given unit.

There are various types of variate.

Qualitative or nominal; described by a word or phrase (e.g. blood group, colour).

Quantitative; described by a number (e.g. time till cure, number of calls arriving at a telephone exchange in 5 seconds).

Ordinal; this is an “in-between” case. Observations are not numbers but they can be ordered (e.g. much improved, improved, same, worse, much worse).

Averages etc. can sensibly be evaluated for quantitative data, but not for the other two. Qualitative data can be analysed by considering the frequencies of different categories. Ordinal data can be analysed like qualitative data, but really requires special techniques called nonparametric methods.

Quantitative data can be:

discrete; the variate can only take one of a finite or countable number of values (e.g. a count)

continuous; the variate is a measurement which can take any value in an interval of the real line (e.g. a weight).

Displaying data

It is nearly always useful to use graphical methods to illustrate your data. We shall describe in this section just a few of the methods available.

Discrete data: frequency table and bar chart

Suppose that you have collected some discrete data. It will be difficult to get a “feel” for the distribution of the data just by looking at it in list form. It may be worthwhile constructing a frequency table or barchart.

The frequency of a value is the number of observations taking that value.

A frequency table is a list of possible values and their frequencies.

A barchart consists of bars corresponding to each of the possible values, whose heights are equal to the frequencies.

Example

The numbers of accidents experienced by 80 machinists in a certain industry over a period of one year were found to be as shown below. Construct a frequency table and draw a barchart.

2 0 0 1 0 3 0 6 0 0 8 0 2 0 1

5 1 0 1 1 2 1 0 0 0 2 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

0 0 0 5 1 0 0 0 0 0 0 0 0 1 1

0 3 0 0 1 1 0 0 0 2 0 1 0 0 0

0 0 0 0 0

Solution

Number of accidents

Tallies

Frequency

0

|||| |||| |||| |||| |||| |||| |||| |||| |||| |||| ||||

55

1

|||| |||| ||||

14

2

||||

5

3

||

2

4

0

5

||

2

6

|

1

7

0

8

|

1

Barchart

Continuous data: histograms

When the variate is continuous, we do not look at the frequency of each value, but group the values into intervals. The plot of frequency against interval is called a histogram. Be careful to define the interval boundaries unambiguously.

Example

The following data are the left ventricular ejection fractions (LVEF) for a group of 99 heart transplant patients. Construct a frequency table and histogram.

62 64 63 70 63 69 65 74 67 77 65 72 65

77 71 79 75 78 64 78 72 32 78 78 80 69

69 65 76 53 74 78 59 79 77 76 72 76 70

76 76 74 67 65 79 63 71 70 84 65 78 66

72 55 74 79 75 64 73 71 80 66 50 48 57

70 68 71 81 74 74 79 79 73 77 80 69 78

73 78 78 66 70 36 79 75 73 72 57 69 82

70 62 64 69 74 78 70 76

Frequency table

LVEF

Tallies

Frequency

24.5 – 34.5

|

1

34.5 – 44.5

|

1

44.5 – 54.5

|||

3

54.5 – 64.5

|||| |||| |||

13

64.5 – 74.5

|||| |||| |||| |||| |||| |||| |||| |||| ||||

45

74.5 – 84.5

|||| |||| |||| |||| |||| |||| |||| |

36

Histogram

Note: if the interval lengths are unequal, the heights of the rectangles are chosen so that the area of each rectangle equals the frequency i.e. height of rectangle = frequency interval length.

Things to look out for

Barcharts and histograms provide an easily understood illustration of the distribution of the data. As well as showing where most observations lie and how variable the data are, they also indicate certain “danger signals” about the data.

Normally distributed data

The histogram is bell-shaped, like the probability density function of a Normal distribution. It appears, therefore, that the data can be modelled by a Normal distribution. (Other methods for checking this assumption are available.)

Similarly, the histogram can be used to see whether data look as if they are from an Exponential or Uniform distribution.

Very skew data

The relatively few large observations can have an undue influence when comparing two or more sets of data. It might be worthwhile using a transformation e.g. taking logarithms.

Bimodality

This may indicate the presence of two sub-populations with different characteristics. If the subpopulations can be identified it might be better to analyse them separately.

Outliers

The data appear to follow a pattern with the exception of one or two values. You need to decide whether the strange values are simply mistakes, are to be expected or whether they are correct but unexpected. The outliers may have the most interesting story to tell.

Summary Statistics

Measures of location

By a measure of location we mean a value which typifies the numerical level of a set of observations. (It is sometimes called a “central value”, though this can be a misleading name.) We shall look at three measures of location and then discuss their relative merits.

Sample mean

The sample mean of the values is

This is just the average or arithmetic mean of the values. Sometimes the prefix “sample” is dropped, but then there is a possibility of confusion with the population mean which is defined later.

Frequency data: suppose that the frequency of the class with midpoint is , for i = 1, 2, …, m). Then

where n = = total number of observations.

Example

Accidents data: find the sample mean.

Number of accidents,

Frequency

0

55

0

1

14

14

2

5

10

3

2

6

4

0

0

5

2

10

6

1

6

7

0

0

8

1

8

TOTAL

80

54

Sample median

The median is the central value in the sense that there as many values smaller than it as there are larger than it.

All values known: if there are n observations then the median is:

the largest value, if n is odd;

the sample mean of the largest and the +1 largest values, if n is even.

Mode

The mode, or modal value, is the most frequently occurring value. For continuous data, the simplest definition of the mode is the midpoint of the interval with the highest rectangle in the histogram. (There is a more complicated definition involving the frequencies of neighbouring intervals.) It is only useful if there are a large number of observations.

Comparing mean, median and mode

Symmetric data: the mean median and mode will be approximately equal.

Skew data: the median is less sensitive than the mean to extreme observations. The mode ignores them.

The mode is dependent on the choice of class intervals and is therefore not favoured for sophisticated work.

Sample mean and median: it is sometimes said that the mean is better for symmetric, well behaved data while the median is better for skewed data, or data containing outliers. The choice really mainly depends on the use to which you intend putting the “central” value. If the data are very skew, bimodal or contain many outliers, it may be questionable whether any single figure can be used. For more advanced work, the median is more difficult to work with. If the data are skewed, it may be better to make a transformation (e.g. take logarithms) so that the transformed data are approximately symmetric and then use the sample mean.

Measures of dispersion

A measure of dispersion is a value which indicates the degree of variability of data. Knowledge of the variability may be of interest in itself but more often is required in order to decide how precisely the sample mean reflects the population mean (to be discussed later).

Sample variance and standard deviation

For data with sample mean , the sample variance is

The last formula is the most convenient when using a calculator. The first formula shows that the sample variance is approximately the average of the values . will tend to be small for data which do not vary very much and vice versa. [Some text books, particularly for ‘A’ level, use the more natural divisor n rather than n-1. Using n-1 leads to simpler formulae later on; also it reminds you that you cannot measure variability when

n = 1.]

Example Find the sample mean and standard deviation of the following: 6, 4, 9, 5, 2.

For frequency data, where fi is the frequency of the class with midpoint xi (i = 1, 2, … , m):

Again, the last formula is the easiest to use when using a calculator.

The measure of dispersion in the original units as the data is the standard deviation, which is just the (positive) square root of the variance:

Example Evaluate the sample mean and standard deviation, using the frequency table.

LVEF

Midpoint,

Frequency,

24.5 – 34.5

29.5

1

29.5

870.25

34.5 – 44.5

39.5

1

39.5

1560.25

44.5 – 54.5

49.5

3

148.5

7350.75

54.5 – 64.5

59.5

13

773.5

46023.25

64.5 – 74.5

69.5

45

3127.5

217361.25

74.5 – 84.5

79.5

36

2862.0

227529.00

TOTAL

99

6980.5

500695.00

Sample mean,

Sample variance,

Sample standard deviation,

Note: when using a calculator, work to full accuracy during calculations in order to minimise rounding errors. If your calculator has statistical functions, s is denoted by n-1.

Percentiles and the interquartile range

The kth percentile is the value corresponding to cumulative relative frequency of k/100 on the cumulative relative frequency diagram e.g. the 2nd percentile is the value corresponding to cumulative relative frequency 0.02. The 25th percentile is also known as the first quartile and the 75th percentile is also known as the third quartile. The interquartile range of a set of data is the difference between the third quartile and the first quartile, or the interval between these values. It is the range within which the “middle half” of the data lie, and so is a measure of spread which is not too sensitive to one or two outliers.

Range

The range of a set of data is the difference between the maximum and minimum values, or the interval between these values. It is another measure of the spread of the data.

Comparing sample standard deviation, interquartile range and range

The range is simple to evaluate and understand, but is sensitive to the odd extreme value and does not make effective use of all the information of the data. The sample standard deviation is also rather sensitive to extreme values but is easier to work with mathematically than the interquartile range.

Statistical Inference

Probability theory: the probability distribution of the population is known; we want to derive results about the probability of one or more values (“random sample”) – deduction.

Statistics: the results of the random sample are known; we want to determine something about the probability distribution of the population – inference.

In order to carry out valid inference, the sample must be representative, and preferably a random sample.

Random sample: two elements: (i) no bias in the selection of the sample;

(ii) different members of the sample chosen independently.

Formal definition of a random sample: X1, X2, … , Xn are a random sample if each Xi has the same distribution and the Xi‘s are all independent.

Estimation

We assume that we know the type of distribution, but we do not know the value of the parameters , say. We want to estimate on the basis of a random sample X1, X2, … , Xn

Estimates are typically denoted by: , * etc.

Method of moments: estimate , the mean of the distribution, by the sample mean,

(= (X1+ … +Xn)/n) i.e. .

Poisson distribution: random sample X1, X2, … , Xn from Poisson, parameter ( = )

=

Binomial distribution: X ~ B(n, p) ( = np). Here X = , since X is observed only once.

n = X = X/n

Exponential distribution: random sample X1, X2, … , Xn from Exponential, parameter ( = 1/)

= , so that = 1/

Normal distribution: random sample X1, X2, … , Xn from N(, 2)

=

Although the above estimates are fine, the method of moments does not always work satisfactorily.

Maximum likelihood estimation: this is, in general, a more reliable method than the method of moments. We shall consider just the following example. (See text books.)

Example Random sample X1, X2, … , Xn from Exponential, parameter ; show that

1/ is the maximum likelihood estimator of .

Solution

Exponential pdf: f(x) = exp(-xi) (i = 1, 2, … , n)

Since the observations are independent, multiply these together to obtain the joint density:

exp(-x1)exp(-x2) … exp(-xn) = nexp(-xi)

The likelihood is the joint density considered as a function of :

likelihood = L() = nexp(-xi) ( > 0)

The maximum likelihood estimate of is the value of maximising L().

= nn-1exp(-xi) – nxiexp(-xi)

At a turning point:

0 = nn-1exp(-xi) – nxiexp(-xi)

Solving: = = .

[Need to check that this does yield a maximum.]

Comparing estimators

We need criteria for deciding whether an estimator is any good. We shall just look at one such criterion, unbiasedness.

The estimator is unbiased for if E() = , for all values of .

Result: is an unbiased estimator of .

Proof: E() = E((X1+ … +Xn)) = (E(X1) + E(X2) + … + E(Xn))

= ( + + … + ) = (n) = ,

So is unbiased for .

Similarly, it can be shown that is unbiased for and X/n is unbiased for p.

Also, the sample variance s2 = = is unbiased for 2.

[Note, the estimate is biased.]

Example 4.1 (a)

Unbiased estimate of is = 6011.4

x2 = 722957824, so unbiased estimate of 2 is

s2 = =11620.8.

Example 4.2 (a)

Unbiased estimate of is = 9.92

Example 4.3 (a)

Unbiased estimate of p is 28/80 = 0.35

Example 4.4

Method of moments estimate of is 1/ = 1/ 1.995 = 0.501.

Confidence Intervals

Estimates are “best guesses” in some sense, but because they are based on randomly varying data they cannot be 100% accurate. We therefore need an idea of the precision of the estimate. A confidence interval for a parameter is a range within which we are “pretty sure” that the parameter lies.

Normal data, variance known

Random sample X1, X2, … , Xn from N(, 2), where 2 is known but is unknown. We want a confidence interval for .

Recall: (i) ~ N(, 2/n)

(ii) With probability 0.95, a Normal random variables lies within 1.96

standard deviations of the mean.

= 0.95

Re-arranging:

= 0.95

A 95% confidence interval for is: to .

Example 4.1(b)

If = 100, find 95% confidence interval for .

Solution

Recall = 6011.4. By above, 95% confidence interval is:

to

i.e. 5968 to 6055 kg/cm2.

99% confidence interval Replace “1.96” by “2.5758”.

99% confidence interval is:

to

i.e. 5954 to 6069 kg/cm2. [Wider than the 95% interval.]

Binomial data

n Bernoulli trials, X = number of successes; X ~ B(n, p).

If n is large, X is approx. N(np, np(1-p)).

Therefore, 0.95

0.95

Rearranging:

0.95

Estimate by .

An approx. 95% confidence interval for p is:

to

Example 4.3

n = 80, X = 28. Therefore, approximate 95% confidence interval for p is:

to

i.e. 0.245 to 0.455 [Note how wide this is.]

Poisson data

Random sample X1, X2, … , Xn from Poisson distribution, parameter . Similar derivation to the above.

Approximate 95% confidence interval for is:

to

Example 4.2 (b)

= 9.92 and n = 50, so approximate 95% confidence interval for is:

to

i.e. 9.05 to 10.79.

Normal data, variance unknown

Random sample X1, X2, … , Xn from N(, 2), where 2 and are unknown. We want a confidence interval for .

If 2 is known, confidence interval for is: to , where z is obtained from Normal tables.

If 2 is unknown, we need to make two changes:

(i) Estimate 2 by s2, the sample variance;

(ii) replace z by tn-1, the value obtained from t-tables,

The confidence interval for is: to .

t-tables: these relate to the Student’s t distributions, denoted by t where is a parameter called the number of degrees of freedom.

All Student’s t distributions are similar looking to the N(0, 1) distribution, but more spread out. [OHP slides.]

The t-tables are laid out differently from N(0,1).

For a 95% confidence interval, we want the middle 95% region, so Q = 0.975.

Similarly, for a 99% confidence interval, we would want Q = 0.995.

Example 4.1(c)

Recall, n = 20, = 6011.4, s2 = 11620.8. From t-tables, t19, Q = 0.975, t = 2.093.

95% confidence interval for is: to ;

i.e. 5961 to 6062 kg/cm2.

Sample size

When planning an experiment or series of tests, you need to decide how many repeats to carry out to obtain a certain level of precision in you estimate. The confidence interval formula can be helpful.

for example, for Normal data, confidence interval for is: to

i.e.

Suppose we want to estimate to within , where is given. we must choose the sample size, n, satisfying:

n =

To use this need: (i) an estimate of s2 (e.g. results from previous experiments);

(ii) an estimate of tn-1. This depends on n, but not very strongly. You will

not go far wrong, in general, if you take tn-1 = 2.1 for 95% confidence.

95% confidence, choose n =

Example 4.1(d)

Take estimate of variance as: s2 = 11620.8, = 20.

Therefore, n = (2.12 x 11620.8)/202 = 128 i.e. approximately 108 more samples.