Measures of Central Tendency

balajiv.ganesh · Feb 1, 2013

Description
understand the role of descriptive statistics in summarization, description and interpretation of the data. Understand the importance of summary measures to describe characteristics of a data set, use several numerical methods belonging to measures of central tendency to describe the characteristics of a data set.

Measures of Central Tendency

Objectives
?

Understand the role of descriptive statistics in summarization, description and interpretation of the data. Understand the importance of summary measures to describe characteristics of a data set.

?

?

Use several numerical methods belonging to measures of central tendency to describe the characteristics of a data set.

INTRODUCTION

Although frequency distributions and corresponding graphical representations make raw data more meaningful, yet they fail to identify three major properties that describe a set of quantitative data. These three major properties are:
? The

numerical value of an observation (also called central value) around which most numerical values of other observations in the data set show a tendency to cluster or group, called the central tendency.
extent to which numerical values are dispersed around the central value, called variation.

? The

? The

extent of departure of numerical values from symmetrical (normal) distribution around the central value, called skewness.

INTRODUCTION
The term ‘central tendency’ was coined because observations (numerical values) in most data sets show a distinct tendency to group or cluster around a value of an observation located somewhere in the middle of all observations. It is necessary to identify or calculate this typical central value (also called average) to describe or project the characteristic of the entire data set. This descriptive value is the measure of the central tendency or location and methods of computing this central value are called measures of central tendency.

Objectives OF Averaging
A few of the objectives to calculate a typical central value or average in order to describe the entire data set are given below:
?

It is useful to extract and summarize the characteristics of the entire data set in a precise form.
Since an ‘average’ represents the entire data set, it facilitates comparison between two or more data sets.

?

?

It offers a base for computing various other measures such as dispersion, skewness, kurtosis that help in many other phases of statistical analysis.

Measures of Central Tendency
The various measures of central tendency or averages commonly used can be broadly classified in the following categories: 1. Mathematical Averages a) Arithmetic Mean commonly called the mean or average i. Simple ii. Weighted b) Geometric Mean c) Harmonic Mean 2. Averages of Position a) Median b) Quartiles c) Deciles d) Percentiles e) Mode

Mathematical Averages
Direct Method In this method A.M. is calculated by adding the values of all observations and dividing the total by the number of observations. Thus if x1, x2, . . ., xN represent the values of N observations, then A.M. for a population of N observations is: Population mean, =
x1 + x2 + . . . + xN 1 N = å xi N N i =1

(3-1a)

Alternative Formula
In general, when observations xi (i = 1, 2, . . ., n) are grouped as a frequency distribution, then A.M. formula (3-1b) should be modified as: 1 n
x= å fx n i =1 i i
i =1

å fi .

n

(3-2)

where fi represents the frequency (number of observations) with which variable xi occurs in the given data set, i.e. n =

Direct Method
1 n x = å fi mi n i =1

(3-5)

where mi = mid-value of ith class interval. fi = frequency of ith class interval. n = S fi, sum of all frequencies

Step-deviation Method
x =A+
where

å fi di ´ h n

(3-6)

A = assumed value for the A.M. n = sum of all frequencies h = width of the class intervals mi = mid-value of ith class-interval

di =

mi - A , deviation from the assumed mean h

Advantages and Disadvantages of Arithmetic Mean
Advantages (i) (ii) The calculation of arithmetic mean is simple and it is unique, that is, every data set has one and only one mean. The calculation of arithmetic mean is based on all values given in the data set.

(iii) The arithmetic mean is reliable single value that reflects all values in the data set.
(iv) The arithmetic mean is least affected by fluctuations in the sample size. In other words, its value, determined from various samples drawn from a population, vary by the least possible amount.

Advantages
(v) It can be readily put to algebraic treatment. Some of the algebraic properties of arithmetic mean are as follows:

Advantages

Advantages

Disadvantages
(i) The value of A.M. cannot be calculated accurately for unequal and openended class intervals either at the beginning or end of the given frequency distribution. The A.M. is reliable and reflects all the values in the data set. However, it is very much affected by the extreme observations (or outliers) which are not representative of the rest of the data set. Outliers at the high end will increase the mean, while outliers at the lower end will decrease it. For example, if monthly income of four persons is 50, 70, 80, and 1000, then their A.M. will be 300, which does not represent the data.

(ii)

Disadvantages
(iii) The calculations of A.M. sometime become difficult because every data element is used in the calculation (unless the short cut method for grouped data is used to calculate the mean). Moreover the value so obtained may not be among the observations included in the data. The mean cannot be calculated for qualitative characteristics such as intelligence, honesty, beauty, or loyalty. The mean cannot be calculated for a data set that has open-ended classes at either the high or low end of the scale.

(iv) (v)

Weighted Arithmetic Mean
The arithmetic mean, gives equal important (or weight) to each observation in the data set. However, there are situations in which value of individual observations in the data set is not of equal importance. If values occur with different frequencies, then computing A.M. of values (as opposed to the A.M. of observations) may not be truly representative of the data set characteristic and thus may be misleading. Under these circumstances, we may attach to each observation value a ‘weight’ w1, w2, . . ., wN as an indicator of their importance perhap because of size or importance and compute a weighted mean or average denoted by as:
m or xw = w å xi w i å wi

Weighted Arithmetic Mean
The weighted arithmetic mean should be used (i) (ii) (iii) when the importance of all the numerical values in the given data set is not equal. when the frequencies of various classes are widely varying where there is a change either in the proportion of numerical values or in the proportion of their frequencies.

(iv) when ratios, percentages, or rates are being averaged.

The following distribution gives the pattern of overtime work done by 100 employees of a company .Calculate the average overtime work done per employee. Overtime Hours Number of Employees 10-15 11 15-20 20 20-25 35 25-30 20 30-35 8 35-40 6 Ans. 23.1 hours

Median
Median may be defined as the middle value in the data set when its elements are arranged in a sequential order, that is, in either ascending or decending order of magnitude. It is called a middle value in an ordered sequence of data in the sense that half of the observations are smaller and half are larger than this value. The median is thus a measure of the location or centrality of the observations.

Grouped Data
The following formula is used to determine the median of grouped data: Med = l + (( n/2 - c.f ) / f)*h Where l = lower class limit (or boundary) of the median class interval. c.f. = cumulative frequency of the class prior to the median class interval, that is, the sum of all the class frequencies upto, but not including, the median class interval f = frequency of the median class h = width of the median class interval n = total number of observations in the distribution.

Advantages, Disadvantages, and Applications of Median
Advantages
(i) (ii) Median is unique, i.e. like mean, there is only one median for a set of data. The value of median is easy to understand and may be calculated from any type of data. The median in many situations can be located simply by inspection. The sum of the absolute differences of all observations in the data set from median value is minimum. In other words, the absolute difference of observations from the median is less than from any other value in the distribution. That is, S | x – Med | = a minimum value.

(iii)

Advantages, Disadvantages, and Applications of Median
Advantages
(iv) The extreme values in the data set does not affect the calculation of the median value and therefore it is the useful measure of central tendency when such values do occur. The median is considered the best statistical technique for studying the qualitative attribute of a an observation in the data set. The median value may be calculated for an open-end distribution of data set.

(v) (vi)

Advantages, Disadvantages, and Applications of Median
Disadvantages
(iv) (ii) The median is not capable of algebraic treatment. For example, the median of two or more sets of data cannot be determined. The value of median is affected more by sampling variations, that is, it is affected by the number of observations rather than the values of the observations. Any observation selected at random is just as likely to exceed the median as it is to be exceeded by it Since median is an average of position, therefore arranging the data in ascending or descending order of magnitude is time consuming in case of a large number of observations. The calculation of median in case of grouped data is based on the assumption that values of observations are evenly spaced over the entire class-interval.

(iii)

(iv)

Advantages, Disadvantages, and Applications of Median
Applications
The median is helpful in understanding the characteristic of a data set when (i) (ii) (iii) observations are qualitative in nature extreme values are present in the data set a quick estimate of an average is desired.

• A survey was conducted to determine the age (in years) of 120 automobiles.the result of the survey was as follows.

• • • • • • • •

Age of Auto Number of autos 0-4 13 4-8 29 8-12 48 12-16 22 16-20 8 What is the median age of autos. Ans. 9.5 years

Quartiles
The values of observations in a data set, when arranged in an ordered sequence, can be divided into four equal parts, or quarters, using three quartiles namely Q1, Q2, and Q3. The first quartile Q1 divides a distribution in such a way that 25 per cent (=n/4) of observations have a value less than Q1 and 75 per cent (= 3n/4) have a value more than Q1, i.e. Q1 is the median of the ordered values that are below the median.

Quartiles
The generalized formula for calculating quartiles in case of grouped data is: Qi = l + (( i (n/4) – c.f )/ f ) * h I =1,2,3 (3-17)

where cf = cumulative frequency prior to the quartile class interval l = lower limit of the quartile class interval f = frequency of the quartile class interval h = width of the class interval

Deciles
The values of observations in a data set when arranged in an ordered sequence can be divided into ten equal parts, using nine deciles, Di (i = 1, 2, . . ., 9). The generalized formula for calculating deciles in case of grouped data is: Di = l + { i (n/10) – c. f)/ f } * h I = 1,2,3,______,9 (3-18)

where the symbols have their usual meaning and interpretation.

Percentiles
The values of observations in a data when arranged in an ordered sequence can be divided into hundred equal parts using ninety nine percentiles, Pi (i = 1, 2, . . ., 99). In general, the ith percentile is a number that has i% of the data values at or below it and (100 – i)% of the data values at or above it. The lower quartible (Q1), median and upper quartible (Q3) are also the 25th percentile, 50 the percentile and 75th percentile, respectively. For example, if you are told that you scored at 90th percentile in a test (like the CAT), it indicates that 90% of the scores were at or below your score, while 10% were at or above your score. The generalized formula for calculating percentiles in case of grouped data is:
ì i (n 100) - cf ü Pi = l + í ý ´ h; f î þ i = 1, 2, . . ., 99

(3-19)

where the symbols have their usual meaning and interpretation.

• The following distribution gives the pattern of overtime work done by 100 employees of a company .Calculate the average overtime work done per employee. • • • • • • • • • • Overtime Hours Number of Employees 10-15 11 15-20 20 20-25 35 25-30 20 30-35 8 35-40 6 Ans . Median 622.714 First Quartile : 18.5 Seventh Decile : 26

Mode
The mode is that value of an observation which occurs most frequently in the data set, that is, the point (or class mark) with the highest frequency. The concept of mode is of great use to large scale manufacturers of consumable items such as ready-made garments, shoe-makers, and so on. In all such cases it is important to know the size that fits most persons rather than ‘mean’ size.

Mode
There are many practical situations in which arithmetic mean does not always provide an accurate characteristic (reflection) of the data due to the presence of extreme values. The median may not represent the characteristics of the data set completely owing to an uneven distribution of the values of observations. For example, suppose in a distribution the values in the lower half vary from 10 to 100 (say), while the same number of observations in the upper half very from 100 to 7000 (say) with most of them close to the higher limit. In such a distribution, the median value of 100 will not provide an indication of the true nature of the data. Such shortcomings stated above for mean and median are removed by the use of mode, the third measure of central tendency. The mode is a poor measure of central tendency when most frequently occurring values of an observation do not appear close to the centre of the data.

Mode
In the case of grouped data, the following formula is used for calculating mode:
Mode = l + fm - fm - 1 2 fm - fm - 1 - fm + 1 ´ h

where

l = lower limit of the model class interval

fm – 1 = frequency of the class preceding the mode class interval fm + 1 = frequency of the class following the mode class interval h = width of the mode class interval

• Frequency Distribution of Sales per day • • • • • • • • Sales Volume 53-56 57-60 61-64 65-68 69-72 72 and above Ans 62.5 Number of days 2 4 5 4 4 1

Relationship Between Mean, Median, and Mode

Median = Mean = Mode (a) Symmetrical

Mode

Median

Mean

Mean

Median

Mode

(b) Skewed to the Right

(c) Skewed to the Left

Figure 3.5 A comparison of Mean, Median, and Mode for three Distributional Shapes

Relationship Between Mean, Median, and Mode
A distribution that is not symmetrical, but rather has most of its values either to the right or to the left of the mode, is said to be skewed. For such asymmetrical distribution, Karl Pearson has suggested a relationship between these three measures of central tendency as:
Mean – Mode = 3 (Mean – Median) or Mode = 3 Median – 2 Mean

This implies that the value of any of these three measures can be calculated provided we know any two values out of three.

Comparison Between Measures of Central Tendency
1. The Presence of Outlier Data Values: The data values that differ in a big way from the other values in a data set are known as outliers (either very small or very ligh values). As mentioned earlier that the median is not sensitive to outlier values because its value depend only on the number of observations and the value always lies in the middle of the ordered set of values, whereas mean, which is calculated using all data values is sensitive to the outlier values in a data set. Obviously, smaller the number of observations in a data set, greater the influence of any outliers on the mean. The median is said to be resistant to the presence of outlier data values, but the mean is not.

Comparison Between Measures of Central Tendency
2. Shape of Frequency Distribution: The effect of the shape of frequency distribution on mean, median, and mode is shown in Fig. 3.5. In general, the median is preferred to the mean as a way of measuring location for single peaked, skewed distributions.
When data is multi-modal, there is no single measure of central location and the mode can vary dramatically from one sample to another, particularly when dealing with small samples.

Comparison Between Measures of Central Tendency
3. The Status of Theoretical Development: Although the three measures of central tendency—Mean, Median, and Mode, satisfy different mathematical criteria but the objective of any statistical analysis in inferential statistics is always to minimize the sum of squared deviations (errors) taken from these measures to every value in the data set. The criterion of the sum of squared deviations is also called least squares criterion. Since A.M. satisfies the least squares criterion, it is mathematically consistent with several techniques of statistical inference.

doc_251291297.pptx

Measures of Central Tendency

Attachments