5.4. Dispersion
(Deviation) of data:
5.4.1 Mean, Median, Mode for grouped data
Sometimes when the scores
are large it becomes difficult to calculate Mean, Median and Modes. When scores
are large we use class intervals to represent data as studied in the 5.1.1.
Example 2. When scores are represented in class intervals we follow a slightly
different method for the calculation of Mean, Median and Modes. Let us study
the method using an example.
5.4.1. Example 1:
Assume that the following data about the presence of 110 people from
different age groups in a marriage function is collected.
Working:
Class Interval (CI) (Age groups) |
Frequency (f) |
0-10 |
7 |
10-20 |
13 |
20-30 |
24 |
30-40 |
26 |
40-50 |
18 |
50-60 |
12 |
60-70 |
10 |
Note: In the above distribution, we notice, that in
each CI, upper limit of a class interval appears again as a lower limit in the
next class interval (for example 10 appears twice, once in CI: (0-10) and in
CI:(10-20)).
Thus the question arises
where should the score for upper limit (10) be included? However, by convention
the upper limit is not included in the corresponding class interval and is
included in the next class interval.
(i.e. the score 10 is
included in CI: 10-20 and not in CI: 0-10)
Let us calculate the mean,
median and mode for grouped data.
To recollect, if we had
ungrouped scores then
Mean
= ()/Number of scores
Similarly
Median would be in the interval ‘30-40’ (which has 55th and 56th
occurrence of the score).
Since
we do not have individual scores, it will not be possible for us to arrive at
the exact mode and exact median easily. In such cases we follow a different
method:
We use the following
notations to arrive at values as shown below
N = Total number of scores
= 110
‘Mid point’( Or ‘Class mark’)(x)
=
f= frequency
f(x) = f*x
‘Cumulative
frequency’ of a class interval is sum of all the
frequencies of all the class intervals up to this class interval.
C-I |
Frequency (f) |
Cumulative frequency(cf) |
Mid Point (x) of CI |
f(x) =f*x |
7 |
7 |
5 |
35 |
|
10-20 |
13 |
20=7+13 |
15 |
195 |
20-30 |
24 |
44=20+24 |
25 |
600 |
30-40 |
26 |
70=44+26 |
35 |
910 |
40-50 |
18 |
88=70+18 |
45 |
810 |
50-60 |
12 |
100=88+12 |
55 |
660 |
60-70 |
10 |
110=100+10 |
65 |
650 |
Total
|
N=110 |
|
|
= 3860 |
By definition Mean = = = 35.09 35.1
Since number of score is
110, Median must be between 55th and 56th score which is
in the class interval ‘30-40’.(because up to the class interval 20-30 we have 44 (cf) scores and up to the class interval 30-40 we
have 70 scores (cf)).
Let
i= size of the class
interval = 11(There are 11 scores in each class interval)
L= Lower limit of the class
interval which includes the median score (This CI (’30-40’)
is also called Median class interval) = 30 ??
F =Cumulative frequency up to the median class interval = 44
m = frequency of the median
class interval = 26
Then
Median = L+ ()*i
= 30+ ()*11 = 30+*11 = 30+4.65 = 34.65
Mode lies in the class
interval ‘30-40’ and the formula for mode is
Mode = 3*median-2mean
= 3*34.65- 2*35.1
= 33.75
5.4.2 Measures of
dispersion: Range, Deviations
Let us take the following
example of attendance of a class for 2 different weeks in a month.
First week : 45,44,41,10,40,60 : Mean (average) = 40
Second week: 35,45,40,45,40,35: Mean (average) = 40
In both the cases, the
average attendance is 40. But we also observe the following:
1. First week has registered
a very low attendance of 10 and a high attendance of 60, with maximum
deviations (dispersions) from average where as
2. In the second week, the
deviations from average are not high. In simple terms we can say that
attendance is consistent in the second week.
Thus we conclude that,
average may not give a correct picture.
Therefore we need other
measures to arrive at meaningful conclusions.
We introduce the following
concepts:
The difference between two
extreme scores of a distribution is called the ‘Range’
Range = Highest
Score- Lowest Score= H-L
Co-efficient of
Range = =
We have learnt that, median
is a score that divides the distribution of score in to two equal parts.
Similarly we define Quartile as the distribution of scores in to four equal
parts. In such cases the distribution is divided in to four parts as:
1st Quartile (Q1),
2nd Quartile (Q2), 3rd Quartile (Q3). They are scores at
1/4th, 1/2nd and 3/4th the distribution of scores.
We note that 2nd Quartile
is the Median itself.
Quartile deviation( Semi
interquartile-range) is calculated as
QD = (Q3-Q1)/2
5.4.2 Example 1 : Calculate Range, Co-efficient of
Range ,Quartile deviation and Co
–efficient of Quartile deviation for the scores 16,40,23,25,29,24,20,30,32,34,43
Working:
By arranging the scores in
ascending order, we get
16,20,23,25,29,30,32,34,40,43.
Note that L= 16, H =43 and
N=11
Therefore
Range = H-L = 43-16 = 27
Co-efficient of Range ===0.46
Since there are 11elements
-for Q1 the
score to be considered is 3rd (1/4th of 11) score = 23.
-for Q3 the
score to be considered is 8th (3/4th of 11) score = 34
QD = (Q3-Q1)/2
= = 5.5
Co-efficient of
QD = (Q3-Q1)/ (Q3+Q1)
= ==0.1
For
grouped data, we have seen earlier that
If N = Total number of
scores,
i = Size of the class
interval,
L = Lower limit of the
Median class interval,
F = Cumulative frequency
(cf) up to the median class interval and
f = frequency of the median
class interval
Then
Median = L+ ()*i = Q2
Similarly for grouped data
we calculate
Q1 =
L+ ()*i
Q3 =
L+ ()*i
Where
L = Lower limit of the
respective Quartile class interval
F = Cumulative frequency
(cf) up to the respective Quartile class
interval
f = frequency of the
respective Quartile class interval
5.4.2 Example 2: Calculate Range, Co-efficient of
Range, Quartile deviation and Co –efficient of Quartile deviation for the
grouped data of 100 scores
CI |
f |
4-8 |
6 |
9-13 |
10 |
14-18 |
18 |
19-23 |
20 |
24-28 |
15 |
29-33 |
15 |
34-38 |
9 |
39-43 |
7 |
Working:
Here we have N = 100, i = 5
and let us calculate cumulative frequency as follows:
CI |
f |
cf |
4-8 |
6 |
6 |
9-13 |
10 |
16 |
14-18 |
18 |
34 |
19-23 |
20 |
54 |
24-28 |
15 |
69 |
29-33 |
15 |
84 |
34-38 |
9 |
93 |
39-43 |
7 |
100 |
For Q1 we need
to find 25th (1/4th of 100) element which lies in the
class interval ’14-18’
L= 13.5, F=16, f= 18
Q1 = L+ () * i
= 14 +*5 = 14 + 2.5 = 16.5
For Q3 we need
to find 75th (3/4th of 100) element which lies in the
class interval ’29-33’
L = 29, F = 69, f = 15
Q3 = L+ ()*i
=29+*5 = 29+2 =31
QD = (Q3-Q1)/2
= =7.25
Co-efficient of QD = (Q3-Q1)/ (Q3+Q1)
= ==0.31
5.4.
3 Mean Deviation for Ungrouped data:
As the name suggests, here
we calculate the average deviation from the mean.
Note: Mean
Deviation can be found in two ways - using Median method or using Mean method.
5.4.3 Example 1. Calculate the mean deviation for the scores
given below, by BOTH methods.
90,125,115,100,110.
Working:
By rearranging the scores
in increasing order we get
90,100,110,115,125
Here we have N= 5, = 90+100+110+115+125=540
The median (M) = 110
(3rd term)
The mean () of scores
is = = =108
Scores(X) |
I Method Deviation from Median D= X-M |
II Method Deviation from Mean |
90 |
-20(90-110) |
-18(90-108) |
100 |
-10(100-110) |
-8(100-108) |
110 |
0(110-110) |
2(110-108) |
115 |
5(115-110) |
7(115-108) |
125 |
15(125-110) |
17(125-108) |
= 540 |
=20+10+0+5+15= 50 |
=18+8+2+7+17= 52 |
In the above calculation
|D| is the absolute value of D (we consider value of D as always positive).
By Median method, Mean
deviation = = =10
By Mean method, Mean
deviation = = =10.4
5.4.4
Mean Deviation for Grouped data:
Note: As in the
case of ungrouped data, Mean Deviation can be found in two ways (Using Median
method and Mean method)
5.4.4 Example 1.
Compute Mean Deviation of
C.I |
f |
0-20 |
8 |
20-40 |
10 |
40-60 |
19 |
60-80 |
14 |
80-100 |
9 |
Workings:
Here we have N = 60 and i=
21
Median (M) = L+ ()*i
= 40 +*21 = 40+13.3 = 53.3 (Use the values from the table arrived
below)
Mean () = = = 52 (Use the values from the table arrived below)
C.I |
Mid Point (x) |
f |
I Method Deviation from Median |
II Method Deviation from
Mean |
||||
cf |
D = x-M |
f*|D| |
fx |
D = x- |
f*|D| |
|||
0-20 |
10 |
8 |
8 |
-43.3 |
346.4 |
80 |
-42 |
336 |
20-40 |
30 |
10 |
18 |
-23.3 |
233 |
300 |
-22 |
220 |
40-60 |
50 |
19 |
37 |
-3.3 |
62.7 |
950 |
-2 |
38 |
60-80 |
70 |
14 |
51 |
16.7 |
233.8 |
980 |
18 |
252 |
80-100 |
90 |
9 |
60 |
36.7 |
330.3 |
810 |
38 |
342 |
|
|
N=60 |
|
|
=1206.2 |
=3120 |
|
= 1188 |
By Median method, Mean Deviation = == 20.10
By Mean
method, Mean Deviation = == 19.8
5.4.5.
Graphical representation of frequency distribution
In earlier sessions we have
seen that, graphical representation of data is always easy to understand and
interpret. Two important types of representations are histogram and frequency
polygon.
Histogram:
Here we represent the distribution in vertical rectangles. The rectangles are
drawn side by side. The vertical height is proportional to the frequency and is
represented on y axis. The class intervals are represented on x-axis .
We need a graph sheet for
this type of representation. Class intervals (CI) are marked as the base of
rectangle on x axis. Frequencies are marked as the height of rectangle on y
axis.
5.4.5 Example 1.
Draw histogram and frequency polygon for
C.I |
f |
0-20 |
8 |
20-40 |
10 |
40-60 |
19 |
60-80 |
14 |
80-100 |
9 |
Working:
Use a suitable scale for
representing Class interval and frequency
(In this case let 1C.I =
1cm and 2f=1cm)
Histogram:
Step
1: Take a graph sheet. Mark 0 and draw x –axis and y-axis. Step
2: On the x-axis mark the class intervals adjacent to each other from 0. Use
1cm as the width of each class interval. (Thus the scale for C.I. is 1C.I. =
1cm) Step
3: Convert frequency to a suitable unit so that the graph fits into one page
easily. In
this example use the scale 1cm = 2f. Therefore we have: 8f =4cm, 10f =5cm, 19f = 9.5cm, 14f
= 7cm and 9f =4.5cm. (Thus
the scale for frequency is 2f = 1cm)
Step
4: Draw a rectangle of height 4cm representing the first CI (0-20) Step
5: Draw a rectangle of height 5cm
representing the next CI 20-40, next
to the previous one, so that these two vertical bars have a common side. Draw
the remaining rectangles for other class intervals. |
|
Observations:
1.Class intervals are
represented on x axis and frequency on y axis
2.The scales chosen for
both the axes need not be same.
3. Since the sizes of class
intervals are same, width of the rectangles are also same.
4. Since there are no gaps
in the class intervals the rectangles are contiguous (No space in between
them).
5. Height of the rectangle
is proportional to the respective frequencies of the C.I.
Note : If there
are breaks in the class intervals(usually in the beginning) a zig-zag curve (is drawn between the class intervals).
Frequency Polygon (Method I):
When the mid points of the
adjacent tops of the rectangles are joined by straight lines, the figure so
obtained is called ‘frequency polygon’
Step
1: Draw the histogram as above. Step
2: Mark non existing class interval (since f =
0, height = 0cm) one
each at two extreme ends (i.e. (-20) - 0 on the left side and 100 -120 on the
right side). Step
3: Identify middle point for each of
the class interval bars (at
-0.5, 0.5, 1.5, 2.5, 3.5, 4.5 and 5.5cms on x-axis and y being (0, 4, 5, 9.5, 7,4.5
and 0 )
respectively).
Step
4: Join two consecutive mid points of
bars by a straight line to get the required polygon |
|
Frequency Polygon (Method II):
Step
1: Mark non existing class intervals one each at two extreme ends (i.e.
(-20) - 0 on the left side and 100 - 120 on the right side).
Step
2: Identify middle point for each of
the class intervals as per the scale used (in
this example 1C.I. = 1cm). These
points are -0.5, 0.5, 1.5, 2.5, 3.5, 4.5 and 5.5 on the x-axis. Step
3: Identify the height of frequency
for each class interval as per the scale used (2f=1cm). These
points are 0, 4, 5, 9.5, 7,4.5 and 0 on the
y-axis. Step
4: Plot and join these points. |
|
Note : If the mid points
of class intervals are very close,
then we get a frequency curve by
joining these points by a smooth curve rather than joining by straight lines.
5.4.5 Cumulative Frequency Curve
(Ogive):
In this type of graph we
plot the points corresponding to cumulative frequency for the given data
(Ungrouped or grouped) and join the points by a smooth curve.
The given data (actual
score or Upper class limit in case of grouped data) is marked along the x-axis.
Cumulative frequency is marked along the y-axis.
Let us again consider the same example we have taken in 5.4.5
Example 1.
5.4.5 Example 2.
Draw Ogive for
C.I |
f |
0-20 |
8 |
20-40 |
10 |
40-60 |
19 |
60-80 |
14 |
80-100 |
9 |
Working:
1. First
arrive at an ‘imaginary’ class interval
with 0 frequency (In this case -20 to 0). 2.
Prepare the cumulative frequency table as shown below starting with
the imaginary class interval (-20 to 0).
3.
Use a suitable scale for x-axis for representing the upper Class limit (In
this case let 1cm=10 upper class limit). 4.
Use a suitable scale for y-axis for representing the cumulative frequency (In
this case let 1cm =10cf) 5.
Plot the points corresponding to each upper class limit as shown in the adjacent graph. 6.
Join these points by a smooth curve (This curve is Ogive). |
|
From the
cumulative frequency curve it will be easy to arrive at frequencies for
different class intervals.
(For example: From the
above graph we can conclude that the cumulative frequency for scores up to 30
is 13. This point is circled red in the graph).
5.4
Summary of learning
No |
Points to remember |
1 |
Mean = (For grouped data) |
2 |
Median = L+ ()*i(For grouped data) |
3 |
Mode = 3*median-2mean(For
grouped data) |
4 |
Co-efficient of Range = (For ungrouped data) |
5 |
Mean deviation = (For ungrouped data) |
6 |
Mean Deviation = (For grouped data) |
Additional
Points:
5.4.1 Assumed mean method for
calculation of mean for grouped data
This
method is very useful when class intervals and their frequencies are very
large. In this method we assume one of the mid-points to be the mean and find
the deviation from that mid-point and hence this method is called ‘assumed mean method’.
Let us
take the example solved earlier (5.4.1 Example 1) to illustrate this method.
Let 25 be the assumed mean (any
score can be assumed to be the mean but we normally
take the score which is in the middle part of the distribution as assumed mean)
The
Deviation D (D = Score- Assumed mean) is calculated for each of the score.
Then
Average (mean) = A + ()/Number of scores
C-I |
Frequency (f) |
Mid Point (x) of CI |
Deviation D= A-M |
fD= f*D |
0-10 |
7 |
5 |
-20(=5-25) |
-140 |
10-20 |
13 |
15 |
-10(=15-25) |
-130 |
20-30 |
24 |
25= A |
0 |
0 |
30-40 |
26 |
35 |
10(=35-25) |
260 |
40-50 |
18 |
45 |
20(=45-25) |
360 |
50-60 |
12 |
55 |
30(=55-25) |
360 |
60-70 |
10 |
65 |
40(=65-25) |
400 |
Total
|
N=110 |
|
|
=1110 |
Average (mean) = A + ()/Number of scores = 25+1110/110 = 25+10 = 35
This is the same
value(approximate) which we got earlier.