# Mean is affected by the extreme small or large values on a data set

Chapter 6. Univariate descriptive statistics1. Compute the mode, median, and mean for the following four sets of numbers:a. 2, 7, 6, 5, 3, 8, 6, 4, 9, 7b. 6, 2, 2, 5, 4, 2, 3, 4, 5

**Chapter 6. Univariate descriptive statistics**

1**.** Compute the mode, median, and mean for the following four sets of numbers:

a. 2, 7, 6, 5, 3, 8, 6, 4, 9, 7

b. 6, 2, 2, 5, 4, 2, 3, 4, 5, 6, 3

c. 2, 5, 8, 2, 8, 4, 2, 8, 1, 9, 9

d. 3, 5, 4, 8, 6, 9, 4, 43, 7, 2

To get the median, put all of the numbers in a set in increasing numerical order. Count how many numbers there are in each set. This is shown here:

a. 10 numbers: 2, 3, 4, 5, 6, 6, 7, 7, 8, 9

b. 11 numbers: 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6

c. 11 numbers: 1, 2, 2, 2, 4, 5, 8, 8, 8, 9, 9

d. 10 numbers: 2, 3, 4, 4, 5, 6, 7, 8, 9, 43

Now if there are an odd number of items in a set, add 1 to how many there are. Divide the result by 2. Now count this many items up from the bottom. The one you land on is the median. So where there are11items in a set,as in setb.andc.adding1gives12. Then dividing by2gives6. Then count up from the bottom of the list (or down from the top of the list) and the median will be the6thone.

The median forb.is4.The median forc.is5.

If the number of items in a set is even, divide it by2. Then count this far up from the bottom of the list. The median will be halfway between the value you landed on and the next value.

For the numbers in seta.and setd.-- there are10items in the list. And10divided by2is5, so the median is halfway between the5thand6thitems, counting up from the bottom of the list.

For seta.the median is halfway between 6 and 6, which is6.

For setd.the median is halfway between 5 and 6, which is5.5.

The results are shown in the table below.modemedianmean

**Use this set of numbers for the following questions:**

4, 3, 5, 4, 1, 2, 5, 4, 3, 4, 1, 2, 4, 3, 5, 2, 3, 5, 7, 6, 4, 1, 2, 4

2.Assume the numbers in the data are the answers you get when you ask people "How many magazines do you subscribe to?" What are the proper measures of central tendency and dispersion for this data? Calculate their values.

Because the data would be counts of the number of magazines, it would be ratio scaled data. The strongest measures of central tendency and dispersion for ratio level data are the mean and standard deviation.

3. Assume the numbers in the data are the answers you get when you ask people "Name your favorite television program." Then you classify each program according to its thematic content. You use a system that has seven different classes (eg. 1=science fiction, 2=comedy, 3=romance, 4=adventure, 5=news, ....). The numbers in the data indicate which category their favorite programs fall into. What are the proper measures of central tendency and dispersion for this data?

Since the numbers stand forcategoriesor types of television programs, the data is not likely to be scaled at the interval or ratio level. Since there isno apparent relation of orderbetween the categories, the data isn't even ordinal. This only leaves thenominallevel of scaling in which the numbers are merely stand-ins fornamesof categories. With this data, the only measure that can be used for central tendency is the mode. The only measure of dispersion that can be used for this data is the information theoretic uncertainty measure.

4. Assume the numbers in the data are the answers you get when you ask people "What is your household's annual income? I'm going to read a list of possible ranges, and I want you to stop me when I read the range that describes your household's income." You then read the following list and record their answers:

1) below $10,000

2) between $10,000 and $15,000

3) between $15,000 and $20,000

4) between $20,000 and $30,000

5) between $30,000 and $45,000

6) between $45,000 and $60,000

7) above $60,000

What are the proper measures of central tendency and dispersion for this data? Calculate their values.

Because the numbers indicate which category individials fall into, and because the categories areorderedfrom the lowest to the highest income levels, this data would be at least ordinal scaled. Since the categories are not all the same size, it is not scaled at the interval or ratio level. Finally, since "0" does not mean "no income at all,"and since "2" does not mean "twice as much income" as "1", the data can't be ratio scaled. For ordinal data you can use both the median and the mode for central tendency. You can also use both the Inter Quartile Range (IQR) and the information theoretic uncertainty measure for dispersion. Since the median and IQR are stronger measures than the mode and the information theoretic uncertainty measure, they are the ones you should use.

To calculate the median, sort the numbers so they are in increasing numerical order from smallest to highest:

1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 7

Count the number in the list. There are 24.

If there are an even number of items on the list, divide the number by 2. In this example, you get "12." Count this many items up from the bottom of the list. In this example, the 12th item is the first "4". The median will be halfway between the value of this item and the value of the next item. In this example, both of those numbers are 4, so the median will also be 4.

1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4 | 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 7

If there are an odd number of items on the list (for example, 25), add 1 to how many there are. Divide the result by 2. Now count this many items up from the bottom. The one you land on is the median.

To calculate the IQR, repeat the procedure again to find the median of each half of the list you used to calculate the median. The median of the first half of the original list is the first quartile of the list. The median of the second half is the third quartile. The Inter Quartile Range is the difference between the first and third quartiles.

1, 1, 1, 2, 2, 2 | 2, 3, 3, 3, 3, 4 | 4, 4, 4, 4, 4, 4 | 5, 5, 5, 5, 6, 7

5. Below are the final exam scores in percentages for students in a course on postmodernist approaches to analysis of individual differences in skiing preferences.

Males64.1076.5695.3175.0075.0053.1364.0646.8898.4478.1385.9493.7567.1987.5092.1971.8876.5670.3178.1371.8893.7532.8179.6950.0071.8857.8160.9459.3868.7575.00Females69.7485.5388.1692.1176.3277.6361.8452.6396.0584.2188.1680.2661.8475.0090.7972.3773.6882.8968.4296.0588.1664.4790.7963.1680.2646.0564.4778.9576.3263.20

a. Which of the measures of central tendency are the most and least appropriate for this data?

The mean and median are both appropriate for this data. Since the numbers are percentages, the data is scaled at the ratio level. The mode is almost useless because these are continuous variables and there is an infinite number of different possible values. Also, the mode is the weakest of the three measures of central tendency; it doesn't take advantage of the actual size of the percentages or the relative sizes of the various numbers in the data.

b. Which tell you more about the relative performance of males and females on the exam?

The median and the mean both tell you more than the mode. For one thing, the data for the males is bimodal -- there are two modes. Second, the mode is a "central" value only in the sense that it is the one that occurred more often than any other value. It is not central in the sense that it is the middle value rather than one of the extremely high or low ones. Finally, the mean and median both contain more of the information in the data than does the mode. Since they contain more information, they tell you more about the two groups of numbers.

c. Discuss the benefits and drawbacks of each measure of central tendency for this data.

The mode is not influenced by extreme values.

The mode is sensitive only to the most frequently occurring score; it is insensitive to all other scores.

The mode is of little value for non-categorical (e.g., continuous) data; it is used almost exclusively for discrete variables.

The median can be used for discrete or continuous variables.

The median is not influenced by extreme values.

The median is sensitive only to the value of the middle point or points; it is not sensitive to the values of all other points.

The mean requires interval or ratio data.

The mean is the preferred measure for interval or ratio data.

The mean is generally not used for discrete variables.

The mean is sensitive to all scores in a sample (every number in the data affects the mean), which makes it a more "powerful" measure than the median or mode.

The mean's sensitivity to all scores also makes it sensitive to extreme values, which is why the median is used when there are extreme values.

d. Compute the range, interquartile range, and standard deviation.

See table below

e. Discuss the benefits and drawbacks of each measure of dispersion for this data.

The range is the difference between the highest and lowest values. Because of this dependence on the two most unusual values, the range doesn't tell much about the data. For example, it tells nothing about how far from the center typical values lie.

The interquartile range (IQR) is the difference between the first and third quartiles in the data. If you remove the top 25% and the bottom 25% of all cases and then calculate the range of the remaining cases, you will get the IQR. While the IQR is more valuable than the range because it is not influenced as much by extreme values, it is more difficult to calculate, as it requires the data points to be rank ordered.

Neither the range or the interquartile range take all of the values in your data into account. The range is determined by the two most extreme values and the IQR is determined by the lowest and highest values in the middle 50% of your data.

The deviation score for an individual is the difference between the individual's score and the mean. The standard deviation, the square root of the variance, is the square root of the mean of the squared deviation scores. Roughly speaking, the standard deviation tells you how far away from the mean the typical person's score is. The standard deviation is the most commonly used measure of dispersion for interval or ratio level data. Like the variance and the mean, the standard deviation is sensitive toallscores.

6. Use the table of random numbers (Table 7 in Appendix B) for this question. Use the last two digits of the 5-digit numbers. Starting at the top of the second column, scan down and mark the numbers that are between 10 and 29, including 10 and 29. Do this until you get a total of 15 numbers. Write these 15 two-digit numbers on a piece of paper. Calculate the median, the mean, and the standard deviation for these numbers. Use the computational equation for standard deviation.

7. Analyze all four sets of numbers in Question 1 in terms of which of the measures of central tendency are the most and least appropriate. For each set of numbers, discuss the benefits and drawbacks of each measure of central tendency.

8. On a mid-term exam, the median score is 73 and the mean is 79. Which student's score is likely to be further away from the median the one at the top of the class or the one at the bottom? Why?

The one at the top is likely to be further away from the median, because the mean has been distorted (upwards) by an extreme score. You know the mean has been distorted upwards because the mean is higher than the median. Since half of the scores are above the median and half are below the median, and since the mean is higher than the median, there must be some scores that are a fairly long distance above the median. These extremely high scores distort the mean -- they pull it up above the median.

9. If the standard deviation of a sample is 5.3,

a. What is the variance?

Since the standard deviation is the square root of the variance, you square the standard deviation in order to get variance:

5.3 × 5.3 = 28.09

b. What is the sum of the squares?

The equation for standard deviation is this:

The equation for sum of the squares is this:

You could write the equation for standard deviation like this:

You can get rid of the square root by squaring both sides:

And you can get rid of the division by (n-1) by multiplying both sides by that:

And now you see how to get the sum of squares from the standard deviation -- square it and multiply by (n-1):

5.3 × 5.3 = 28.09

28.09 × n-1 = SS

Although you don't know what the sample size is, you do know that the sum of squares can be had by squaring the standard deviation and multiplying it by the sample size minus one.

c. What is the root mean square?

The root mean square is another name for standard deviation. It is sometimes called the root mean square because it is the square root of the mean of the squared deviation scores. So, if the standard deviation is 5.3, so is the root mean square.

10. Compute the standard deviation, range, and interquartile range for the following data:

a. Remove the lowest score and repeat the calculations

Here are the results:sd8.7536.468IQR11.8310.395range28.1517.15

b. Which of the three measures changed the most? Why?

The range changed the most because it is sensitive only to extreme values, one of which is the lowest score

c. Which of the three measures changed the least? Why?

The IQR changed the least because it is not at all sensitive to extreme values.]

11. Multiply each of the nine numbers in Question 11 a by a constant, say 0.4, and calculate the standard deviation. What is the effect on the standard deviation of multiplying the numbers by a constant? Try it with a different constant, say 1.3. What is the effect? What is the general pattern here?

See columns a, b, c in table below. The original numbers are in column a. The numbers multiplied by 0.4 and 1.3 are in columns b and c.aa *.4a *1.3a -50a -63.89aa *.4a *1.3a -50a -63.89You can see that multiplying all of the numbers in a set of data by a constant has an effect on the dispersion. If the constant is greater than 1.0, the multiplication spreads the original set of values out over a longer range, increasing the distance between values in the process. Since the values are more spread out, the dispersion will be higher. If the constant is less than 1.0, the multiplication squeezes the original set of values together over a smaller range, decreasing the distance between values in the process. Since the values are less spread out, the dispersion will be lower. This is apparent in all three measures of dispersion -- the range, the IQR, and the standard deviation.

When all the numbers in the set are multiplied by 0.4, the range changes from 28.15 to 11.26. If you multiply 28.15 by 0.4, you get 11.26. When you multiply the numbers by 1.3, the range is also increased by a factor of 1.3 from 28.15 to 36.595. So multiplying the numbers in a set of data by a constant multiplies the range by the same amount.

The standard deviation also changes in exactly the same way the range does. When you multiply the numbers by 0.4, the standard deviation decreases by a factor of 0.4 from 8.753 to 3.501. When you multiply the numbers by 1.3, the standard deviation increases by a factor of 1.3 from 8.753 to 11.378.

The IQR also changes in exactly the same way the other two measures do. When you multiply the numbers by 0.4, the IQR decreases by a factor of 0.4 from 11.83 to 4.732. When you multiply the numbers by 1.3, the IQR increases by a factor of 1.3 from 11.83 to 15.379.

12. Subtract a constant, say 50.0, from each of the nine numbers in Question 11, and calculate the standard deviation. What is the effect on the standard deviation of subtracting a constant? Try it with a different constant, say 63.89. What is the effect? What is the general pattern here?

Both of these subtractions have no effect on any of the measures of dispersion. The reason for this is that adding or subtracting a constant from all of the values has no effect on the distance between values and thus no effect on the dispersion. They are neither spread further apart nor squeezed closer together. See columns a, d, e in table above

13. What is the nature of the sample data if s = 0 and n = 75?

If the standard deviation is zero, all 75 values must be exactly the same. There is no spread at all. The only time the standard deviation can be zero when the sample contains more than one member is when all of the deviation scores are zero. This can only happen if all of the values are the same as the mean.