How to Calculate Outliers: 10 Steps (With Pictures)

Table of contents:

How to Calculate Outliers: 10 Steps (With Pictures)
How to Calculate Outliers: 10 Steps (With Pictures)
Anonim

In statistics, an outlier is data that is considerably different from the other data in the sample. Often times, outliers in a data set can alert statisticians to experimental abnormalities or errors in the measurements taken, and because of this they may be discarded from the data set. If the outliers in the set are ignored, there may be significant changes in the conclusions drawn from the study. Therefore, knowing how to calculate and evaluate outliers is important to ensure proper understanding of statistical data.

Steps

Calculate Outliers Step 1
Calculate Outliers Step 1

Step 1. Learn how to recognize possible outliers

Before deciding whether or not you should ignore the outliers of a certain dataset, we must first, obviously, identify the possible outliers of the dataset. Broadly speaking, outliers are data that is very different from the trend expressed by the other values in the data set. In other words, it they locate distant to the other values. This is generally easy to spot in data tables or (especially) in graphs. If the data set is expressed visually on a graph, the outliers are located “distant” from the other values. If, for example, most of the data in a data set formed a straight line, the outliers could not reasonably be interpreted as being part of that line.

We are going to take a data set that represents the temperatures of 12 different objects in a room. If 11 of the objects have temperatures near 70 degrees Fahrenheit (21 degrees Celsius), but the twelfth object, an oven, has a temperature of 300 degrees Fahrenheit (150 degrees Celsius), a quick observation will tell you that the oven is probably a outlier

Calculate Outliers Step 2
Calculate Outliers Step 2

Step 2. Order the data from least to greatest

The first step in calculating outliers in a data set is to find the median (middle) value of the data set. This task is greatly simplified if the values in the data set are in order from least to greatest. So before continuing, order the dataset values like this.

Let's continue with the example above. The following is the data set that represents the temperatures of various objects in a room: {71, 70, 73, 70, 70, 69, 70, 72, 71, 300, 71, 69}. If we order the values of the data set from smallest to largest, our set of values is: {69, 69, 70, 70, 70, 70, 71, 71, 71, 72, 73, 300}

Calculate Outliers Step 3
Calculate Outliers Step 3

Step 3. Find the median of the data set

The median of the data set is the data above which half of the data is and below which the other half of the data is; basically, it is the value "in the middle" of the data set. If the data set contains an odd number of data, it is easy to find it (the median will be the data that has the same number of values above and below itself). However, if there is an even number of data, then there is no single midpoint, the two midpoints must be averaged to find the median. Note that when calculating outliers, the variable Q2 is usually assigned to the median, since it is between Q1 and Q3, the first and third quartiles, which we will define later.

  • Make no mistake if the data set has an even number of values. The average of the two values is often a number that does not appear in the data set itself; this is normal. However, if the two middle values are the same number, the average will obviously be that same number, and this is also normal.
  • In our example we have 12 values or data. The 2 middle values correspond to data 6 and 7, 70 and 71 respectively. Therefore, the median of our data set is the average of these two values: ((70 + 71) / 2), = 70, 5.
Calculate Outliers Step 4
Calculate Outliers Step 4

Step 4. Find the first quartile

This value, to which we will assign the variable Q1, is the data below which 25 percent (or a quarter) of the values are found. In other words, this is the data that is in the middle of the data in the data set that is below the median. If there are an even number of values below the median, you must again average the two values in the middle to find Q1, as you may have to to find the median itself.

  • In our example, 6 data are above the median and 6 are below it. That means that to find the first quartile, we are going to have to average the two data in the middle of the six smallest data. Data 3 and 4 of the six minors are both 70. Therefore, their average is ((70 + 70) / 2), = 70. 70 will be our value for Q1.
Calculate Outliers Step 5
Calculate Outliers Step 5

Step 5. Find the third quartile

This value, to which we will assign the variable Q3, is the data on which 25 percent of the values are found. The method to find Q3 is almost identical to that used to find Q1, with the difference that in this case the data above the median are considered, instead of those below it.

  • Continuing with our example, the two values in the middle of the six data on the median are 71 and 72. By averaging these two values we obtain ((71 + 72) / 2), = 71, 5. 71.5 will be our value for Q3.
Calculate Outliers Step 6
Calculate Outliers Step 6

Step 6. Find the interquartile range

Now that we have defined Q1 and Q3, we need to calculate the distance between these two variables. The distance from Q1 to Q3 is calculated by subtracting Q1 from Q3. The value obtained for the interquartile range is key to determine the limits for the non-outliers of the data set.

  • In our example, the values for Q1 and Q3 are 70 and 71.5, respectively. To find the interquartile range we subtract Q3 - Q1: 71, 5 - 70 = 1, 5.
  • Note that this works even if Q1, Q3, or both are negative numbers. For example, if our value for Q1 were -70, our interquartile range would be 71.5 - (-70) = 141.5, which would be correct.
Calculate Outliers Step 7
Calculate Outliers Step 7

Step 7. Find the "inner limits" of the data set

Outliers are identified by evaluating whether or not they fall within numerical limits called "inner limits" and "outer limits." A value that is outside the inner limits of the data set is called a mild outlier, and one that is outside the outer limits is called an extreme outlier. To find the internal limits of the data set, first, multiply the interquartile range by 1.5. Then add the result to Q3 and subtract it from Q1. The two values that you will get as a result are the internal limits of the data set.

  • In our example, the interquartile range is (71, 5 -70) or 1, 5. Multiplying this by 1, 5 we get 2, 25. We add this number to Q3 and subtract it from Q1 to find the internal limits as shown. see below:

    • 71, 5 + 2, 25 = 73, 75
    • 70 – 2, 25 = 67, 75
    • Therefore, the internal limits are 67, 75 and 73, 75.
  • In our data set, only the oven temperature (300 degrees) is outside this range and therefore could be a slight outlier. However, we still have to determine if this temperature is an extreme outlier, so let's not jump to conclusions until we have.

    Calculate Outliers Step 7Bullet2
    Calculate Outliers Step 7Bullet2
Calculate Outliers Step 8
Calculate Outliers Step 8

Step 8. Find the “outer limits” of the data set

These are calculated in the same way as the internal limits, except that the interquartile range is multiplied by 3 instead of by 1, 5. Then the result is added to Q3 and subtracted from Q1 to find the upper and lower limits. external.

  • In our example, multiplying the aforementioned interquartile range by three we get (1, 5 * 3) or 4, 5. We find the upper and lower outer limits as before:

    • 71, 5 + 4, 5 = 76
    • 70 – 4, 5 = 65, 5
    • The outer limits are 65, 5 and 76.
  • Any data that is outside the outer limits is considered an extreme outlier. In this example the oven temperature, 300 degrees, is well outside the outer limits therefore it is definitely a very outlier.

    Calculate Outliers Step 8Bullet2
    Calculate Outliers Step 8Bullet2
Calculate Outliers Step 9
Calculate Outliers Step 9

Step 9. Use a qualitative assessment to determine whether or not you should "rule out" outliers

Using the described methodology it is possible to determine whether certain data are mild outliers, extreme outliers, or no outliers at all. Make no mistake, however, identifying a piece of data as an outlier only categorizes it as a candidate that can be ignored from the data set, but not as a piece of information that should be ignored. The reason why an outlier differs from the rest of the values in the data set is crucial in determining whether or not you should ignore the outlier. Generally, outliers whose origin can be attributed to an error of some kind, such as an error in measurement, recording, or experimental design, are ignored. On the other hand, outliers that cannot be attributed to an error and that reveal new information or trends that had not been predicted are generally not ignored.

  • Another criterion to consider is whether the outlier significantly affects the mean (average) of the data set by skewing it or making it misleading. Bearing this in mind is particularly important if you plan to draw conclusions from the mean of the data set.
  • Let's evaluate our example. In our example, since it is highly unlikely that the oven would reach a temperature of 300 degrees due to an unforeseen natural force, we can almost certainly conclude that the oven was accidentally turned on, resulting in an abnormal elevated temperature reading. Also, if we don't ignore the outlier, the mean of our data set is (69 + 69 + 70 + 70 + 70 + 70 + 71 + 71 + 71 + 72 + 73 + 300) / 12 = 89, 67 degrees, while the mean if we ignore the outlier is (69 + 69 + 70 + 70 + 70 + 70 + 71 + 71 + 71 + 72 + 73) / 11 = 70, 55.

    • Since the outlier can be attributed to human error and because it is not correct to say that the average temperature of this room was almost 90 degrees, we should choose ignore our outlier.
Calculate Outliers Step 10
Calculate Outliers Step 10

Step 10. Understand the importance of (sometimes) counting outliers

While some outliers should be ignored from data sets because they are the result of an error and / or skew the results making them incorrect or misleading, some outliers should be counted. If, for example, it appears that an outlier was validly obtained (that is, not as a result of an error) and / or gives you new insight into the phenomenon you are measuring, it should not be ruled out. Science experiments are particularly susceptible situations when it comes to outliers. Ignoring an outlier by mistake can mean discarding information that indicates trends or new discoveries.

For example, let's say we are designing a new drug to increase the size of fish in a fish farm. We will use the same data set from before ({71, 70, 73, 70, 70, 69, 70, 72, 71, 300, 71, 69}), except this time, each data will represent the mass of a fish (in grams) after being treated with a different experimental drug since birth. In other words, the first drug made one fish weigh 71 grams, the second drug gave a different fish a mass of 70 grams, and so on. In this situation, 300 is still a very outlier, but we should not ignore it because, assuming it is not due to an error, it represents a significant success in our experiment. The drug that I produce a 300 gram fish served more than all the others; therefore, this data is actually the most important in our data set, rather than the least important

Popular by topic