With all the focus on data manipulation and calculation, it can be really easy to get caught up in those beautiful, little digits. The key is to make those numbers come to life and tell you a story that doesn’t look like the green code from The Matrix.
In the following data set, you now have the skills to make many useful calculations and manipulations: mean, median, filtering sub-populations, and even graphing.
You decide to simply calculate the mean and find a class average of 74% on the most recent writing test. You’re down. You’re dejected. “But we had great class discussions on imagery! I thought we were going to average a low B at least!!”
With our discussions from previous lessons, this is where we would stop. We averaged a low C as a class, and as the teacher, you would just try harder in the future. However, upon further inspection, you notice that little Ricky Marx made a 14% on this assessment! Whew, he must have had a bad day.
Ricky Marx’s grade in data terminology is what you would refer to as an “outlier.” True to their name, outliers lie far outside of peer data points and move means into new zip codes. With a grade as low as a 14%, Ricky’s grade will have a tremendous impact on the class average.
In order to accurately gauge the true performance of a class, extreme outliers must be removed. In this case, Ricky’s test score is 48% points away from the next closest grade: quite far, indeed. As data scientists, ahem, teachers, it’s our job to now make sense of the data. If we remove Ricky’s test grade, we’ll get a more accurate observation of overall class performance. Delete the cell containing Ricky’s test score in order to see this new view.
As you can see, Ricky’s test score negatively impacted the class average by 7%! This is magnified by our fairly low class size as well. In general, smaller samples of data are impacted more dramatically by outliers. In the context of hundreds of students, outliers may not have that large of an impact. But in order to reclaim your classroom, it’s important for you to be on the lookout for outliers.
Another way to soften the meteor-like impact an outlier can have on the mean of your data is to use another method to measure the center of the data. The “median” is when we order our data set from least to greatest and find the middle number. Because the highest and lowest numbers won’t usually be near the middle, the median acts as a polygraph test by getting those outliers to start telling the truth!
Simply type in a cell “=MEDIAN” with an open parenthesis, drag across your data set, and hit enter. Like magic, we have another measure of the center of our data without those pesky outliers to cause the massive swings we saw earlier. Measures of the data’s center are also referred to as measures of central tendency in fancy-speak.
We would now recommend logging into your grade book and looking at your most recent assessment. Calculate the mean for this assessment. Now calculate the median and investigate the difference. Were there any outliers in your last assessment?
- Say, “Out, liar!” and recognize the effect that outliers have on your data set. Tweet
- Don’t give a cold shoulder to all outliers all the time! They have needs too. Tweet
- Removing outliers is more important when your data set is small. Tweet
- In the face of outliers, use the median to measure the center of your data. Tweet
Graphing outliers shows their Pluto-like status among the other data points in more drastic fashion. Graph the data from the previous exercise to see something that will probably look like the crevasse of an arctic glacier.