How to Avoid Being Misled By Data

How to Avoid Being Misled By Data

Everybody uses data on a daily basis. Whether you’re comparing prices at several shops, or using Google Maps to plan a trip, you are using data to make a decision. Data is everywhere and part of almost any good decision, but just because numbers or graphs are in front of you, doesn’t mean they’re accurate - and even if they’re accurate, it doesn’t mean they’re helpful. If you have enough data, you can make anything look good - so how can you tell what actually matters

Here are a few points to consider when trying to figure that out and avoid being misled by the numbers:

Sample size

Small sample sizes are generally much more affected by “noise” and outliers, whereas larger sample sizes are likely to smooth out and provide better information. 

In his excellent and highly recommended book “Thinking, Fast and Slow”, Daniel Kahneman discusses the “law of small numbers” which effectively explains how sample sizes can ruin our thinking. He cites a story about Bill Gates’ foundation’s study of what makes schools successful. In the study, they found that the highest-performing schools were generally those with lower numbers of students. This led Bill Gates’ foundation to invest in building more small schools with few students. However, Kahneman later observed that the schools at the lower end of the performance spectrum were also mainly smaller schools. This is because fewer students at a school makes it much more likely to show extreme results, due to the effect of noise and outliers. In contrast, a school with a large student body will generally have any outliers or extreme trends smoothed out by the large sample. The moral of the story is to be aware, and detailed, when acting off of data with smaller sample sizes. 

Outliers

In reality, discussing outliers is an extension of the sample size point. But it’s important to identify certain days, weeks, or months that are outliers in terms of providing extreme results. For example, you may have found that November 2022 had a huge month in terms of sales, but a deeper dive into the data may show that there was just one day where a buyer made a historic purchase, while the rest of the month was standard. 

What are some ways to counteract this on a larger scale? One simple method is to focus more on “median” or “mode” averaging than the “mean” average. While the mean average is by far the most common way to report things like time series data, the other two can provide a sounder analysis of the dataset. The median allows you to essentially ignore outliers, and mode averaging merely requires you to use the most common results of whatever you are measuring as the guiding average or baseline figure.

Data Visualisation

This is easy to explain, and easy to understand, but it can still catch us off-guard. Anyone that is showing data with an agenda will try to fit the results to their will. For example, a cropped or poorly-labelled axis can vastly exaggerate the difference in a visual comparison:

First graph

The above chart makes each month look drastically different, due to the axes being cropped up from 0 to 10,400. However, compare the above chart, to the below one with a correctly-labelled axis:

Second graph

Other examples can include confusing colour schemes that make comparison difficult, graphs without any legend or labels, and pie charts that don’t add up to 100%: 

Third graph

Correlation, not causation

Two factors may correlate (i.e move in the same direction over a certain time period), but that doesn’t necessarily mean that one causes the movement of the other. Often, mistakes are made in this area by failing to take into account other factors. In other words, the mistakes come from failing to make the analysis as close to a “fair test” as possible. 

Take two metrics, say: website views and online sales. The two metrics may have moved roughly in the same direction over the past 6 months or so, but that is just correlation and not necessarily causation. There may have been other contextual factors at-play in causing one or the other to move the way they did. For example, a seasonal spike in purchase intent (e.g Christmas) or a strong referral source on social media around a certain product. It can seem like an endless fight, but the challenge in data-based problem solving is figuring out which data points are causing the movements, and which are merely coinciding with them.

by Frankly

More articles like this

Get Frankly updates | Careers | Privacy notice

© 2024 Frankly Analytics. All rights reserved.