Is your data lying to you?

The Scenario:

Say we have a large barrel of candy and the goal is to know how many different kinds of candy i.e. chocolates, mints and caramels, are in the barrel. An obvious way to figure out this problem would be to simply tip over the barrel, divide the assortment of candy into separate piles and count them separately. Depending on the size of the barrel, this task may take a considerable amount of time to accomplish. If we think of the barrel as a ‘population’ of candy (i.e. all of the candy in the barrel), then, after counting for several exhausting hours, with mountains of candy around us, we may be inclined to exclaim in exasperation. “THERE MUST BE A BETTER WAY OF DOING THIS!”

We may decide to sacrifice the unequivocal truth of the grand (and hand-counted) total, in exchange for something that comes close to the truth. We are now operating in the realm of statistics. The practice of statistical analysis helps us make sense of the data that is generated by, from and because of us, every day.

At this point, our thoughts may return to courses taken in college that were filled with never ending formulae, rules, axioms, and theorems that we quickly deemed as an academic exercise, never to have to be used again, and definitely not at all applicable in real life.

For the nay-sayers described above, I offer only a silent chuckle and ask that you please continue counting candy from the barrel.


road-to-truthThe (approximate) path to the truth

In the field of statistics, our goal is to draw an inference about a parameter (a number that describes/quantifies a population) by calculating a number called a statistic (that describes/quantifies, a typically smaller and representative sample, of the population).

In our candy example, if our goal is to find the total number of the different types of candy in the barrel, we define the counts of each type of candy as individual parameters, because they are derived from the population.
A key driver to the application of statistical analysis in the real world is the prospect of getting an answer for a problem while the problem is still relevant. We are up against the prospect of finding a solution before the problem itself becomes redundant. Our task is to approximate reality, shortcut the counting process and settle for something less than perfect, yet meaningful.

In our example, if we had a limit of 2 hours to count the contents of the entire barrel of candy (which would otherwise take an entire day), we might decide to use a small bucket and fill it with candy from the barrel (our sample), then proceed to count the number of candy of each type from the bucket. We could then make a reasonable assumption that the mix of candy in the bucket is a reasonably good representation of the mix of candy from the entire barrel.

With this assumption, we can make a prediction about the count of each type of candy, as long as we know, or can approximate, the total count of candy contained in the barrel.

Two immediate questions come to mind:

• How can we be certain that the sample we drew is a true representation?
• How do we know for certain that our sample does not have a disproportionately large number of chocolates or mints?

Since we can only model the sample and not the entire population of candy, the results we obtain may indeed lead us to completely different (and possibly wrong) conclusions, based entirely on the composition of the sample that was drawn.


Sampling with Purpose

The basis of statistical analysis relies on the simple concept of taking a representative sample in order to derive a reasonably accurate conclusion about a bigger reality.

The method of constructing an appropriate sample is an important step and one that is often taken for granted.

Before starting any predictive analytics or statistical exploration, we suggest having justifiable answers to a few key questions.

Consider your sample data, then ask the following:

  • How large of a sample do we really need? The sample size determines the predictive power.
  • Is the sample we have too big or too small? The sample size determines the precision of results.
  • Did we include the right mix of observations? As in candy example above.
  • Does it contain variables that have adequate predictive power?

Knowing when your data is lying to you depends not only on the implementation but also on the structure of the appropriate experimental design.

The degree of success or failure of statistical analysis starts with the sample. The quality of the sample can have an amplifying effect on the end result. How can you prevent your small sampling error from leading you down the path of disastrous results?


Read our next article in which we will be discussing and demonstrating a few advanced sampling techniques.


Leave a reply

Your email address will not be published. Required fields are marked *