The Battle between Machine Learning & Statistics over Consumer Insights

With consumers providing so many data points through any number of information gathering techniques, it is imperative that companies take a strategic approach to analysis, especially that demographics no longer suffice.  Furthermore, effective consumer research should get at the “why” behind consumer behaviors and preferences to survive a competitive environment and lead the future.

All of this begs the question, how?  Researchers have been often debating the effectiveness of two techniques: machine learning versus classic statistics. The relationship between them has not been without its hardships, with each one of them making the case that it is the proper strategy for maximizing your ROI from the data collected from consumers.

Over a series of blog posts, we will help dispel some myths about a lot of buzzwords in the field. The first topic we’re tackling is, Machine Learning vs. Statistics. What is machine learning? What is classical statistics? Are they different? If yes, How? When do I use them? And which one is more effective to help me understand my consumers?

First things first, let us cover some working definitions for both. Machine learning and statistics are fields that employ various analysis techniques for the purpose of understanding data. Machine learning is a type of artificial intelligence (A.I.) that allows software applications to learn and predict outcomes without being explicitly programmed. You would mainly use machine learning to generate a prediction about your whole customer base from existing datasets.

Statistics on the other hand is defined as a branch of mathematics dealing with the collection, classification, analysis, and interpretation of data. It is powerful for drawing inferences about your customers from a sample of a larger population. While Machine Learning is concerned with identifying patterns based on existing datasets, the primary goal with classic statistics is to focus on both describing the data by reducing it to its most meaningful level and to infer about the larger population from only a portion of your customers.

Because of these reasons, they tend to focus on solving slightly different business needs. Machine learning rules when there is a need for an individualized prediction about a certain consumer behavior or trend. Statistics wins the day when there is a need to understand a big strategic question such as “why”, “how”, and for “who”. For example, machine learning is deployed when you’re interested in generating a list of recommended items for consumers based on past behavior. Statistics is optimal when you want to test a hypothesis around why consumers are buying specific products, or why behaviors are trending a certain way.

What makes a certain technique more effective than the other? The answer is it depends on what you are hoping to achieve. While a deep academic analysis is beyond this blog, here are three key differentiators.

Assumptions, Assumptions, Assumptions

The bell curve. We all saw it by day 3 of Statistics 101 class. It takes many back to that unpleasant time of your introductory class in statistics where the lecturer talked about things that we’ve just as soon forgotten. Do you remember what a t-test is and the meaning of a p-value or what significant testing is?  At the heart of it all is the ability to infer something about the population from only a sample. So we make assumptions about things such as the independence of observations and the distribution of the population.

For example, in our case as it may apply to the group of customers who responded to last month’s satisfaction survey or the brand health tracker from last quarter. The soundness of those assumptions and the representation of this sample as it pertains to the larger population will greatly affect the extent to which your prediction models about the larger consumer base are actually accurate.

On the other hand, when you apply machine learning to your analysis it is free from any of those assumptions. The focus is on the existing dataset at hand, such as recent purchase behavior or brand perceptions, and the patterns it can reveal. No assumptions are made because machine learning users are not interested in inferring something about the population from the sample. The population of interest is actually the sample.  The idea is the more data you have, the more patterns will be revealed. Over time, with more data the predictive models will improve.

Data Quantity vs. Data Quality

The second big differentiator between machine learning and statistics is the importance of sampling techniques. Statistics is concerned with inferring something about all of your customers based off of data from a survey of only a sample of the entire customer base. This is why you may hear statisticians discussing how important proper sampling is to the final outcome (e.g. see literally anything about political polling).

Machine learning assumes that the samples are independent and identically distributed from the population and that they are already representative of that entire population. The result is that machine learning techniques end up being way more pragmatic and cheaper to conduct on scale.

Keep in mind, however, that what you gain in scalability you may lose in accuracy.  Google’s epic failure to predict the number of flu cases based on Google search terms in 2013 is a classic example.  While the underlying machine learning algorithms were relatively sound, ignoring variables such as uncertainties and sampling techniques lead to spectacularly inaccurate estimates over time.

 Exploring vs. Confirming: Different Ways of Learning

Data analysis techniques are classified as either exploratory or confirmatory. As the labels imply, exploratory analysis seeks to identify interesting or useful patterns, whereas confirmatory analysis tests specific hypotheses in the dataset that can either be confirmed or refuted.

You’re either looking for new trends in consumer data that you aren’t aware of or checking to see if customers are engaging with your products the way that you intended.

Machine learning algorithms are mainly exploratory and attempt to generalize decision making. Again, due to the fact that machine learning folks are less concerned with hypothesis testing.

Statisticians focus primarily on hypothesis testing. Asking questions like, are females more likely to purchase organic food than men? Are millennials more conscious about environmentally friendly products than other generations?

Both have their place in solving business challenges, depending on the context. Companies need to take a step back to evaluate which method is the best for that particular problem before getting caught up in the buzzwords of the moment. Or feel free to just reach out to us!

So What?

Given the choice between machine learning and classic statistics, which should be used? Of course, the answer is it depends. It is becoming clear that both fields can benefit from each other and both fields can assist in better understanding consumers.

The team at Frontier7 has extensive experience in data analytics and have helped companies of all sizes make data-driven, consumer focused decisions. We have a general excitement about the potential for big, meaningful impact that we can have in the world of consumer research.

We admit, “machine learning” has a sexy ring to it, but trendy buzzwords do not a smart business decision make. Blindly following trends won’t benefit anyone. Big data doesn’t mean smart data. We want to contribute intelligent tools to the consumer research space to help free time for thinking within companies.