I regularly conduct both qualitative and quantitative research. Regardless of qual or quant I’m often asked about sample size.
An economist might balk at the idea that you can get value from 5-10 1 hour interviews. Yet there is a lot of value in qualitative research that can’t be achieved with quant. And quantitative research has its own perils – including sample size issues.
Questions about sample size are more complex than they appear. A proper answer requires nuance. It depends on the theoretical justification for the results, the effect size observed, the number of hypothesis, and more. Is it possible that a sample size of 1 is ever sufficient? The answer may surprise you.
A brief case study
Some years ago I conducted a piece of research for an online grocery chain. It’s a useful project to explore because it combined qualitative and quantitative methods. My client wanted to understand which consumer attributes most affected user behaviours.
This was not a typical market segmentation exercise. They already knew that users in certain categories were more likely to purchase different kinds of products. For example age ranges and life-stage categories. That information was mildly useful though the correlations were often small. For this project they wanted to drive product decisions based around user needs and behaviours.
I first ran an analysis over a year of spending data for their entire customer base. Attributes included: demographics (age, gender, etc), when they placed their orders, what device they used, how much they spent in different product categories, and much more.
This data was amazingly noisy. Where were almost no (statistically significant) correlations between any attributes or behaviours. One interesting segmentation split customers on which categories of fresh food they were willing to buy online. The groups were (1) Not-picky, (2) Wouldn’t buy Fruit & Veg, (3) Wouldn’t buy Meat & Fish. Other than this segmentation the results were largely uninteresting.
A survey of a sample of customers also failed to reveal any interesting patterns or clusters. Plenty of obvious tidbits came out but nothing insightful. For example users who opted for pick-up over delivery were much more likely to have a car and care about the delivery fee. It makes sense but it doesn’t provide insight that could drive product decisions.
The most interesting quantitative data was actually the free-text responses from their regular customer feedback surveys. You would think that I would class this as qualitative data but there were over 60,000 responses spanning a 6 month period. Analysing this kind of data required plenty of finesse. Ultimately I needed a combination of machine learning algorithms to find patterns.
For example: there are older (or disabled) users who have trouble leaving the house and they are usually a small fixed income. If a significant part of their order is missing they worry about going hungry. E.g. 1kg of a particular meat they ordered was out of stock.
The final piece of research was a series of 20 video interviews with a range of customers. I chose participants based on factors from the research mentioned above as well as existing segmentation.
This qualitative research revealed rich insights about user behaviour. Some highlights included:
- Many users do in-store “top-up shops” for quickly consumed goods (milk, bread)
- People use shopping lists in all sorts of unexpected ways, including some users who use their shopping cart as an ongoing list
- People on a low income value online grocery shopping because it allows them to see their cart total before going to the checkout
- The reason some people don’t buy their fruit & vegetables online? They don’t trust the pickers to choose the freshest ones (for too many reasons to go into in this short article)
Granted, we cannot use these insights in the same way as some of the quant data. For example, using quant data the business can predict what kinds of products people want to buy. This can drive effective and automated marketing campaigns.
Similarly we can’t be sure from the qualitative data how widespread any insight is (even within a particular subgroup). Yet this qualitative data did provide valuable insight into behaviour. How? They combined it with further research, designed experiments to test hypothesis, and conduct on-going quant analysis.
What’s the point of a small sample?
I mentioned the small sample size (N=20) used for my qualitative research. In fact this is considered a large sample size for a lot of projects. Typical customer interviews involve sample sizes of 5-10.
The point of the case study was to illustrate that different kinds of data are useful for different kinds of purposes. If I was asked to build a model that predicted what users would buy in their next shop I would have used a different method. In fact I would start with a naive approach and then measure all other models against it. What would this involve? Assume that the customer will buy exactly the same items as the last time they shopped. This turns out to be a powerful predictor and many “sophisticated” algorithms fail to do better.
Suppose I wanted to work out what kinds of features would drive the most engagement? To do that I need to understand what kinds of problems users face and what kinds of goals they are trying to achieve. I would do exactly the deep dive I described above. Then I would combine my findings into a task analysis. And I would use that to create possible product directions. Once we have ideas, generated from customer research, we can look at quant methods to develop a product roadmap.
Small sample qual research is most useful for discovering new things. Charles Darwin’s Theory of Evolution by Natural Selection was not born of rigorous quantitative analysis. Instead he developed his theory from qualitative observations while on the HMS Beagle. His qualitative catalogue of information led to a brilliant insight: random changes in new generations will naturally result in adaptation to their environment.
There is a second way that small sample sizes are useful: when there is a strong theoretical justification for the results. The most famous example comes from Australian scientist Barry Marshall. He had some evidence that the bacteria Helicobacter pylori caused stomach ulcers. Yet the evidence was not considered conclusive. So he drank a beaker of H. pylori bacteria and an exam later revealed signs of gastritis. He followed this up with a course of antibiotics which both removed the bacteria and resolve the symptoms. This experiment had a sample size of one yet it was strong evidence.
Why was this strong evidence? The reason is “effect size”. Marshall’s experiment was so significant is because he had no gastritis symptoms before ingesting a beaker of H. pylori. Afterwards he developed symptoms. Then after administering a treatment the symptoms went away. And these were not subjective symptoms such as pain (which is more readily reduced from placebo).
This sample size of one was sufficient evidence to spur further research. There are two important takeaways here: (1) there was a strong theoretical justification before the experiment was conducted, (2) this was followed up with further research to draw conclusive proof and reduce the possibility of coincidence or deception.
When quantitative research fails
Nate Silver, in his book “The Signal and the Noise”, highlights an interesting failure of forecasting. In the period 1986-2006 economists were incredibly successful in forecasting GDP with an average error of just 1.1 points. By contrast, in the previous period 1968-1985 economists forecasts were off by an average error of 2.3 points. Does this mean that economists got better at forecasting changes in the economy?
The fact is that between 1986 and 2006 the US economy was relatively stable. The actual GDP tended to be in a narrow range between 2 percent and 5 percent. It was a safe bet to predict that the economy would continue to grow the same way it had for the last 20 years.
In fact, when looking at these forecasts in more detail, it becomes obvious that the economists actually got worse at predicting GDP growth. Why? Economists failed to predict 2 recessions that happened in that time. They continued to predict growth because most of the data was made up of growth. This failure to anticipate an upcoming problem was partly to blame for the slow (and lacklustre) response to the 2008 recession and ultimately the global financial crisis.
This article is not about throwing shade on quantitative research in an attempt to promote qualitative methods. The point is to say that analysing data to draw conclusions is difficult. It’s easy to make mistakes in both qualitative and quantitative methods.
The data these economists had on-hand wasn’t a small sample. They had historical trends going back as far as they wanted. Bad quantitative models made for poor predictive power. Sample size is one question but it is not the most important factor in designing research.
When qualitative research fails
I’ve also seen many projects where a company made bad product decisions based on poor qualitative evidence. Situations where a decision was made simply because a market research team conducted a focus group or a customer research team conducted a handful of interviews.
One (purposely vague) example is of a company who, after a demo, asked participants if they would use a new feature. Not only did they say yes but the participants were enthusiastic about the feature. When launched uptake was zero. Another example involved security settings on a financial services product. Researchers asked participants what level of security they would choose. During the sessions customers all selected the strongest (and most inconvenient) security measures. In the real world customers all chose the weakest (and most convenient) option instead.
These research methods involved 2 major flaws: (1) the theoretical justification was weak because we know that people are bad at predicting their own future behaviour, and (2) there was no follow-up research to see if this behaviour would generalise to the broader customer base.
These problems would not have been fixed by a larger sample size. The product decisions were caused by fundamental problems in the research design.
The multiple comparison problem
Before discussing a solution to the sample size challenge I want to touch on one more problem area: multiple comparisons.
A quick (but important) detour: there is a website that describes a range of “spurious correlations” of completely unrelated phenomena. For example: Age of Miss America winner correlates with murders by steam, hot vapours, and hot objects. Why does this happen? There are so many pieces of data out there that coincidences are almost guaranteed.
You may be familiar with the term ‘p-value’. A simplified (though inaccurate) definition of a p-value is: the probability that a result (at least) this extreme might have happened if the mechanism for the effect did not exist. Suppose you read a study which concludes that people who drive white cars are 14% more likely to speed with a p-value of 0.05. Now suppose that, in reality, people who drive white cars are no more likely to speed than anyone else (the null hypothesis). Given that the null-hypothesis is true what are the odds that a given sample of data would show white cars as being 14% more likely to speed? A p-value tells you that probability.
A p-value of 0.05 is sometimes written as 95% confidence or with a 95% confidence interval. This means that, based on similar data, we would expect to see a coincidence 5% of the time.
As we’ve seen with the “spurious correlations” website it’s possible to find random correlations if you dredge through enough data. The same problem can arise when conducting research without a significant direction.
Suppose you conduct a survey of 300 customers. You use an online calculator that tells you the results will be “statistically significant” with a 95% confidence interval and a small margin of error. You would expect to find a coincidence only 5% of the time. Great, right?
That is true if your survey is looking to measure a specific hypothesis (e.g. do users who own dogs buy more of your product). But suppose you don’t know what you’re looking for? Then you might ask many survey questions about seemingly unrelated topics. Now when you compare any two aspects of the survey you are performing 1 comparison. The more comparisons you make the more opportunities there are for finding a coincidence. In statistics this is called the “multiple comparison problem”.
Choosing a sample size
Regardless of what kind of research you are conducting the first thing you need to do is understand the nature of the question you are trying to answer. Here are some basic considerations:
1. What are you trying to accomplish? For example: discover something new, make a prediction, make a business decision that will result in success, etc.
2. Who will use your findings and how? For example: make a product decision, make an algorithm that acts automatically, provide evidence to the CEO, etc.
3. Under what circumstances would it be most risky for your findings to be wrong?
4. Will you have an opportunity to conduct follow-up research to validate the findings?
5. How much of an opportunity will you have to course correct if it turns out the findings were spurious?
These factors will influence your research design. In an ideal world you would (1) conduct exploratory research, which you then (2) use to run a follow-up experiment to validate the findings, and (3) you implement gradually to validate that the outcome works in the real world.
The worst case scenario is that you make a major, irreversible, decision based on a small sample size and no follow-up research.
Sample size in quant research
As it turns out sample size is not the only consideration.
Before concluding that your single research study is statistically significant try to conduct follow-up research. If you are making multiple comparisons then you need to be particularly suspicious of spurious correlations and coincidences.
Make sure you have a theoretical justification for any findings. See if you can back this up with desk research or qualitative research.
Before deciding on sample size we also need to consider research design. Recall the example of economists working with data during a period of relative economic stability. They may have considered it a safe assumption that “things have changed” and that the US economy had simply become more stable over time. They turned out to be catastrophically wrong.
Suppose you are doing research on customer spending behaviour. Does the sample include customers with varying levels of spend? Do most sales come from the “long-tail” of small spenders? Or do you have a Pareto distribution where most income comes from a handful of large spenders? If most of your revenue comes from a handful of large customers then a sample of low-spenders won’t actually help at all. These extreme examples may seem obvious but I’ve seen plenty of research that makes these kinds of mistakes.
One thing that you won’t know ahead of time is potential “effect size”. The stronger the observed effect the smaller the sample that you need. You may try to calculate the sample size in advance by assuming a worst case (small) effect size. But perhaps an effect size that small wouldn’t be worth drawing conclusions on anyway. And if you are trawling data for large effects then you need to worry about multiple comparisons anyway.
The sample size question ultimately boils down to the question of risk. Under what circumstances would it be most risky for this research to be wrong? Do you have a large enough sample size for this circumstance?
If you are able to you should save enough budge for follow-up research. Use your initial findings to develop a hypothesis and design an experiment which you can use validate it properly. At that point you already have evidence that your hypothesis is true. Now you can validate it with a statistically significant sample.
Sample size in qual research
Qualitative research methods are time intensive: Interviewing people, making observations, conducting diary studies, experience sampling, all take a long time. Planning, admin, running sessions, taking notes, conducting analysis, these all take time. This means that your largest consideration are time and budget.
So when you’re thinking about sample you need to ask yourself a lot of questions. Are you speaking to enough of the right kind of people to reduce the risk of making a mistake? Do you have a theoretical justification to backup the findings? Do you have the right research design to avoid bias?
You can avoid a lot of issues by properly mapping out research objectives in advance, consider different methods for analysing qualitative research, or synthesising your findings into a task analysis.
If your research is well planned (focusing on the kind of output you want) then you should have a good idea of who you will need to speak to. Look at any previous research you have (qual or quant) to identify participant groups. Make sure to recruit participants based on potential behaviours more than their demographics. Though you should make sure not to completely ignore demographic factors. A bias in participant age, race, and gender could lead to findings that don’t represent the diversity of your customer base. Try to cast your net as wide as possible.
Then how do you decide on the sample size? Talk to as many (different kinds of) people as you can until you (1) stop hearing anything new, or (2) run out of time or budget.
That might seem like a cop-out. But remember: your biggest constraint for qualitative research is not statistically significance results but how time intensive the work is.
There are plenty of articles and calculators out there for calculating sample size. I want the most important takeaway from this article to be this: sample size is not the most important factor to consider when determining sample size. Sample size should be a single consideration in overall research design.