Challenging “Best Practices” for CX/UX Qualitative Sample Sizes

Debbie Levitt
R Before D
Published in
10 min readNov 8, 2023

--

If we’re doing qualitative research such as observations or interviews, how many people should we include in the study? We typically find from Googling or expert articles that we want five users.

Nielsen Norman Group aka NNg says that it only takes five users to find around 85% of your usability issues. The advice is typically that you don’t need to run a usability test with more than five humans because you’ll find enough UX bugs, and that’s “good enough.”

Considering some of these “best practices” date back to recommendations made decades ago, it’s time to take a fresh and critical thinking look.

This advice comes from a bygone era.

NNg’s “sample size” article is from the year 2000, but references an academic paper from 1993 as its source. 1993 was not a fantastic time for companies to really care about the experiences of People of Color, gay users, transgender users, or immigrant or refugee users. It wasn’t a great time to study the usage of digital devices like computers or mobile phones, considering that most people worldwide didn’t have any of those.

This paper is not about modern technology usage… or modern users.

Who were the people who found 100% of issues, and who were the people who found 85% of issues?

NNg said I only need five people to find 85% of usability issues. To know that a group of participants found 85% of issues means that you are sure that you found all 100% of issues. I’ve often been surprised by some of the issues testing participants found. Could NNg have been wrong about the true number of usability issues present in the tested designs? How can we really know the full set of 100% of usability issues?

Did they test with people with accessibility needs? Did they test with people with different knowledge levels, skill levels, reading levels, etc?

Skimming their original 1993 paper, I noticed:

  • The conclusion was based on 11 studies, one each of 11 different types. Is one study enough to be representative?
  • Five of the studies were with users. Six studies were expert heuristic evaluations. The advice we’ve been told for decades is about users as participants, but the paper only covers five studies where users were included.
  • The five user studies were around a spreadsheet, a calendar program, a word processor, an outliner, and a bibliography database. In 1993, y’all! Who used those in 1993? Members of marginalized communities? People with disabilities or accessibility needs? And how simple were these software applications in 1993? They didn’t have anywhere near the complexity of our current word processors (MS Word, Google Docs, etc.), spreadsheets, or calendar programs.
  • Who was recruited for these studies? People savvy at these tasks? People new to these tasks or systems? Someone else? Perhaps people who were used to these systems and had workarounds didn’t report as many usability issues since they knew how to complete their task no matter what.

I didn’t read the whole paper, but I’m concerned. About a lot of things.

I also wonder about this chart:

Chart showing the number of test subjects against the percentage of issues found.

Does this mean that if you don’t recruit very well, maybe you only find 50% of usability issues with five people? I’d imagine that to be possible. Yet when we see the modern version of this graph, it just has a high line showing that 5 subjects will find nearly all of the possible issues. That graph doesn’t tell the same story as this one.

And “all of the usability issues of a 1993 word processor” and “all of the usability issues of what your team has created” might be very different.

Also, this has been diagrammed to create a nice curve. But what if users 1–10 found very few usability issues, and the main issues were found by users 11–15? Does the chart still look like this? If we only recruited 5 people, did we miss participants who might have found more issues?

So many questions.

We usability test to do more than finding Human-Centered Design bugs.

We usability test to ensure that our flow, interface, and experience match our target users’ needs, habits, and mental models. We usability test to ensure we have created a great solution to a well-understood and well-defined problem.

We could just test for design bugs, which seems like Minimum Viable Usability Testing. If we found 100% of bugs and delivered a feature with no UX debt, but it doesn’t solve the right problem well or at all, congratulations on your bug-free design. Have we really succeeded?

The benefits of usability testing go well beyond the number of bugs found.

Including users with accessibility needs and from marginalized communities.

When I first heard some of these early recommendations, like you only need to test with five users, I wondered if that included anybody with accessibility needs. Did it include a dyslexic person? A low-vision person? Someone who doesn’t use a mouse but moves through a digital interface by tabbing or using a foot switch?

We probably didn’t include any of those people. If we included someone who is dyslexic, neurodivergent, or with other conditions or diagnoses, did we recruit specifically to include them, or were they included by accident? Did we recruit to exclude them?

Similarly, when this paper was written in 1993, we weren’t having the same conversations we have today about gender, sexuality, identities, marginalized communities, immigrant or refugee communities, and others who are probably among your target audiences.

You might not have meant to leave them out, but did you recruit to include them?

In 2023, it’s no longer good enough to talk to some men, some women, some younger people, some older people, and call that a diverse audience.

If we consider the permutations and combinations of gender, age, ethnicity, language proficiency, disabilities, and belonging or identification in marginalized communities, a total of five or eight people are unlikely to represent a diverse population.

A sample size of five, even per segment or typology, might be too few if we want to ensure the inclusion of more diverse populations.

This also affects “fast discovery research” or “just call a few customers.”

Some authors and trainers might tell you that for great user research and customer feedback, you only need to talk to a few customers each week or each sprint. Is this based on the old and misunderstood NNg number of 5 users being all you need to get lots of insights? Is this sample size too small?

We must be careful that any research we are doing includes a true variety of people, matching the rainbow of current and potential users and customers we could have. Don’t just call three white guys, three happy customers, three current customers, etc. You need to talk to and observe a larger and more diverse participant group.

If we include more people in qualitative studies, that’s too slow. It’s not Agile or Lean!

Please show me anything in Agile, Lean, or books about getting away from feature factories or product management that want you to:

  • Skip getting to know target audiences better than you know them now.
  • Ignore the realities and needs of diverse audiences.
  • Have empathy, but only fake #empathy. Just focus on our whitest, richest, happiest, or smartest audiences, and give the middle finger to everybody else.
  • Do a mediocre or bad job now so that you can fail, delay things, and have to fix this later. No time to make it better now, but we’ll take time to make it better later!

Examples from Delta CX projects

Let’s look at this from another angle. My company has had some recent projects where our clients weren’t totally sure who their target audience was. They didn’t yet have personas.

Example #1: A client told us they only knew that some of their customers are B2B and some are B2C. For our generative qualitative research, we recruited B2B and B2C people who tend to shop for and purchase the type of item our client sells. We separated the people we met into six very clear personas/typologies based on their roles and behaviors.

If we had gone with the often quoted or misquoted best practice of you only need to speak to a few customers or just find five people, we would have been unlikely to achieve the results that we did. There is a nearly zero chance that we would have discovered six distinct behavioral personas from talking to five people.

Example #2: Our client knew their target audience was American millennials. For our generative qualitative research, we recruited American millennials who have (at least) a little money left over at the end of each month. The target audience didn’t include people experiencing poverty. After 32 completed sessions, we broke this audience into five behavioral typologies.

Typology 1 was unmotivated. We knew why and that they probably can’t be motivated around this particular topic. Typologies 2 through 5 represent a process or leveling up; as people gain knowledge and skill, they move into a different typology. Again, we would have been unlikely to have noticed these patterns across people if we only interviewed five people. Maybe we would have met five people from typology 3, assumed there’s just one persona here, and given our client some pretty poor advice.

Interestingly, we were the second company to be hired for this project. Before this client hired us, they hired a company to do “user research.” That company spoke to eight New York City millennials making around the same amount of money. The agency gave the client a report that was a fire hose of everything everybody said. On the surface, this report appeared to be a lot of data, but it wasn’t actionable, which made it worthless to the client.

Example #3: A few years ago, we did a research project for a well-known online selling platform. This was planned around six target segments the client provided. Recruiting 8–12 people per segment — and accounting for some no-shows — we ultimately met 71 people.

Each group had some things in common with the other groups but also expressed very different experiences and unique struggles only found within their group. If we went for five people total, or maybe 20–30 people across all segments, we would have missed a lot of the interesting insights we found, as well as the patterns we saw within each segment.

What would be an exception to broadening the sample size for qualitative research?

An exception would be when our research has shown where people behave similarly, and we don’t need to meet 12, 20, or 40 people to understand behaviors, tasks, and perspectives.

For example, I recently conducted a study at my current job that included interviews and some evaluative research. We aimed for 40 participants to include more varied populations, including members of marginalized communities.

As these were all current students at the same university, many behaved quite similarly. They used the same tools. They had similar experiences in the area we were discussing. The main behavioral differentiator for this particular research was which degree they were pursuing. For example, students in the nursing program had different behaviors than nearly everybody else we spoke to.

Although everybody had their own stories, and we learned something from each participant, behaviors and perceptions were similar across most of the non-nursing students. Behaviors were similar whether you started at this school or transferred in later. Behaviors were similar whether this was your second year at the school or last year. Behaviors and needs were similar across genders, nationalities, and current locations.

Therefore, in the future, my team agreed that we probably don’t need to meet 40 people to get a representative sample of behaviors, needs, and tasks. We will still recruit for diversity, including people with disabilities and members of marginalized communities.

What should the new best practice for qualitative sample size be?

Here’s where we must balance scope, scale, time, budget, and knowledge. It’s not an easy balance, and I can’t say there’s one right way for everybody.

  • I do not believe that only five people total will find 85% of usability issues. I never believed that we should only meet 3–5 people for a study or “to find out what people want.” That’s not good research.
  • I research and test for more than usability issues. I want to know if we solved the original problem well, and if we created any new problems. That means we have to put an interactively-realistic prototype in front of a larger variety of people.
  • I recruit five or six people per persona, segment, or typology for evaluative research unless or until I know that that is not the right number. Perhaps you need a more diverse population than six people will provide.
  • I believe you can find some usability bugs showing an interactively-realistic prototype, dev build, or live product to only five people. But that shouldn’t be the only time we test. We should iterate based on what we learn, find new people, and test again. If I am doing a smaller or faster evaluative study, especially unmoderated, I will start with 10–12 well-recruited participants to make sure I have a good variety of experiences.
  • Make sure that people with accessibility needs are testing with prototypes, dev builds, or live product that has already been tested to work well with assistive devices. Don’t give people with accessibility needs something that might not function for them at all because nobody checked yet. This might mean more dev build testing than UX prototype testing, but that’s fine; plan it and do it.
  • I recruit eight to twelve people per persona, segment, or typology for generative research unless or until I know that’s not the right number.
  • When a client or project doesn’t have personas, segments, or typologies, or needs to better understand a broad audience, I aim to meet 30 people. Recruit people who behaviorally fit into the target audience, and recruit for diversity. I find that 30 people (give or take) is normally enough that we can see patterns across people and behaviors. We can then create behavioral personas or typologies.

Connect with us or learn more:

--

--

“The Mary Poppins of CX & UX.” CX and UX Strategist, Researcher, Architect, Speaker, Trainer. Algorithms suck, so pls follow me on Patreon.com/cxcc