Book A Demo

An Introduction to Sample Size and Representativity in Tourism Research


Written by Alexandra V. Pasi, PhD


Alexandra V. Pasi holds a PhD in Mathematics, and has broad industry and academic experience in machine learning, statistics, data science, and quantitative and qualitative modeling and forecasting with applications to a variety of fields including finance, genomics, and geopolitics. She is passionate about communicating the fundamentals of good mathematical and scientific practice to decision-makers looking to make data-informed decisions. 


We routinely put our lives in the hands of sample sizes which account for less than .001% of the population. By comparison, modern geolocation data sets represent a sample approximately 5,000 times larger than the sample sizes routinely used in medicine and car crash testing.

The modern world of big data has introduced within the tourism industry a vast array of large, varied, and dynamic data sets that differ fundamentally in scope, utility, and kind from the more traditional and much smaller survey-based data sets which usually include data from only a couple thousand individuals. However, the modern DMO now has access to data from over one billion observations per day from more than 25 million mobile devices, representing approximately 5% of total devices. Using geolocation data to understand visitor movement allows organizations to better answer questions about where, when, and how people move throughout a destination.

Understanding the differences between available sources of data, and the distinction of how they are best utilized, is critical to making data-driven and informed decisions. A deeper understanding of the data begins with a discussion of how the data comes to exist, and how it actually relates to what it is presumed to reflect.

Salt Lake County Dynamic Visualization-1
Download This Whitepaper
Learn More
Salt Lake County Dynamic Visualization-2
Learn More
Salt Lake County Dynamic Visualization
Case Studies
Learn More

Sample Size in the Real World

Sample Size White Paper GraphWhen a study is completed and data is gathered on a certain phenomenon, it is nearly impossible or impractical to observe every possible instance of that event or trend. For example, if scientists want to study the safety of a new medicine, it would be prohibitively impractical, not to mention highly unethical, to administer that medicine to the over 7 billion people on the planet and then monitor them for any adverse effects. So what researchers, pollsters, and scientists do is choose a specific sample, or subset, of the population to study instead, with the underlying assumption being that this sample will tend to represent the population overall (the degree to which it does is called the sample representivity). 

Having a large enough sample size is key to ensuring the sampled population is actually representative of the population it is meant to represent. If the data-collection methodology is well-designed (this is a key consideration which cannot be taken for granted), then a large sample can be safely assumed to generally reflect the same trends as the overall population. This principle is referred to as the Law of Large Numbers. In a carefully designed and controlled Phase 4 clinical trial, for example, you can expect to see several thousands of enrolled participants, accounting for about 0.001% of the US population. And when it comes to car crash tests, the sample size is even smaller at approximately 0.0001% of all cars on the road. The key observation is that we routinely put our lives in the hands of sample sizes which account for less than 0.001% of the population. By comparison, modern geolocation data sets represent a sample approximately 5,000 times larger than the sample sizes routinely used in medicine and car crash testing. In the case of clinical trials and crash tests, of course, there is a great amount of risk and cost involved in performing the studies, which constrains the sample size significantly; this constraint on the sample size is compensated for by careful and controlled experimental design. 

Such a controlled experimental design is not always possible when trying to observe the actual behavior of individuals in real-world settings. In these cases, having a large and robust sample is critical. Within the travel industry, survey-based research is commonplace, and these sample sizes vary from hundreds to thousands. However, the advent of modern data sets such as mobile geolocation data and credit card transaction data allows for the utilization of sample sizes in the tens of millions to analyze the movement and spending habits of travelers within destinations across the US. These large, granular data sets are critical to our ability to answer a broad variety of questions and offer specific, time-sensitive insights into the actual behavior of tourists within particular locations.

When dealing with data sets with a smaller sample size, the question of sample representivity and methodological design becomes increasingly important, especially in cases in which a controlled experiment is not possible or appropriate. A data set with a small, unrepresentative sample provides ambiguity at best, and misdirection at worst. That is why working with a reputable firm which uses a carefully crafted methodology appropriate to your desired use-case is particularly important. 

Sample Representivity

Sample size is not the only factor influencing sample representivity. The methodology by which data is collected is crucial to selecting a representative sample. Every data-collection methodology is prone to its own limitations and introduced biases, therefore understanding the specific bias introduced by a given methodology is the key to choosing the data and tools to solve a particular problem. Data is only as reliable as the methodology through which it was collected, underscoring the importance of working with researchers who are transparent and thorough about communicating their methodology and thoughtful about compensating for the limitations and biases inherent within their data. 

Historically, much research within the travel industry has been conducted by survey, and it is critical for sample representivity that the pool of participants in the survey is representative of the actual population of travelers. Importantly, even if attempts are made to ensure that the sample of people who are eligible to take a survey is as random and representative as possible, there is still a potential for bias based on actual respondents of the survey. The lower the response rate, the stronger this bias tends to be. This is referred to as non-participation bias, and it can be statistically corrected to a certain degree by boosting the signal from populations that are underrepresented within the sample of survey responses. But if you are starting with a small sample size, then the data collected from a small underrepresented population within that sample will be liable to contain an untenable amount of volatility due to its even smaller size, further complicating efforts to correct the non-participation bias.

A particularly relevant example of non-participation bias is presented in the case of online polling. As phone-survey response rates have plummeted in the past decades, more survey-based research is being conducted through online polling. But online polling, like phone polling, is subject to its own biases. A Pew Research study into online polling found that online polling samples can tend to overrepresent low-income individuals, those who live alone, those without children, and those who receive unemployment benefits. This is of particular note within the context of travel industry research, as low-income adults tend to have very different travel and spending habits than the typical tourist. It is also of worth to note that another survey by Pew Research found that 25% of those aged 65+ did not use the internet, and thus this significant portion of older adults (an age group which tends to spend more on tourism than its younger counterparts) is liable to be left out from online polling altogether. Online polling methodologies must be carefully designed in order to compensate for these limitations. 

Despite the biases inherent in any data-collection methodology, the combination of self-reported survey-based data with larger behavior-based data sets offers powerful insights into the movement, habits, and beliefs of individuals, creating the opportunity for increasingly targeted and robust insights. 

Matching Methodology with Use-Case in the Tourism Industry

Given that every method of collecting data has its own limitations, matching a given use-case with a particular data set or methodology is critical to gaining meaningful and reliable insight from the data. 

Dr. Sarah Dolnicar of the University of Queensland Australia published a 2020 article titled “Survey Research in Tourism: a Perspective Paper” in the journal Tourism Review detailing the methodological limitations of the kind of surveys typically deployed in tourism research. Echoing the existing research on the matter, she identifies several significant challenges faced by survey-based research, including declining response rates, participants not reading and/or understanding the survey questions, fraudulent responses, and a lack of a representative sample. Many of these factors are extrinsic to, and thus cannot be significantly corrected by, improvements in survey design (although it bears mentioning that a lack of careful survey design can easily add an independent and impenetrable layer of noise, bias, and ambiguity on top of the aforementioned general limitations of survey research).

Importantly, she also points out that while well-crafted and deployed surveys can be useful tools in helping to answer questions about perception, direct observation is superior when it comes to answering questions of behavior. Of key note here is that what people say versus what they do are not as closely linked as one might assume. Modern data sets such as mobile app geolocation data and credit card transaction data allow for observational insight into how tourists actually move through a space and spend money within it.

Another limitation of self-reported perception-based data is that it tends to correlate with more dispersed effects over a much longer timeline, but is less well-suited for identifying and explaining acute, time-sensitive changes in the data. While this can play a role in understanding general aspects of long-term brand-building, it has extremely limited utility in making real-time decisions or informing and guiding periods of rapid change. The challenge of navigating the ever-evolving tourism market presents a clear example of the necessity of having more granular insights into behavior than perception-based survey data can offer. Such a need can be better met by a careful synthesis of faster-moving, more-granular data sources such as geolocation and credit card data with more stable, yet slower and coarser, sources like tax data.

Understanding the nature of the data and the methodology used to generate it is critical to choosing the right tools to answer the right questions. The distinct advantage of using these large, modern data sets is their ability to answer questions about actual behavior, and achieve more granular, generalizable, and dynamic views. In an ever-changing world that seems to be constantly challenging preconceived notions, this kind of insight can help provide an invaluable map by which to successfully navigate new terrain.