Do Not Trust Online Survey Sample Vendors to Give You Clean Data – Clean it Yourself

This information is useful for people who purchase sample for online surveys, and who want to make sure their survey data is truly clean.

Online Survey Panels Tell Us Their Panelists Are Clean

It’s hard to open a marketing magazine without seeing an ad from an online survey panel company proclaiming how clean and high quality their panel is.  Some years ago, this claim was a big deal – it was the Wild West of online survey panels, and buyers of sample had to be very careful as to who they worked with.  Today, however, most major online survey sample companies have adopted measures to get rid of professional respondents, prevent over-surveying, and make sure that respondents are who they say they are.  So, whether the sample is “true” or “pure”, or there’s “attention to detail”, most reputable panel companies are doing a decent job of giving those of us who field surveys a good product.

But Survey Data is Still Dirty

However, and here’s a big however, the data from most online surveys using panel sample still comes in with some dirty responses.  Our research shows that between 5% and 40% of survey data from panel sample is garbage.  Garbage – throw it out; don’t bring it into your final dataset to analyze.  Sure, one can blame some of these dirty responses on frustrated respondents dealing with poor survey writing (bad questions, too long, etc.), but the fact remains that you had better clean that survey data before it goes in for analysis.

So, How Do I Clean the Data?

Here’s a plan you can use to clean your data.

When we say “flag” below, we mean that you create a new variable in your dataset next to the variable you are examining, and you place a “1” in a cell if the respondent’s case is flagged.

  1. Flag speeders. Look at time to completion and flag those respondents who took the survey in an unrealistically short time.  Trim the top 5% of the distribution on time to complete.  Then find the Mean and Standard Deviation.  Look at respondents who are > 2 SDs below the mean.  Now, give a reality check – how long did you think the survey was going to take.  Say, it was 10 minutes.  Look at the distribution and see where 5 minutes fits in.  Did a lot of people finish in 5 minutes?  How many people finished in 3 minutes or less?  How does this relate to your > 2 SD below cut-off?  Ok, do you now have a sense of respondents who were so fast that they look suspicious, and also respondents who were ridiculously fast?  Create a new variable in your dataset and flag the “fast” respondents with a value of “1” and the “super-fast” respondents with a “2”.
  2. Flag straightliners. If you having any grid/matrix questions, flag those respondents who gave the same response to every item.
  3. Flag gibberish or garbage responses. If you have any open-ended responses, look for text such as “asdf” or “…..”; flag these responses, and any other “colorful, yet meaningless” responses you find.
  4. Flag incongruent combinations. If a respondent says their company size is 1000 and the number of PCs in the company is 5, something’s wrong here.  Flag it.
  5. Trap questions. Did you include any questions such as “Please choose the third response below”, or “Please type word “attention” below”?  If you did, check them, and flag those respondents who didn’t follow the directions.
  6. Sum up your flags. Compute a new variable that sums all the flags.
  7. Sort your dataset by computed variable. Bring cases to the top that have suspicious answers on a number of your checks.
  8. Inspect and delete cases with flags. Delete those cases that are too “dirty” to be included.  Review with key stakeholders to agree on deletions.
  9. Notify your vendor of any bogus respondents. All the vendors we work with do not charge for any respondents we have flagged for deletion.  Show them the IDs of the respondents you threw out, and they’ll take action on their side to warn and/or remove these panelists from their database.

Following the steps above will insure that the data you analyze is a clean as possible.  Yes, it takes a bit of time, but the effort is clearly worth it when compared to making decisions based on the analysis of data that includes bogus responses.

One last note: if you really need your final sample size to hit a specific number, and you can’t go below that number, you can over-sample, in anticipation of throwing out some respondents.

We use the protocol above to ensure that all of our data is as clean as possible.  Contact us at to discuss how we can help with your next project.