My online #researchstudy was recently infiltrated by bots. I haven't shared this story publicly because I felt a bit like it was my fault. I'm putting my pride aside because I think #dataintegrity is and will be a growing issue in survey data and is not discussed enough (1/n)
Bots have been a huge threat data integrity in recent years, and I can't believe that bot protection is not yet a standard part of the data integrity section of IRB submissions. Gone are the days where "checking the quality of the data every few days" will suffice (2/n)
Adding protections INTO your survey may take time and energy spent coding & creating advanced branch logic but it will save you hundreds of hours (and LOTS of money) if you do it right! (3/n)
A bit of my story: within 12 hrs of going live I had over 350 false respondents in my study. I tell you this now, but it took me hundreds of hours to identify these bots and more than a week's worth of work. I'm lucky enough to have a quant background which made this easier (4/n)
It took 10 codings schemes to reveal bots. If I hadn't had open ended questions, I am confident that I would not have identified the bots in my study. So, here are my lessons learned. Lesson 1: REQUIRE open-ended responses (5/n)
Lesson 2: Everyone doing online data collection needs to build in ***complex and advanced*** logic/inattentional checks throughout the first sets of surveys (and do NOT cluster them) (6/n)
Lesson 3: Add "honeypot" items to your survey. These are fields that are hidden to your average participants but are visible to bots. Name the items in an identical fashion to your other fields to prevent the bots from catching on (7/n)
Lesson 4: Captchas are not enough. But add them in anyway
Lesson 5: Screen participants and then email those who passed the screener with a unique survey link. This takes more time, but you have to do it. NO PUBLIC LINKS. EVER. DON'T DO IT.
(8/n)
Lesson 6: flag/prompt participants who are "speeding" through materials
Lesson 7: Ask similar questions at different points in your study to check for inconsistencies (e.g., ask gender twice)
(9/n)
Lessons 8-10: Your study will still have bots.

Check your data. A LOT. Don't blame yourself. Acknowledge this as a historical factor influencing data integrity and prepare for it.
That's it for now. Thank you for coming to my #TedTalk.

(10/10)
#AcademicTwitter #DataScience
One last tip - don’t automate participant payment. Thankfully I did not. This allowed me to review my data before paying to avoid compensating bots. Definitely check all of your data integrity markers before compensation and have an IRB approved protocol to determine who to pay
Clarification: this study did not use MTurk
In another update - it seems like Twitter posts about research studies is the main way that bot programers are identifying studies to spam. I would think twice about posting a study with a large compensation amount without including a phone screen
You can follow @m_simonephd.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.