Anonymizing MTurk WorkerIDs

It may be the case that Amazon Mechanical Turk WorkerIDs are not anonymous. Lease et al., 2013 describe at length how personally identifying information may be exposed when a researcher shares WorkerIDs. It is unclear to me the extent to which Amazon constructs their WorkerIDs at present, given that one of their striking demonstrations did not apply to my WorkerID. That is, they describe simply googling the WorkerID and receiving a picture of the participant, along with their full name. My WorkerID turn up nothing. Though, I have only been a worker on MTurk for a short while, so maybe I’ve been lucky and my ID has just not yet been shared widely.

Regardless, providing extra anonymity to participants isn’t too much trouble. This post serves as documentation for a brief script that takes a sqlite database produced by running an experiment in jspsych + psiturk and replaces all instances of the WorkerID with a more secure code.

The script relies on five R libraries

  1. magrittr, for ease of writing
  2. dplyr, through (dbplyr)[cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html], serves as the way to interface with the sqlite database
  3. openssl constructs a more secure identifier for each participant that can still be used to cross reference them across studies
  4. stringr does the work of replacing instances of the WorkerID with the more secure code
  5. docopt, wraps up the Rscript such that it can be called from the command line (in an environment in which Rscript is the name of a function. i.e., may be Rscript.exe in Windows powershell)
#!/usr/bin/env Rscript
# Anonymize participants database NOTE: always overrides the file --outfile

library(docopt)

doc <- "Usage: 
  anonymize_db.R [-i DBNAME] KEY OUTFILE
  anonymize_db.R -h

Options:
  -i --infile DBNAME     sqlite database filename from which to read [default: participants_raw.db]
  -t --table TABLE       name of table within database to anonymize [default: participants]
  -h --help              show this help text

Arguments:
  KEY                   key to salt WorkerIDs for extra security
  OUTFILE               sqlite database filename to write"


opt <- docopt(doc)

library(magrittr)
library(dplyr)
library(stringr)
library(openssl)

db <- dplyr::src_sqlite(opt$infile) %>%
    dplyr::tbl(opt$table) %>%
    dplyr::collect() %>%
    dplyr::mutate(uniqueid = stringr::str_replace(uniqueid, workerid, openssl::sha256(workerid, 
        key = opt$KEY)), datastring = dplyr::case_when(is.na(datastring) ~ datastring, 
        TRUE ~ stringr::str_replace_all(datastring, workerid, openssl::sha256(workerid, 
            key = opt$KEY))), workerid = openssl::sha256(workerid, key = opt$KEY))
message(paste0("read raw database: ", opt$infile))

con <- DBI::dbConnect(RSQLite::SQLite(), opt$OUTFILE)
dplyr::copy_to(con, db, opt$table, temporary = FALSE, indexes = list("uniqueid"), 
    overwrite = TRUE)

DBI::dbDisconnect(con)
message(paste0("wrote anonymized database: ", opt$OUTFILE))
message(paste0("Store your KEY securely if you want the same WorkerIDs to create the same HMACs!"))

As stated in the initial string of this script, a typical call might be

anonymize_db.R longandsecurelystoredsalt participants.db

which will read in the sqlite database participants_raw.db (default for –infile), convert all instances of WorkerID into a hash-digest with the sha256 algorithm, and store the result in a new sqlite database called participants.db.

The general workflow would be to include in your .gitignore the raw database output by psiturk. That way, the raw database is never uploaded into any repository. Then, when you are ready to host the experiment, you pull your repository as usual. As an extra step, you will now need to separately move around your raw database such that when you run the next experiment psiturk will know which workers have already participated. After collecting data, retrieve the database and run this anonymization script on it. The newly created database can then be bundled with your repository.

This is a bit of extra work (i.e., you must manually send the database, retrieve the database, then anonymize it). However, the whole point is to avoid making it easy to download something with potentially identifying information.

Gotchas

  1. This function will overwrite any database of the same name as OUTFILE. Though, that’s often not an issue. If you’ve anonymized a database (call the result participants.db), added new participants to the same raw database, and then anonymize the raw database again, those participants that were anonymized in the first round will be re-anonymized and included in the new result.

  2. If you want this function to convert a given WorkerID into a consistent code, you’ll need to call it with the same value for KEY.

  3. It would be more secure to use a salt of random length for each participant separately.

View Page Source

Patrick Sadil
Patrick Sadil
Research Associate; Biostatistics Faculty