Help Center

Topic: Recognition

How duplicate uploads are being detected?

Help Center RecognitionLast updated: 18 February, 2020

duplicate surveys

There are several ways how papersurvey.io ensures you are not accidentally uploading duplicate responses and contaminating your dataset.

Comparing unique/page identifiers

If you choose to use unique page identifiers in your paper surveys, this duplication check will be activated by default. If your survey does not use unique identifiers (e.g. single page survey), the section below will not apply to you.

Before processing the uploaded page the unique identifier, page and survey id will be read (e.g. page 1, unique id: 91, survey id : 991). If such page with given data has been already processed, the page will be marked as a duplicate and data won't be processed twice.

If you are uploading surveys with unique identifiers but would like to avoid this additional check, you can disable it in your survey settings by turning on 'Allow non-unique page marking identifiers' toggle.

Where it falls through

  • Sometimes you can accidentally print the survey copies twice and half of the responses will not be processed.
    • To go around this issue, you can:
      • Retry You may click a button "Retry" in the uploads page and process the detected duplicates as new responses.
      • Disable unique page identifiers and upload again - a random incremental identifier will be generated instead.
      • Allow non-unique page marking identifiers and upload again - the duplication check will not be active for the survey.

Comparing file hashes

Before processing the page we calculate the SHA-1* hash and check our database if we don't already have the file with such hash processed.

If you did upload this file previously, it won't be processed again.

Where it falls through

  • This duplication check will not work if you scan the page twice as each scan will have a different hash, even if they look pretty much identical.
  • Modifying the scanned page with image editing software will alter its signature.
  • But SHA-1 algorithm is no longer safe!?
    • Yes, we are aware that file collisions are possible but this does not compromise our threat model and SHA-1 remains a suitable method for detecting duplicate uploads.

*- SHA-1 (Secure Hash Algorithm 1) is a 40 digit number which uniquely identifies the file and always returns the same number even if you rename the file. However, if you open an image editing software and draw a line somewhere, a completely different hash will be produced.


Get Started with PaperSurvey.io Software

Get Started

Start your 14-day free trial now, no credit card required.