Skip to Main Content

Research Data Management

Pseudonymisation and Anonymisation

Researchers have an ethical and legal duty towards human subjects. They and their personal data must be protected from any potential harm.

Sensitive data you collect and share must be rendered unidentifiable through anonymisation at a level sufficient to ensure they cannot be identified by cross-checking datasets. In the case of non-sensitive personal data, simple pseudonymisation (replacing names and other identifiers with codes while keeping a secure codebook) can be sufficient with the subject's informed consent.

For information on the legal side, you can check our guide on data protection (in French) covering aspects of GDPR (EU law) and LDAP (Swiss law). For information on the ethical side, please check the Research ethics page created by the Research office.

Direct vs Indirect Identifiers

In general, direct identifiers will need to be removed from your dataset as soon as practical and stored separately, and often deleted when your research does not require them anymore. Direct identifiers include:

  • Names, pictures
  • Phone numbers, social media ID
  • E-mail or physical addresses
  • ID card or passport numbers
  • Medical file ID, insurance numbers
  • IP adresses
  • and more

You must also check that indirect identifiers do not collectively allow the identification of a research participant, especially for a small sample size. Indirect identifiers include:

  • Age and birth date
  • Gender and sexual orientation
  • Employment and salary
  • Nationality, ethnic group
  • Religion, political affiliation
  • Location (town, zip code)
  • Medical condition
  • and many, many more

These various types of identifiers can be found both in qualitative and quantitative data in various forms.

How to Anonymise Indirect Identifiers

The risk stemming from indirect identifiers is higher with a small target group or sample size. Re-identification risk factors in general include:

  • Individualisation: crossing multiple indirect identifiers until only one possible person remains - think of the game "Guess Who?"
  • Correlation: crossing the dataset with other datasets such as those provided by data brokers.
  • Inference: when a supposedly indirect identifier is actually unique, such as interviewing a 110-year old participant - not many of these around.

Methods to reduce the risk include:

  • Data minimisation: at the time of collection, do not collect direct/indirect identifiers you will not actually need.
  • Transformation: rather than keeping dates, set a day D=0 and indicate "D+11" to indicate evolution over time.
  • Encryption: removing identifiers by replacing them with a code, with the key being kept separately and/or (later) destroyed.
  • Generalisation: keeping information about the year rather than exact date, or a wider zip code instead of an address.
  • Aggregation: grouping in a way that remains informative while making identification harder. This can be applied in two ways:
    • Grouping values: age ranges work well, especially "90+" since the number of participants beyond that age will be small.
    • Grouping individuals: you could only keep information for groups rather than individual answers, especially for open data.
  • Randomisation: altering data by
    • Adding random +/- offsets to dates
    • Shuffling values between participants

These operations must be properly described in your documentation, especially if you choose to apply randomisation, which severely affects a dataset.

Data Anonymisation Tools

The following tools might be of interest for you:

Tabular data

Qualitative/textual data

Other Guides

Anonymisation, UK Data Service: (with tool for Word)

Deidentification, Latrobe:

Data anonymisation, Nanyang TU Singapore:

Removing identifiers from data, USYD: