LibGuides: Research Data Management: Anonymising Data

Pseudonymisation and Anonymisation

Researchers have an ethical and legal duty towards human subjects. They and their personal data must be protected from any potential harm.

Sensitive data you collect and share must be rendered unidentifiable through anonymisation at a level sufficient to ensure they cannot be identified by cross-checking datasets. Since perfect anonymisation is generally impossible, we generally refer to the process as de-identification.

In the case of non-sensitive personal data, simple pseudonymisation (replacing names and other direct identifiers with codes while keeping a secure codebook) can be sufficient with the subject's informed consent.

For information on the legal side, you can check our guide on data protection (in French) covering aspects of GDPR (EU law) and LDAP (Swiss law). For information on the ethical side, please check the Research ethics page created by the Research office.

Direct vs Indirect Identifiers

In general, direct identifiers will need to be removed from your dataset as soon as practical and stored separately, and often deleted when your research does not require them anymore. Direct identifiers include:

Names, pictures
Phone numbers, social media ID
E-mail or physical addresses
ID card or passport numbers
Medical file ID, insurance numbers
IP adresses
and more

You must also check that indirect identifiers do not collectively allow the identification of a research participant, especially for a small sample size. Indirect identifiers include:

Age and birth date
Gender and sexual orientation
Employment and salary
Nationality, ethnic group
Religion, political affiliation
Location (town, zip code)
Medical condition
and many, many more

These various types of identifiers can be found both in qualitative and quantitative data in various forms.

How to De-Identify Indirect Identifiers

The risk stemming from indirect identifiers is higher with a small target group or sample size. Re-identification risk factors in general include:

Individualisation: crossing multiple indirect identifiers until only one possible person remains - think of the game "Guess Who?"
Correlation: crossing the dataset with other datasets such as those provided by data brokers.
Inference: when a supposedly indirect identifier is actually unique, such as interviewing a 110-year old participant - not many of these around.

Methods to reduce the risk include:

Data minimisation: at the time of collection, do not collect direct/indirect identifiers you will not actually need.
Transformation: rather than keeping dates, set a day D=0 and indicate "D+11" to indicate evolution over time.
Encryption: removing identifiers by replacing them with a code, with the key being kept separately and/or (later) destroyed.
Generalisation: keeping information about the year rather than exact date, or a wider zip code instead of an address.
Aggregation: grouping in a way that remains informative while making identification harder. This can be applied in two ways:
- Grouping values: age ranges work well, especially "90+" since the number of participants beyond that age will be small.
- Grouping individuals: you could only keep information for groups rather than individual answers, especially for open data.
Randomisation: altering data by
- Adding random +/- offsets to dates
- Shuffling values between participants

These operations must be properly described in your documentation, especially if you choose to apply randomisation, which severely affects a dataset.

Data Anonymisation Tools

The following tools might be of interest for you:

Tabular data

Amnesia (OpenAIRE)
ARX Data Anonymization Tool
Cornell Anonymization Toolkit
sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation, package for R.
- sdcApp: Graphical UI of sdcMicro for those not as comfortable with R programming

Qualitative/textual data

Other Guides

Anonymisation, UK Data Service: https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation.aspx (with tool for Word)

Deidentification, Latrobe: https://latrobe.libguides.com/sensitivedata/deidentification

Data anonymisation, Nanyang TU Singapore: https://libguides.ntu.edu.sg/anon

Removing identifiers from data, USYD: https://libguides.library.usyd.edu.au/datapublication/desensitise-data