TP Transcription Limited and University Transcriptions are expert academic transcribers and preferred suppliers to a large number of universities in the UK, Ireland and around the world. This article is about the difference between anonymising, applying de-identification techniques or pseudonymisation to research interviews, focus groups, patient notes or other free-text data containing personal information.
Anonymisation is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified.
An individual may be directly identified from their name, address, postcode, telephone number, photograph, image, or other unique personal characteristic.
An individual may be indirectly identified when particular information is linked together with other sources of information. This can include their place of work, job title, salary, a particular diagnosis or condition, an event (eg a disaster) or presence at a location at a specific time.
The main reason given for most anonymisation relates to GDPR (data protection regulations). Once data is completely anonymised and individuals are no longer identifiable, the data will not fall within the scope of the GDPR and it becomes easier to use, hence the regular request by our academic clients for some element of anonymisation, depending on the project in question. There are of course plenty of other reasons for it, including a promise by researchers to interviewees that their identity will not be revealed at any point.
While there may be incentives for some organisations to process data in anonymised form, this technique may devalue the data, so that it is no longer of useful for some purposes. Therefore, before anonymization consideration should be given to the purposes for which the data is to be used.
Free Anonymisation Example
We can remove names and/or places, which can be highlighted so you can decide which to remove/leave in, or we can anonymise.
“Hello, my name is <Anna> and I live in <Manchester>”
“Hello, my name is <Name> and I live in <Place>”
Please contact us to discuss your requirements – we are very experienced at all forms of anonymisation. Usually we do not charge for the service, provided we do it as we transcribe, but other anonymisation will require an extra charge due to the time it takes for us to filter the text once we have transcribed the recording.
Very often if researchers need to share participant notes or interview transcripts the data will need anonymising.
The UK Information Commissioner’s Office lists the following reasons for considering anonymisation:
- developing greater public trust and confidence that data is being used for the public good, while privacy is protected
- incentivising researchers and others to use anonymous information instead of personal data, where possible
- economic and societal benefits deriving from the availability of rich data sources
The best way to protect your participant’s privacy may be not to collect certain identifiable information at all – easier said than done when interviewing of course! The second best is anonymisation which allows data to be shared whilst protecting participant’s personal information.
Anonymisation should be considered in the context of the whole project and how it can be utilised alongside informed consent and control of access to data. Of course if a participant consents to their data being shared then the use of anonymisation may not be required. We strongly recommend asking the question – a blanket use of anonymisation without any reason can be time consuming for all concerned.
The Consortium of European Social Science Data Archives has produced a best practice guide for anonymising quantitative and qualitative data. They have also generated a guide to other sources of guidance.
Summary of best practices for anonymising quantitative data (CESSDA)
- Removing or aggregating variables or reducing the precision or detailed textual meaning of a variable;
- Aggregate or reduce the precision of a variable such as age or place of residence;
- Generalise the meaning of a detailed text variable by replacing potentially disclosive free-text responses with more general text;
- Restrict the upper or lower ranges of a continuous variable to hide outliers if the values for certain individuals are unusual or atypical within the wider group researched.
Summary of best practices for anonymising qualitative data (CESSDA)
- Using pseudonyms or generic descriptors to edit identifying information, rather than blanking-out that information;
- Plan anonymisation at the time of transcription or initial write-up;
- Use pseudonyms or replacements that are consistent throughout the research team and the project.
- Use ‘search and replace’ techniques carefully so that unintended changes are not made, and misspelt words are not missed;
- Identify replacements in text clearly, for example with [brackets] or using XML tags such as <seg>word to be anonymised</seg>;
- Create an anonymisation log (also known as a de-anonymisation key) of all replacements, aggregations or removals made and store such a log securely and separately from the anonymised data files.
GDPR – EU and UK
The EU regulation Recital 26 defines anonymous information, as ‘…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable’.
The key to anonymising data is that GDPR does not apply to anonymised information and this is why it is such an important function. No GDPR application means the data can be utilised in less restrictive ways.
The ICO’s Code of Conduct on Anonymisation provides a further guidance on anonymisation techniques – including the suggestion of applying a ‘motivated intruder’ test for ensuring the adequacy of de-identification techniques.
- The UK Information Commissioner’s Office has a very useful guide entitled ‘Introduction to
- The Text Anonymization Helper Tool – developed by the UK Data Archive – runs using Microsoft Word Macros.
- The UK Anonymisation Network (UKAN) – a network coordinated by Manchester University and Southampton University to promote best practice in anonymising data.
- UK Data Archive – https://dam.ukdataservice.ac.uk/media/622417/managingsharing.pdf
De-identification is the process of removing or obscuring personally identifiable information (PII) from a text or dataset. This data tends to include names, locations and contact details. The process can be approached in a number of ways, but the output is often along the lines of:
a. the masking of PII with labels (“my name is Anna” becomes “my name is <NAME>”)
b. the replacement of PII with dummy data (“my name is Anna” becomes “my name is Alan”)
Example of de-identification:
Speaker 1: hi John, good to see you again. How was your weekend? Did you get up to much?
Speaker 2: Yes, all good thanks. I was with my mum in Chelmsford. We saw that Harry Potter film. What’s it called? Then got a couple of drinks at the Slug & Lettuce in Ilford.
Speaker 1: That’s close to your flat, right?
Speaker 2: Yes about ten minutes away from my flat in James Street. It was my mum’s birthday on Sunday, She’s got a new job at Aldi in Romford.
Speaker 1: hi PER, good to see you again. How was your weekend? Did you get up to much?
Speaker 2: Yes, all good thanks. I was with my mum in LOC. We saw that Harry Potter film. What’s it called? Then got a couple of drinks at the Slug & Lettuce in LOC.
Speaker 1: That’s close to your flat, right?
Speaker 2: Yes about ten minutes away from my flat in ADD. It was my mum’s birthday on Sunday, She’s got a new job at PLA in LOC.
NLM-Scrubber – Free De-identification Tool
The NLM-Scrubber is a free clinical text de-identification tool designed and developed at the National Library of Medicine in the US. The aim of the tool is to enable clinical scientists in the US to access clinical health information that is not associated with the patient by following the Safe Harbor principles outlined in the HIPAA Privacy Rule. HIPAA stands for Health Insurance Portability and Accountability Act 1996, which is essentially the US version of GDPR applicable to clinical data in the USA.
The tool can be used by all researchers – link is here – https://lhncbc.nlm.nih.gov/scrubber/download.html
NB: outputs reliant on pre-trained models should always be checked for errors – and you may well need to apply de-identification manually to your texts. We can assist – please ask for de-identification when placing your order.
Pseudonymisation is not the same as anonymisation. The definition follows by essentially it involves removing any personal data and replacing with a code that can then be re-attached at any point by someone with the original data and the replacement code.
Pseudonymisation is defined within EU GDPR as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual” (Article 4(3b)).
Example of Pseudonymisation of Data (taken from the Irish Data Protection Commission website):
|Student Name||Student Number||Course of Study|
|Original Data||Joe Smith||12345678||History|
|Pseudonymised Data||Candidate 1||XXXXXXXX||History|
Pseudonymisation is not GDPR Exempt
Pseudonymisation essentially means that anyone who has access to specific data is able to identify the data subject by cross referencing. Unsurprisingly, unlike anonymisation, pseudonymisation techniques will not exempt data controllers from the scope of GDPR. However the process does help academic institutions meet their data protection obligations under UK and EU GDPR, particularly the principles of ‘data minimisation’ and ‘storage limitation’ (Articles 5(1c) and 5(1)e), and processing for research purposes for which ‘appropriate safeguards’ are required.
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments (EU Recital 26).
Recital 26 provides that “Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
Both the above sections of Recital 26 mean that pseudonymised personal data can still fall within scope of the GDPR. The UK Information Commissioner’s Office has given an example of pseudonymisation and refers to its use in this circumstances being good practice for the purpose of data protection.
A delivery/courier firm processes personal data about its drivers’ mileage, journeys and driving frequency. It holds this personal data for two purposes:
- to process expenses claims for mileage; and
- to charge their customers for the service.
For both of these, identifying the individual couriers is crucial.
However, a second team within the organisation also uses the data to optimise the efficiency of the courier fleet. For this, the identification of the individual is unnecessary.
Therefore, the firm pseudonymises the data by replacing identifiers (drivers’ names, job titles, location data and driving history) with a non-identifying equivalent such as a reference number which, on its own, has no meaning.
The members of this second team can only access this pseudonymised information. The delivery firm can of course, as the data controller, link the material back to the identified individuals.
The motivated intruder test
Where ‘de-identified’ or pseudonymised data is in use there is always a residual risk of re-identification, hence the GDPR regulations still being applicable. The motivated intruder test can be used to assess the likelihood of this. Once assessed, a decision can be made on whether further steps to de-identify the data are necessary. By applying this test and documenting the decisions, the study will have evidence that the risk of disclosure has been properly considered; this may be a requirement if the study is audited.
Advice on applying the motivated intruder test (MIT) involves researchers thinking about who an intruder might be (internal or external) and what their motivations might be: for example a disgruntled employee, an individual attempting to discredit the research team or an investigative journalist and look at what measures are being taken to protect the data from these threats.