Data Research

Audience

This page is intended for use by students and researchers in the University of Cambridge Schools of Technology and Physical Sciences who do research with data relating to living identifiable individuals.

There is a separate page describing survey methods such as questionnaires and interviews. It is part of a larger set of research guidance pages on work with human participants.

Checklist of risks to be addressed in ethics applications:

How could data subjects be identified by the researchers or others?
What is the basis for direct or presumed consent?
Has the dataset been acquired from previous research or elsewhere?
How sensitive is the data being collected, and what impact could it have on the data subjects if its security was compromised?
Will consent be requested for publication or reuse of the data?
Will the research comply with (local) regulatory constraints beyond UK legislation?

This guidance page is about research with data relating to living identifiable individuals. Use of personal data is governed by UK law under the UK General Data Protection Regulation (UK GDPR) and the accompanying Data Protection Act 2018. All research must be legal, however compliance with GDPR in itself is not sufficient to define the scope of ethical data research.

This page does not include guidance on how to do research surveys, experiments, fieldwork or software releases. These often do collect and/or produce personal data, but also involve other ethical considerations that are discussed in more detail on other guidance pages. However, some guidance is given later in this document on the construction of new research datasets for reuse or publication, which can be read in conjunction with the specific guidance page on the research method being used to collect the data.

Identifiable data subjects

How could data subjects be identified by the researchers or others?

Any piece of data that can be related to a specific individual makes that person a participant in your research (or “data subject”), bringing ethical obligations as well as legal constraints under data protection legislation. In addition to directly identifying information like someone’s name or a photo of their face, individuals could be identified by their street address, IP address, email, social media account, MAC address, phone number, GPS location, set of friends, employer or even a distinctive phrase from a searchable web page, or some set of data correlations unique to that person.

If personal information is not included in the research data set (for example collections of behavioural measurements), the data may be treated as “pseudonymous” (or “pseudo-anonymous”), by keeping identifying information (e.g. consent forms, payments, contact emails) secure and separate from the main data set. Don’t forget that service providers and other agencies may also collect or retain data that could be used to identify your data subjects or compromise anonymity if you are working with data about individuals. It is difficult to guarantee that data related to individuals could never be related to that person by anyone, meaning that many data sets must be considered pseudonymous. Detailed advice on anonymisation is available in a code of practice from the UK Information Commissioner’s Office (ICO).

Consent

What is the basis for direct or presumed consent?

As with any other research participants, data subjects must consent to have their data used in research. Wherever possible, this should involve direct consent, where the person has explicitly agreed that their data can be used for this research purpose. There are notes later on this page, regarding the kinds of purpose for which consent might be requested if data is going to be published or reused. It is also a core principle of UK data protection legislation, that personal data can legally be used for the purposes agreed by the data subject, but not for other purposes.

Obtaining personal data without direct consent

Scraping data from the Internet: Many researchers “scrape” personal data from the Internet, e.g. social media posts or photographs containing identifiable individuals. The starting point for such research must be the terms and conditions of the sites that are being scraped. Some sites have terms of use that prohibit web scraping, while others (e.g. Twitter) require researchers to use a specific API. However, the fact that data may be publicly and legally available on the Internet does not automatically make its use ethical. The key question is whether the person concerned would reasonably expect their data to be used in this way. Was the information intended for the general public, or was it ‘private’ information, or intended only for friends or family, that has become publicly accessible - perhaps because the user made an error in their account settings, or because the end-user licence agreement (EULA) was not clear? Does the data relate to sensitive matters? Were photographs captured in public places or in private? It may not be easy to answer these questions, but they are ethically important. Web scraping research should be designed with care, choosing which sites are scraped to be consistent with the principle of expectation and intended use.

Obtaining data from an organisation: Alternatively, personal data for research might be obtained from an organisation, e.g. data from customers. In that case, the company supplying the data is responsible for checking that it can legitimately be supplied for academic purposes. This may often be done by reference to their EULA terms, for example using data for "product improvement" which might include research toward future product development. The company will also be responsible for its own data protection compliance, so may prefer to remove directly identifying information before sharing the data set with a researcher. As noted, such data is usually only pseudo-anonymous, and may even directly identify the subjects through information such as location, user ids, photographs etc. Note that data obtained from the NHS will always be subject to NHS ethics approval.

Use of existing datasets

Has the dataset been acquired from previous research (or elsewhere)?

Many researchers work with datasets that have been created elsewhere, which may even be very widely used (for example in vision research competitions or student training sets for machine learning), but still include personal data such as photographs and are thus subject to ethical review. There have been several high-profile cases where major datasets of this kind have been withdrawn, at great inconvenience, after researchers realised that they raised serious ethical problems, so it is worth taking care at the start of a project.

Where a particular dataset is used repeatedly in Cambridge, a streamlined ethical review process is appropriate. It is suggested that groups who repeatedly conduct studies with the same dataset keep records relating to each dataset, recording the following metadata.

Source: How was the dataset obtained? Is it explicitly open? Which individuals / institutions are responsible for the publication or supply to Cambridge?
Origin: Where did the dataset creators originally get the personal data? (Note as above that “on the internet” is not adequate to determine ethical provenance).
Terms: Is the permitted use defined by a signed agreement or published terms of use? (Provide a link to the agreement or terms applied).
Consent: Was informed consent given by the data subjects for research uses of their data, and if not, what is the presumed basis for consent? Does that original consent cover the intended re-use?
Complaint: Could the data subjects (whether anonymized or not) conceivably object to the new use of their data? By what mechanism could they request withdrawal / opt-out?
Bias: What population does this dataset represent? Who might be disadvantaged by this new application of the existing data?
Review: When was the above information most recently reviewed, for example to confirm that the dataset has not been withdrawn or modified?

Where the same dataset is used multiple times for the same purpose, it should be adequate for an ethics review body to review answers to the above questions once, with subsequent uses of the same dataset simply reported to the committee if there has been no change to the dataset and no change in the way it is being used. In cases where the publisher of an externally sourced dataset has not provided necessary information (e.g. regarding origin, consent, bias), they should be contacted to request further information. That request letter, and response actions to it, should be lodged with Cambridge records of dataset usage.

Creating new research datasets for reuse or publication

Will consent be requested for publication or reuse of the data?

This section should be read alongside the other ethics guidance pages for the School of Technology, covering ethical questions in the data collection process - e.g. data might be collected through surveys, interviews, field recordings, controlled experiments etc.

University policy for data management is that all research projects should include a Data Management Plan (DMP): research staff and students are responsible for ensuring that legal, ethical and commercial constraints on release of research data are considered at the initiation of the research process and throughout both the research and data life cycles. Where the research involves personal data and identifiable data subjects, the DMP should document arrangements to be made with participants (consent relating to data storage and sharing) and collaborators (e.g. over IP).

Where data relating to individuals is being collected with the intention of reuse or publication, for example where the funder requires data to be uploaded to a public repository, consent should be obtained from the data subjects.

Considerations for consent might include the following, among other questions relevant to your particular area of research:

I consent to my [ name / profile / contacts / location / photograph / ... ] being recorded in a dataset for use in future research [ by the same research group / named researchers at other institutions / any researcher agreeing to the terms / any person who needs it ]
I consent to my [ name / profile / contacts / location / photograph / ... ] being used to illustrate [ student reports / academic presentations / publications / blogs and websites / commercial use / popular media ]
I consent to the [ image / text / audio ] I have contributed being [ reused / remixed ] [ with / without ] attribution for [ noncommercial / any ] purposes
I consent to my [ anonymised / identifying ] data being published in a data repository where it will be publicly available.
I understand that I will only be able to request deletion of data [ within timescale / circumstances / before aggregation / … ]

Note that deposit in the Cambridge Apollo repository requires evidence of consent for datasets that include personal data.

Transfer of data to other institutions outside of open data publication may be defined by a Materials Transfer Agreement (MTA). The Research Operations Office Contracts Team can advise on preparation of an MTA.

Sensitive data and security

How sensitive is the data being collected, and what impact could it have on the data subjects if its security was compromised?

Where data subjects do not expect the data to be published, there is risk of harm from data being revealed (whether accidentally or maliciously). Greater harm may result if the identifying records linked to pseudo-anonymous data are compromised.

Special precautions are likely to be necessary in compiling commercially sensitive data, data revealing illegal activity, or data revealing unethical activity. Specialist advice can be sought as appropriate from Cambridge Enterprise, legal staff in the Cambridge Research Office, or the Cambridge Cybercrime Centre.

Note that UK GDPR defines 8 special categories of sensitive data: Racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data concerning health and sexual life and sexual orientation (Article 9). Research that collects any of these will require special ethical scrutiny and justification.

The University Information Security team offers a Data Protection Impact Assessment, in which risks are considered in terms of impact on the data subjects should data be compromised. This leads to a classification in terms of number of people affected, and severity of the impact, e.g. Levels 0 and 1 for minor impacts on a single individual or less than 100; while Levels 2 and 3 relate to minor impact for more than 100 or 1000 individuals or severe impacts on one person or on larger numbers.

If risks to data subjects are at the higher levels, you should seek advice from the University Information Security team. Their guidance notes, for example, that use of platforms such as Moodle and Qualtrix, and storing data on USB drives, are not recommended for risk classification levels 2 and 3, while University cloud storage, and encrypted USB drives should be adequate for levels 0 and 1.

Please follow these links for further guidance on data security:

Legal considerations

Will the research comply with regulatory constraints?

Presumed consent: In situations where it is not possible to obtain direct consent, UK legislation allows some basis for presumed consent, where academic research is carried out in the public interest. However the provisions are often quite complicated.

The University’s Research and Information Compliance Offices have provided more detailed advice on academic research involving personal data, aimed at researchers in all disciplines:

Research in other countries: If your research involves participants, activity or data processing in places other than the UK, UK data protection legislation may apply alongside other local legal requirements. For example, some countries do not allow any personal data to leave the country, and there are others that do not allow foreign researchers to collect data without permission.

The University does not currently maintain detailed records of data protection legislation or legal constraints on research in other countries, so it will be the responsibility of the researcher, in the first instance, to determine legal constraints in other countries where they are working. In addition to questions of legality, you may need to consider whether researchers or data subjects may be exposed to potential harm in other countries, for example as a result of local laws on libel, IP, national security etc.

Collecting network data: Monitoring of network traffic is subject to legal constraint under the Telecommunications (Interception of Communications) Regulations 2000, and should consider the Authorization of the Use of the Cambridge University Data Network (CUDN), including terms of use for the UK Joint Academic Network (JANET), and its Acceptable Use Policy.

Audience

Checklist of risks to be addressed in ethics applications:

Identifiable data subjects

Consent

Obtaining personal data without direct consent

Use of existing datasets

Creating new research datasets for reuse or publication

Sensitive data and security

Legal considerations

Further reading

Contact Us

School of Technology

University Policy and Guidelines

Study at Cambridge

About the University

Research at Cambridge