The Information Commissioner’s Office published draft guidance on privacy enhancing technologies that can be used to comply with privacy-by-design requirements.

By Gail Crawford, Fiona Maclean, Irina Vasile, and Amy Smyth

On 7 September 2022, the Information Commissioner’s Office (ICO) published a draft guidance on privacy-enhancing technologies (Draft Guidance) in which it explains what privacy enhancing technologies (PETs) are and how organizations can use them to meet privacy-by-design requirements. PETs incorporate data protection principles by (amongst others) minimizing use of personal data, ensuring security, and facilitating data subject rights. Organizations that want to use PETs should first conduct a data protection impact assessment to determine whether such technologies are indeed adequate for their processing activities.

According to the Draft Guidance, PETs are particularly suitable in contexts that involve large-scale collection and analysis of personal data, such as artificial intelligence applications, Internet of Things, and cloud computing services. The ICO specifically states that PETs are not a “silver bullet”, and does not impose a specific obligation regarding their use. Further, the ICO flags certain downsides associated with these technologies, such as lack of scalability, lack of sufficient information/research regarding some PETs, and the potential for inadequate or erroneous implementation.

The ICO classifies PETs into three categories:

  • PETs that derive or generate data that reduces or removes the identifiability of individual, which aim to weaken the connection between an individual in the original personal data and the derived data
  • PETs that focus on hiding or shielding data, which aim to protect individuals’ privacy while not affecting data utility and accuracy
  • PETs that split datasets or control access to certain parts of the data, which aim to minimize the volume of shared data and ensure security whilst not affecting the utility and data accuracy

The Draft Guidance sets out various types of PETs (without providing an exhaustive list), their associated advantages/disadvantages, and some example use cases. Below is a brief summary of these PETs analysed by the ICO.

Homomorphic encryption (HE)

HE allows entities to perform computations on encrypted data without first decrypting it, and the computations themselves are also encrypted. Once the computations are decrypted, the result is an output identical to what would have been produced if the computation had been performed on the original plaintext data.

Advantages Disadvantages
May minimize risk from data breaches if such breaches occur; e.g., HE renders the data unintelligible to an attacker, the risks to individuals are reduced, and therefore notifying individuals of the breach may not be necessary.

Provides a level of assurance that the computation results are the same as if performed on unencrypted data (subject to providing correct inputs prior to encryption).

 

Some types of HE are significantly slower than processing plaintext and can increase communications costs — not as useful for large-volume processing.

HE is subject to general issues associated with encryption, e.g., (i) necessary to implement appropriate technical and organizational measures to keep data secure; (ii) must ensure processes are in place to generate a new key in case the original one is compromised.

 

Secure multiparty computation (SMPC)

SMPC is a protocol (a set of rules for transmitting data between computers) that allows at least two different parties to jointly perform processing on their combined data, without any party needing to share all of its data with each of the other parties. SMPC uses a cryptographic technique called “secret sharing”, which refers to the division of a secret and its distribution amongst each of the parties; this means that each party’s data is split into fragments to be shared with other parties.

Example use case: multiple organizations that want to use SMPC to calculate their average expenditure.

Advantages Disadvantages
Helps ensure that the amount of shared data is limited to what is necessary for the specific processing purposes.

Helps minimize the risk from personal data breaches when processing with other parties as the shared data is not stored collectively.

Evolving and maturing concept; may not be suitable for large scale processing activities in real time (as can be computationally expensive).

Effective use requires tech expertise and resources (and so organizations may not be able to implement it by themselves, though this may depend on the specific SMPC model).

SMPC protocols can be compromised, which can lead to reconstruction of the input data or incorrect computation results.

By design, SMPC means that data inputs are not visible during the computation, so conducting accuracy checks is necessary.

SMPC protects data during the computation but does not protect the output (if this is personal data, appropriate measures should be implemented, such as encryption).

 

Private set intersection (PSI)

PSI is a specific type of SMPC (see above) which allows two parties, each with their own dataset, to find the “intersection” between them (i.e., the elements the two datasets have in common), without revealing or sharing those datasets. PSI can also be used to compute the size of the intersection or aggregate statistics on it.

Example use case: Two health organizations (A and B) process personal data about individuals’ health. A processes data about individuals’ vaccination status, while B processes data about individuals’ specific health conditions. B needs to determine the percentage of individuals with underlying health conditions who have not been vaccinated. Ordinarily, this may require A to disclose its entire dataset to B so the latter can compare with its own. By using PSI, it does not need to do so — while the computation involves processing of the personal data that both organizations hold, the output of that computation is the number of unvaccinated individuals who have underlying health conditions. B therefore only learns this output, and does not otherwise process A’s dataset directly.

Advantages Disadvantages
Helps achieve data minimization as no data is shared beyond what each party has in common.

Helps prevent purpose creep as the parties involved receive the minimum amount of information.

Re-identification from inappropriate intersection size or over-analysis.

Potential for one or more parties to use fictional data in an attempt to reveal information about individuals (choosing an appropriate intersection size is necessary.

Low intersection size may allow the party computing to single out individuals within that intersection in cases in which an individual’s record has additional information associated with it.

 

Federated learning (FL)

A technique which allows multiple parties to train artificial intelligence models on their own data (“local” models). The parties can then combine some of the patterns that those models have identified (known as “gradients”) into a single, more accurate “global” model, without having to share any training data with each other.

Advantages Disadvantages
Helps minimize personal data processed during a model’s training phase.

Provides an appropriate level of security (in combination with other PETs).

Minimizes risks from data breaches as no data is held in one place (in which case it could be more valuable to an attacker).

May lead to significant computational costs, making it unusable for large-scale processing operations.

Information shared as part of FL may indirectly expose private data (e.g., by model inversion of the model updates) used for local training of machine learning models — the data exposed to multiple parties can increase the risk of leakage.

 

Differential privacy

Differential privacy measures how much information is revealed about an individual by virtue of a computation output. It is based on the randomized injection of “noise” (i.e., a random alteration of data in a dataset so that values such as direct or indirect identifiers of individuals are harder to reveal).

Example use case: Differential privacy was used by the US Census Bureau when collecting personal data from individuals for the 2020 US Census to prevent matching between an individual’s identity, their data, and a specific data release.

Advantages Disadvantages
May be used to anonymize personal data, subject to sufficient noise being added.

Anonymous aggregates can be generated from personal data, or personal data can be used to query a database to provide anonymized statistics.

May result in poor utility due to noise addition.

Does not necessarily result in anonymous information.

Improper configuration may lead to data leakage.

 

Trusted execution environment (TEE)

A TEE is a secure area inside a computing device’s central processing unit (CPU). It allows code to be run, and data to be accessed, in a way that is isolated from the rest of the system. The operating system or hypervisor (a process that separates a computer’s operating system and applications from the underlying physical hardware) cannot read the code in the TEE. Applications running in the TEE can only directly access their own data.

Advantages Disadvantages
Provides security services, including (i) integrity of execution; (ii) secure communication with the applications running in the main operating system; (iii) trusted storage; (iv) key management; and (iv) cryptographic algorithms.

Processing is limited to a specific part of the CPU with no access available to external code, which means data is protected from disclosure and provides assurances on integrity and confidentiality.

May help with data governance as TEEs evidence steps taken to mitigate risks and demonstrate appropriateness.

Lack of available memory may create issues for large-scale processing (only limited data can be processed at any one time).

Published security flaws, such as: (i) side-channel attacks (based on extra information gathered from the way TEE communicates with other parts of a computer); and (ii) timing attacks which can leak cryptographic keys or infer information about the underlying operation of the TEE.

 

Zero-knowledge proof (ZKP)

ZKP refers to any protocol in which a prover, usually an individual, is able to prove to another party (verifier) that they are in possession of a secret.

Example applications of ZKP: Confirmation that someone is a certain age without revealing their birth date; proving someone is financially solvent, without revealing any further information regarding their financial status; demonstrating ownership of an asset, without revealing or linking to past transactions; supporting biometric authentication such as facial recognition on mobile devices.

Advantages Disadvantages
Helps with data minimization as it limits the amount of personal data to what is required.

Helps with security as confidential data such as actual age does not have to be shared with other parties.

Poor implementation can cause weaknesses such as code bugs, compromise during deployments, attacks based on extra information that can be gathered whilst the ZKP protocol is implemented, and tampering attacks.

 

Synthetic data

Synthetic data is “artificial” data generated by data synthesis algorithms, which replicate patterns and the statistical properties of real data (which may be personal data). It is generated from real data using a model trained to reproduce the characteristics and structure of that data, which means results should be very similar to analysis conducted on the original real data.

Advantages Disadvantages
Helps comply with the data minimization principle as it reduces/eliminates the processing of personal data.

Note, however, it is necessary to ensure that (i) bias in generating synthetic data is detected and corrected, and (ii) adequate measures are taken in case synthetic data is used to make decisions that have consequences (i.e., legal/health consequences) for individuals.

Use of synthetic data is still actively researched so it may not be a viable solution for many data processing scenarios.

The more synthetic data mimics real data, the more it is likely to reveal individuals’ personal data.

Some synthetic data generation methods have been shown to be vulnerable to model inversion attacks (i.e., an attack in which attackers already have access to some personal data belonging to specific individuals in the training data, but can also infer further personal information about those same individuals by observing the inputs and outputs of the machine learning model.)