EU – Ten key points provide insight into regulators’ views on artificial intelligence
The Spanish Data Protection Agency (AEPD) and the European Data Protection Supervisor have issued a joint paper on machine learning. It sets out 10 (mis)conceptions about this technology.
The GDPR as the “law of everything”
Despite the increasing number of proposals to specifically regulate artificial intelligence, such as the AI Act and the AI Liability Directive, the GDPR is still a key instrument in this area.
The GDPR’s broad scope and flexible and technology-neutral rules means it will have a central role regulating emerging technology – such as artificial intelligence – for some time to come. However, as a “law of everything”, it is often difficult to predict how the GDPR’s subtle and principle-based obligations apply in practice. Rather than provide a series of strict rules, the GDPR instead provides a series of questions on the application of subjective concepts of fairness and proportionality.
It is therefore important to understand the regulators’ attitudes to this technology. How do they see the principles of the GDPR mapping on to artificial intelligence? Which principles are they focusing on? How do they assess the capabilities and limitations of this technology?
Ten tenets from AEPD and EDPS
The joint paper issued by the AEPD and the EDPS sets out their view on machine learning systems. Machine learning is a very common form of artificial intelligence. Rather than having their behaviour exhaustively set out by a human in code, these systems learn from data.
The paper challenges 10 (mis)conceptions (shown in between brackets):
1. Causality requires more than finding correlations (v. “Correlation implies causality”)
A “correlation” is the connection that exists between two or more factors that tend to evolve or vary together. “Causality” refers to the relationship between cause and effect.
The paper notes that machine learning systems find correlations but do not have the ability to establish causal relationships. This is a well-known statistical concern given, for example, the correlation between the US per capita consumption of margarine and the divorce rate in Maine. Or the correlation between the total US crude oil imports and the US per capita consumption of chicken (here).
The paper suggests that where a machine learning tool is seeking to identify causation, it should be subject to expert human supervision and interpretation.
2. Machine learning training datasets must meet accuracy and representativeness thresholds (v. “The greater the variety of data, the better”)
The training of machine learning systems requires large amounts of data. This leads to a greater demand for the sharing of data (personal and non-personal). However, the paper suggests that more data will not necessarily improve the performance of a machine learning model and, if it is the “wrong” sort of data, could even aggravate biases in the system.
The GDPR requires that the processing of personal data must be proportionate to its purpose. There is a trade-off between the amount of personal data in a training dataset and the potential interference to individual rights. The paper cautions against substantially increasing the size of the dataset if it will only provide a minor improvement in the performance of the model.
3. Machine learning systems require training datasets above a certain quality threshold to perform well (v. “Machine learning needs completely error-free training datasets”)
Machine learning models typically require large training datasets to make accurate predictions. The quality of the training, testing, and validation of datasets influences the performance of machine learning systems.
The paper notes, however, that when processing large amounts of information, it is possible to obtain accurate results in spite of individual errors in the data. Introducing noise into the training datasets is a technique that can be used to preserve the privacy of data subjects. This is known as “differential privacy”. Machine learning models can obtain good results despite the inaccuracies caused by differential privacy.
This is, of course, true. However, degrading the quality of the data used to train a machine learning system is something that should be approached with caution.
4. Federated learning allows the development of machine learning systems without sharing training datasets (v. “Large repositories of data or the sharing of datasets from different sources are needed”)
The traditional approach to machine learning used centralised models, i.e. the pooling of data into a single environment controlled by the machine learning developer.
However, the processing of personal data requires the controller and the recipient of the data to comply with the principles of accountability, security, and purpose limitation under the GDPR. Also, the larger the dataset, the greater the risk of unauthorised access and the greater the impact in case of a security breach.
Accordingly, the paper recommends “federated learning” as an alternative development technique where each controller trains a model with its own data and only shares the parameters derived from this analysis. This way, the “raw” data remains with each controller.
5. The performance of deployed machine learning models may deteriorate and will not improve unless further training is conducted (v. “Machine learning models automatically improve over time”)
All machine learning systems are at risk of “model drift” – i.e. they become inaccurate either because the data input into the system changes over time or because the relationships between the data change. If the system is not updated there is a risk to the accuracy of the system and its ability to produce fair and adequate predictions.
Machine learning systems should therefore be monitored to detect and correct “model drift”.
6. If a machine learning model is well designed, it can produce decisions understandable to all relevant stakeholders (v. “Automatic decisions taken by machine learning algorithms cannot be explained”)
The paper suggests there are techniques that can be used to “explain” the decisions made by machine learning models. It also notes the degree of detail needed in the explanations depends on the individuals and the context.
This appears to be unduly optimistic. For example, the UK Information Commissioner’s paper Explaining decisions made with AI, created in conjunction with the Alan Turing Institute, sets out in detail the various techniques that can be used to help with explainability (such as surrogate models, LIMEs and SHAPs). That paper concludes that while these can be helpful, they are very unlikely to provide the full picture for “black box” systems such as artificial neural nets. While there has been significant progress in algorithmic interpretability in recent years, these tools will not provide a complete solution and complex multi-dimensional artificial intelligence models are likely to remain largely unknowable.
7. It is possible to comply with the principle of transparency without violating intellectual property (v. “Transparency in machine learning violates intellectual property and is not understood by the user”)
The paper argues that providing adequate and sufficient information to data subjects does not necessarily require the disclosure of technical information protected by intellectual property rights.
Data subjects must be given meaningful information on the processing of their personal data using machine learning systems (e.g. limitations of the system, the performance metrics of the system, and personal data used as input).
Again, the paper could do more to explore how easy it is to communicate complicated technical issues to individuals in a way that is meaningful and does not overload them with information. The UK Information Commissioner’s Explaining decisions made with AI provides more detail about this issue and the different types of explanation that could be provided, e.g. depending on the context it might be appropriate to explain the data in the model, or the steps taken to ensure fairness, or the safety of the system.
8. Machine learning systems are subject to human bias as well as other types of biases (v. “Machine learning systems are less subject to human biases”)
The paper notes the risk of bias and highlights a study that shows that machine learning systems may be subject to more than 20 types of bias. For example, since machine learning systems are generally trained on data created by humans, there is a risk they can replicate human biases.
Bias is a feature in both human and machine decision-making. However, bias in machine decision-making is perceived as a greater risk because most people assume a human will have some degree of empathy and so try to identify and self-correct their bias and will apply a final common-sense check to their decision. Machine decision-making does not have either safeguard.
The paper does not, however, provide much explanation of how to address bias. While there are various tools and techniques to identify bias, they often involve the processing of special category personal data. For example, detecting racial bias will necessarily involve the processing of information about racial and ethnic origin, which can be challenging given the strict rules in the GDPR on the use of this data. No explanation is provided of what the legal basis for that processing would be.
9. Machine learning system predictions are only accurate when future events reproduce past trends (v. “Machine learning can accurately predict the future”)
Like all predictions about the future, machine learning systems look in the rear-view mirror to predict the curve of the road ahead.
They cannot adapt to completely new scenarios or rapidly changing events and are likely to be much less capable than a human to assess the risks of these eventualities.
10. The ability of machine learning to find unpredicted correlations in data can result in the discovery of new data, unknown to the data subject (v. “Individuals are able to anticipate the possible outcomes that machine learning systems can make of their data”)
The paper notes that machine learning systems can generate novel inferences when processing personal data that raise several concerns from a privacy perspective. Data subjects can be affected by these novel inferences in a way they had no way of knowing or anticipating.
Controllers must consider carefully if these novel inferences comply with data protection principles, particularly the risk they breach the purpose limitation principle or result in personal data being processed in a way that is not lawful and transparent.
Context is king
Perhaps the most striking part of the paper is its treatment of machine learning as a monolith; requiring the same treatment regardless of its purpose or context.
The impact of machine learning models on individual rights will vary dramatically. In some cases, they have no effect because limited or no personal data is processed, e.g. a system trained to analyse share prices. In other cases, the context in which the model is used will significantly impact its risk. For example, a machine learning tool that automatically dispenses drug prescriptions to patients will be significantly higher risk than one that simply makes recommendations to a doctor.
In practice, what is often most important is a clear understanding of the limitations and uncertainty of the output of these models. Where fundamental rights are involved, it should be treated as a Delphic pronouncement that needs to be carefully parsed and applied, rather than a binary statement of fact.