Benefits and challenges of federated learning under the GDPR

Musketeer
7 min readSep 28, 2021
Image by Dmitri Posudin from Pixabay

Federated learning, i.e. a “distributed machine learning approach which enables training on a large corpus of decentralized data residing on devices like mobile phones”, has in the past couple of years been emerging as one of ‘THE’ technological solutions to some of the data protection challenges raised by AI. From digital advertising to healthcare, this technique appears to be a popular approach to training machine learning models in many domains. In this post, which summarizes the findings presented in this article, we briefly examine whether this seeming enthusiasm about federated learning can be justified under the General Data Protection Regulation (“GDPR”). We more specifically inquire (i) whether federated learning can help comply with the GDPR and (ii) which challenges this technique may raise under this Regulation. In order to do so, it is helpful to know what the main differences are between federated learning and traditional centralized machine learning (see our earlier blogpost on this topic) and which security challenges federated learning raises.

Security aspects of federated learning

A significant amount of research has been invested into improving the security of federated learning algorithms against poisoning attacks. These are attacks through which an adversary aims at “subverting the entire learning process in a nondiscriminatory or targeted way, i.e. aiming to decrease the overall performance of a system or to produce particular kinds of errors”. These attacks can aim at manipulating either the data provided by one or more of the local participants (data poisoning attacks) or the information exchanged between the local participants and the central node (model poisoning attacks). Both data and model poisoning attacks can be performed by internal adversaries (e.g. one of the participants in the federated learning scheme) and external ones. Moreover, data poisoning attacks can also occur in centralized machine learning scenarios. In many cases, their adverse effect can, however, be limited. Model poisoning attacks, by contrast, are unique to federated learning and can, when standard federated learning aggregation schemes are used, have a significant negative impact on the performance of the final federated learning model.

Compared to centralized machine learning models, federated learning models are arguably less vulnerable to certain privacy attacks, such as membership inference attacks. Membership inference attacks aim at inferring whether a specific data point (i.e. a record of an individual with some characteristics, e.g. a picture of Alice) has been used to train the model. Since federated learning allows to train models with a larger dataset than the one used in centralized machine learning, under federated learning it is arguably more difficult for an adversary to gather all the data needed to perform a successful membership inference attack. By contrast, federated learning models are uniquely vulnerable to property inference attacks, which can typically be performed by malicious training participants. These attacks target the updates exchanged between the local participants and the central node and aim at inferring broader properties (such as e.g. gender or race) of the training data. If these properties are unique to a specific individual, this could allow the attacker to identify that individual. Research indicates that these attacks are theoretically possible, although we consider that the likelihood of these attacks occurring in real life is currently limited.

In the following two paragraphs we shortly touch upon what the above means in terms of compliance with the GDPR.

Potential GDPR benefits

As already suggested by several data protection authorities, federated learning has the potential to facilitate compliance with the principle of data minimization, contained in article 5.1 (c) GDPR. This principle requires that “personal data shall be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed”. When applied to the training phase of a machine learning model, this essentially means that one should use only as much personal data as are strictly necessary to properly train the model. Assuming that the raw training data qualifies as personal data under the GDPR, it is easy to see how federated learning can offer more compliance with this principle than centralized learning. Federated learning namely avoids transferring the raw training data to a central server and consequently (unnecessarily) duplicate such data. By avoiding such duplication, and consequent centralization, of data in the hands of a central server, federated learning can also help limit the risk that these data are re-used for a purpose that is incompatible with the original purpose of collection. This means that, indirectly, federated learning could also facilitate compliance with the principle of purpose limitation, contained in article 5.1.(b) GDPR. This principle requires that “personal data shall be collected for specified, explicit and legitimate purposes and not further processed in a way that is incompatible with those purposes […]”. Finally, the fact that federated learning models are arguably less vulnerable to membership inference attacks can also present an advantage under the GDPR. Indeed, although this is disputed in legal scholarship, some authors are of the opinion that models amenable to membership inference attacks could qualify as personal data. If one agrees with this position, federated learning models could present the advantage of being less likely to qualify as personal data, because they are arguably less vulnerable to membership inference attacks.

Potential GDPR challenges

Below we shortly highlight some of the main questions that using federated learning may raise under the GDPR.

Which data qualify as personal data?

A first thorny task for organizations adopting federated learning is to discern which processing operations occurring in the context of such learning are covered by the GDPR and which ones are not. In fact, a lot of the difficulty lies in determining which data qualify as “personal data” under article 4 (1) GDPR. This article defines personal data as “any information relating to an identified or identifiable natural person”. The concept of “identifiability” under the GDPR is highly ambiguous and can potentially be subject to a broad interpretation. In the case of federated learning, it is clear that if the raw training data provided by the local participants qualify as personal data, the operations performed on these data during the learning process will be covered by the GDPR. If, for example, a group of hospitals provide MRI data containing a patient’s unique identifier number within the hospital to train a federated learning model, these data are likely to qualify as data relating to an identifiable individual (i.e. the patient) and be considered as personal data. Consequently, (wholly or partly) automated processing operations performed on these data in the context of federated learning (such as e.g. data pre-processing operations such as, data normalization or data alignment) would likely fall under the scope of application of the GDPR. What is less clear is whether the operations performed on the model updates exchanged between the local participants and the central node can be covered by the GDPR. Also in this case, the key question is whether these updates can qualify as personal data. Our view is that this cannot be excluded, especially in light of (i) the potentially broad interpretation of the concept of personal data in general, and identifiability in particular, and (ii) the possibility of updates leaking information or being amenable to property inference attacks.

Who is responsible for GDPR compliance ?

A second challenging task will be to determine who is responsible for compliance with the GDPR when processing personal data through federated learning. The GDPR relies on the binary controller — processor distinction, with the bulk of the obligations resting on the controller (i.e. the (legal or natural) person that determines “alone or jointly with others, the purposes and the means of the processing” (article 4 (7) GDPR)). The processor is the (legal or natural) person that “processes personal data on behalf of the controller”. As remarked by Van Alsenoy, there is, however, a great deal of uncertainty on how to identify controller and processors in practice. In our opinion, this uncertainty is amplified in the context of complex decentralized ecosystems such as federated learning, due to the multiplicity of actors (potentially, thousands to millions) that can participate in such learning. Critically, if the allocation of roles is not done clearly before the processing operation occurs, this may risk resulting in a lack of fair and transparent processing vis-à-vis the person whose personal data is processed, contrary to article 5.1(a) GDPR.

How to ensure accurate predictions ?

This question ties into the peculiarity of federated learning which is precisely to train a global model with potentially millions of participants, where the raw training data provided by each participant can “by design” not be inspected by the other participants. As stated above, this could render federated learning vulnerable to poisoning attacks which may result in a model that performs poorly. This may mean that, if the final federated learning model is used to infer new personal data, those data risk being inaccurate. Depending on the purpose for which the data are processed, this could result in an infringement of the principle of accuracy (article 5(1)d GDPR), which requires personal data to “be accurate and, where necessary, kept up to date […]”.

Conclusion

What appears from the above is that, under the GDPR, federated learning can definitively present some advantages compared to centralized machine learning. It should, however, not be considered as a silver bullet. In light of the increasing attention to decentralized ecosystems, such as federated learning, we believe that it is crucial to devote future research efforts to understand how the GDPR applies to these ecosystems and which tools can be adopted to further facilitate compliance with the latter in this type of environments.

Authors:

To know more about MUSKETEER, you can visit musketeer.eu

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824988.

--

--

Musketeer

MUSKETEER is an H2020 project developing an industrial data platform enabling privacy-preserving data sharing musketeer.eu