Image by Barbara A Lane from Pixabay

Tackling the privacy issue: the solution of federated learning

Musketeer

--

Data, it’s always about the data!

In recent years, the advancements in Big Data technologies have fostered the penetration of AI and machine learning in many application domains, producing a disruptive change in the society and in the way many systems and services are organized. Machine learning technologies provide significant advantages to improve the efficiency, automation, functionality and usability of many services and applications. In some cases, machine learning algorithms are even capable of outperforming humans. It is also clear that machine learning is one of the key pillars for the fourth industrial revolution and it will have a very significant impact in the future economy.

Machine learning algorithms learn patterns and they are capable of extracting useful information from data. But in some cases, the amount of data needed to achieve a high level of performance is significant. A few years ago, Peter Norvig, Director of Research at Google recognized: “we don’t have better algorithms, we just have more data.” Thus, apart from the expertise and the computational resources needed to develop and deploy machine learning systems, for many companies, the lack of data can be an important obstacle to participate of this new technological revolution. This can have a profound negative effect on SMEs, which, in many cases, will not be able to compete with the largest companies in this sector.

In recent years this problem has been alleviated by the growth and development of data markets, which enables companies the access to datasets to develop AI and machine learning models. However, in many sectors, the access to these data markets is very difficult because of the sensitivity of the data or privacy restrictions that impede companies to share or commercialize the data. This can be the case, for example, of healthcare applications, where the privacy of the patients must be preserved and where often the data cannot leave the data owner’s facilities.

Then, what can we do? Is it possible to design high-performance machine learning systems given such privacy constraints? The answer to these questions is provided by federated learning.

Building collaborative models with federated learning

Federated learning is a machine learning technique that allows building collaborative learning models at scale with many participants whilst preserving the privacy of their datasets. Federated learning aims to train a machine learning algorithm using multiple datasets stored locally in the facilities or devices of the clients participating in the collaborative learning task. Thus, the data is not exchanged between the clients and always remains in their own facilities. In federated learning, there is a central node (or server) that orchestrate the learning process and aggregates and distribute the information provided by all the participants (or clients). Typically, the server first collects and aggregates the information provided by the different clients, updating the global machine learning model. Then, the server sends back the parameters of the updated global model back to the clients. On the other side, the clients get the new parameters sent by the server and train the model locally using their own datasets. This communication process between the server and the clients is repeated for a given number of iterations to produce a high-performance collaborative model.

Depending on the application, there are different variants for learning federated machine learning models. For example, in some cases, a global collaborative model is learned from all the clients’ data. In other scenarios, only part of the parameters of the machine learning model are shared between participants, whereas the rest of these parameters are local (specific for each client). This allows to provide some level of personalization for the clients. This approach is also useful in cases where the datasets from the different participants is not completely aligned, i.e. they do not have exactly the same features.

Where can we use federated machine learning?

To reduce the amount of data to be stored in the cloud, as the computational power of small devices like smartphones and other IoT devices is increasing, federated machine learning offers news opportunities to build more efficient machine learning models at scale by keeping the data locally on these devices and using their computational power to train the model. This approach was used in the one of the first successful use cases of federated learning: “Gboard” for Android, a predictive keyboard developed by Google which leverages the smartphones of millions of users to improve the precision of the machine learning model that makes the predictions. In this case Google did not require to upload the text messages from millions of users to the cloud, which can be troublesome from a regulatory perspective, helping to protect the privacy of the users’ data. This use case showed that federated machine learning can be applied in settings with millions of participants that have small datasets.

Federated learning is also a suitable solution in application where data sharing or access to data markets is limited. In these cases, the performance of the machine learning models that can be achieved using smaller datasets for training can be unacceptable for their practical application. Thus, federated machine learning allows to boost performance by training a collaborative model that uses the data from all the participants in the learning task but keeping the data private. The performance of this collaborative model is similar to the one we would obtain by training a standard machine learning model merging the datasets from all the participants, which would require to share the data.

MUSKETEER’s vision

Our mission in MUSKETEER is to develop an industrial data platform with scalable algorithms for federated and privacy-preserving machine learning techniques, including detection and mitigation of adversarial attacks and a rewarding model capable to fairly monetize datasets according to the real data value.

In MUSKETEER we are addressing specific fundamental problems related to the privacy, scalability, robustness and security of federated machine learning, including the following objectives:

  • To create machine learning models over a variety of privacy-preserving scenarios: in MUSKETEER we are defining and developing a platform for federated learning with different privacy operation modes to provide compliance with the legal and confidentiality restrictions of most industrial scenarios.
  • To ensure security and robustness against external and internal threats: in MUSKETEER we are not only concerned with the security of our software but also with the security and robustness of our algorithms to internal and external threats. We are investigating and developing new mechanisms to make the algorithms robust and resilient to failures, malicious users and external attackers trying to compromise the machine learning algorithms.
  • To provide a standardized and extendable architecture: MUSKETEER platform aims to enable interoperability with Big Data frameworks by providing portability mechanisms to load and export the predictive models from/to other platforms. For this, MUSKETEER design aims to comply with the International Data Spaces Association (IDSA) reference architecture model[i].
  • To enhance data economy by boosting sharing across domains: MUSKETEER platform will help to foster and develop data markets, enabling data provides to share their datasets to create predictive models without explicitly disclosing their datasets.
  • To demonstrate and validate in two different industrial scenarios: in MUSKETEER we want to show that our platform is suitable to provide effective solutions in different industrial scenarios such as manufacturing and healthcare, where federated learning approaches can bring important benefits in terms of cost, performance and efficiency.

If you want to know more about MUSKETEER, visit our website or contact us!

In our following posts we will be explaining some of the key aspects that we are considering in MUSKETEER, including how we protect the privacy of the clients’ data with our Privacy Operation Modes, or how we provide secure and robust mechanisms to make our algorithms resistant to failures and attacks. Do you want to know more about this? Then, please, stay tuned!

Luis Muñoz-González

Research Associate at Imperial College London

[i] https://www.internationaldataspaces.org/wp-content/uploads/2019/03/IDS-Reference-Architecture-Model-3.0.pdf

--

--

Musketeer

MUSKETEER is an H2020 project developing an industrial data platform enabling privacy-preserving data sharing musketeer.eu