Leverage the power of Federated Learning with Nevermined
Nevermined offers its users the ability to build data-sharing ecosystems where untrusted parties can share and monetize their data in a way that’s efficient, secure and privacy-preserving.
- Data Sharing — Enables the sharing and access of digital assets between untrusted parties in the data ecosystem;
- Data In Situ Computation (DISC) — Allows the execution of models and algorithms without moving the data (i.e. “in situ”);
- Federated Learning — With its agnostic capabilities, Nevermined lets you pick the appropriate Federated Learning frameworks to work across multiple datasets at the same time. Extract value and insights where you couldn’t before.
As a baseline, we are going to use the Kaggle Credit Card Fraud Detection dataset and enhance it into a Federated Learning scenario.
One of the biggest problems when trying to train a model is finding large amounts of data. Now imagine we have two banks: Bank A and Bank B. Both of these banks have large amounts of credit card transactions that they aren’t really leveraging for machine learning purposes. They either lack the data science expertise or they are afraid of the penalties that may arise from mishandling the data under data protection laws.
Now we introduce a third party: a Data Scientist. The Data Scientist wants to build a service that can automatically flag fraudulent credit card transactions with a high degree of accuracy. Knowing the typical features that should be present in every credit card transaction, he already built a model using scikit-learn.
Now the biggest challenge becomes getting access to real data so that the data scientist can use it to train the model to be able to make predictions with a high degree of accuracy. Banks won’t and can’t share this type of data with third parties, so the data scientist hits a roadblock.
Currently, there exists the capability to perform privacy-preserving machine learning through Federated Learning. Federated Learning can be seen as a form of privacy-preserving distributed machine learning. A machine learning model is trained in parallel over multiple data sets. The individual weights resulting from the training over each dataset are then aggregated in a privacy-preserving manner and this produces the final trained model.
However, orchestrating this type of analytical application is difficult. Orchestrating this type of analytical application across disparate datasets controlled by independent counterparties is virtually impossible.
But not anymore!
Nevermined to the Rescue!
To summarize our use-case:
- We have a data scientist that requires a large amount of private data to train a model
- We have two distinct banks, Bank A and Bank B, with their own distinct infrastructure and data assets, and they would like to monetize their data
- We need a system that allows for: interoperable data sharing between multiple parties; data monetization; an interface to run code spanning multiple organizations
With Nevermined’s Data Sharing capabilities, both Bank A and B can both advertise their assets and monetize their data without having to move or change their existing infrastructure, this turning them into data providers. And Nevermined’s Data In Situ Computation (DISC) capability allows for consumers of the data, like the Data Scientist, to run generic computations (inside the scope allowed by the data providers) without ever having to access the data directly.
This means that private data remains private as it does not need to move. Now the Data Scientist has a single interface that he can use to train his model over data from multiple, wholly distinct and independent organizations. All the heavy lifting of orchestrating a Federated Learning session spanning multiple organizations is handled by Nevermined. This means the Data Scientist only needs to reuse his existing machine learning code and tools with little to no modification.
In the next sections we will describe how, through Nevermined, the following can be achieved:
- Data Providers can publish and advertise their data;
- Data Consumers can pay for access to that data;
- Data Consumers, like data scientists, can reuse their existing machine learning models to run Federated Learning over multiple, disparate datasets.
Publishing the data
In order to make the data available and advertisable through Nevermined, both Bank A and Bank B first need to publish the data on Nevermined.
Below is a snippet of code from nevermined_provider.ipynb. In order to publish a dataset we need to define some metadata describing what kind of asset this is.
In this example we define all of the required attributes, but in reality the metadata can have any number of additional attributes. One thing to note is that we are publishing a file, setting a price, and marking it as a dataset. We then publish this metadata to Nevermined using ‘create_compute’. This will create a compute type dataset meaning that it can only be used to run computations against, and cannot be accessed directly.
This is all each bank needs to do to publish their datasets on Nevermined.
Consuming the data
Now that the data is published, the Data Scientist discovers both bank’s datasets through a marketplace (more on this in a later article) and decides that he would like to train his machine learning model over these two datasets. But before doing this, however, he needs to pay the required price in order to get permissions to run computations over the data.
Below is a snippet of code from nevermined_consumer.ipynb. The Data Scientist orders the data asset represented by the DID (Decentralized Identifier), which was generated earlier by Bank A during their asset publishing process.
Since a single asset in Nevermined can provide multiple services, we also need to specify which service we are ordering; in this case, it’s the computing service. If the order is successful, payment is facilitated via Nevermined and a unique service agreement is returned. The Data Scientist will later use this service agreement to request the execution of an algorithm over the data (the service he just ordered).
Running computations over the data
Up to this point, the Data Scientist has bought compute access to two datasets and now wants to use Federated Learning to train his model over both datasets.
Nevermined is agnostic to the computations or frameworks used on its execution environment. For this example, we are going to be using Xaynet Federated Learning Framework v0.8.0 together with scikit-learn.
Below is a snippet of code from federated_fraud_demo.ipynb. The Xaynet framework requires us to define the participant code. We start by preparing the data for training and creating the model we want to use. Next, we define the ‘train_round’ method that receives as input the weights of the global model sent by the coordinator.
We initialize the model with these weights and then we perform normal training using the scikit-learn model. In the end, we return the new weights as a result of the training round. The ‘serialize’ and ‘deserialize’ methods are just helper functions to serialize and deserialize the weights from and into NumPy arrays. Finally, we start the participant which connects to the coordinator and waits for instructions.
This code will be run over both banks’ datasets.
Once we have our code we need to host it (ex: on github) and then we need to publish it on Nevermined in the same way we published the banks’ datasets.
Below is a snippet of code from nevermined_consumer.ipynb. In the metadata we provide the information about the file to execute, we mark the asset as ‘algorithm’ and we specify both the ‘requirements’ for execution and the ‘entrypoint’. In the requirements we can set the Docker image that we want to use and in the entrypoint we define how the algorithm should be executed.
Now that we have the computation that we want to run over the datasets we need to define the workflow. A workflow is a set of instructions to Nevermined describing how the computation should be carried out. Nevermined uses these instructions to configure the execution environment.
Below is a snippet of code from nevermined_consumer.ipynb. Defining a workflow is similar to defining a job in a CI like Github actions or CircleCI. In our case we have only one stage with one input which is the dataset from one the banks, and one transformation, which is the code we want to run.
We create one workflow for each bank.
And finally we order Nevermined to execute the computations.
With this last step our algorithm will run on the execution environment of each bank, and the results will be published and made accessible to the Data Scientist.
Note that in the entry point of the algorithm we are using papermill to execute the Jupyter Notebooks. This shows how flexible Nevermined is in allowing data scientists to reuse tools they are already familiar with.
The result outputs will be two notebooks (one for each bank) and the global model. The notebooks are returned with each cell filled with the output as a result of the execution. The final aggregated model is returned as a serialized numpy array.
Update 11/02/2021: We have recorded a video showing the demo in action.
In this article, we showed how a Nevermined data ecosystem can allow for easy interoperability between data providers and data scientists. They can run complex distributed machine learning tasks spanning multiple different organizations and do it in a way that is privacy-preserving and within the allowed scope of data protection laws.
You can find the full demo explained here and others in our Github repo, get a deeper understanding of Nevermined through our documentation and reach out to us for any questions on our discord server.
If you would like to know more about the commercial opportunities Nevermined makes available for your organization, or are interested in having a demo, please reach out to us on email@example.com or visit our website.
Originally posted on 2021-01-08 on Medium.