Back to blog

Provenance — Everything has a story

Data Provenance

6 min read
Photo by Adli Wahid on Unsplash

In a complex and heterogeneous world of people and objects interacting, it is necessary to understand how things are created, used and connected during their life-cycle. Even the most simple things have interesting stories behind.

Understanding the evolution of events is especially important when we touch on topics requiring complex logistics, like vaccines for example. Where are they coming from and how have they been handled before reaching you? Alternatively, who produced the food you are going to eat and where is it coming from? Or, is this artwork that I want to buy an original?

There are many different examples where this basic understanding is critical and would make our lives much simpler if transparency existed.

This is why data provenance is a key feature of Nevermined.

Says Wikipedia:

“Provenance (from the French provenir, ‘to come from/forth’) is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses in a wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books, the circular economy, and science and computing.”

Provenance allows us to understand the context in which “something” was created, how it is used and by whom, and how ownership is transferred or delegated.

The complexity is how to record this provenance information in such a way that interacting with “something” can be done transparently with trust?

A transparent, generic and unique source of truth would be a good solution for recording all the provenance generated between the independent parties related to “something”.

This smells like a blockchain

Of course we are not saying anything new when we suggest blockchain is the right fit here. Using immutable and decentralized storage provides a unique source of truth for all the parties involved during the complete life-cycle of a “something”. It doesn’t matter if it’s a vaccine, an artwork or a dataset. The main benefits of doing it this way are:

  • We can keep the complete provenance record during “something’s” whole life-cycle. This provides the complete lineage of the goods during a supply chain workflow, from the manufacturing to final delivery. It also works for any other situation where multiple independent parties collaborate in a common goal. This is what we call a “Data Ecosystem”.
  • We can keep track of the digital signatures of all the parties involved. To avoid goods tampering, tracking who is doing what during any handover or interaction between parties, it is possible to record the signatures of the parties involved in that process. This provides transparency of all of the actors involved in the workflow.
  • We can record the digital fingerprint of the assets involved during that process too. Any physical good during the life-cycle can be associated with multiple digital files and metadata. This can include manufacturing details, product specifications, quality controls, etc. We can use the same blockchain solution to record the fingerprint of these digital assets and use that for flagging any further data or metadata changes afterward.

So, you might be saying “let’s fix the world with another blockchain!” But with if we have complex scenarios with dozens of different parties participating and interacting with each other? Why does everybody need to adapt to a new way of recording what’s going on? How is this going to happen? How do we reduce the friction of doing it in an homogeneous and generic way?

What we need is a way for multiple independent systems to speak to each other in a transparent, common and generic way. A Standard of sorts.

W3C PROV Standard FTW!

In 2010, the W3C created a working group to define how all the different information related to provenance can be represented in a generic way, valid for any use case. That working group synthesized all the common interactions between actors (people, organizations, etc.) with entities (things, assets, datasets, etc.) via activities. The result of this work was the W3C Provenance specifications released in 2013.

The W3C Provenance (PROV) specification represents any situation via generic notation for essentially any use cases, including those described above.

Provenance information can be modeled as the interaction between Agents and Entities related via the Activities between them:

  • Entities — In PROV, physical, digital, conceptual, or other kinds of things are called entities. Examples of such entities are assets, data, an AI, a web page, a chart, a spell checker, etc.
  • Agents — An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
  • Activities — Activities are how entities come into existence and how their attributes change to become new entities, often making use of previously existing entities to achieve this. They are dynamic aspects of the world, such as actions, processes, etc. For example, if the second version of document D was generated by a translation from the first version of the document in another language, then this translation is an activity.

From Wikipedia:

The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody and places of storage. The practice has a particular value in helping authenticate objects.

Let’s put all together with Nevermined

Nevermined is an Open Source solution, offering its users the ability to build data sharing ecosystems in which untrusted parties can share and monetize their data in a way that’s efficient, secure and privacy preserving.

Nevermined integrates with a blockchain to provide decentralized access control and data orchestration. And beyond that, the Nevermined Smart Contracts implement the W3C Provenance specification allowing to register on-chain all the provenance information, digital signatures and fingerprints. This creates an open, transparent and unique source of truth for any data ecosystem where multiple parties need to collaborate in a common goal.

The main benefits of this approach are:

  • A Single Standard — There is no “new” way to record things as Nevermined follows the W3C Provenance specifications.
  • Digital Signatures — These are recorded for each step in the data value chain. It makes it possible to identify anomalies and pinpoint when they happened with laser-like focus.
  • W3C Provenance Events — Events are recorded in a unique source of truth providing a complete lineage of “something”.
  • Modularity — Existing and independent Supply Chain Management (SCM) software components in use by different entities can be plugged into the network without modifications to make information available to specific users.
  • Transparency — The complete record of transactions is kept in an immutable and decentralized place that anyone involved with the right permissions can access.
  • Metadata Assets — Allows an automatic way of sharing digital assets and related metadata, like quality control checks, receipts, customs clearance documents, etc.
  • Analytics & Federated Learning—Leveraging these capabilities across potentially all data from all ecosystem actors.

In a further blog post we will provide more details about how to manage all this provenance information showing some code and examples :).

Originally posted on 2020-12-04 on Medium.