Back to blog

Data will become the bottleneck for AI. How ironic.

Access control

5 min read

Data is the primary rate limiting factor to the progress of Artificial Intelligence. This is our strong belief and one of the motivations for building Nevermined. So, academic research stating that “stock of high-quality language data will be exhausted soon; likely before 2026”  provides a huge validation of our value proposition.

Their recommendation for preventing the manifestation of this gloomy scenario? Improve data efficiency or create availability of new sources of data.

Two, tops three, more years before we run out of data.

Let’s unpack this.

Photo by Thought Catalog on Unsplash

The flippening will happen

Currently, computational capacity and access to expertise are two of the biggest factors limiting AI development. LLMs require massive data centers with vast amounts of specialized processing, like GPUs. And the expertise for creating the infrastructure and fine-tuning the models is expensive. Yet, history shows that over time both the cost of compute and expertise will decrease. 

With data, the trend goes the other way. Deep learning, and large neural networks like LLMs, benefit significantly from exposure to larger and larger training datasets. Over time, however, these large training datasets approach a ceiling size. As the study shows, at some point soon, all models will have scraped all available public data. At that point, AI models will converge to the same result and their individual ability to separate their usefulness from other, equally large LLMs, will diminish.  

So instead of computation and expertise, data will become the bottleneck for AI. How ironic.

Turning Data silos into AI goldmines

To sidestep this AI convergence issue, access to relevant walled data becomes critical. In order to increase their relevance, their appeal and their commercial value, AI models will need access to data that is not publicly available. The real relevant data is stored in company databases, in personal healthcare records and in private data repositories. Behind walls. 

With this thesis as a foundation, Nevermined has set out on a mission to transform siloed data into AI goldmines. To do this, our team has taken multiple novel approaches to addressing the barriers to data sharing and incentivization. The benefits are threefold: 

  • Better Precision via Access Control
  • High-Fidelity Trust and Provenance
  • Easier Monetization & Incentive design
Photo by Thomas Griggs on Unsplash

Precision AI needs Precise Data

With Nevermined’s blockchain-based technology an Asset Publisher can register and tokenize any data asset or data processing workflow. In this process, the Publisher also defines the rules for access, which are consolidated into a smart contract. 

Once published, a User (e.g. an AI model) can gain access to these assets, based on the rules set by the Asset Publisher. These rules are evaluated based on whether the User holds the correct access token and on the access conditions placed on that token. This token-gating access control can be managed in a decentralized manner.

This approach matches the trend in the AI industry, moving towards Transfer Learning and Precision AI. This is where secure access to walled data is critical. By making a smaller context available to a larger LLM, the model inference can effectively be filtered and provide more fine tuned responses. Nevermined enables federated compute on relevant, decentralized datasets, so you can run far less invasive and energy consumptive AI learning.

AI needs Provenance and Attribution

Provenancial integrity and data lineage remain significant challenges in the development of large scale data estates. At the core, you’re trying to answer four simple questions:

  1. Where is the data coming from?
  2. Where is the data going?
  3. Who (or what) is using the data?
  4. What are they doing with the data?

The ability to answer these questions with confidence remains difficult. There is no silver bullet but blockchain does provide an elegant solution. With the ability to record transactions immutably, and expose them to all actors in an ecosystem, blockchains provide novel transparency to the usual opaqueness of data lakes and estates. Solutions using Nevermined can now trace assets from source to destination, and record how the data was used, with detailed provenance records. 

But there’s more. With high fidelity provenance comes the ability to establish fine-grained attribution: which actor did what? And lack of attribution is part of AI’s data problem. Currently, LLMs trained on publicly available source material fail to disclose how a model response was inferred, and therefore, acknowledge what source training material was used to create the response. This can be problematic when the source material was actually copyrighted or subject to licensing restrictions. Because current models lack the ability to track and trace this source material, they fail in their ability to properly attribute publishers.

Photo by Robert Linder on Unsplash

AI needs better monetization

Provenance and Attribution create the base for monetization. Granular attribution can improve payments such as royalties, residuals and rewards. This creates the ability to commercialize datasets (and AI services) as well as more advanced incentive schemes.  

In a Data and AI ecosystem, powered by Nevermined, LLMs will acquire datasets, relevant to the use case it intends to address. Above we’ve described how access conditions are baked into the dataset. These access conditions can obviously be monetary. For example, the AI could simply pay to access the dataset and use it in its training set. On the other hand, Nevermined also offers the ability to monetize AI’s via tokengating the web service. So as these systems evolve, we will see more advanced commercial models take root.

Because an AI Agent can create its own revenue (by selling access), it may negotiate a lower access payment to the Data Publisher, in return for royalties. This could be a percentage of revenue that the AI accrues based on its usage and/or output. Now we have a system in which the Publisher receives proper attribution via royalties relative to how well the AI performs in the future.

Final thoughts

While we certainly subscribe to the academics’ view that AIs will soon run out of data, we see this as an opportunity for Web3 tech to come to the fore and help solve these core issues. We believe that access, provenance and monetization will be key enablers for a booming and aligned Data & AI Economy.

Questions, ideas, coffee? Get in touch via the website or join our Discord