Part of our series: Collaborative Computing — How Web3 can be rocket fuel for the future of Data and AI
When Lawrence Lundy-Bryan, Research Partner at Lunar Ventures and a Nevermined collaborator wrote ‘The Age of Collaborative Computing’, we were thrilled. We totally subscribe to this idea that we are at the beginning of a fundamental reshaping of how data and AI are used in the economy. And we certainly won’t argue about his prediction that Collaborative Computing is the next trillion dollar market.
So, to help us understand the challenge of establishing a new product category, we worked with Lawrence to create a series of interviews with experts in the data collaboration space who are buying, selling or investing in this vision.
We previously published 4 of Lawrence’s conversations, with Stijn Christiaens (Collibra), Rick Hao (SpeedInvest), Dr Hyoduk Shin (UC San Diego) and Jordan Brandt (Inpher). For this final episode, he hooked up with Flavio Bergamaschi, Director, Private AI and Analytics at Intel.
Conversation with Flavio Bergamaschi, Director, Private AI and Analytics, Intel
By Lawrence Lundy-Bryan, Research Partner at Lunar Ventures
Collaborative computing is the next trillion dollar market. We are at the beginning of a fundamental reshaping of how data is used in the economy. When data can be shared internally and externally without barriers, the value of all data assets can be maximized for private and public value.
To explore this vision in more depth, I spoke with Flavio Bergamaschi, Director, Private AI and Analytics at Intel. Highlights include:
- Why integrity is just as important as confidentiality
- Why the importance of crypto-agility is overlooked
- The 5 groups you need to convince to sell data collaboration software
Hi Flavio, your role at Intel is about turning R&D into products. What do you think about customer needs and the role privacy-enhancing technologies play?
We are almost like an interface working with the cryptographers at Intel and customers on the other. We get needs and requirements from customers and then feed that back into the R&D work. It’s a strong feedback loop and we are careful to make sure the stuff being created solves customer needs. There are so many different customer needs for which privacy tools can play some role in addressing. There isn’t really a killer app for PETs as such, because PETs are so broad. Right now we are seeing use cases where customers want to use privacy tools to make existing processes more private in some way.
The real value, and we aren’t there yet, is when we think about the sort of stuff that can’t be done today and can only be done with these tools. Things like data collaboration are probably somewhere in between, in that we can imagine what the environment has to look like, but the end-to-end solution with strong privacy guarantees isn’t here yet.
What is the most common barrier you encounter when speaking to customers about how they can use privacy tools?
It’s the complex nature of how all the software and hardware interacts to achieve privacy guarantees. In most cases, customers want to solve specific problems like sharing personal data from one database to another, but you really have to understand the whole system to be able to guarantee confidentiality. This is challenging, primarily because vendors and customers aren’t used to thinking through their systems when buying a bit of software, even more so with software as a service and APIs connecting things together. So when you have to take a system’s approach, you have to work with more stakeholders across more departments and it gets more complicated.
And that’s just the confidentiality side of things, obviously for many use cases we want to have integrity of computation, too, right?
Exactly yes, this is where the hardware side of things come into it. We can use homomorphic encryption, for example, to secure the confidentiality of the data, but somebody can still corrupt the processing. This is where a trusted execution environment like Intel SGX can come in and combine with homomorphic encryption. For cloud-based computing, we are basically outsourcing work and so we need assurances that the work is going to be performed as we asked.
For private analytics then, and for our purposes, data collaboration, customers need to think about confidentiality and integrity. But there is a more prosaic challenge here isn’t there: the management of cryptographic keys?
Yes, we can think about this as being foundational to using cryptography securely in any organization. The generation, distribution and management of cryptographic keys isn’t a trivial problem. This is sometimes called crypto-agility. This is when the security team has some sort of crypto inventory and is aware of all of the algorithms, keys, crypto libraries and protocols in use in their infrastructure and applications. Without sound foundations of key management, it’s challenging for customers to start using PETs and getting value from things like shared processing and data sharing.
Okay so customers are thinking about the basics of key management and then confidentiality and integrity of data processing, that seems like that would touch a lot of people’s roles in a large organization? Who do you need to convince when selling data collaboration tools?
As always it depends on the particular product. The best thing to do is to go through the workflow and see who is responsible at each stage. Obviously the key role here is Chief Information Security Officers (CISOs), because we are talking about cryptography and security. CISOs manage risk and even though these tools have potential value-generation implications, you won’t get anywhere without CISO buy-in.
Some firms will have Chief Data Officers (CDOs) and they are the people who will think more about the opportunities around data sharing and monetization.
Below the C-suite you have 5 groups: data custodians, security groups, infrastructure, line-of-business, and R&D. Each of these teams will be involved in a decision around data collaboration, so it can be a hard sell. But, as mentioned, we are seeing more and more firms becoming crypto agile and streamlining processes so they can innovate around data. Therefore we can expect the sales cycle to shorten, as the number of stakeholders required is reduced.
Okay let’s go “big picture”. What cultural, technical or social change would be required for demand in data collaboration to increase 10/100x?
Assuming we continue to work on the performance side of things, like accelerators for homomorphic encryption, then the assumption is that, with some change, millions more people will demand to use private analytics or AI compared to non-private versions. I think we are looking at two drivers: behavioral change and regulation. Users are already becoming privacy-conscious so that’s not a new thing, but what’s been lacking are alternatives to switch to.
So what you’re saying is that there is some latent demand for data collaboration tools, so when the tools are available, then you’ve got your 10/100x increase?
Yes, I think so. Much of this isn’t direct-to-consumer, it’s B2B, but now companies are using privacy tools to be socially responsible and have some PR gains. This seems to be ramping up.
As for regulation, maybe we haven’t seen it ripple through the industry yet, because performance is lagging regulation in this case. GDPR and the CCPA are just the first, we have privacy regulation all around the world and many of them require firms to exhaust all technical solutions to protect privacy. Everyone is exploring all viable technologies in a way they wouldn’t have maybe done a few years ago. For many companies it is going to be easier to roll out PETs in products and infrastructure, simply to reduce risk.
When thinking about helping companies utilize their data, a sensible framework is: governance, sharing, and monetization. It feels like 95% of companies investing in their data infrastructure are still on data governance, maybe 5% are finding ways to share internally, and <1% are even thinking about monetization yet. Does this sound right to you?
These three topics are topics of discussion today; how they split, though, is unclear. It’s obvious that the governance bit is the bit that needs to be fixed first and this is wrapped up with other efforts like entity resolution and master data records. Companies first have to organize their data assets before they can utilize them.
There is a pretty big gap between organizing the data and permissioning it and then sharing. Think of a data analyst who has been tasked to find out some information and they think some data they don’t have access to might be useful. First they ask for permission. This might need a couple of levels of authorization and they will have to detail exactly what they are using the data for. Even though, in this case, the data analyst might not know until they run the analysis. Even if they get access there will be rules around when and how the data can be accessed especially if it contains sensitive or personal data. There might be secure environments they have to work in or even air gapped computers in some cases. There will be a lot of friction and then a high likelihood that the data isn’t actually useful. Imagine that process playing out across thousands of companies every day. So data sharing, even internally, is very hard.
As mentioned, few organizations know they can address this problem with PETs. You could design a process in which a sample of data is provided to the data analyst in a privacy-enhanced way, just to test the analysis. The analyst can then test across as many datasets as they need to figure out which datasets they need permissions for. You can also extend the process externally, using something like private set intersection. Something that is done rarely today because the legal costs are too burdensome.
And then the <1% monetization, I don’t know. This feels like a different problem, one with bigger non-technical questions around ownership of data. Right now, collectors of data appear to have the right to package it up and sell it, but it feels like that is shifting somehow. But we are in the very early stages of this.
Finally, it feels like the data consolidation model is coming to an end. Do you think data federation is the future?
That is probably too simplistic. It will depend on the application requirements and constraints. Some use cases will require extremely low-latency and so it makes sense to do as much of the computing as close to the edge as possible. But others will still need data consolidation for efficiency and cost reasons.
Certainly with federated learning we have tools that mean we can do more learning at the edges which has efficiency and security benefits, but federated learning is the frontier. We still have lots of organizations moving to the cloud and using data warehouses or lakes, and that trend isn’t going away.
So maybe the question should be different: what do you want to protect when you want to process distributed data? Each node might have different policies for example, so it’s not the case that distributed processing is necessarily more secure or private. Any of these nodes, if we are talking about personal smartphones or edge devices, will have been configured by a highly variable range of expertise. There is no general principle to say distributed computing is more secure or more private than centralized computing. It will depend on the configuration of the overall system, the requirements of the application, and the threat you are protecting against.
This interview has been edited for clarity. It was the last of the 5 interviews that Lawrence did. You can find the other 4 here:
- Jordan Brandt (Inpher)
- Stijn Christiaens (Collibra)
- Rick Hao (SpeedInvest)
- Dr Hyoduk Shin (UC San Diego)
As always, we’d love to hear from you. Got any questions, suggestions or want to chat about your project? Contact us via the website or join our Discord.
Originally posted on 2023-02-09 on Medium.