As many people know, data access is one of the limitations to applications of powerful supervised learning models in critical areas such as healthcare. Without strategies to ensure privacy, that data will remain locked away. But new strategies are emerging that can allow machine learning models to train on that data without ever really seeing it.
Casimir Wierzynski, a senior director in Intel's AI products group, took me on a tour of the latest strategies that promise to unlock the data necessary to liberate AI. Cas also talked about hardening encryption against the code-cracking power of quantum computers.
CASIMIR WIERZYNSKI:
My current role at Intel is in the office of the CTO for the AI products group. I come from a family of humanists, so it's kind of funny that I'm in a tech role now. My mom's an attorney, my dad's a journalist. Actually that journalism background kind of informed some of the privacy work that I've gotten into recently.
These AI systems have so much promise, we want to unlock all of this potential and find all this good stuff in the data around us. AI systems are fundamentally shaped by data. But these data are increasingly private and sensitive. How do you kind of reconcile those two?
One of the things that motivates me about this whole field of privacy is that there's a human right around privacy. My dad came from Poland. He left Poland in 1945, first to Europe and then to the U.S. But I still had cousins and uncles and so on in Poland, during martial law. I remember people had to register their typewriters with the government.
So, I lead a team right now that specializes in this emerging area called "Privacy-Preserving Machine Learning". And under that privacy-preserving machine learning umbrella, there is a set of technology that can be used to get insights from data without explicitly having to look at the data in detail.
There are two reasons why you'd want to do that. One is basic security and privacy, and making sure that you don't have breaches of data and so on. But another reason is that now you can enable these entirely new applications. Say that you had a group of hospitals that wanted to share their data in some way in order to build a more reliable detector of brain tumors in MRI scans. Statistics is such that the more data you have, the more accurate you can be. But then clearly there's an issue with sharing patient data.
So there are cases where people want to pull data, but for excellent reasons they can't. Now, using these privacy preserving techniques, you can unlock that whole space. And it's not just healthcare, it's also in financial services. If you wanted to detect nefarious activity, fraud, things like that. Banks clearly want to do this, they clearly want to operate collectively to do that. But then they have these very real privacy concerns around data.
It’s an emerging field, but already there's some core technologies and they're at various stages of readiness for the real world. So let's say you want to pool data; the example I just talked about where people want to collectively operate on data without explicitly having to share it. There are techniques called federated learning or multi-party computation. So that's one bucket. Another bucket is in this idea that you compute on encrypted data. So can one party receive data that's encrypted from somebody, and without ever decrypting it, do some kind of math on those data. And then still give an answer to somebody back, even though they never saw the underlying data. So that technique is the most magical in my mind, that's called homomorphic encryption. There is another important set of techniques around privacy. So if you look at a dataset and you build some kind of statistical model that learns about this dataset, you don't want to overlearn in some sense, you don't want to memorize individual details from that data set. You actually want to extract the bigger irregularities out of that. And so differential privacy is a technique that helps you achieve that.
We can talk about each of those in turn, starting with federated learning. It's actually kind of straight forward, although it's cool that it works. So imagine you have several people who hold data and it's private, we’ll call them the Federation. It sounds a little bit like Star Trek. This is your Federation and then just start with some initial blank slate version of a machine learning model. Every member of the Federation gets that blank slate. Then each of them, on their private data, and their private data never leave their premises, figure out what adjustments they need to make the model work better on their specific data. Each of them figures out how to change the model to make it work better for them. They share all those updates with some central coordinator and then this coordinator basically adds all those suggested updates together. Now you get a new candidate model and that new candidate model again gets shared with all the members of the Federation and then you iterate again. They compute another set of adjustments. They go back and forth over time. What you end up with is a model that will satisfy everyone's data. It's as if you had kind of worked with the pool dataset, but you never worked with any of the data directly.
What you're transferring back and forth is not the entire model, but the parameters of the model and, because some of these models have hundreds and millions of parameters and in each round, these suggested updates may only touch a small proportion of those parameters, you only send the deltas;t the changes.
We did a study about a year ago with a doctor at the University of Pennsylvania, a radiologist there. There's a standard data set called BraTS, which is a bunch of images. The task is to try to segment brain tumors from MRI images. Those images were collected at different institutions. So if we pretend like we had trained this model in a federated way where we take all the examples from one institution and put them in one bucket and bucketize each institution, we can explicitly test this idea of whether training federated gets you close to the same performance as just having all the data. And in that case it worked extremely well. There are still some interesting research ideas around is. The challenge happens when different members of the Federation have wildly different data. Then this question becomes more acute and you want to test it a little bit more. We've done some research on my team to come up with strategies where you adaptively change the rate at which you do this updating process to adapt the fact that there may be very big differences across institutions.
Let's say you had a single hospital and they have enough data to serve their own needs. Even there, there's a really good reason to expand your dataset by doing federated learning, because you can make sure that the model that you have actually generalizes to the underlying task and not something spurious.
Now, homomorphic encryption is a way of encrypting data, following the Greek etymology. Homomorphic means having the same shape. So when you move data from the un-encrypted world into this encrypted space, the data still have the same shape relative to each other. You're still preserving some structure among the data relative to each other, but you're of course obscuring what the actual data are because it is encryption. So in particular, a single number in the plain world becomes, in this encrypted world, a very high order polynomial. So you remember from high school, polynomials are things like 2x + 3x2 - 2x3, except in this case, the powers of x go all the way up to like, let's say, 4,096. You have these very large polynomials in the encrypted world.
If you add two numbers in the encrypted world and then bring them back into the real world, you get the sum of the two things that you brought over. So, it's a way that you can operate in the encrypted world, in mathematical ways that correspond to operations in the real world, except you never actually see the underlying data. So let's make it a concrete example. A hospital takes a scan, they use their private key to encrypt this scan. Now it's a bunch of polynomials. They send it to some cloud-based radiology service. That radiology service is operating purely on polynomials. They have no idea what the underlying data are. The polynomials come back to the hospital, the hospital uses its key now, and only they can unlock this thing and get back the diagnosis that they were looking for.
For differential privacy, let's think of the example where you have predictive text on your phone. You start to type a word and then your phone predicts what the next word might be. For that kind of prediction model, you take a bunch of text message data and you build a model that says, okay, if the previous word is ‘apple,’ then the next one may be ‘pie.’ The problem is that if you have in your dataset people saying, ‘my credit card number is,’ and then the very next thing they say is their credit card number, you don't want that level of granularity to show up in your model. You don't want somebody else typing 'my credit card number is' and then suddenly out pops your number. To avoid that, we use differential privacy. Some people call it fuzzing the data or adding a little bit of noise to the training data so that forces the machine learning model to learn the overall statistics of English and basic syntax, but it won't have enough statistical power to see these tiny little details like individual people's information. That's an example of differential privacy.
There are probably two or three dozen papers out there on aspects of how to get the most utility out of some model while still preserving the privacy that you need. There is going to be a bit of a trade-off between privacy and utility in this space. Although I think in practice people are finding that there is a sweet spot in a sense where some amount of fuzzing can make the model better because one of the overall goals of machine learning is generalization. The concept of generalization is completely in line with this idea of not overly memorizing different parts of your data set because those are not germane to the task that you're training for.
There is this thing called an extraction attack and there's actually a nice confluence here with homomorphic encryption. Machine learning models can actually memorize or hold data in the hidden layers that can be extracted after the training. You can think of two scenarios. One is where the attacker actually has access to the model where they can see all parameters of the model and how it was built and so on. That's called a white box attack. And then there's another version of this where the attacker can use the model, they can send stuff to it and get answers back, but they can't look inside to see how it was built. Extraction attacks are possible in both of those scenarios, but they're a lot easier if you actually have access to the details of the model. So homomorphic encryption can actually be a way to protect the details of the model. The example that I gave with homomorphic encryption before was where I was trying to protect the privacy of the data that I fed to the model. But you could also turn it around, you could actually encrypt the model and thereby protect the confidentiality of the model itself.
Differential privacy addresses a slightly different problem than homomorphic encryption in the sense that differential privacy addresses the ability to tie a specific set of information to a specific individual. Homomorphic encryption and federated learning are more about confidentiality. Those are subtly different but they are actually kind of different problems and we just now went through an example where you might want to do both in the sense that you might want to use homomorphic encryption to prevent someone training data to see what the underlying data are. But then you'd want to use differential privacy to make sure that that model hadn't learned details of the training set.
I think of all this as kind of like the Robert Frost poem where, ‘good fences make good neighbors.’ If we can make nice mathematical guarantees that I've done this machine learning operation and you cannot learn anything about the data, the amount that you can learn about individuals and the data is kind of mathematically bounded, I feel like this is the right thing to do. Just as when you go to the supermarket and you see the list of ingredients on the box of cereal, that gives confidence that now I can go out and buy any box of cereal and know that it's going to have certain properties. I think this is the foundational work that AI will need in order to grow.
My vision, and in some ways my hope is that, remember in the early days of the web that you type HTTP in front of amazon.com and you'd fill in your credit card number. And then after a while people said, ‘Hey, you know what, like we probably shouldn't be sending credit card numbers around.’ And so then for certain very sensitive things, they developed this thing called HTTPS, and that was a secure web interface. And then people gradually got trained to look for the lock on the browser that indicates it’s HTTPS. But it was only just a few pages that would be guarded that way. And then after a while people said, ‘Hey, if you can secure this one page, why don't you just secure all the pages?’
So, almost everything now is HTTPS. I feel like, for machine learning, the idea that people are going to be operating on raw data is just going to seem quaint and weird and slightly indecent.
To get to that point of HTTPS, you need a couple things. You need to make all these technologies more usable, so that the people who are doing the data science and don't have to worry about how big the polynomials have to be in some weird space. And you also need some level of interoperability and industry consensus around the underlying technologies. That is starting to happen for some things like homomorphic encryption. People on my team are working with other industrial partners to work with the right standards bodies to get that process started, to standardize aspects of homomorphic encryption. There are already efforts in other standards bodies around federated learning. So I think this will come together.
One thing that we haven't talked about yet is the fact that some of these techniques require extra compute. They are computationally intensive. Intel has seen this situation before in the early 2000's when encryption became much more commonplace. There was an encryption standard called AES. We added a new instruction to the Intel processor line to accelerate this particular part of the encryption scheme. So, we've done that before and I feel like for some of these protocols we will probably do that again. We'll need to provide hardware support to speed up these very specific calculations.
Trusted execution environments are ways that you can dedicate a certain amount of memory that you're working with on your computer to stay encrypted. Whenever the processor needs to access that memory, it's going to access it only in its encrypted form. And then once it reaches the inner sanctum of the processor, only then are we going to decrypt it, do some kind of operation and then very quickly re-encrypt it. There's a way to speed up that process and make it much more efficient. That's definitely part of the toolkit that we can use.
You actually could use both, a kind of a belt and suspenders approach, where you have encryption inside one of these trusted enclaves. Let's say two different parties have intellectual property that they want to protect. One party owns a model and then the other party has sensitive data that needs to operate in that model. You could use homomorphic encryption to protect the patient scan, let's say, and then you could use the enclave to encrypt the model. So, now you're kind of protecting two different parties at the same time.
The crypto community is looking very closely at quantum computers and how to handle that. There's actually a process at the NIST, which used to be known as the Bureau of Standards, where they are looking at what would be the recommended new cryptography systems that people should use to make them so-called post-quantum, or kind of quantum resistant.
Homomorphic encryption is a family of cryptography called lattice-based crypto schemes and some of those are post-quantum. There are people at Intel who are actively suggesting to NIST what should be the next adopted standard. Some of those are lattice-based.
These technologies exist in a world subject to economic laws, so federated learning could be a way that people who have private data silos could monetize those data without explicitly sharing them with anyone. That's a very interesting possibility because then it creates the markets and, I'm a former trader, so I love market mechanisms around allocation of resources. I think that would be a fantastic development for the field.
The context of my thoughts is mostly around immediate customer problems that we're seeing. So we haven't gotten to the level of individuals, whether they should sell their individual data or not. There are already data sharing agreements in place, looking at health care, where a pharma company wants data from a hospital and they get armies of lawyers together and come up with some very large check. It's a pretty complicated process. To facilitate those kinds of business-to-business type interactions would be a really great place to start. I know Jaron Lanier has also been in the New York Times talking about individuals selling their data. That's actually a pretty complicated topic. There's plenty of commercially and societally important business-to-business cases that we'd like to work on sooner.
This post is adapted from the second half of Episode 32 of the podcast Eye on AI. Find the full recorded episode here, or listen to it below.