Terry Sejnowski is one of the pioneers of deep learning who, together with Geoffrey Hinton, created Boltzmann machines: a deep learning network that has remarkable similarities to learning in the brain. I recently had a conversation with Terry after reading his wonderful book Deep Learning Revolution. We talked about the convergence between deep learning and neuroscience, and whether machines dream.
I was trained as a theoretical physicist, but was fascinated by the brain. We were doing simulations with computers that were really puny compared to today's computers. Machine learning was in its infancy back then.
Geoff and I got into neural networks that were bio-inspired from how the brain is organized. But these were much simpler in terms of connectivity. They were just simple nonlinear functions. But they were sufficiently complex that we were trying to use them to solve complex problems in vision.
Vision seems deceivingly simple because when we open our eyes in the morning, we see objects. It seems like it doesn't take any effort, any thought. That machinery was opaque. We didn't understand it. We had no concept back then for the degree of complexity, for how much computing power the brain has.
In the brain, the cerebral cortex over the surface of the brain represents the world and all of your plans and actions. It's the highest level of processing in the brain. There was an outstanding problem, which was how do you learn in a system that has that complexity with all those layers.
Our goal was to try to take a network with multiple layers - an input layer, an output layer and layers in between – and make it learn. It was generally thought, because of early work that was done in AI in the 60s, that no one would ever find such a learning algorithm because it was just too mathematically difficult.
And that's when Geoff and I invented the Boltzmann machine with an architecture inspired by physics. What made it different from all the other architectures that were being considered at the time was that it was probabilistic.
In the models available at the time, the input goes through a series of stages in a deterministic way. But instead of automatically getting the same output, we thought maybe we can make progress if each unit had a probability to have an output that varied with the amount of input that you're giving it. So, more input, the probability gets higher that it's going to produce an output. And if the input is low, the probability of an output is low. That introduced a degree of variability.
Not only that, it created a different class of network, which is generative. In the traditional input-output network, when there’s no input, there’s no output. Basically there's nothing going on inside. But in the Boltzmann machine, even without an input, the thing is chugging away because there's always some probability that there'll be an output from each unit. Therein lies the secret that we discovered for learning in a very complex network with many layers, which we now call deep learning.
We gave the network an input and then kept track of the activity patterns within the network. For each connection we kept track of the correlation between the input and the output. Then in order to be able to learn - and this is all mathematical analysis that we had done - you have to basically get rid of the inputs and let it free run, in a sense put the network to sleep. But, of course, it's still chugging away and you can do the same measurement for every pair of units with a connection.
You keep track of the correlations. We call it the sleep phase. The learning algorithm is very simple. You subtract the sleep phase correlation from the awake learning phase and that's how you change the strength of the weight. Either it goes up or it goes down. And we showed that if you do that and you have a big enough data set then you can learn arbitrary mappings. We trained it to discriminate between handwritten digits on zip codes. So, there are 10 output units. You give it some input, which is a little handwritten number 2, then the unit at the very top, which represents 2, is going to be active at the highest level compared to the other units.
That was how we classified the digits. But now what you can do is ‘clamp,’ we called it, or fix the output of the 2 so that it's the only one that's active and the rest are off. And now that percolates down because this network had inputs and outputs going up and down. It was a highly recurrent network. And what it would do is start creating inputs that looked like 2's, but they would be constantly changing. The loop at the top would come and go and then the loop at the bottom would come and go and they would wander around. And so it was basically dreaming. It was a dreaming about 2-ness. The network had created an internal representation of what it meant to be a 2.
You prevent any input from coming in so that the network could express an input that represented this concept at the highest level. And so, the information now instead of flowing from the input to the output is flowing from the output to the input. And that's what's called a generative network. And now we have even more powerful generative networks, the generative adversarial networks, which are amazing because not only can you generate 2's, you can generate pictures of people's faces. You give it a bunch of examples of rooms like the one we're in and it will start generating new rooms that don't exist, with different kinds of tables and chairs and windows, and they all look real, photo realistic. And that's what's really astonishing because we can create very high fidelity models of the world.
And in a sense, that's what the brain does when we fall asleep and we dream. We're seeing the generated patterns that are based on our experience.
Geoff and I were completely convinced we had figured out how the brain works. Is it just a coincidence that in order to learn in a multi-layer network, you had to go to sleep? Humans go to sleep every night for eight hours. Why do we go to sleep? In fact, one of the areas that I've helped to pioneer is trying to really understand what goes on in your brain when you fall asleep.
Scientists doing computational models like me have made a tremendous amount of progress on understanding how experiences you have during the day get integrated into your brain at night. It's called memory consolidation, and there's an overwhelming amount of evidence now that this is what's happening.
There's something called replay that happens between a part of your brain that's important for memories, episodic memories, so things that have happened to you; events, unique objects, things that. During the night the hippocampus literally plays back those experiences to the cortex, and the cortex then has to integrate that into the knowledge base, this semantic knowledge that you have about the world. The Boltzmann machine analogy turned out to actually be a really good insight into what's going on during sleep. But now, obviously what's really going on during sleep is orders of magnitude more complex in terms of the numbers of neurons and the patterns of activity, which we have studied in great detail. But we really think that computationally, it's actually what's going on.
There's a convergence going on right now between our knowledge of the brain, on the one hand, and our ability to now create these large-scale networks in the image of the brain. Not precisely, we're not trying to duplicate the brain, but rather take the principles from the brain and try to build up systems that have some of the capabilities of the brain, like vision, like speech recognition, like language processing.
Neuroscientists are watching what's happening with deep learning and getting inspired and coming up with hypotheses and going back and testing it in the brain. And as we learn more about the brain, how it solves these problems, we can take that and apply it to deep learning.
Consider attention. While we're looking around, we are not trying to process everything that's out there. We focus on a particular object, on reading a sentence. And that means you have to direct your attention. Well, it turns out that if you add attention to these deep learning networks, you vastly improve their performance.
If you're doing language translation, a word at the beginning of a sentence may have a strong relationship with a word later in the sentence. And so, you want to be able to hold onto that information, attend to it while the inputs are coming in sequence. And now another word shows up and those two words have to link with each other.
So attention is a way of marking and saying this is important. Keep it in mind. And then after you've linked up all these words, it's now a meaningful representation. You then begin to output words in another language. Again, respecting those relationships between the words, how they're ordered and what their clauses look like. And in German you have to wait till the end of the sentence in order to put the verb. The network has to understand that. It has to keep track of what the verb is, know what the verb is and know where to put it. And this is all something we take for granted. That's what our brains are really good at.
And so, as we learn more about the mechanisms that the brain uses for processing words, speech, vision and so forth, these will get incorporated and improve the performance of the networks.
And now, especially with natural language processing, this has reached a point, as you probably know from your cell phone, where it's really good. Speech recognition has gotten amazingly good, even in noisy environments. It's a whole new era.
The model that we have for computer vision is based on the camera, which is frame based. So when you're taking a video, it's really a sequence of frames with images and your brain then puts them together into a sequence and so you can see motion and recognize things that are moving. There's a new generation of cameras that are based on how your retina works. Your retina is actually a part of the brain, it's a little pouch on the back surface of your eye and through several layers of processing, it converts an image first into electrical signals and then into spikes.
The spikes flowing into the brain has coded information about things having to do with color, motion and time. How are things changing in time and the relative strengths, for example, on an edge where you have a change in contrast that's coded in spikes. You have all of that information. Now this train of spikes is asynchronous. Unlike a frame where you collect information over 30 or 40 milliseconds, you can send a spike at any time. And that means you can send out spikes as something occurs in the world within a millisecond or less. And the relative timing of the spikes carries a lot of information about where things are going - much more information than if you use a frame-based camera.
If you use the spike-based representation, it's call it a dynamic vision sensor. And what's nice about them is that they're very low power because they're only putting out these spikes. They are very sparse in the sense that if nothing's moving you actually don't get anything. You have to have motion. And it's very lightweight. It's the perfect thing for a robot, for example, because powering a robot is very expensive. If you can do vision with spikes instead of supercomputers, or GPUs, which is what is being used for deep learning, it's easier to be autonomous. And that's where we're headed.
Edge devices like your cell phone and your watch are computers and they're soon going to have deep learning chips in them. You have to have better batteries, but ultimately if you could replace the digital circuitry with some of these analog VLSI circuits, like a DVS camera, that is going to revolutionize the amount of computing you can do on board - in your hand.
Spikes are interesting. The neurons in the brain emit these spikes and they're all or none, they last about a millisecond. They're relatively slow compared to digital electronics. In that sense, they are analog. A digital chip has a clock and every cycle, every transistor is updated, so you have to have synchrony across the whole chip. Whereas these analog VLSI chips are asynchronous. So, every single model neuron can send a spike whenever it wants.
And these are then transferred up the road to the other chips through a digital line. So, it's a hybrid chip. It has analog processing, which is really cheap and not very accurate by the way, but that's okay. It turns out if you do a lot of parallel computations with a lot of elements and then integrate that information, you're better off. But to communicate between chips, just like the way the brain does, you have to convert it into a digital bus and send the information over using some protocol.
Once we realized that learning is possible in multi-layer networks, then a bunch of other learning algorithms were discovered literally within years. The one that has been the most popular is the backpropagation of error, which requires that you take information about how well you're doing and compare it to a labeled input and then use that error to go backwards and update the weights. And you're always reducing the error and you can do that very efficiently, very quickly. And because it's so efficient, it's now the way that most of these practical problems are attacked with bigger and bigger and bigger networks.
The brain has 12 layers in the visual cortex. Now people are dealing with networks that have 200 layers or more. And what we didn’t know back then, and this is the key to success, is that these learning algorithms scale very well.
A typical algorithm in AI is able to solve small problems where you have just a few variables for which you're trying to find an optimal solution. The traveling salesman problem is a good example. Given a bunch of cities, what's the fastest route between the cities if you visit each one once? That's called NP complete and what that means is that as the number of cities goes up, the problem becomes exponentially more difficult. At some point, it doesn't matter how fast your computer is, the problem is going to saturate it. And that's the problem with many of the algorithms that are used in a digital computer with a single processor, which is von Neumann architecture, where you have the memory separated from the processor. You have this bottleneck between the two.
Now, the beauty of these neural networks that we pioneered in the 80s is that they are massively parallel. That means that they use simple processors where the memory is located on the processor. They're together so you don't have to ship information back and forth. In the brain, we have a hundred billion neurons that are working together in parallel. That means that you can do much, much more computing in real time and you don't have to worry about buffers or anything. And as you add more and more neurons to your network and more and more layers, the performance gets better and better and better. And that means it scales beautifully.
In fact, and this is absolutely amazing, if you have parallel hardware, that is to say if you're simulating each unit at the same time and you're passing the information through the connection weights at the same time, then it's called order of one scaling.
That means the amount of time it takes is independent of the number of units you've got. It's fixed. And that's how the brain works. The brain is working order one. In other words, as the cortex evolved more and more neurons in primate brains, especially in human brains, it still works in real time. It's still works with the same amount of time in order to come to a conclusion - just to recognize an object it’s about a hundred milliseconds - and you can't get better than that. So, nature has found a way to scale up computation. And now we’re finding that out. And now hardware has become a really big part of machine learning. Until recently, there were memory chips, there were CPU chips and maybe some digital signal processing chips.
But now these machine learning algorithms are being put into silicon. Google already has a tensor processing unit, TPU, which does deep learning. But there are a ton of other machine learning algorithms that could be put into silicon and it's going to vastly improve the amount of computing that you can do because these are like supercomputers now. In fact, there's one, from Cerebras, that is 20 centimeters across with 400 million processing units. That's getting up to real scale. Of course, it's a kilowatt so you have to have a power generator there, but it is scaling up. It's a completely new type of chip that people are just beginning to appreciate.
First of all, it's asynchronous and that means you don't need a clock on the chip. You can just let the whole thing go. Number two, you don't need 64-bit accuracy. You can get by with eight bits. So that means vast savings on memory. And then there's a high degree of connectivity locally. So that means that the processors that are near to each other are exchanging a lot of information all the time. That's how the brain works too. And now load all the data as it's coming in, just the way it is through your senses. It flows through like a pipeline. Information is circulating and decisions are being made. It's an incredibly complex dynamical system ultimately.
We're faced now with an interesting problem. What we really want to know is what's going on inside the network. What does it learn? And the hottest thing right now is probing the artificial neural network with the same experiments that neuroscience is doing on the brain. How do you figure out what's going on in the brain? You put an electrode onto one of the units and see what it responds to, when it responds. Is it firing before the decision or after? And that gives you hints about how the information is flowing through the network. And we're doing that now with these artificial networks. It's really, really exciting.
It may be that the brain is somewhere between a Boltzmann machine and the back prop net. And this actually leads to a really exciting new area of research, which is of all possible computing systems that are parallel, that have this ability to learn and the ability to take in lots of data and be able to classify or predict. We're just scratching the surface. This is the beginning of a whole new mathematical exploration of this space. I've written an article that was recently accepted in the Proceedings of the National Academy of Sciences. The title is The Unreasonable Effectiveness of Deep Learning, because deep learning is able to do things that are unaccountable.
In the first part of this interview Terry Sejnowski talks to us about machines dreaming, the birth of the Boltzmann machine, the inner-workings of the brain, and how we recreate them in AI.