Transcript of Ali Rahimi NIPS 2017 Test-of-Time Award Presentation Speech
This post lists a transcript of Ali Rahimi NIPS 2017 Test-of-Time award speech, a.k.a. the "machine learning has become alchemy" speech.
Lady Introducing Ali Rahimi (can't find her name)
Good morning. I think we are ready for the next talk if this actually works.
So not every nips has a test of time award so it's a test time even to get a test a time award this year and one of the criteria that we decided - the program decided on - for for awarding it was ten year back nips papers so we formed a little committee and I must say that it was not easy - all the ones that did not get their test of time award don't feel too bad, this was not an easy decision but we did arrive, we did arrive, at a paper that we definitely felt was really really good and we're happy about it
So just let me introduce it in the author's own word
random features I check to speed up supervised learning algorithms so they can operate on big that as millions of data point data says traditional learning algorithms work too hard because they optimized parameters that don't actually need to be optimized instead one can randomize over some of the parameters and quickly optimize over the rest thanks to the concentration of measure-of-phenomenon this still provides excellent accuracy this recipe makes for accurate learning algorithms that run extremely fast and are very easy to implement
So the paper introduced random 4-year feature sampling to efficiently approximate Gaussian kernels
It also introduced random binning features to approximate separate separate separable multivariate shift invariant kernels later they followed up with random stamps for boosting and random support vectors for empirical kernel maps
and from the later variations of this paper follow a very popular term the random kitchen sink
So without any further ado, in case you shouldn't have guessed what paper it is, random features for life scale kerning machine kernel machines by Ali Rahimi and pen raft and they are here both of them let's give them a hand and Ali will be giving the talk thank you
Would you mind, oh thank you.
Okay it feels great to get an award thank you but I got to tell you nothing makes you feel old like getting an award award called Test of Time.
It's forcing me to come to grips with my age. Ben and I are no longer young so we decided to name the talk accordingly (Back When We Were Kids)
We're giving you this award for this paper this first paper up here, but this paper was the beginning of a trilogy of sorts and like all stories worth telling the good stuff happens in the middle or at the end not at the beginning
So if you'll put up with my old man ways I'd like to take you all the way back to NIPS 2006 when dinosaurs roamed the earth and Ben and I were young spry men
Deep learning had just made a had just made a splash at NIPS 2006 the training algorithms were complicated and the results were competitive with linear methods like PCA and linear SVMs
At the workshop some of us were kibitzing and somebody pointed out that this should maybe be compared against long nonlinear methods like SVM's, but of course at the time you couldn't scale SVM's up to that size data set.
Ben and I had both been working on randomized algorithms separately, Ben for compressed sensing and I for sketches to speed up bipartite graph matching for computer vision.
Once we got home it took us just two emails to come up with a way to speed up kernel machines.
These two emails became that first paper.
The idea was simple.
Normally when you train a kernel a machine on a data set, you slap one kernel on top of each data point and an associate of weights to each one of those kernels and you let your optimizers tune those weights to minimize the lots that are training there.
Here's the here's the observation in the paper
Most of the kernels people were using at the time can be approximated as a straight sum of the product random functions.
if you plug in that approximation in that first equation you get a linear combination of a straight sum which is just a linear combination but one with fewer coefficients to solve for and that's something the optimizes at the time could handle
The paper provided ways to approximately popular kernels as randomly like this and also provided guarantees about the quality of the approximation
If you have a kernel, and you want to approximate it with such-and-such fidelity, here's how many random features you need
And in practice the method worked very well
Now we set out to provide a baseline for deep learning so they could compare against nonlinear methods but we can find any code to compare against - this was before machine learning was reproducible the way it is now
So we compared against what was reproducible at the time which was boosting and non accelerated kernel machines of various kinds
During our presentation at the the poster session we handed out this leaflet to showcase how easy it was to train large-scale kernel machines
It's uh, "hey buddy you wanna you want to train a kernel machine here's just four lines of MATLAB code" a little bit of a guerilla marketing for for an algorithm
Now there's something a little shady about the story up and telling you
According to the bound I just showed you
and that bound is a fine bound
in order to approximate a kernel accurately with these random features you need tens of thousands of random features and yet in our experiments we were getting away with using just a few hundred features and getting good results
Even more strangely in some of our experiments our approximate method was producing lower test errors than the original kernel machine we were trying to approximate
It's always weird when the thing you're trying to approximate does worse than your approximation
This is relevant, because back in those days machine learning had just finished transitioning from what Sam Roweis used to call "an ideas conference" into something more... rigorous
During the poster sessions you could see this roving band of what I used to call the nips rigor police.They would come to your poster and make sure that your arguments were airtight I'm thinking of Nati Srebro, Ofer Dekel, some of Mike Jordan’s students, or if you’re unlucky, Shai Ben-David, and if you’re really unlucky, Manfred Warmuth.
But anyway, we decided to send the paper out as-is with this doggyness in it and braved the the nips rigor police - but to do right by them we eventually came up with an explanation for this phenomenon and I'll share it with you
Here's the algorithm without any of the kernel flim-flam, straight up:
you just draw a bunch of functions, independently from your data set - you weight them and you tune those weights so that you get a low loss on your training data set okay
In the second paper we showed this, in the same way that Fourier basis
provide a dense base basis set for for an l2 ball of l2 functions
Or in the same way that three layer wide neural networks could approximate any smooth function arbitrarily well, so to do a bunch of randomly drawn smooth functions approximate densely a function in a Hilbert space arbitrarily well with high probability
So now you don't need to talk about kernels to justify these random features
They don't have to be eigenfunctions of any famous kernels, they're a legitimate basis for learning into themselves
In the third paper we finally derive generalization bounds for the algorithm I just showed you
If you have a data set with this many points and you want to achieve this kind of test error on a data set here's how many random features you need
Now at the time - but by this point we were now just no longer thinking about kernels we'd legitimize using, taking a bunch of random kitchen sinks and combining them together and you don't need to justify that the approximate kernels in anyway, so it didn't really bother us if we were underperforming or over performing kernel machines or if we had to use fewer parameters or more parameters than a Kernel machine, we'd set out on this journey to provide a baseline for deep learning and we couldn't do it at the time
But since then the field has become much more reproducible and various folks have provided baselines in speech where random features are competitive with, with deep learning to this day
I myself still use random features at work, I'd like to get creative with the random functions I use and adapt them to the problem at hand
When they work well and I'm feeling good I say to myself, "Wow random features is so powerful - they cracked this dataset" or if I'm in a more somber mood I might say, "huh this problem wasn't hard at all, even random features cracked it"
It's the same way I think about nearest neighbors.
When nearest neighbors does well you can either marvel about the power of nearest neighbors or you might conclude that your problem wasn't hard to begin with
Both of these algorithms are good solid baselines and a way to get a diagnostic on your problem
It's not 2017 (he meant to say It’s now 2017) and the field has changed
We've made incredible progress
We are now reproducible
We shared code freely and used common tasks benchmarks
We've also made incredible technological progress
self-driving cars seem to be around the corner
artificial intelligence tags photos, transcribes voicemails, translates documents, serves us ads billion-dollar companies are built on top of machine learning
In many ways we're way better off than we were 10 years ago
And in some ways we're worse off
There's a self-congratulatory feeling in the air
We say things like "machine learning is the new electricity"
I'd like to offer another analogy
Machine learning has become alchemy 12:09 in the YouTube
Now alchemy is okay
Alchemy is not bad
There is a place for alchemy
Alchemists invented metallurgy, ways to dye textiles, our modern glass making processes and medications
Then again alchemists also believed that could cure diseases with leeches and transmute base metals into gold
For the physics and chemistry of the 1700s to usher in the sea change in our understanding of the universe that we now experience, scientists had to dismantle 2,000 years worth of alchemical theories
If you're building photo-sharing systems alchemy is okay
But we're beyond that now
We're building systems that govern healthcare and mediate our Civic dialogue
We influence elections
I would like to live in a society whose systems are built on top of verifiable, rigorous, thorough knowledge and not on alchemy
As aggravating as the NIPS rigor police was
I miss them and I wish they'd come back
I'll give you examples of where this hurts us
How many of you have devised the deep net from scratch architecture and all and trained it from the ground up and when it didn't work you felt bad about yourself like you did something wrong
Please raise your hand
This happens to me about every three months and let me tell you I don't think it's you, I don't think it's your fault.
I think it's gradient descents fault
I'm gonna take the simplest deep net I can think of
it's a two layer linear net, so identity activation functions
and the labels are badly conditioned linear function of the inputs
On the left is my model it's a product of two matrices that get taller and they just multiply the input and on the right is my loss function and this is the progress of gradient descent in a various variants of it
It looks like gradient descent goes really fast at the beginning and then it just seems to peter out
You might say this is a local minimum or a saddle point, but its not a local minimum, it's not a saddle point the gradient magnitudes are nowhere near zero
You might say it's hitting some statistical noise floor of the problem, that's not it either this is not a statistical noise floor, I can compute the expectation of the loss, run grading descent and get the same curve, this is not due to randomness
Here's what a better descent direction would do
This is Levenberg-Marquardy, it gets down to zero loss it nails the solution a few hundred iterations, it gets to machine precision zero
If you haven't tried optimizing this problem please take 10 minutes on your laptop and just try it
This is our workhorse, this is what we're building models around
You might say this is a contrived problem, that badly conditions something, something has just seen the columns of this A matrix are are correlated that's the only weird thing about it - this is not a contrived problem
You might say gradient descent works just fine on more complicated larger networks
Two answers: first everybody you just raised their hands we'd probably disagree with you and second this is how we build knowledge we apply our tools on simple, easy to analyze setups, we learn and we work our way up in complexity
We seem to have just jumped our way up
This pain is real, here's an email that landed in my inbox just a month ago
I'll read it out loud for you
it's for my friend Boris
On Friday someone on another team changed the default rounding mode of some tensorflow internals from truncate toward zero to round to even
Our training broke, our error rate went from less than 25% error it's almost 99 percent error
I have several emails like this in my inbox and you can find similar reports on various boards on various bug reports
on the public internet
This is happening because we apply brittle optimization techniques to lost surfaces we don't understand
In our solution is to add more mystery to an already mysterious technical stack
Batch norm is a way to speed up gradient descent
You insert Batch Norm between the layers of your deep net and gradient descent goes faster
Now I'm okay using technologies I don't understand
I got here on an airplane and I don't fully understand how airplanes work
But I take comfort knowing that there is an entire field of aeronautics dedicated to creating that understanding
Here is what we know about batch norm as a field
It works because it reduces internal covariant shift
Wouldn't you like to know why reducing internal covariant shift speeds up gradient descent
Wouldn't you like to see a theorem or an experiment
Wouldn't you like to know, wouldn't you like to see evidence that batch norm reduces internal covariant shift?
Wouldn't you like to know what internal covariant shift is?
Wouldn't you like to see a definition of it?
Batch Norm has become a foundational tool in how we build deep nets and yet as a field we know almost nothing about it
Machine learning is taking a new spot in society
If any of what I've been saying resonates with you, let me offer just two ways we can assume this new position responsibly
Think about in the past year the experiments you ran where you were jockeying for position on a leaderboard and trying out different techniques to see if they could give you an edge in performance
And now think about in the past year the experiments you ran where you were chasing down an explanation for a strange phenomenon you'd observed
You were trying to find the root cause for something weird
We do a lot of the former kind of experiments, we could use a lot more of the later
Simple experiments, simple theorems are the building blocks that help us understand more complicated systems
Here's a second, second thing we could do
Right now are mature computational engines that run on commodity hardware are all variants of gradient descent
That's what we have that can handle tens of billions of variables