**Transcript of Ali Rahimi NIPS 2017 Test-of-Time Award Presentation Speech**

*December 4, 2018*

Â

This post lists a transcript of Ali Rahimi NIPS 2017 Test-of-Time award speech, a.k.a. the "machine learning has become alchemy" speech.

Â

Lady Introducing Ali Rahimi (can't find her name)

Â

Good morning. I think we are ready for the next talk if this actually works.

Â

So not every nips has a test of time award so it's a test time even to get a test a time award this year and one of the criteria that we decided - the program decided on - for for awarding it was ten year back nips papers so we formed a little committee and I must say that it was not easy - all the ones that did not get their test of time award don't feel too bad, this was not an easy decision but we did arrive, we did arrive, at a paper that we definitely felt was really really good and we're happy about it

Â

So just let me introduce it in the author's own word

Â

random features I check to speed up supervised learning algorithms so they can operate on big that as millions of data point data says traditional learning algorithms work too hard because they optimized parameters that don't actually need to be optimized instead one can randomize over some of the parameters and quickly optimize over the rest thanks to the concentration of measure-of-phenomenon this still provides excellent accuracy this recipe makes for accurate learning algorithms that run extremely fast and are very easy to implement

Â

So the paper introduced random 4-year feature sampling to efficiently approximate Gaussian kernels

Â

It also introduced random binning features to approximate separate separate separable multivariate shift invariant kernels later they followed up with random stamps for boosting and random support vectors for empirical kernel maps

Â

and from the later variations of this paper follow a very popular term the random kitchen sink

Â

So without any further ado, in case you shouldn't have guessed what paper it is, random

features for life scale kerning machine kernel machines by Ali Rahimi and pen raft

and they are here both of them let's give them a hand

and Ali will be giving the talk thank you

Â

Would you mind, oh thank you.

Â

Ali Rahimi

Â

Okay it feels great to get an award thank you but I got to tell you nothing makes you feel old like getting an award award called Test of Time.

Â

It's forcing me to come to grips with my age. Ben and I are no longer young so we decided to name the talk accordingly (Back When We Were Kids)

Â

We're giving you this award for this paper this first paper up here, but this paper was the

beginning of a trilogy of sorts and like all stories worth telling the good stuff happens in the middle or at the end not at the beginning

Â

So if you'll put up with my old man ways I'd like to take you all the way back to NIPS 2006 when

dinosaurs roamed the earth and Ben and I were young spry men

Â

Deep learning had just made a had just made a splash at NIPS 2006 the training algorithms were complicated and the results were competitive with linear methods like PCA and linear SVMs

Â

At the workshop some of us were kibitzing and somebody pointed out that this should maybe be compared against long nonlinear methods like SVM's, but of course at the time you couldn't

scale SVM's up to that size data set.

Â

Ben and I had both been working on randomized algorithms separately, Ben for compressed sensing and I for sketches to speed up bipartite graph matching for computer vision.

Â

Once we got home it took us just two emails to come up with a way to speed up kernel machines.

Â

These two emails became that first paper.

Â

The idea was simple.

Â

Â

Â

Normally when you train a kernel a machine on a data set, you slap one kernel on top of each data point and an associate of weights to each one of those kernels and you let your optimizers tune those weights to minimize the lots that are training there.

Â

Here's the here's the observation in the paper

Most of the kernels people were using at the time can be approximated as a straight sum of the product random functions.

Â

if you plug in that approximation in that first equation you get a linear combination of a straight

sum which is just a linear combination but one with fewer coefficients to solve for and that's something the optimizes at the time could handle

Â

The paper provided ways to approximately popular kernels as randomly like this and also provided guarantees about the quality of the approximation

Â

Â

Â

Â

If you have a kernel, and you want to approximate it with such-and-such fidelity, here's how many random features you need

Â

And in practice the method worked very well

Â

Now we set out to provide a baseline for deep learning so they could compare against nonlinear

methods but we can find any code to compare against - Â this was before machine learning was reproducible the way it is now

Â

So we compared against what was reproducible at the time which was boosting and non accelerated kernel machines of various kinds

Â

During our presentation at the the poster session we handed out this leaflet to showcase

how easy it was to train large-scale kernel machines

Â

Â

Â

Â

It's uh, "hey buddy you wanna you want to train a kernel machine here's just four lines of MATLAB code" a little bit of a guerilla marketing for for an algorithm

Â

Now there's something a little shady about the story up and telling you

Â

According to the bound I just showed you

Â

and that bound is a fine bound

Â

in order to approximate a kernel accurately with these random features you need tens of

thousands of random features and yet in our experiments we were getting away with using just a few hundred features and getting good results

Â

Even more strangely in some of our experiments our approximate method was producing lower

test errors than the original kernel machine we were trying to approximate

Â

It's always weird when the thing you're trying to approximate does worse than your approximation

Â

This is relevant, because back in those days machine learning had just finished transitioning from what Sam Roweis used to call "an ideas conference" into something more... rigorous

Â

During the poster sessions you could see this roving band of what I used to call the nips rigor police.They would come to your poster and make sure that your arguments were airtight I'm thinking of Nati Srebro, Ofer Dekel, some of Mike Jordanâ€™s students, or if youâ€™re unlucky, Shai Ben-David, and if youâ€™re really unlucky, Manfred Warmuth.

Â

But anyway, we decided to send the paper out as-is with this doggyness in it and braved the the nips rigor police - but to do right by them we eventually came up with an explanation for this phenomenon and I'll share it with you

Â

Here's the algorithm without any of the kernel flim-flam, straight up:

Â

Â

you just draw a bunch of functions, independently from your data set - you weight them and you tune those weights so that you get a low loss on your training data set okay

Â

In the second paper we showed this, in the same way that Fourier basis

Â

Â

Â

Â

provide a dense base basis set for for an l2 ball of l2 functions

Â

Or in the same way that three layer wide neural networks could approximate any smooth

function arbitrarily well, so to do a bunch of randomly drawn smooth functions approximate densely a function in a Hilbert space arbitrarily well with high probability

Â

So now you don't need to talk about kernels to justify these random features

Â

They don't have to be eigenfunctions of any famous kernels, they're a legitimate basis for

learning into themselves

Â

In the third paper we finally derive generalization bounds for the algorithm I just showed you

Â

Â

Â

Â

If you have a data set with this many points and you want to achieve this kind of test error on a

data set here's how many random features you need

Now at the time - but by this point we were now just no longer thinking about

kernels we'd legitimize using, taking a bunch of random kitchen sinks and combining them together and you don't need to justify that the approximate kernels in anyway, Â so it

didn't really bother us if we were underperforming or over performing kernel machines or if we had to use fewer parameters or more parameters than a Kernel machine, we'd set out on this

journey to provide a baseline for deep learning and we couldn't do it at the time

Â

Â

But since then the field has become much more reproducible and various folks have provided baselines in speech where random features are competitive with, with deep learning to

this day

Â

I myself still use random features at work, I'd like to get creative with the random functions I use

and adapt them to the problem at hand

Â

When they work well and I'm feeling good I say to myself, "Wow random features is so powerful - they cracked this dataset" or if I'm in a more somber mood I might say, "huh this problem wasn't hard at all, even random features cracked it"

Â

It's the same way I think about nearest neighbors.

Â

When nearest neighbors does well you can either marvel about the power of nearest neighbors or you might conclude that your problem wasn't hard to begin with

Â

Both of these algorithms are good solid baselines and a way to get a diagnostic

on your problem

Â

It's not 2017 (he meant to say Itâ€™s now 2017) and the field has changed

Â

Â

Â

Â

We've made incredible progress

Â

We are now reproducible

Â

We shared code freely and used common tasks benchmarks

Â

We've also made incredible technological progress

Â

self-driving cars seem to be around the corner

Â

artificial intelligence tags photos, transcribes voicemails, translates documents, serves us ads billion-dollar companies are built on top of machine learning

Â

In many ways we're way better off than we were 10 years ago

Â

And in some ways we're worse off

There's a self-congratulatory feeling in the air

Â

Â

Â

We say things like "machine learning is the new electricity"

Â

I'd like to offer another analogy

Â

Machine learning has become alchemy 12:09 in the YouTube

Â

Â

Â

Â

Now alchemy is okay

Alchemy is not bad

Â

There is a place for alchemy

Â

Alchemy "worked"

Â

Â

Â

Â

Alchemists invented metallurgy, ways to dye textiles, our modern glass making processes and

medications

Â

Then again alchemists also believed that could cure diseases with leeches and transmute base metals into gold

Â

For the physics and chemistry of the 1700s to usher in the sea change in our understanding of the universe that we now experience, scientists had to dismantle 2,000 years worth of

alchemical theories

Â

If you're building photo-sharing systems alchemy is okay

Â

But we're beyond that now

Â

We're building systems that govern healthcare and mediate our Civic dialogue

Â

We influence elections

Â

I would like to live in a society whose systems are built on top of verifiable, rigorous, thorough knowledge and not on alchemy

Â

As aggravating as the NIPS rigor police was

I miss them and I wish they'd come back

I'll give you examples of where this hurts us

Â

Â

Â

Â

How many of you have devised the deep net from scratch architecture and all and trained it from the ground up and when it didn't work you felt bad about yourself like you did something

wrong

Â

Please raise your hand

Â

This happens to me about every three months and let me tell you I don't think it's you, I don't think it's your fault.

I think it's gradient descents fault

I'll illustrate

Â

I'm gonna take the simplest deep net I can think of

Â

it's a two layer linear net, so identity activation functions

Â

Â

Â

Â

and the labels are badly conditioned linear function of the inputs

Â

On the left is my model it's a product of two matrices that get taller and they just multiply the input and on the right is my loss function and this is the progress of gradient descent in a various variants of it

Â

It looks like gradient descent goes really fast at the beginning and then it just seems to peter out

Â

You might say this is a local minimum or a saddle point, but its not a local minimum, it's not a saddle point the gradient magnitudes are nowhere near zero

Â

You might say it's hitting some statistical noise floor of the problem, that's not it either this is not a statistical noise floor, I can compute the expectation of the loss, run grading descent and get the same curve, this is not due to randomness

Â

Here's what a better descent direction would do

Â

Â

Â

Â

This is Levenberg-Marquardy, it gets down to zero loss it nails the solution a few hundred iterations, it gets to machine precision zero

Â

If you haven't tried optimizing this problem please take 10 minutes on your laptop and just try it

Â

This is our workhorse, this is what we're building models around

Â

You might say this is a contrived problem, that badly conditions something, something has just Â seen the columns of this A matrix are are correlated that's the only weird thing about it - this is

not a contrived problem

Â

You might say gradient descent works just fine on more complicated larger networks

Â

Two answers: first everybody you just raised their hands we'd probably disagree with you and second this is how we build knowledge we apply our tools on simple, easy to analyze setups, we learn and we work our way up in complexity

Â

We seem to have just jumped our way up

Â

This pain is real, here's an email that landed in my inbox just a month ago

Â

Â

Â

Â

I'll read it out loud for you

Â

Â

Â

Â

it's for my friend Boris

Â

On Friday someone on another team changed the default rounding mode of some tensorflow

internals from truncate toward zero to round to even

Â

Our training broke, our error rate went from less than 25% error it's almost 99 percent error

Â

I have several emails like this in my inbox and you can find similar reports on various boards on various bug reports

Â

Â

Â

Â

on the public internet

Â

This is happening because we apply brittle optimization techniques to lost surfaces we don't

understand

Â

In our solution is to add more mystery to an already mysterious technical stack

Â

Batch norm is a way to speed up gradient descent

Â

Â

Â

Â

You insert Batch Norm between the layers of your deep net and gradient descent goes

faster

Â

Now I'm okay using technologies I don't understand

Â

I got here on an airplane and I don't fully understand how airplanes work

Â

But I take comfort knowing that there is an entire field of aeronautics dedicated to creating that

understanding

Â

Here is what we know about batch norm as a field

Â

It works because it reduces internal covariant shift

Â

Wouldn't you like to know why reducing internal covariant shift speeds up gradient descent

Â

Wouldn't you like to see a theorem or an experiment

Â

Wouldn't you like to know, wouldn't you like to see evidence that batch norm reduces internal covariant shift?

Wouldn't you like to know what internal covariant shift is?

Wouldn't you like to see a definition of it?

Â

Batch Norm has become a foundational tool in how we build deep nets and yet as a field we know almost nothing about it

Â

Â

Â

Â

Machine learning is taking a new spot in society

Â

If any of what I've been saying resonates with you, Â let me offer just two ways we can assume this new position responsibly

Â

Think about in the past year the experiments you ran where you were jockeying for position on a leaderboard and trying out different techniques to see if they could give you an edge in performance

Â

And now think about in the past year the experiments you ran where you were chasing down an explanation for a strange phenomenon you'd observed

Â

You were trying to find the root cause for something weird

Â

We do a lot of the former kind of experiments, we could use a lot more of the later

Â

Simple experiments, simple theorems are the building blocks that help us understand more complicated systems

Â

Here's a second, second thing we could do

Â

Right now are mature computational engines that run on commodity hardware are all variants of

gradient descent

Â

Â

Â

Â

That's what we have that can handle tens of billions of variables

Â

Imagine if we had linear system solvers or matrix factorization engines that could handle tens of

billions of variables and operate on commodity hardware

Â

Imagine the amazing kinds of optimization algorithms we could build

Â

Imagine the incredible models we could experiment with

Â

Of course it's a hard math problem and it's a hard system's problem, but this is exactly the group that can solve these problems

Â

Â

Â

Â

Over the years many of my dearest relationships that have emerged from this community

Â

My affection and respect for this group are sincere and that's why I'm up here asking us to be more rigorous, less alchemical, better

Â

Ben and I are grateful for the award and the opportunity to have gotten to know many of you and to work with many of you

Â

And we would love it if we could work together to take machine learning from alchemy and into electricity

Â

[Applause]

Â

Â

Lady

Â

I thank you so much for this talk I hope that we'll be able to give you a new award in ten years from that from now for the return of the rigor in the next community. We have time for probably one question or two. Since we are since, since, we all just stunned by by them message here which we all appreciated so much I just want to invite Ben up here as well so I can hand you the official certificate for the Test of Time award.

Â

[Applause]

Â

Man's Voice

Here by it is the end of the test of time award and the program continues onÂ I think there's a break now so we you Up next

Â

Â

References

Â

Â

Â

Our Recent Posts

Tags