[MUSIC] Okay, so in this course, we've talked about lots of different
machine learning methods and lots of applications where these types
of methods can be very impactful. But of course there a lots of open
challenges that still remain in machine learning. So let's discuss some of them. One is the fact we often have
a choice of which model to use. So for example, when we talked about recommending products
we said we can use a classification model. Where we take features of the user and the
product, pass it through this classifier to say whether yes or no the person
will like or not like this product. But then we also talked about using matrix
factorization where we're gonna learn features about users and products and use
that to recommend products to the users. And then we also talked about featurized
matrix factorization, combining these two ideas the list often of possible models
we can consider for a task is very large. So typically, this leaves
the practitioner very perplexed. Which model should I use and
searching over this set of possible choices is still
an open challenge in machine learning. Another really important challenge
that we're often faced with is how do we represent our data? So for example, when we talked
about our document modeling, our document retrieval task, we said,
well, we could use just raw word counts, or we also talked about we
could normalize the vectors. We could use things like
tf-idf to account for very popular words, and to more emphasize
the important words in the document. But honestly, there are lots of
different variants of tf-idf, we just provided one
example of doing this. You could also think about using BiTrams,
and trigrams, and there are lots and lots of ways we can think about representing
the words that appear in a document. That's, our, our data set of interest
that we'd like to represent. But that's just for a document. Then maybe we have images. How do we represent an image? We've talked about some ways. We'll talk about others, but
there's lots of challenges with that. Then maybe you have data
that's really network based, so things like from Facebook. So you can have very complicated
data structures and from very, very different, diverse data sets. We wanna be able to use the types
of methods we've described. So, how we represent our data,
of course, is gonna have significant impact on the types of
inferences that we make on that data. So, this is a really really important
problem and there's no one method for choosing the right
representation of your data. One of the other really important and
really significant challenges we're faced with in machine learning these days is
how to scale up in multiple dimensions. So one aspect of this is the fact that
data is getting bigger and bigger. So this is something that's been
talked about extensively in the media. So let's just describe a few
situations in which we're faced with a growing amount of data. One is the fact that there's just a large
number of different platforms out there for social networking,
collecting data via crowdsourcing, and different things like this,
sharing your photos, your videos. And reviewing restaurants and the list
of possible ways in which you can now go online and
give data to the world is growing. And the amount of people doing this and providing data is just growing
at this huge huge rate. So we have lots of new data
sources available to us. And in addition, the way we think about
buying products, we no longer often just go to a store and have some hand written
record of what, product was purchased. We now have, vendors like Amazon
who have these huge online marketplaces and
collect data about different products and customers and different purchases
being made and lots and lots of data from different
sources like this. And beyond these types of
websites there's also a lot of devices that we can now wear. So there are these wearable
devices I can now wear a watch that monitors all the activities
I'm doing, how I'm sleeping at night. I can wear glasses that record
everything that I'm seeing. I can also talk about
the Internet of Things, which is just lots of
connected devices and lots of different sources of information
communicating with one another. So these are just some of the areas
in which we're seeing lots and lots of new data sources, but
of course that's not exhaustive. We can also talk about
things like medical records. Again, no longer do you go
into your doctor's office and just have them write notes by
hand that gets put in some file. Often, they're taking
electronic health records and these are now communicating across
systems and we have lots and lots of electronic health records that,
it's just a source of data to be parsed and understood and
used to innovate in medicine. So lots of new datasets,
which is exciting. We can learn a lot about how people
operate about our bodies, about how people purchase and make friends and how they go
about their day to day activities but of course we need to have methods that scale
to analyze these types of data sets and also to the unique structure of data
that's presented in these data sets, and the noisy structure, and the list
of challenges is really extensive. This is one of the very big
challenges in machine learning, is how to deal with this big data. And simultaneously,
to data being really large, we're also faced with the challenge
with the fact that the models that we're using to analyze these
increasingly complex data sets. Are also growing, so the models
themselves are becoming bigger and more complicated in order to extract
information from these ever growingly. I don't know if that's a word,
but you get my point. These very intricate data sources and
very large data sources. So just as an example, when we talked
about clustering we talked about this, application where, you have, recordings of brain activity taken
over time, so, this is just one, quick example of, a model that was
used to analyze this type of data set, and without going into the details
of what's shown here on this slide. Just realize that there are lots
of circles and lots of arrows, and what that means is that this is
a really complicated big, big model. So you might think, okay, well data's
getting bigger, models are getting bigger, but that's okay because
processors are getting faster. Well, that was the story for a while. We were seeing this exponentially
increasing rate of increased speed for our processors. But that stopped about a decade ago. And now what we're seeing, is really very marginal increase in
the speed of an individual processor. So instead, we have to think
about new ways to scale up. And the typical thing that
we're leveraging these days is collections of processors. And there are different architectures,
that we have. We have things like GPUs and
multicore and clusters and cloud computing resources and really, really
fancy and expensive super computers. So that's great. Those are really, really powerful or potentially powerful computing
resources that we have. But a question is how do we
use these in machine learning? And in machine learning we have a number
of challenges we're faced with. One is the fact that taking our
machine learning out of those and thinking about how to distribute them
across these different processors and run everything we want to run in
a coherent way, is very challenging. Another challenge is how do we distribute
the data across these different machines and how do we do all of
this in a way that is very tolerant to different failures we
can have of the individual machines. So these represent a number of challenges
that we are facing in machine learning. And a lot of lot exciting research is
coming out to start addressing these problems. [MUSIC]