I've seen over and over that
one of the most reliable ways to
get a high performance machine learning
system is to take
a low bias learning algorithm
and to train it on a massive training set.
But where did you get so much training data from?
Turns out that the machine earnings
there's a fascinating idea called artificial
data synthesis, this doesn't
apply to every single problem, and
to apply to a specific
problem, often takes some thought and innovation and insight.
But if this idea applies
to your machine, only problem, it
can sometimes be a an
easy way to get a
huge training set to give to your learning algorithm.
The idea of artificial
data synthesis comprises of two
variations, main the first
is if we are essentially creating
data from [xx], creating new data from scratch.
And the second is if
we already have it's small
label training set and we
somehow have amplify that training
set or use a small training
set to turn that into
a larger training set and in
this video we'll go over both those ideas.
To talk about the artificial data
synthesis idea, let's use
the character portion of
the photo OCR pipeline, we
want to take it's input image
and recognize what character it is.
If we go out and collect
a large label data set,
here's what it is and what it look like.
For this particular example, I've chosen a square aspect ratio.
So we're taking square image patches.
And the goal is to take
an image patch and recognize the
character in the middle of that image patch.
And for the sake of simplicity,
I'm going to treat these images
as grey scale images, rather than color images.
It turns out that using color
doesn't seem to help that much for this particular problem.
So given this image patch, we'd
like to recognize that that's a
T. Given this image patch,
we'd like to recognize that it's an 'S'.
Given that image patch we
would like to recognize that as an 'I' and so on.
So all of these, our
examples of row images, how
can we come up with a much larger training set?
Modern computers often have a
huge font library and
if you use a word processing
software, depending on what word
processor you use, you might
have all of these fonts and
many, many more Already stored inside.
And, in fact, if you go different websites, there
are, again, huge, free font
libraries on the internet we
can download many, many different
types of fonts, hundreds or perhaps thousands of different fonts.
So if you want more
training examples, one thing you
can do is just take
characters from different fonts
and paste these characters against
different random backgrounds.
So you might take this ----  and paste that c against a random background.
If you do that you now have
a training example of an
image of the character C.
So after some amount of
work, you know this,
and it is a little bit of
work to synthisize realistic looking data.
But after some amount of work,
you can get a synthetic training set like that.
Every image shown on the right was actually a synthesized image.
Where you take a font,
maybe a random font downloaded off
the web and you paste
an image of one character
or a few characters from that font
against this other random background image.
And then apply maybe a little
blurring operators  -----of app
finder, distortions that app
finder, meaning just the sharing
and scaling and little rotation
operations and if you
do that you get a synthetic
training set, on what the one shown here.
And this is work,
grade, it is, it takes
thought at work, in order to
make the synthetic data look realistic,
and if you do a sloppy
job in terms of how
you create the synthetic data then it actually won't work well.
But if you look at
the synthetic data looks remarkably similar to the real data.
And so by using synthetic data
you have essentially an unlimited
supply of training examples for
artificial training synthesis And
so, if you use this
source synthetic data, you have
essentially unlimited supply of
label data to create
a improvised learning algorithm
for the character recognition problem.
So this is an example of
artificial data synthesis where youre
basically creating new data from
scratch, you just generating brand new images from scratch.
The other main approach to artificial data
synthesis is where you
take a examples that you
currently have, that we take
a real example, maybe from
real image, and you create
additional data, so as to
amplify your training set.
So here is an image of a compared
to a from a real image,
not a synthesized image, and
I have overlayed this with
the grid lines just for the purpose of illustration.
Actually have these ----.
So what you
can do is then take this
alphabet here, take this image
and introduce artificial warpings[sp?]
or artificial distortions into the
image so they can
take the image a and turn
that into 16 new examples.
So in this way you can
take a small label training set
and amplify your training set
to suddenly get a lot
more examples, all of it.
Again, in order to do
this for application, it does
take thought and it does
take insight to figure out
what our reasonable sets of
distortions, or whether these
are ways that amplify and multiply
your training set, and for
the specific example of
character recognition, introducing these
warping seems like a natural
choice, but for a
different learning machine application, there may
be different the distortions that might make more sense.
Let me just show one example
from the totally different domain of speech recognition.
So the speech recognition, let's say
you have audio clips and you
want to learn from the audio
clip to recognize what were
the words spoken in that clip.
So let's see how one labeled training example.
So let's say you have one
labeled training example, of someone
saying a few specific words.
So let's play that audio clip here.
0 -1-2-3-4-5.
Alright, so someone
counting from 0 to 5,
and so you want to
try to apply a learning algorithm
to try to recognize the words said in that.
So, how can we amplify the data set?
Well, one thing we do is
introduce additional audio distortions into the data set.
So here I'm going to
add background sounds to simulate a bad cell phone connection.
When you hear beeping sounds, that's
actually part of the audio
track, that's nothing wrong with the speakers, I'm going to play this now.
0-1-2-3-4-5.
Right, so you can listen
to that sort of audio clip and
recognize the sounds,
that seems like another useful training
example to have, here's another example, noisy background.
Zero, one, two, three
four five you know
of cars driving past, people walking
in the background, here's another
one, so taking the original
clean audio clip so
taking the clean audio of
someone saying 0 1 2 3
4 5 we can then automatically
synthesize these additional training
examples and thus amplify
one training example into maybe four different training examples.
So let me play this final
example, as well.
0-1 3-4-5 So by
taking just one labelled example,
we have to go through the effort
to collect just one labelled example
fall of the 01205, and by
synthesizing additional distortions,
by introducing different background sounds,
we've now multiplied this one
example into many more examples.
Much work by just automatically
adding these different background sounds
to the clean audio Just
one word of warning about synthesizing
data by introducing distortions: if
you try to do this
yourself, the distortions you
introduce should be representative the source
of noises, or distortions, that
you might see in the test set.
So, for the character recognition example,
you know, the working things
begin introduced are actually kind
of reasonable, because an image
A that looks like that, that's,
could be an image that
we could actually see in a test set.Reflect
a fact And, you know, that
image on the upper-right, that
could be an image that we could imagine seeing.
And for audio, well, we do
wanna recognize speech, even against
a bad self internal connection, against
different types of background noise, and
so for the audio, we're again
synthesizing examples are actually
representative of the sorts of
examples that we want to
classify, that we want to recognize correctly.
In contrast, usually it does
not help perhaps you actually
a meaning as noise to your data.
I'm not sure you can see
this, but what we've done
here is taken the image, and
for each pixel, in each
of these 4 images, has just
added some random Gaussian noise to each pixel.
To each pixel, is the
pixel brightness, it would
just add some, you know, maybe Gaussian random noise to each pixel.
So it's just a totally meaningless noise, right?
And so, unless you're expecting
to see these sorts of pixel
wise noise in your test
set, this sort of
purely random meaningless noise is less likely to be useful.
But the process of artificial
data synthesis it is you
know a little bit of
an art as well and sometimes
you just have to try it and see if it works.
But if you're trying to
decide what sorts of distortions
to add, you know, do
think about what other meaningful
distortions you might add that
will cause you to generate additional
training examples that are at
least somewhat representative of the
sorts of images you expect to see in your test sets.
Finally, to wrap up this
video, I just wanna say
a couple of words, more about
this idea of getting loss
of data via artificial data synthesis.
As always, before expending a lot
of effort, you know, figuring out
how to create artificial training
examples, it's often a good
practice is to make sure
that you really have a low biased
crossfire, and having a
lot more training data will be of help.
And standard way to do
this is to plot the learning
curves, and make sure that
you only have a low
as well, high variance falsifier.
Or if you don't have a low
bias falsifier, you know,
one other thing that's worth trying
is to keep increasing the number
of features that your classifier
has, increasing the number of
hidden units in your network,
saying, until you actually have a
low bias falsifier, and only
then, should you put
the effort into creating a
large, artificial training set, so
what you really want to avoid
is to, you know, spend
a whole week or spend a few
months figuring out how
to get a great artificially
synthesized data set.
Only to realize afterward, that,
you know, your learning algorithm, performance
doesn't improve that much, even when you're given a huge training set.
So that's about my usual advice
about of a testing that
you really can make use
of a large training set before
spending a lot of effort going out to get that large training set.
Second is, when i'm working
on machine learning problems, one question
I often ask the team
I'm working with, often ask my
students, which is, how much work
would it be to get 10 times as much date as we currently had.
When I face a new machine
learning application very often I
will sit down with a team
and ask exactly this question,
I've asked this question over and
over and over and I've
been very surprised how often
this answer has been that.
You know, it's really not that hard,
maybe a few days of work
at most, to get ten times
as much data as we currently
have for a machine
running application and very
often if you can get
ten times as much data there
will be a way to make your algorithm do much better.
So, you know, if you
ever join the product team
working on some machine learning
application product this is
a very good questions ask yourself
ask the team don't be
too surprised if after a
few minutes of brainstorming if your
team comes up with a
way to get literally ten
times this much data, in
which case, I think you would
be a hero to that team,
because with 10 times as
much data, I think you'll really
get much better performance, just from learning from so much data.
So there are several waysand
that comprised both the ideas
of generating data from
scratch using random fonts and so on.
As well as the second idea
of taking an existing example and
and introducing distortions that amplify
to enlarge the training set A
couple of other examples of
ways to get a lot more
data are to collect the
data or to label them yourself.
So one useful calculation that
I often do is, you know,
how many minutes, how many
hours does it take to
get a certain number of
examples, so actually sit down and
figure out, you know, suppose it
takes me ten seconds to
label one example then
and, suppose that, for
our application, currently we
have 1000 labeled examples examples
so ten times as
much of that would be
if n were equal to ten thousand.
A second way to
get a lot of data is
to just collect the data and you label it yourself.
So what I mean by this is
I will often set down and
do a calculation to figure
out how much time, you
know just like how many hours
will it take, how many
hours or how many days will
it take for me or
for someone else to just sit
down and collect ten times
as much data, as we have
currently, by collecting the data ourselves and labeling them ourselves.
So, for example, that, for
our machine learning application, currently
we have 1,000 examples, so M 1,000.
That what we do is sit
down and ask, how long does
it take me really to collect and label one example.
And sometimes maybe it will
take you, you know ten
seconds to label
one new example, and so
if I want 10 X as many examples, I'd do a calculation.
If it takes me 10 seconds to get one training example.
If I wanted to get 10
times as much data, then I need 10,000 examples.
So I do the calculation, how long
is it gonna take to label,
to manually label 10,000 examples,
if it takes me 10 seconds to label 1 example.
So when you do this calculation,
often I've seen many you
would be surprised, you know,
how little, or sometimes a
few days at work, sometimes a
small number of days of work,
well I've seen many teams be very
surprised that sometimes how
little work it could be,
to just get a lot more
data, and let that be
a way to give your learning
app to give you a huge boost
in performance, and necessarily, you
know, sometimes when you've just
managed to do this, you
will be a hero and whatever product
development, whatever team you're working
on, because this can
be a great way to get much better performance.
Third and finally, one sometimes
good way to get a
lot of data is to use
what's now called crowd sourcing.
So today, there are a
few websites or a few
services that allow you
to hire people on
the web to, you know, fairly
inexpensively label large training sets for you.
So this idea of crowd
sourcing, or crowd sourced
data labeling, is something
that has, is obviously, like
an entire academic literature,
has some of it's own complications and
so on, pertaining to labeler reliability.
Maybe, you know, hundreds of thousands
of labelers, around the
world, working fairly inexpensively to
help label data for you,
and that I've just had mentioned,
there's this one alternative as well.
And probably Amazon Mechanical Turk
systems is probably the most
popular crowd sourcing option right now.
This is often quite a
bit of work to
get to work, if you want
to get very high quality labels,
but is sometimes an
option worth considering as well.
If you want to try to
hire many people, fairly inexpensively
on the web, our labels launch miles of data for you.
So this video, we
talked about the idea of
artificial data synthesis of
either creating new data
from scratch, looking, using
the ramming funds as an example,
or by amplifying an
existing training set, by taking
existing label examples and
introducing distortions to it,
to sort of create extra label examples.
And finally, one thing that
I hope you remember from this
video this idea of if
you are facing a machine learning
problem, it is often worth doing two things.
One just a sanity check,
with learning curves, that having more data would help.
And second, assuming that that's the case,
I will often seat down and
ask yourself seriously: what would
it take to get ten times as
much creative data as you
currently have, and not always,
but sometimes, you may be
surprised by how easy that
turns out to be, maybe
a few days, a few weeks at
work, and that can be
a great way to give your learning algorithm a huge boost in performance