1
00:00:00,000 --> 00:00:04,594
[MUSIC]

2
00:00:04,594 --> 00:00:08,038
In this video,
we'll see a couple of examples of how

3
00:00:08,038 --> 00:00:12,207
Bayesian optimization can be
applied to real world problems.

4
00:00:12,207 --> 00:00:15,778
So the first one is hyperparameter tuning.

5
00:00:15,778 --> 00:00:20,725
You usually train your neural networks and
you have to retrain them

6
00:00:20,725 --> 00:00:25,671
many times by finding optimal
number of layers, the layer sizes,

7
00:00:25,671 --> 00:00:30,441
whether to use dropout or not,
to use batch normalization, and

8
00:00:30,441 --> 00:00:35,331
which nonlinearity should you use,
the RELU, SELU, and so on.

9
00:00:35,331 --> 00:00:40,246
Also you have training parameters
like learning rate, momentum, or

10
00:00:40,246 --> 00:00:43,596
maybe it can change
the different optimizers.

11
00:00:43,596 --> 00:00:47,226
For example, and SGD.

12
00:00:47,226 --> 00:00:52,543
So what you could do is you could use
the Bayesian optimization to find

13
00:00:52,543 --> 00:00:57,686
the best values of all of those
parameters for you automatically.

14
00:00:57,686 --> 00:01:02,320
It usually finds better optima than
when you tune those by hand, and

15
00:01:02,320 --> 00:01:07,703
also it allows for honest comparison
with other methods when you do research.

16
00:01:07,703 --> 00:01:11,317
For example,
you came up with a brilliant method and

17
00:01:11,317 --> 00:01:15,176
you spend a lot of time tuning
the parameters for it, and

18
00:01:15,176 --> 00:01:19,878
in your paper you want to compare
your model with some other models.

19
00:01:19,878 --> 00:01:25,901
And it is really tempting to not to spend
much time tuning the parameters for

20
00:01:25,901 --> 00:01:27,340
the other model.

21
00:01:27,340 --> 00:01:31,538
However, what you could do is
you could run the automatic

22
00:01:31,538 --> 00:01:35,476
hyperparameter tuning to
find the best variables for

23
00:01:35,476 --> 00:01:40,198
those parameters for
the model that you are comparing with, and

24
00:01:40,198 --> 00:01:43,891
in this case the comparison
would be more honest.

25
00:01:43,891 --> 00:01:47,516
So the problem here actually is that
we have a mixture of discrete and

26
00:01:47,516 --> 00:01:48,863
continuous variables.

27
00:01:48,863 --> 00:01:53,937
For example, we have the learning
rate that is continuous, and we have

28
00:01:53,937 --> 00:01:59,870
the parameter whether to use drop out or
not, which is actually a binary decision.

29
00:01:59,870 --> 00:02:04,717
So how can we mix the continuous and
discrete variables for Gaussian process?

30
00:02:04,717 --> 00:02:06,850
Well, the simple trick is like this.

31
00:02:06,850 --> 00:02:10,530
You treat discrete variables as
continuous when you fitting process.

32
00:02:10,530 --> 00:02:14,625
So for example, when you use drop out,
this would be the value one and

33
00:02:14,625 --> 00:02:17,673
when you don't use it,
it would be a value of zero.

34
00:02:17,673 --> 00:02:22,644
And then when you try to maximize
the acquisition function, you optimize

35
00:02:22,644 --> 00:02:27,467
it by [INAUDIBLE] forcing whole
possible values of discrete variables.

36
00:02:27,467 --> 00:02:32,168
So for example, you will find
the maximum of the acquisition

37
00:02:32,168 --> 00:02:36,961
function without drop out,
we'll find it with drop out, and

38
00:02:36,961 --> 00:02:40,670
then select the one case
that is better for you.

39
00:02:40,670 --> 00:02:43,739
One special case is when
all variables are discrete.

40
00:02:43,739 --> 00:02:47,375
Those are called multi-armed bandits, and

41
00:02:47,375 --> 00:02:51,784
they are widely used in
information retrieval tasks.

42
00:02:51,784 --> 00:02:57,036
For example, when you're building
a search engine result page,

43
00:02:57,036 --> 00:03:02,005
you can select a lot of
hyperparameters that are discrete, and

44
00:03:02,005 --> 00:03:06,902
for this case, the Bayesian
optimization is really useful.

45
00:03:06,902 --> 00:03:10,136
Another application is drug discovery.

46
00:03:10,136 --> 00:03:17,354
We have some molecules that can probably
be the drugs for some severe diseases.

47
00:03:17,354 --> 00:03:22,616
In this case, I have the molecule of and
we can represent it using string.

48
00:03:22,616 --> 00:03:25,550
This string is called SMILES and

49
00:03:25,550 --> 00:03:30,450
it can be constructed from
the molecule very simply.

50
00:03:30,450 --> 00:03:35,606
What you can do then is you can
build an autoencoder that will try

51
00:03:35,606 --> 00:03:40,876
to take the SMILES as an input and
reproduce itself as an output.

52
00:03:40,876 --> 00:03:46,193
You can use a variational autoencoder
that we talked about in week

53
00:03:46,193 --> 00:03:52,282
five to make the latent space dense,
that is you can move along the space and

54
00:03:52,282 --> 00:03:58,009
for each point, you will be able to
reconstruct some valid molecule.

55
00:03:58,009 --> 00:04:04,687
And now here's a trick, you know
that some molecules are useful for

56
00:04:04,687 --> 00:04:08,565
curing some diseases and some are not.

57
00:04:08,565 --> 00:04:13,494
And so here I have a plot of a latent
space, and in this latent space,

58
00:04:13,494 --> 00:04:16,780
you want to find
the position of the maximum,

59
00:04:16,780 --> 00:04:21,296
that is the molecule that will be best for
cures and diseases.

60
00:04:21,296 --> 00:04:24,642
After you find the maximum
in the latent space,

61
00:04:24,642 --> 00:04:29,579
you simply plug it in to the decoder and
reconstruct the molecule, and

62
00:04:29,579 --> 00:04:33,860
then you can do some trials, for
example in vitro or in viva.

63
00:04:33,860 --> 00:04:39,380
And after this, you get the value
of the point, the new value.

64
00:04:39,380 --> 00:04:41,363
You add it to a model.

65
00:04:41,363 --> 00:04:43,543
You reconstruct the Gaussian process and

66
00:04:43,543 --> 00:04:46,164
find the new maximum of
an acquisition function.

67
00:04:46,164 --> 00:04:51,098
And just by [INAUDIBLE], you can quickly
find new drugs for different diseases.

68
00:04:53,013 --> 00:05:03,013
[MUSIC]