1
00:00:03,160 --> 00:00:08,325
In this video, we'll see what are Gaussian processes.

2
00:00:08,325 --> 00:00:10,210
But before we go on,

3
00:00:10,210 --> 00:00:12,190
we should see what random processes are,

4
00:00:12,190 --> 00:00:17,160
since Gaussian process is just a special case of a random process.

5
00:00:17,160 --> 00:00:19,045
So, in a random process,

6
00:00:19,045 --> 00:00:22,667
you have a new dimensional space,

7
00:00:22,667 --> 00:00:24,880
R^d and for each point of the space,

8
00:00:24,880 --> 00:00:27,630
you assign a random variable f(x).

9
00:00:27,630 --> 00:00:32,175
So, those variables can have some correlation.

10
00:00:32,175 --> 00:00:35,620
And using this, you'll have for example,

11
00:00:35,620 --> 00:00:40,740
some smooth functions as samples or maybe non-smooth ones.

12
00:00:40,740 --> 00:00:46,300
So, if you take f(x) at some point in this case for x=3,

13
00:00:46,300 --> 00:00:50,480
you'll have one dimensional distribution that will look like,

14
00:00:50,480 --> 00:00:52,425
for example this red curve.

15
00:00:52,425 --> 00:00:59,695
You can also sample all the f(x) points through all of the space,

16
00:00:59,695 --> 00:01:02,860
R^d, and in this case you'll have just a normal function

17
00:01:02,860 --> 00:01:07,455
f. We can write down like this trajectory.

18
00:01:07,455 --> 00:01:14,915
And so those functions after we sample the other invariables are called trajectories.

19
00:01:14,915 --> 00:01:17,360
When the dimension of x=1,

20
00:01:17,360 --> 00:01:20,060
we can say that it equals the time.

21
00:01:20,060 --> 00:01:23,550
We're now ready to define the Gaussian process.

22
00:01:23,550 --> 00:01:27,000
So, what I would like to say is that the joint distribution over

23
00:01:27,000 --> 00:01:31,475
all points in R^d is Gaussian.

24
00:01:31,475 --> 00:01:35,650
However, we didn't define the Gaussian for the infinite number of points.

25
00:01:35,650 --> 00:01:39,270
We have multivaried Gaussian only for finite number of points,

26
00:01:39,270 --> 00:01:44,385
and so what we can say is that for arbitrary number of points n,

27
00:01:44,385 --> 00:01:46,916
it would take the endpoints x1 to xn,

28
00:01:46,916 --> 00:01:49,270
their joint distribution will be normal.

29
00:01:49,270 --> 00:01:53,513
And since we said that it happens for arbitrary number of points,

30
00:01:53,513 --> 00:01:55,995
we can select n to be arbitrarily large.

31
00:01:55,995 --> 00:01:57,840
And it would be something like,

32
00:01:57,840 --> 00:02:02,490
the joint probability of all points approximately would be Gaussian.

33
00:02:02,490 --> 00:02:08,095
So, when we take n points and take the joint and distribution with them,

34
00:02:08,095 --> 00:02:11,306
it is called a finite-dimensional distribution.

35
00:02:11,306 --> 00:02:17,559
Actually, using finite-dimensional distributions is useful for something.

36
00:02:17,559 --> 00:02:20,805
For example, we cannot sample the whole function,

37
00:02:20,805 --> 00:02:23,700
but we can sample it in like thousand points

38
00:02:23,700 --> 00:02:28,780
and plot it by interpolating the space between them.

39
00:02:28,780 --> 00:02:37,211
And actually, it is the case, it's what I used to draw this plot.

40
00:02:37,211 --> 00:02:43,820
So, Gaussian process is franchised by the mean and the covariance matrix.

41
00:02:43,820 --> 00:02:49,530
So, the mean would be the function that takes the random variable f(x) at each point

42
00:02:49,530 --> 00:02:56,035
of the space and assigns the mean value of it to all points.

43
00:02:56,035 --> 00:03:01,485
Also we have the covariance matrix that takes two points x1 and

44
00:03:01,485 --> 00:03:07,705
x2 and returns the covariance between the line of variables f(x1) and f(x2).

45
00:03:07,705 --> 00:03:10,575
And so it will be equal to sum of function K,

46
00:03:10,575 --> 00:03:14,310
that depends simply on the positions of those two points.

47
00:03:14,310 --> 00:03:17,441
We'll call this function kernel.

48
00:03:17,441 --> 00:03:20,200
So, finally, it will have endpoints.

49
00:03:20,200 --> 00:03:25,620
The joint distribution of them would be normal and with mean being

50
00:03:25,620 --> 00:03:31,405
the vector where the components of it are the function m(x1),

51
00:03:31,405 --> 00:03:33,890
m(x2) and so on, m(xn),

52
00:03:33,890 --> 00:03:36,915
and the covariance matrix would have the function K

53
00:03:36,915 --> 00:03:39,960
of different points and these elements.

54
00:03:39,960 --> 00:03:43,803
For example, the first element would be K(x1,

55
00:03:43,803 --> 00:03:45,814
x1) and so on.

56
00:03:45,814 --> 00:03:49,820
We'll also need a notion of a stationary process.

57
00:03:49,820 --> 00:03:52,260
The process is called stationary if it's

58
00:03:52,260 --> 00:03:57,220
finite-dimensional distributions depend only on its relative positions of the points.

59
00:03:57,220 --> 00:04:01,755
For example, here I have four points drawn as red,

60
00:04:01,755 --> 00:04:06,720
and their joint distribution would be equal to the joint distribution of

61
00:04:06,720 --> 00:04:13,200
the blue points since they're just obtained by shifting the red points to the right.

62
00:04:13,200 --> 00:04:14,290
So let's play a game.

63
00:04:14,290 --> 00:04:17,310
I have some samples from a Gaussian process,

64
00:04:17,310 --> 00:04:22,510
and we should find out whether those are samples from the stationary process or not.

65
00:04:22,510 --> 00:04:26,255
So, what do you think about this sample.

66
00:04:26,255 --> 00:04:30,055
So, actually it is not stationary since here is seasonality.

67
00:04:30,055 --> 00:04:35,210
If we take the points in the beginning of the period,

68
00:04:35,210 --> 00:04:38,350
you will easily say that you are well,

69
00:04:38,350 --> 00:04:39,805
in the beginning of that period.

70
00:04:39,805 --> 00:04:42,325
And if we move it a bit to the right,

71
00:04:42,325 --> 00:04:46,325
you'll say something like you're in the end of the period.

72
00:04:46,325 --> 00:04:51,675
And so, the joint distribution would be different in different parts of the space.

73
00:04:51,675 --> 00:04:53,786
And so it is not stationary.

74
00:04:53,786 --> 00:04:56,623
What about this sample?

75
00:04:56,623 --> 00:05:00,520
Well, again we have a trend here.

76
00:05:00,520 --> 00:05:03,880
And so by computing the mean of for example,

77
00:05:03,880 --> 00:05:06,460
some points, you can take 10 points,

78
00:05:06,460 --> 00:05:11,575
compute their mean, and by using this mean you would be able to predict where,

79
00:05:11,575 --> 00:05:14,195
in which part of the space you are.

80
00:05:14,195 --> 00:05:20,985
And so this means that the process is not stationary. What about this one?

81
00:05:20,985 --> 00:05:25,518
Well, seems like stationary.

82
00:05:25,518 --> 00:05:28,175
So, for stationary process,

83
00:05:28,175 --> 00:05:30,930
we shouldn't have trend.

84
00:05:30,930 --> 00:05:35,525
This means that the mean variance should be constant over all space.

85
00:05:35,525 --> 00:05:39,100
So m(x) as a function would be constant.

86
00:05:39,100 --> 00:05:44,063
Also the kernel should depend only on the difference of the two points,

87
00:05:44,063 --> 00:05:46,595
and this would mean that the joint distribution

88
00:05:46,595 --> 00:05:50,425
depends only on the relative positions of the points.

89
00:05:50,425 --> 00:05:54,820
So, we'll write it down as the K(x1 - x2).

90
00:05:54,820 --> 00:05:58,630
Also using this citation,

91
00:05:58,630 --> 00:06:03,660
it is really easy to compute the variance of the line of variable f(x).

92
00:06:03,660 --> 00:06:07,425
The variance would actually be equal to the kernel at position zero,

93
00:06:07,425 --> 00:06:10,135
since it would be equal to the covariance

94
00:06:10,135 --> 00:06:17,420
between the line of variable f(x) with itself which actually equal to the variance.

95
00:06:17,420 --> 00:06:21,540
Below I have the example of the kernel.

96
00:06:21,540 --> 00:06:25,430
So, the covariance at 0 would be 1,

97
00:06:25,430 --> 00:06:28,930
so the variance of f(x) is 1,

98
00:06:28,930 --> 00:06:30,850
and as it goes further from the point,

99
00:06:30,850 --> 00:06:35,132
the covariance becomes lower and lower.

100
00:06:35,132 --> 00:06:40,815
There are many different kernels that you can use for training Gaussian process.

101
00:06:40,815 --> 00:06:46,953
The most widely used one is called the radial basis function or RBF for short.

102
00:06:46,953 --> 00:06:52,435
So, it equals to the sigma squared times the exponent of minus

103
00:06:52,435 --> 00:06:58,843
the squared distance between the two points over 2l^2.

104
00:06:58,843 --> 00:07:01,405
l is a parameter that is called a length scale,

105
00:07:01,405 --> 00:07:06,595
and sigma squared just controls the variance at position zero,

106
00:07:06,595 --> 00:07:09,115
which is a variance of f(x).

107
00:07:09,115 --> 00:07:15,268
There are also many other kernels like a rational quadratic kernel or Brownian kernel.

108
00:07:15,268 --> 00:07:17,920
They all have some different parameters and

109
00:07:17,920 --> 00:07:22,080
will give you different samples from the process.

110
00:07:22,080 --> 00:07:27,510
So, let's see the radial basis function a bit closer.

111
00:07:27,510 --> 00:07:31,053
When the line scale parameter equals to 0.1,

112
00:07:31,053 --> 00:07:33,395
the samples would look like this.

113
00:07:33,395 --> 00:07:35,610
If we increase the length scale,

114
00:07:35,610 --> 00:07:43,304
the samples would look a bit smoother and they'll change less rapidly.

115
00:07:43,304 --> 00:07:47,730
And if we keep increasing the value of l,

116
00:07:47,730 --> 00:07:51,430
we'll have almost constant functions.