[MUSIC] Hi, welcome to our course. In this first video, we will see basic principles that
we'll use throughout this course. Let's learn them by example. Imagine you are running through a park and
you see another man running. And you ask yourself, why is he running? And you come up with four
different explanations. First, he is in a hurry. Second, he is doing some sports. Third, he always runs. And fourth, he saw a dragon. Principle 1, use prior knowledge. From our previous experience we
know that dragons do no exist. And so, we can exclude fourth
option from next consideration. Principle 2, choose answer that
explains observations the most. Imagine you saw that he is
not wearing a sports suit. In this case, it´s very unlikely
that he´s doing sports, and so we can exclude number two. Principle 3,
avoid making extra assumptions. From the last two options,
the third option, does he always runs, makes a lot of extra assumptions and
so should exclude it. This principle is also
known as outcomes racer. And finally, we are left with only
one case, that he is in a hurry. To conclude, we've seen three principles. To use prior knowledge, to choose answer
that explains observations the most, and finally to avoid making extra assumptions. Before we continue, let's review some
basic principles from probability theory. We define probability
in the following way. Imagine you have some source of
randomness, for example, a dice. And you repeat an experiment
multiple times. And as the number of
experiments goes to infinity, we get the probability as a fraction
of the times some event occurred. For example, you would expect for
a fair dice that the event that you threw five would have
a frequency about one-sixth. And for
events that you threw an odd number, it would be somewhere around one-half. We will consider two different types of
random variables depending on which values they can take, discrete and continuous. The discrete for random variables can have
either finite number of values that can take, as for example, for a dice. Or infinite, if you count the number of
times that some certain event happened. An example of continuous random variable
would be at tomorrow's temperature. The most convenient way to find
the discrete distribution is to call the probability mass function. It maps a number before each point
that refers to the probability. For example, in this case, we'll get a point that equals
to 1 which produces in 0.2. The 0.3 with probability 0.5 and
so on with probability 0.3 and other points with probability 0. Also note that these points sum up to 1. The most convenient way to define
continuous distributions is called a probability density function. It assigns a non-negative value for
each point. And then to compute the probability that
a point will fall into some range, for example, from a to b, you should integrate
this function over this given range. As is given on the slide. We will also need
a notion of independence. The two run variables are considered
independent if their joint probability, that is, a probability of X and Y,
equals to the product of their marginals. So it will be a probability of
X times a probability of Y. Let's see an example. Imagine that you have a deck of 52 cards
and you take, randomly, 2 cards from it. And the first random variable would
be the picture that is drawn on the first card and second would be the
picture that is drawn on the second card. Those kind of variables are dependent
since it is impossible to take one card two times. Another example is throwing
two coins independently. Here the probability that the first
coin will land heads up and the second would land tails up equals to
the product of the two probabilities. And so
these random variables are independent. The last thing we'll need is
a conditional probability. We want to answer a question, what is the probability of X given that
something that is called Y happened. It is given by the formula
that you can see on the slide. It is the probability of X given Y equals
to the joint probability P of X and Y over the marginal probability P of Y. Let's consider an example. Imagine you are a student and
you want to pass some course. It has two exams in it,
a midterm and the final. The probability that the student
will pass a midterm is 0.4 and the probability that the student will
pass a midterm and the final 0.25. If you want to find the probability that
you will pass the final, given that you already passed the midterm, you can apply
the formula from the previous slide. And this will give you a value around 60%. We'll need two tricks
to deal with formulas. The first is called the chain rule. We can derive it from the definition
of the conditional probability. That is, the joint probability of X and Y equals to the product of X given Y and
the probability of Y. By induction, we can prove the same
formula for three variables. It will be the probability of X, Y, and
Z equals to probability of X given Y and Z, the probability of Y given Z,
and finally probability of Z. And in a similar way,
we can obtain the formula for the arbitrary number of points. So this would be the probability
of the current point, given all its previous points. The last rule is called the sum rule. That is, if you want to find out
the marginal distribution p(X), and you know only the joint
probability that p(X,Y), you can integrate out the random variable
Y, as it is given on the formula. And finally, the most important formula
for this course, the Bayes theorem. We want to find out
the probability of theta given X, where theta are the parameters
of our model. For example, we have a neural network and
those are its parameters. And then we have X. Those are the observations, for example,
the images that you are dealing with. From the definition of the conditional
probability, we can say that it is a ratio between the joint probability and
the marginal probability, P(X). And also we apply the chain rule,
we'll get the following formula. It will be the probability
of X given theta, times the probability of
theta over probability of X. This formula is so important that each
of its components has its own name. The probability of theta
is called a prior, it shows us what prior knowledge
we know about the parameters. For example, you can know that some
parameters are distributed at around 0. The term probability of X given
theta is called a likelihood, and it shows how well the parameters
explain our data. The thing that we get, the probability of
theta given X, is called a posterior, and it is the probability of the parameters
after we observe the data. And finally the term in
the denominator is called evidence [MUSIC]