Hi, I'm going to describe to you the
details of your homework assignment.
predicting the ethnicity of an individual
given their genetic information.
before I go, go ahead with that a few
words about myself.
My name is Rajgopal srinivasan.
I'm part of the TCS R and D group.
I my lab is based out of Hydrabad.
we do many interesting things in our lab
one of which is looking
at the effects of the radiation of the
human genome on issues related to health.
Right, so we use a number of strategies
here.
We use algorithms from computer science.
from statistics.
along with deep domain knowledge to
understanding effects of human genome
variation on health.
So, what is human genome variation?
Right.
So let us try to understand that, because
that
is important to know, in order to do this
problem.
you know that we have, all of us have
cells.
millions of cells.
each of which carries DNA.
so in fact we have two copies of the DNA
in each one of our cells.
One of which comes from the father and the
other from the mother.
some of you might have once heard that
the difference in DNA between any pair of
random.
And the
[UNKNOWN]
is less than half a percent.
What does it mean to say that the
difference
between two religions is less than half a
percent?
to understand that visualize the DNA as a
three billion long
string, made up of the alphabet A, C, G
and D.
So at any of these three billion positions
200
[UNKNOWN]
may have a different alphabet.
[UNKNOWN],
right?
for example, at a particular position,
individual one might have an
A, while at the same position, individual
two might have a C.
we have to do this as a variation.
Of course, more complex variations are
possible.
so you could have entire regions that
are deleted in one individual,
corresponding to another.
conversely an individual might carry an
insertion with
respect to another.
More complex variations are possible,
where entire regions are
deleted in one, and then reinserted at a
different position.
compared to another
[INAUDIBLE].
so this is how one compares two
individuals.
But it is tedious to always have to
compare pairs of individuals, an easier
way of representing variation across
population is
to compare every individual to a
reference.
And in order to do a reference biologists
use what is know as the reference human
genome.
Genome to a very first approximation, the
reference human genome is a 3 billion
long string, where at each position the
alphabet, is most commonly a Greek
alphabet, right.
So that is what is used by most biologists
when they compare one genome to another.
So what are the effects of the variations?
variations are what make us unique.
some of us are tall some of us
have blue eyes, others have red hair and
so on.
these all have their basis in the genetic
information that we have in our DNA in
ourselves.
these differences aught to have more
serious consequences.
[COUGH]
It makes some of them more likely to
get diseases, and others less likely to
get diseases.
some of these variants also have
implications
on how well we respond to certain
therapies.
so it is very important then,
to understand genetic variation and its
consequences.
And there is, in fact, an active area of
research in modern biology.
So, with that background, let's turn to
the
problem at hand.
by the way, if you want more details
on this there is a pdf file called
genetics.pdf
on the courser homework site, which you
can
go through to get a little bit more
information.
about this, about genetics and it's
implications in general.
Okay, with that let's get back to the
problem at hand.
what you have is variation
information for a 150 individuals.
these individuals come from five different
ethnic groups.
So for each individual you are given their
ethnic
group, and information on approximately
200,000 variants in their genome.
In other words, what you have is 150
labelled vectors.
Where the label tells you what is the
[UNKNOWN]
of the individual.
And each element in the vector which is
approximately 2,000 long has a value of
either of
one or of zero, indicating whether a
particular variable's
expressed in that individual or absent in
that individual.
there are a small number of cases where
the information is
not available, and so this is imported as
the missing value.
so you should account for that when
you're, when you process the data
using your whatever software package you
are using, right.
So what you have to do, the task in front
of you, is to build a model
using the presence or absence of gradient
of approximately
200,000 positions, to predict the
ethnicity of the individuals, all right.
Once you have built yourself such a model,
use that model on a different data
set of 38 individuals, for whom you are
given
the genetic data but, you aren't given the
ethnicity.
So use this model to predict the ethnicity
of these 38 individuals, right.
So, that is the problem, if you want more
details on the five formats and so on,
please look at the course website, where
all of this is detailed
in in a file called readme.txt.
Happy solving.