Hi, I'm going to describe to you the details of your homework assignment. predicting the ethnicity of an individual given their genetic information. before I go, go ahead with that a few words about myself. My name is Rajgopal srinivasan. I'm part of the TCS R and D group. I my lab is based out of Hydrabad. we do many interesting things in our lab one of which is looking at the effects of the radiation of the human genome on issues related to health. Right, so we use a number of strategies here. We use algorithms from computer science. from statistics. along with deep domain knowledge to understanding effects of human genome variation on health. So, what is human genome variation? Right. So let us try to understand that, because that is important to know, in order to do this problem. you know that we have, all of us have cells. millions of cells. each of which carries DNA. so in fact we have two copies of the DNA in each one of our cells. One of which comes from the father and the other from the mother. some of you might have once heard that the difference in DNA between any pair of random. And the [UNKNOWN] is less than half a percent. What does it mean to say that the difference between two religions is less than half a percent? to understand that visualize the DNA as a three billion long string, made up of the alphabet A, C, G and D. So at any of these three billion positions 200 [UNKNOWN] may have a different alphabet. [UNKNOWN], right? for example, at a particular position, individual one might have an A, while at the same position, individual two might have a C. we have to do this as a variation. Of course, more complex variations are possible. so you could have entire regions that are deleted in one individual, corresponding to another. conversely an individual might carry an insertion with respect to another. More complex variations are possible, where entire regions are deleted in one, and then reinserted at a different position. compared to another [INAUDIBLE]. so this is how one compares two individuals. But it is tedious to always have to compare pairs of individuals, an easier way of representing variation across population is to compare every individual to a reference. And in order to do a reference biologists use what is know as the reference human genome. Genome to a very first approximation, the reference human genome is a 3 billion long string, where at each position the alphabet, is most commonly a Greek alphabet, right. So that is what is used by most biologists when they compare one genome to another. So what are the effects of the variations? variations are what make us unique. some of us are tall some of us have blue eyes, others have red hair and so on. these all have their basis in the genetic information that we have in our DNA in ourselves. these differences aught to have more serious consequences. [COUGH] It makes some of them more likely to get diseases, and others less likely to get diseases. some of these variants also have implications on how well we respond to certain therapies. so it is very important then, to understand genetic variation and its consequences. And there is, in fact, an active area of research in modern biology. So, with that background, let's turn to the problem at hand. by the way, if you want more details on this there is a pdf file called genetics.pdf on the courser homework site, which you can go through to get a little bit more information. about this, about genetics and it's implications in general. Okay, with that let's get back to the problem at hand. what you have is variation information for a 150 individuals. these individuals come from five different ethnic groups. So for each individual you are given their ethnic group, and information on approximately 200,000 variants in their genome. In other words, what you have is 150 labelled vectors. Where the label tells you what is the [UNKNOWN] of the individual. And each element in the vector which is approximately 2,000 long has a value of either of one or of zero, indicating whether a particular variable's expressed in that individual or absent in that individual. there are a small number of cases where the information is not available, and so this is imported as the missing value. so you should account for that when you're, when you process the data using your whatever software package you are using, right. So what you have to do, the task in front of you, is to build a model using the presence or absence of gradient of approximately 200,000 positions, to predict the ethnicity of the individuals, all right. Once you have built yourself such a model, use that model on a different data set of 38 individuals, for whom you are given the genetic data but, you aren't given the ethnicity. So use this model to predict the ethnicity of these 38 individuals, right. So, that is the problem, if you want more details on the five formats and so on, please look at the course website, where all of this is detailed in in a file called readme.txt. Happy solving.