1 00:00:01,480 --> 00:00:05,720 Hi, I'm going to describe to you the details of your homework assignment. 2 00:00:05,720 --> 00:00:09,050 predicting the ethnicity of an individual given their genetic information. 3 00:00:10,246 --> 00:00:13,598 before I go, go ahead with that a few words about myself. 4 00:00:13,598 --> 00:00:16,010 My name is Rajgopal srinivasan. 5 00:00:16,010 --> 00:00:20,910 I'm part of the TCS R and D group. I my lab is based out of Hydrabad. 6 00:00:22,910 --> 00:00:27,350 we do many interesting things in our lab one of which is looking 7 00:00:27,350 --> 00:00:33,610 at the effects of the radiation of the human genome on issues related to health. 8 00:00:33,610 --> 00:00:36,120 Right, so we use a number of strategies here. 9 00:00:36,120 --> 00:00:40,100 We use algorithms from computer science. 10 00:00:40,100 --> 00:00:44,620 from statistics. along with deep domain knowledge to 11 00:00:44,620 --> 00:00:50,630 understanding effects of human genome variation on health. 12 00:00:50,630 --> 00:00:52,730 So, what is human genome variation? 13 00:00:52,730 --> 00:00:53,230 Right. 14 00:00:53,230 --> 00:00:54,870 So let us try to understand that, because that 15 00:00:54,870 --> 00:00:57,740 is important to know, in order to do this problem. 16 00:00:59,020 --> 00:01:01,679 you know that we have, all of us have cells. 17 00:01:02,860 --> 00:01:04,760 millions of cells. 18 00:01:04,760 --> 00:01:07,930 each of which carries DNA. 19 00:01:07,930 --> 00:01:11,710 so in fact we have two copies of the DNA in each one of our cells. 20 00:01:11,710 --> 00:01:14,850 One of which comes from the father and the other from the mother. 21 00:01:16,160 --> 00:01:17,840 some of you might have once heard that 22 00:01:17,840 --> 00:01:20,038 the difference in DNA between any pair of random. 23 00:01:20,038 --> 00:01:20,312 And the 24 00:01:20,312 --> 00:01:20,800 [UNKNOWN] 25 00:01:20,800 --> 00:01:22,720 is less than half a percent. 26 00:01:23,720 --> 00:01:25,890 What does it mean to say that the difference 27 00:01:25,890 --> 00:01:29,184 between two religions is less than half a percent? 28 00:01:29,184 --> 00:01:33,234 to understand that visualize the DNA as a three billion long 29 00:01:33,234 --> 00:01:36,480 string, made up of the alphabet A, C, G and D. 30 00:01:37,800 --> 00:01:39,904 So at any of these three billion positions 200 31 00:01:39,904 --> 00:01:40,359 [UNKNOWN] 32 00:01:40,359 --> 00:01:41,445 may have a different alphabet. 33 00:01:41,445 --> 00:01:42,560 [UNKNOWN], 34 00:01:42,560 --> 00:01:43,086 right? 35 00:01:43,086 --> 00:01:47,157 for example, at a particular position, individual one might have an 36 00:01:47,157 --> 00:01:51,800 A, while at the same position, individual two might have a C. 37 00:01:51,800 --> 00:01:54,090 we have to do this as a variation. 38 00:01:54,090 --> 00:01:57,170 Of course, more complex variations are possible. 39 00:01:57,170 --> 00:01:59,800 so you could have entire regions that 40 00:01:59,800 --> 00:02:03,470 are deleted in one individual, corresponding to another. 41 00:02:04,870 --> 00:02:08,840 conversely an individual might carry an insertion with 42 00:02:08,840 --> 00:02:10,670 respect to another. 43 00:02:10,670 --> 00:02:14,760 More complex variations are possible, where entire regions are 44 00:02:14,760 --> 00:02:18,720 deleted in one, and then reinserted at a different position. 45 00:02:18,720 --> 00:02:19,305 compared to another 46 00:02:19,305 --> 00:02:21,290 [INAUDIBLE]. 47 00:02:21,290 --> 00:02:24,374 so this is how one compares two individuals. 48 00:02:24,374 --> 00:02:29,800 But it is tedious to always have to compare pairs of individuals, an easier 49 00:02:29,800 --> 00:02:34,210 way of representing variation across population is 50 00:02:34,210 --> 00:02:36,250 to compare every individual to a reference. 51 00:02:37,250 --> 00:02:39,620 And in order to do a reference biologists 52 00:02:39,620 --> 00:02:41,460 use what is know as the reference human genome. 53 00:02:41,460 --> 00:02:46,490 Genome to a very first approximation, the reference human genome is a 3 billion 54 00:02:46,490 --> 00:02:51,660 long string, where at each position the alphabet, is most commonly a Greek 55 00:02:51,660 --> 00:02:58,720 alphabet, right. So that is what is used by most biologists 56 00:02:58,720 --> 00:03:05,900 when they compare one genome to another. So what are the effects of the variations? 57 00:03:07,220 --> 00:03:11,660 variations are what make us unique. some of us are tall some of us 58 00:03:11,660 --> 00:03:14,930 have blue eyes, others have red hair and so on. 59 00:03:14,930 --> 00:03:18,260 these all have their basis in the genetic 60 00:03:18,260 --> 00:03:21,540 information that we have in our DNA in ourselves. 61 00:03:23,050 --> 00:03:26,725 these differences aught to have more serious consequences. 62 00:03:26,725 --> 00:03:27,710 [COUGH] 63 00:03:27,710 --> 00:03:29,600 It makes some of them more likely to 64 00:03:29,600 --> 00:03:34,290 get diseases, and others less likely to get diseases. 65 00:03:34,290 --> 00:03:37,430 some of these variants also have implications 66 00:03:37,430 --> 00:03:39,229 on how well we respond to certain therapies. 67 00:03:40,450 --> 00:03:42,150 so it is very important then, 68 00:03:42,150 --> 00:03:46,340 to understand genetic variation and its consequences. 69 00:03:46,340 --> 00:03:49,420 And there is, in fact, an active area of research in modern biology. 70 00:03:50,600 --> 00:03:52,930 So, with that background, let's turn to the 71 00:03:52,930 --> 00:03:54,800 problem at hand. 72 00:03:54,800 --> 00:03:57,380 by the way, if you want more details 73 00:03:57,380 --> 00:04:00,880 on this there is a pdf file called genetics.pdf 74 00:04:00,880 --> 00:04:04,420 on the courser homework site, which you can 75 00:04:04,420 --> 00:04:07,230 go through to get a little bit more information. 76 00:04:08,350 --> 00:04:12,240 about this, about genetics and it's implications in general. 77 00:04:13,520 --> 00:04:16,720 Okay, with that let's get back to the problem at hand. 78 00:04:16,720 --> 00:04:18,180 what you have is variation 79 00:04:18,180 --> 00:04:20,870 information for a 150 individuals. 80 00:04:22,628 --> 00:04:26,060 these individuals come from five different ethnic groups. 81 00:04:26,060 --> 00:04:28,684 So for each individual you are given their ethnic 82 00:04:28,684 --> 00:04:33,440 group, and information on approximately 200,000 variants in their genome. 83 00:04:35,320 --> 00:04:38,878 In other words, what you have is 150 labelled vectors. 84 00:04:38,878 --> 00:04:40,738 Where the label tells you what is the 85 00:04:40,738 --> 00:04:41,172 [UNKNOWN] 86 00:04:41,172 --> 00:04:42,260 of the individual. 87 00:04:42,260 --> 00:04:44,852 And each element in the vector which is 88 00:04:44,852 --> 00:04:48,452 approximately 2,000 long has a value of either of 89 00:04:48,452 --> 00:04:52,196 one or of zero, indicating whether a particular variable's 90 00:04:52,196 --> 00:04:56,246 expressed in that individual or absent in that individual. 91 00:04:56,246 --> 00:04:58,982 there are a small number of cases where the information is 92 00:04:58,982 --> 00:05:02,640 not available, and so this is imported as the missing value. 93 00:05:02,640 --> 00:05:06,500 so you should account for that when you're, when you process the data 94 00:05:06,500 --> 00:05:11,510 using your whatever software package you are using, right. 95 00:05:11,510 --> 00:05:17,000 So what you have to do, the task in front of you, is to build a model 96 00:05:17,000 --> 00:05:21,969 using the presence or absence of gradient of approximately 97 00:05:21,969 --> 00:05:26,930 200,000 positions, to predict the ethnicity of the individuals, all right. 98 00:05:26,930 --> 00:05:31,590 Once you have built yourself such a model, use that model on a different data 99 00:05:31,590 --> 00:05:34,520 set of 38 individuals, for whom you are given 100 00:05:34,520 --> 00:05:37,360 the genetic data but, you aren't given the ethnicity. 101 00:05:37,360 --> 00:05:43,340 So use this model to predict the ethnicity of these 38 individuals, right. 102 00:05:43,340 --> 00:05:47,740 So, that is the problem, if you want more details on the five formats and so on, 103 00:05:47,740 --> 00:05:53,230 please look at the course website, where all of this is detailed 104 00:05:53,230 --> 00:05:58,828 in in a file called readme.txt. 105 00:05:58,828 --> 00:06:02,220 Happy solving.