A recent study aimed to sequence and store the complete genomes of 10,000 humans from all across the world. Because there can be experimental errors in determining the complete genome sequence, scientists routinely repeat the sequencing experiment at least 30 times in order to be fully accurate about each nucleotide in every position. Therefore each human will have genomic data that is 30 times the size of a typical human genome.

Question

A recent study aimed to sequence and store the complete genomes of 10,000 humans from all across the world. Because there can be experimental errors in determining the complete genome sequence, scientists routinely repeat the sequencing experiment at least 30 times in order to be fully accurate about each nucleotide in every position. Therefore each human will have genomic data that is 30 times the size of a typical human genome.

1. How many letters (in terabytes) will the whole genome project generate? Please include all the data generated, including the repeated sequencing efforts
2. If a typical external hard drive can contain 1 terabyte of information (i.e., 1 terabyte of letters), how many hard drives will it take to store all the data described above?
3. Imagine that you have access to all the data from the project described above. Can you come up with two possible uses for this kind of data? What kinds of information or knowledge can this massive data set reveal?

Answer 1

To calculate the answers to the questions, we need to understand the size of a typical human genome and the number of individuals being sequenced.

1. Size of the Whole Genome Project:
Since each human will have genomic data that is 30 times the size of a typical human genome, we need to determine the size of a typical human genome. The human genome is approximately 3 billion base pairs long, and each base pair requires 2 bits to represent it.

To calculate the size of a typical human genome, we can multiply the number of base pairs by the number of bits required to represent each base pair:

3 billion base pairs * 2 bits = 6 billion bits

Now, since we know that 8 bits equal 1 byte, we can convert the size to bytes by dividing by 8:

6 billion bits / 8 = 750 million bytes

Lastly, to convert the size to terabytes, we divide by 1 trillion (1 terabyte is 1 trillion bytes):

750 million bytes / 1 trillion = 0.75 terabytes

Since each human will have genomic data that is 30 times the size of a typical human genome, we need to multiply the size by 30:

0.75 terabytes * 30 = 22.5 terabytes

Therefore, the whole genome project will generate 22.5 terabytes of data.

2. Number of Hard Drives:
If a typical external hard drive can hold 1 terabyte of information, we can divide the total data size by 1 terabyte to determine the number of hard drives needed:

22.5 terabytes / 1 terabyte = 22.5

Therefore, it would take approximately 23 external hard drives to store all the data described above, rounding up to the nearest whole number.

3. Possible Uses for the Data:
Having access to this massive data set would provide a wealth of information and knowledge. Two possible uses for the data could be:

a. Understanding Genetic Variations: By comparing the genomes of individuals from different regions, researchers can identify genetic variations that are unique to certain populations. This information can help in studying the origins of specific traits, diseases, and susceptibility to certain conditions in different populations.

b. Precision Medicine: With the availability of complete genome data from a diverse set of individuals, scientists can better understand the links between genetic variations and diseases. This can facilitate the development of personalized medicine approaches that take into account an individual's genetic makeup, leading to more accurate diagnoses and tailored treatments.

These are just a couple of examples, and the dataset could also be used for evolutionary studies, population genetics, genetic counseling, and many other areas of research and healthcare.