16th December 2015 Genomics 3.0: Big Data in Precision Medicine Asoke K Talukder, Ph.D InterpretOmics, Bangalore, India BDA 2015, 16th December 2015, Hyderabad, India 17th December 2009
16th December 2015 Part III – Big Data Genomics
16th December 2015 Multi Scale Big Data 3
16th December 2015 Multi Omics Big Data 4
16th December 2015 Big ‘OMICS’ (High-throughput) Data Domains DNA-Seq ChIP-Seq RNA-Seq Systems Biology Meta Analysis Population Genetics GWAS Microarray Exome-Seq Repli-Seq Small RNA-Seq Biological Networks Proteomics Metagenomics 5
16th December 2015 Model • Create a virtual (or physical) entity that has same physical appearance of the original entity in a reduced scale (space) • Use Physical Science to create sensors that can sense and quantify the input to the system causing Perturbation • Use Physical Science to measure the output of the Perturbed model entity • Use Mathematical (or Statistical) science that can simulate the function and behaviour of the original entity in reduced space and reduced time with perturbation 6
16th December 2015 Dimensions of Big Data The 7 Vs of Genomic Big Data • Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10 9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond. • Velocity is about the data-retrieval time or the time taken to service a request. Velocity is also measured through the rate of change of the data volume. • Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio etcetera. • Veracity is another dimension to measure data reliability - the ability of an organization to trust the data and be able to confidently use it to make crucial decisions. • Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that data processing time is close to linear and the algorithm does not have any bias; irrespective of the volume of the data, the algorithm is able to process the data in reasonable time. • Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers. • Value is the final actionable insight or the functional knowledge. The same mutation in a gene may have a different effect depending on the population or the environmental factors.
16th December 2015 Types of Genomic Big Data 1. Patient Data (n = 1) 2. Perishable (n = 1) 3. Persistent (n = N) 4. Phenotypic (n = N) 5. Clinical (n = N) 6. Biological/Molecular (n = N)
16th December 2015 Applications of Next-Generation Sequencing 9
16th December 2015 Asoke Talukder Frederick Sanger • Frederick Sanger was born in Rendcomb, a small village in Gloucestershire on August 13, 1918. He completed his Ph.D. in 1943 on lysine metabolism and a more practical problem concerning the nitrogen of potatoes. Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of insulin in 1955. It was this achievement that earned him his first Nobel prize in Chemistry in 1958. By 1967 he had determined the nucleotide sequence of the 5S ribosomal RNA from Escherichia coli, a small RNA about 115 nucleotides long. He then turned to DNA and, by 1975, had developed the “dideoxy” method for sequencing DNA molecules, also known as the Sanger method. This has been of key importance in such projects as the Human Genome Project and earned him his second Nobel prize in Chemistry in 1980. 10
16th December 2015 Asoke Talukder 11
16th December 2015 Sample generation and cluster generation 200,000 clusters per tile 62.5 million reads per lane 100 bp reads -> 12.5 Gb per lane Prepare DNA or cDNA fragments Ligate adapters 100μm Random array of clusters Attach single molecules to surface Amplify to form cl 12
16th December 2015 Base Calling Consecuitive cycles The identity of each base of a cluster is read from stacked sequence image Sequence 13
16th December 2015 Asoke Talukder 14 Dideoxynucleotide Sequencing 14
16th December 2015 Decoding the Book of Life – milestone for Quantitative Biology A Milestone for Humanity – the Human genome Human Genome Completed, June 26th, 2000 15 Francis CollinsBill ClintonJ Craig Ventor 15
16th December 2015 3 billion base pair => 6 G letters & 1 letter => 1 byte The whole genome can be recorded in just 10 CD-ROMs! In 2003, Human genome sequence was deciphered! • Genome is the complete set of genes of a living thing. • In 2003, the human genome sequencing was completed. • The human genome contains about 3 billion base pairs. • The number of genes is estimated to be between 20,000 to 25,000. • The difference between the genome of human and that of chimpanzee is only 1.23%! 16
16th December 2015 Asoke Talukder Illumina Genome Analyzer (GA) • The Genome Analyzer sequences clustered template DNA using a robust four-color DNA Sequencing-By- Synthesis (SBS) technology that employs reversible terminators with removable fluorescence. This approach provides a high degree of sequencing accuracy even through homopolymeric regions. 17
16th December 2015 Asoke Talukder NGS (Next Generation Sequencing) Technology 18
16th December 2015 Asoke Talukder How is Microarray Manufactured? • Affymetrix GeneChip • silicon chip • oligonucleiotide probes lithographically synthesized on the array • cRNA is used instead of cDNA 19
16th December 2015 How Does Microarray Work? 20
16th December 2015 Part IV – Biological Databases
Molecular Biology Databases … AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD,Beanref,Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!! 16 December, 2015 22
NCBI (National Center for Biotechnology Information) • over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/) 16 December, 2015 23
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 36
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 37
Protein Data Bank (PDB) 16 December, 2015 38
16 December, 2015 39
ENTREZ: A DISCOVERY SYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLASTBLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected. 16 December, 2015 40
PRECISE RESULTS MLH1[Gene Name] AND Human[Organism]
UMLS Knowledge Source Server (UMLSKS) Home Page Unified Medical Language System From top links or buttons  Search 3 Knowledge Sources From sidebar  Downloads  Documentation  Resources 16 December, 2015 42
“Biologic Function” hierarchy Biologic Function 360 Pathologic Function 9983 Physiologic Function 691 Disease or Syndrome 67716 Cell or Molecular Dysfunction 1276 Experimental Model of Disease 72 Organism Function 1528 Organ or Tissue Function 2912 Cell Function 4417 Molecular Function 13442 Mental or Behavioral Dysfunction 5691 Neoplastic Process 19436 Mental Process 1224 Genetic Function 1340 16 December, 2015 43
16th December 2015 Part V – Algorithms
Algorithms • An algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem • First you must identify exactly what the problem is! • A problem describes a class of computational tasks. A problem instance is one particular input from that task • In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible) • Unlike commercial software that is data intensive, algorithms as science and mathematics intensive 16 December, 2015 45
Schematic representation of our implementation of the de Bruijn graph Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
Example of Tour Bus error correction Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
Breadcrumb algorithm Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
16 December, 2015 49
16th December 2015 • Overview of Human Disease – classifications, Inheritance, mechanisms (cause) • Databases – OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) – Gene Clinics (http://www.geneclinics.org/) – Mutation database (http://mutdb.org/) – Ocomine (http://www.oncomine.org/) – Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/) • Analysis of genes for molecular functions, biological processes and pathways The PANTHER (Protein ANalysis THrough Evolutionary Relationships (http://www.pantherdb.org/) Protein Interaction network (http://string.embl.de/) 50
16th December 2015 • Are results statistically significant? • Many random process are involved in Biological processes • Many processes appear to be random but in reality are non-random • Many chances and uncertainties are involved in biology data collection • Statistical modeling of biological phenomenon can help to understand patterns in life Why Statistics? 51
16th December 2015 Deductive and Inductive Science Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003 Law of Gravitation, Newton's Law of Motion E = mC2 Biological Phenomenon Simulation Clinical Trial 52
16th December 2015 Why Statistics? Purpose of statistics is to draw inferences from samples of data to the population from which these samples came Or Abstract an entity with average behavior where the behavior of the constituent parts cannot be measured 53
16th December 2015 Challenges in Computing • Nature is a Tweaker • Computers are efficient in discovering identity but not similarity • Biology needs similarity & not identity • All Biology problems are different & unique • Huge data generated by Next Generation Sequencers with many errors • Eliminate Noise from Information • Minimize False Positive and False Negative 54
16th December 2015 Most Biology Solutions are NP-Hard • If the data volume increases by x, complexity of solution is much higher than x (non deterministic polynomial time) • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time • You may not know when you have an optimal solution, if you use a heuristic • Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need? 55
16th December 2015 NGS: Experiment with an Open Mind • The process (Wet Lab) • Take DNA/RNA/cDNA/miRNA etc • Break into tiny pieces • Amplify them • Read them as sequence of bases • The process (Dry Lab) • Analyze the data • Extract information from data • NGS Experiments are unbiased • NGS can help discover many unknown patterns in the genome/gene or cell 56
16th December 2015 Next Generation Sequence Data • FASTQ (Illumina) • Sff (454) • CCS (PacBio) • ... • Microarray Single End Sequences Insert Size Library Size Sequence Seque nce Paired End or Mate-paired         DNA/RNA/miRNA OverlappedOverlapped reads  Random Order & Orientation Long reads Short reads Fixed length reads Variable length reads cDNA/mRNA Hundreds to Billions Bases Circular Consensus reads Billions to Hundreds Bases 57
16th December 2015 Paired-end/Mate-pair Data Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing, Nature Methods Supplement| Vol.6 No.11s | November 2009 58
16th December 2015 Roche 454 NGS Data (.sff) FNA File content >contig00001 length=439 numreads=17 CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC QUAL File content >contig00001 length=439 numreads=17 64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64 64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Phred quality score Q is defined as a property which is logarithmically related to the base-calling error probabilities P Q = -10 * log10P or P = 10-Q/10 • If Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to assess the quality of sequences. Because • In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64 based on the vendor 59
16th December 2015 @HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII >HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA >HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA >HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA Data in FASTQ/FASTA Format • For Paired-end sequences you have two files with name • _1 & _2 to indicate End_1 & End_2 • Within files you have matching record id @HWI-EAS107_1_4_1_113_501/1 • To indicate the sequence of End_1 • And @HWI-EAS107_1_4_1_113_501/2 • To indicate the sequence of End_2 • Paired-end read is INWARD •   • Mate-pair read is OUTWARD •   • FASTA • FASTQ 60
16th December 2015 Error Due to Physics Beginning (bad quality data) Middle (good quality data) End (bad quality data) Source: Wikipedia 61
16th December 2015 Base-calling Error (Errors occur at rates 1 to 5 errors every 100 nucleotide) ACCGT CGTGC TTAC TACCGT ACCGT CGTGC TTAC TGCCGT ACCGT CAGTGC TTAC TACCGT ACCGT CGTGC TTAC TACGT Substitution Insertion Deletion Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology --ACCGT-- ----CGTGC TTAC----- -TACCGT— TTACCGTC (Consensus) 62
16th December 2015 Adaptors & Contamination • Illumina Adaptors: 1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG 2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT 5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT • In a Paired Read, contamination in one end will result into filtering of both ends 63
16th December 2015 Genome/DNA Data Run1: Lane No of Reads Size (bytes) 1 41,179,668 10,285,393,108 2 43,252,726 10,455,103,434 3 42,951,004 10,381,539,992 4 43,580,180 10,534,360,126 6 42,071,130 10,171,701,008 7 43,084,416 10,414,795,392 8 42,891,196 10,369,596,648 Run2: Lane No of Reads Size(bytes) 1 42,773,842 10,703,924,228 2 44,809,016 10,772,709,314 3 44,898,528 10,790,934,680 4 44,099,962 10,598,532,600 6 44,731,270 10,746,564,462 7 44,162,428 10,607,946,662 8 43,689,238 10,492,962,600 Lane Size (bytes) 6 6,396,631,302 7 6,392,634,380 8 6,240,332,704 Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT) Lane Size (bytes) 1 6,535,068,410 2 6,512,213,186 3 6,497,931,646 4 6,417,130,928 64
16th December 2015 RNA-Seq Data for a Marine Animal Tissue Name # Reads # Bases Size (bytes) Brain 73,224,886 4,393,493,160 14,378,439,860 Heart 71,954,940 4,317,296,400 14,129,178,812 Liver 68,992,472 4,139,548,320 13,547,005,500 65
16th December 2015 miRNA Data Sample No of Bases No. of Bases No. of Size of Name Received Processed Sequences Data ======================================================== S1 27,951,043 27,951,043 1,114,585 70.5 MB S2 24,768,291 24,768,291 1,043,462 64.5 MB S3 41,569,143 41,569,143 1,685,096 106.5 MB S4 34,037,239 34,037,239 1,433,791 89.2 MB S5 24,963,089 24,963,089 1,033,362 61.6 MB S6 34,846,223 34,846,223 1,439,337 96.5 MB S7 74,262,271 74,262,271 2,309,712 164.6 MB Read Size varying from 18 to 36 in FASTA format 66
16th December 2015 Typical Biological Data Volume (Illumina sequencing platform based) 67
16th December 2015 Complexities in NGS Data • Large files – Microsoft Windows often fails to even open the file • Variable Length Reads – allocating memory is always a computational challenges • Computers are good at Identity discovery but Biology needs Similarity discovery • Categorical data – cannot take differences between two objects • Data are error prone – Quality of data is always a challenge • Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base) • Needs Super Computing power with Terabytes of Memory, and Petabytes of Storage • Most Biology problems are NP-Hard – algorithms fail to scale with large data volume • Many Open Source tools for NGS data and poorly documented and not maintained, supported, or easy to change 68
16th December 2015 NGS Data Challenges TACCGT TGCCGT TCCGT TCCCGT ACCCGT ACCGT Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology No Coverage Fragments No Coverage DeletionInsertionSubstitution Read Errors XTarget XA XB C XA XD XCAssembled D B Repeats 69
16th December 2015 Unknown Orientation & Order CACGT ACGT TGCA ACTACG GTACT ACTGA CTGA CACGT-------- -ACGT-------- -ACGT-------- --CGTAGT----- -----AGTAC--- --------ACTGA ---------CTGA CACGTAGTACTGA 70
16th December 2015 Discovering Biomedical Knowledge Data Information Knowledge Literature/ Molecular Data Clinical/Bedside Data Medical Knowledge Target Data Preprocessed Data Transformed Data Patterns iOmics Clinical/Drug Data 71
16th December 2015 Data Information Knowledge Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002 Wet Lab experiment & High-throughput data Open-domain widely used Algorithms & Tools Custom Tools and Open-domain Databases Problem Specific Algorithms, Analysis, and Databases Data Information Knowledge Related Information 72
16th December 2015 Systems Biology – Hypothesis Agnostic System/Genome Wide Study ETL Experiment/Sample Big Data Data ScienceMolecular Biology / Genetics Hypothesis Computer Science/ Algorithms Bioinformatics Statistics Meta Analysis / Network Biology Publish / Translational Biomedicine Scientist / Biologist NGS / Sequencer Biomedical Databases Literature 73
16th December 2015 Data Sciences • Data Science is about learning from data, in order to gain useful predictions and insights • Separating signal from noise presents many computational and inferential challenges, which we approached from a perspective at the interface of computer science and statistics • Data munging/scraping/sampling/cleaning in order to get an informative, manageable data set • Data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis • Exploratory data analysis to generate hypotheses and intuition about the data • Prediction based on statistical tools such as regression, classification, and clustering • Communication of results through visualization, stories, and interpretable summaries. 74
16th December 2015 Data Simulator (Synthetic Data) • Take a Reference genome (e.g., hg19 or mm10 or some other genome) • Create a VCF (Variation Call Format) file with synthetic mutations • Or, take known mutations in VCF format from COSMIC or 1000Genome • Apply (inject) the mutations from VCF file into the reference genome • This will create a genome (single strand) with known mutations • Inject random errors (sequencer errors) • Define the depth or coverage • Create fixed length single-end or paired-end reads • A FASTQ file will be generated with known coverage and known mutations • Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data 75
16th December 2015 Data Scientists' Skills Ref: Wikipedia 76
16th December 2015 Exploratory Data Analysis Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. Maximize insight into a data set; 2. Uncover underlying structure; 3. Extract important variables; 4. Detect outliers and anomalies; 5. Test underlying assumptions; 6. Develop parsimonious models; and 7. Determine optimal factor settings. 77
16th December 2015 • Real Human miRNA Data • Nucleotide Patterns – Mono, Di, Poly statistics – Motif Statistics • Quality of Nucleotides Truth is in the Data 78
16th December 2015 Random genomes fragmentation Genomes assembly using overlaps Metagenomics/ Multiple genomes The Sequencing & Assembly Process Target Microbial Genomes
16th December 2015 The Jigsaw Puzzle Source: Unknown 80
16th December 2015 Phases in Assembly • Understand the data – Data inventory – Single End, Paired End, Mate Paired etc – Sequence structure (Read size, Format) – Quality of the data – Patterns within the data • Clean up the data – Remove (Filter/Trim) vector/adaptor contaminated data – Remove data of bad quality – Remove data that might cause chimeric error • Genome or Trancriptome in Ref-Assembly • Contigs in Denovo Assembly 81
16th December 2015 Genome Reference Assembly • Seed Based Algorithm – Indexes either the genome or the reads in a data structure – All k-long words (k-mers) of one sequence are indexed in a table with an entry for every possible k-mer – Seeds (exact or nearly exact substring matches between the read and the genome) are used to rapidly isolate the potential locations where the read could match, and then a sensitive, full alignment phase, often with the Smith–Waterman Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 82
16th December 2015 • MAQ (Mapping and Assembly with Qualities) is a Reference Assembly that supports 63 bases of short fixed- length Reads • MAQ was designed for Illumina 1G Genetic Analyzer data, with functions to handle ABI SOLiD data. • MAQ aligns reads to reference sequences and then calls the consensus. For single-end reads, MAQ is able to find all hits with up to 2 or 3 mismatches. • For paired-end reads, MAQ finds all paired hits with one of the two reads containing up to 1 mismatch. • At the assembling stage, MAQ calls the consensus based on a statistical model. It calls the base which maximizes the posterior probability and calculates a phread quality at each position along the consensus. Heterozygotes are also called in this process. MAQ Ref: http://maq.sourceforge.net/ 83
16th December 2015 • BWT (Burrows–Wheeler Transform) • In the BWT index, only a fraction of the pointers must be precomputed and saved, while the rest are reconstructed on demand • Bowtie and BWA utilize heuristic algorithms to search for non-exact matches in the BWT- based index, if exact matches cannot be located Faster Genome Ref-assembly Algorithm Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 84
16th December 2015 Alignment – Bowtie (SAM – Sequence Assembly Map) HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774 1374500255 100M * 0 0 TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA %%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/ >8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_4160774 1374501255 100M * 0 0 CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT 444454313355544455544433244445661493/3;;565=;491=;5;54==3= ;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_4160774 1374502255 100M * 0 0 TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC BAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555 5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100 NM:i:0 85
16th December 2015 Alignment in Genome Viewer 86
16th December 2015 • Greatest computational challenge for Variation Analysis (SNP/InDel) task lies in judging the likelihood that a position is a heterozygous or homozygous variant given the error rates of the various platforms • The probability of bad mappings, and the amount of support or coverage • Therefore, most of the tools include a detailed data preparation step in which they filter, realign and often re-score reads, followed by a nucleotide or heterozygosity calling step done under a Bayesian framework SNP, Micro-InDel, & Point Mutation 87
16th December 2015 Lack of Coverage • Coverage at a position i of a target is defined as the number of fragments that cover this position. If coverage is zero or low, there is not enough information in the fragment set to reconstruct the target completely No Coverage Target Fragments No Coverage 88
16th December 2015 End of Part III, IV & V InterpretOmics Office: Shezan Lavelle, 5th Floor, #15 Walton Road, Bengaluru 560001 Lab: #329, 7th Main, HAL 2nd Stage, Indiranagar, Bengaluru 560008 Phone: +91(80)46623800 89

Bda2015 tutorial-part2-data&amp;databases

  • 1.
    16th December 2015 Genomics3.0: Big Data in Precision Medicine Asoke K Talukder, Ph.D InterpretOmics, Bangalore, India BDA 2015, 16th December 2015, Hyderabad, India 17th December 2009
  • 2.
    16th December 2015 PartIII – Big Data Genomics
  • 3.
    16th December 2015 MultiScale Big Data 3
  • 4.
    16th December 2015 MultiOmics Big Data 4
  • 5.
    16th December 2015 Big‘OMICS’ (High-throughput) Data Domains DNA-Seq ChIP-Seq RNA-Seq Systems Biology Meta Analysis Population Genetics GWAS Microarray Exome-Seq Repli-Seq Small RNA-Seq Biological Networks Proteomics Metagenomics 5
  • 6.
    16th December 2015 Model •Create a virtual (or physical) entity that has same physical appearance of the original entity in a reduced scale (space) • Use Physical Science to create sensors that can sense and quantify the input to the system causing Perturbation • Use Physical Science to measure the output of the Perturbed model entity • Use Mathematical (or Statistical) science that can simulate the function and behaviour of the original entity in reduced space and reduced time with perturbation 6
  • 7.
    16th December 2015 Dimensionsof Big Data The 7 Vs of Genomic Big Data • Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10 9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond. • Velocity is about the data-retrieval time or the time taken to service a request. Velocity is also measured through the rate of change of the data volume. • Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio etcetera. • Veracity is another dimension to measure data reliability - the ability of an organization to trust the data and be able to confidently use it to make crucial decisions. • Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that data processing time is close to linear and the algorithm does not have any bias; irrespective of the volume of the data, the algorithm is able to process the data in reasonable time. • Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers. • Value is the final actionable insight or the functional knowledge. The same mutation in a gene may have a different effect depending on the population or the environmental factors.
  • 8.
    16th December 2015 Typesof Genomic Big Data 1. Patient Data (n = 1) 2. Perishable (n = 1) 3. Persistent (n = N) 4. Phenotypic (n = N) 5. Clinical (n = N) 6. Biological/Molecular (n = N)
  • 9.
    16th December 2015 Applicationsof Next-Generation Sequencing 9
  • 10.
    16th December 2015 AsokeTalukder Frederick Sanger • Frederick Sanger was born in Rendcomb, a small village in Gloucestershire on August 13, 1918. He completed his Ph.D. in 1943 on lysine metabolism and a more practical problem concerning the nitrogen of potatoes. Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of insulin in 1955. It was this achievement that earned him his first Nobel prize in Chemistry in 1958. By 1967 he had determined the nucleotide sequence of the 5S ribosomal RNA from Escherichia coli, a small RNA about 115 nucleotides long. He then turned to DNA and, by 1975, had developed the “dideoxy” method for sequencing DNA molecules, also known as the Sanger method. This has been of key importance in such projects as the Human Genome Project and earned him his second Nobel prize in Chemistry in 1980. 10
  • 11.
  • 12.
    16th December 2015 Samplegeneration and cluster generation 200,000 clusters per tile 62.5 million reads per lane 100 bp reads -> 12.5 Gb per lane Prepare DNA or cDNA fragments Ligate adapters 100μm Random array of clusters Attach single molecules to surface Amplify to form cl 12
  • 13.
    16th December 2015 BaseCalling Consecuitive cycles The identity of each base of a cluster is read from stacked sequence image Sequence 13
  • 14.
    16th December 2015 AsokeTalukder 14 Dideoxynucleotide Sequencing 14
  • 15.
    16th December 2015 Decodingthe Book of Life – milestone for Quantitative Biology A Milestone for Humanity – the Human genome Human Genome Completed, June 26th, 2000 15 Francis CollinsBill ClintonJ Craig Ventor 15
  • 16.
    16th December 2015 3billion base pair => 6 G letters & 1 letter => 1 byte The whole genome can be recorded in just 10 CD-ROMs! In 2003, Human genome sequence was deciphered! • Genome is the complete set of genes of a living thing. • In 2003, the human genome sequencing was completed. • The human genome contains about 3 billion base pairs. • The number of genes is estimated to be between 20,000 to 25,000. • The difference between the genome of human and that of chimpanzee is only 1.23%! 16
  • 17.
    16th December 2015 AsokeTalukder Illumina Genome Analyzer (GA) • The Genome Analyzer sequences clustered template DNA using a robust four-color DNA Sequencing-By- Synthesis (SBS) technology that employs reversible terminators with removable fluorescence. This approach provides a high degree of sequencing accuracy even through homopolymeric regions. 17
  • 18.
    16th December 2015 AsokeTalukder NGS (Next Generation Sequencing) Technology 18
  • 19.
    16th December 2015 AsokeTalukder How is Microarray Manufactured? • Affymetrix GeneChip • silicon chip • oligonucleiotide probes lithographically synthesized on the array • cRNA is used instead of cDNA 19
  • 20.
    16th December 2015 HowDoes Microarray Work? 20
  • 21.
    16th December 2015 PartIV – Biological Databases
  • 22.
    Molecular Biology Databases… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD,Beanref,Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!! 16 December, 2015 22
  • 23.
    NCBI (National Centerfor Biotechnology Information) • over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/) 16 December, 2015 23
  • 36.
    Microarray data arestored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 36
  • 37.
    Microarray data arestored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 37
  • 38.
    Protein Data Bank(PDB) 16 December, 2015 38
  • 39.
  • 40.
    ENTREZ: A DISCOVERYSYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLASTBLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected. 16 December, 2015 40
  • 41.
  • 42.
    UMLS Knowledge SourceServer (UMLSKS) Home Page Unified Medical Language System From top links or buttons  Search 3 Knowledge Sources From sidebar  Downloads  Documentation  Resources 16 December, 2015 42
  • 43.
    “Biologic Function” hierarchy BiologicFunction 360 Pathologic Function 9983 Physiologic Function 691 Disease or Syndrome 67716 Cell or Molecular Dysfunction 1276 Experimental Model of Disease 72 Organism Function 1528 Organ or Tissue Function 2912 Cell Function 4417 Molecular Function 13442 Mental or Behavioral Dysfunction 5691 Neoplastic Process 19436 Mental Process 1224 Genetic Function 1340 16 December, 2015 43
  • 44.
    16th December 2015 PartV – Algorithms
  • 45.
    Algorithms • An algorithmis a sequence of instructions that one must perform in order to solve a well-formulated problem • First you must identify exactly what the problem is! • A problem describes a class of computational tasks. A problem instance is one particular input from that task • In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible) • Unlike commercial software that is data intensive, algorithms as science and mathematics intensive 16 December, 2015 45
  • 46.
    Schematic representation ofour implementation of the de Bruijn graph Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 47.
    Example of TourBus error correction Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 48.
    Breadcrumb algorithm Zerbino D.R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 49.
  • 50.
    16th December 2015 •Overview of Human Disease – classifications, Inheritance, mechanisms (cause) • Databases – OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) – Gene Clinics (http://www.geneclinics.org/) – Mutation database (http://mutdb.org/) – Ocomine (http://www.oncomine.org/) – Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/) • Analysis of genes for molecular functions, biological processes and pathways The PANTHER (Protein ANalysis THrough Evolutionary Relationships (http://www.pantherdb.org/) Protein Interaction network (http://string.embl.de/) 50
  • 51.
    16th December 2015 •Are results statistically significant? • Many random process are involved in Biological processes • Many processes appear to be random but in reality are non-random • Many chances and uncertainties are involved in biology data collection • Statistical modeling of biological phenomenon can help to understand patterns in life Why Statistics? 51
  • 52.
    16th December 2015 Deductiveand Inductive Science Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003 Law of Gravitation, Newton's Law of Motion E = mC2 Biological Phenomenon Simulation Clinical Trial 52
  • 53.
    16th December 2015 WhyStatistics? Purpose of statistics is to draw inferences from samples of data to the population from which these samples came Or Abstract an entity with average behavior where the behavior of the constituent parts cannot be measured 53
  • 54.
    16th December 2015 Challengesin Computing • Nature is a Tweaker • Computers are efficient in discovering identity but not similarity • Biology needs similarity & not identity • All Biology problems are different & unique • Huge data generated by Next Generation Sequencers with many errors • Eliminate Noise from Information • Minimize False Positive and False Negative 54
  • 55.
    16th December 2015 MostBiology Solutions are NP-Hard • If the data volume increases by x, complexity of solution is much higher than x (non deterministic polynomial time) • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time • You may not know when you have an optimal solution, if you use a heuristic • Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need? 55
  • 56.
    16th December 2015 NGS:Experiment with an Open Mind • The process (Wet Lab) • Take DNA/RNA/cDNA/miRNA etc • Break into tiny pieces • Amplify them • Read them as sequence of bases • The process (Dry Lab) • Analyze the data • Extract information from data • NGS Experiments are unbiased • NGS can help discover many unknown patterns in the genome/gene or cell 56
  • 57.
    16th December 2015 NextGeneration Sequence Data • FASTQ (Illumina) • Sff (454) • CCS (PacBio) • ... • Microarray Single End Sequences Insert Size Library Size Sequence Seque nce Paired End or Mate-paired         DNA/RNA/miRNA OverlappedOverlapped reads  Random Order & Orientation Long reads Short reads Fixed length reads Variable length reads cDNA/mRNA Hundreds to Billions Bases Circular Consensus reads Billions to Hundreds Bases 57
  • 58.
    16th December 2015 Paired-end/Mate-pairData Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing, Nature Methods Supplement| Vol.6 No.11s | November 2009 58
  • 59.
    16th December 2015 Roche454 NGS Data (.sff) FNA File content >contig00001 length=439 numreads=17 CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC QUAL File content >contig00001 length=439 numreads=17 64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64 64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Phred quality score Q is defined as a property which is logarithmically related to the base-calling error probabilities P Q = -10 * log10P or P = 10-Q/10 • If Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to assess the quality of sequences. Because • In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64 based on the vendor 59
  • 60.
    16th December 2015 @HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII >HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA >HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA >HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA Datain FASTQ/FASTA Format • For Paired-end sequences you have two files with name • _1 & _2 to indicate End_1 & End_2 • Within files you have matching record id @HWI-EAS107_1_4_1_113_501/1 • To indicate the sequence of End_1 • And @HWI-EAS107_1_4_1_113_501/2 • To indicate the sequence of End_2 • Paired-end read is INWARD •   • Mate-pair read is OUTWARD •   • FASTA • FASTQ 60
  • 61.
    16th December 2015 ErrorDue to Physics Beginning (bad quality data) Middle (good quality data) End (bad quality data) Source: Wikipedia 61
  • 62.
    16th December 2015 Base-callingError (Errors occur at rates 1 to 5 errors every 100 nucleotide) ACCGT CGTGC TTAC TACCGT ACCGT CGTGC TTAC TGCCGT ACCGT CAGTGC TTAC TACCGT ACCGT CGTGC TTAC TACGT Substitution Insertion Deletion Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology --ACCGT-- ----CGTGC TTAC----- -TACCGT— TTACCGTC (Consensus) 62
  • 63.
    16th December 2015 Adaptors& Contamination • Illumina Adaptors: 1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG 2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT 5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT • In a Paired Read, contamination in one end will result into filtering of both ends 63
  • 64.
    16th December 2015 Genome/DNAData Run1: Lane No of Reads Size (bytes) 1 41,179,668 10,285,393,108 2 43,252,726 10,455,103,434 3 42,951,004 10,381,539,992 4 43,580,180 10,534,360,126 6 42,071,130 10,171,701,008 7 43,084,416 10,414,795,392 8 42,891,196 10,369,596,648 Run2: Lane No of Reads Size(bytes) 1 42,773,842 10,703,924,228 2 44,809,016 10,772,709,314 3 44,898,528 10,790,934,680 4 44,099,962 10,598,532,600 6 44,731,270 10,746,564,462 7 44,162,428 10,607,946,662 8 43,689,238 10,492,962,600 Lane Size (bytes) 6 6,396,631,302 7 6,392,634,380 8 6,240,332,704 Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT) Lane Size (bytes) 1 6,535,068,410 2 6,512,213,186 3 6,497,931,646 4 6,417,130,928 64
  • 65.
    16th December 2015 RNA-SeqData for a Marine Animal Tissue Name # Reads # Bases Size (bytes) Brain 73,224,886 4,393,493,160 14,378,439,860 Heart 71,954,940 4,317,296,400 14,129,178,812 Liver 68,992,472 4,139,548,320 13,547,005,500 65
  • 66.
    16th December 2015 miRNAData Sample No of Bases No. of Bases No. of Size of Name Received Processed Sequences Data ======================================================== S1 27,951,043 27,951,043 1,114,585 70.5 MB S2 24,768,291 24,768,291 1,043,462 64.5 MB S3 41,569,143 41,569,143 1,685,096 106.5 MB S4 34,037,239 34,037,239 1,433,791 89.2 MB S5 24,963,089 24,963,089 1,033,362 61.6 MB S6 34,846,223 34,846,223 1,439,337 96.5 MB S7 74,262,271 74,262,271 2,309,712 164.6 MB Read Size varying from 18 to 36 in FASTA format 66
  • 67.
    16th December 2015 TypicalBiological Data Volume (Illumina sequencing platform based) 67
  • 68.
    16th December 2015 Complexitiesin NGS Data • Large files – Microsoft Windows often fails to even open the file • Variable Length Reads – allocating memory is always a computational challenges • Computers are good at Identity discovery but Biology needs Similarity discovery • Categorical data – cannot take differences between two objects • Data are error prone – Quality of data is always a challenge • Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base) • Needs Super Computing power with Terabytes of Memory, and Petabytes of Storage • Most Biology problems are NP-Hard – algorithms fail to scale with large data volume • Many Open Source tools for NGS data and poorly documented and not maintained, supported, or easy to change 68
  • 69.
    16th December 2015 NGSData Challenges TACCGT TGCCGT TCCGT TCCCGT ACCCGT ACCGT Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology No Coverage Fragments No Coverage DeletionInsertionSubstitution Read Errors XTarget XA XB C XA XD XCAssembled D B Repeats 69
  • 70.
    16th December 2015 UnknownOrientation & Order CACGT ACGT TGCA ACTACG GTACT ACTGA CTGA CACGT-------- -ACGT-------- -ACGT-------- --CGTAGT----- -----AGTAC--- --------ACTGA ---------CTGA CACGTAGTACTGA 70
  • 71.
    16th December 2015 DiscoveringBiomedical Knowledge Data Information Knowledge Literature/ Molecular Data Clinical/Bedside Data Medical Knowledge Target Data Preprocessed Data Transformed Data Patterns iOmics Clinical/Drug Data 71
  • 72.
    16th December 2015 DataInformation Knowledge Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002 Wet Lab experiment & High-throughput data Open-domain widely used Algorithms & Tools Custom Tools and Open-domain Databases Problem Specific Algorithms, Analysis, and Databases Data Information Knowledge Related Information 72
  • 73.
    16th December 2015 SystemsBiology – Hypothesis Agnostic System/Genome Wide Study ETL Experiment/Sample Big Data Data ScienceMolecular Biology / Genetics Hypothesis Computer Science/ Algorithms Bioinformatics Statistics Meta Analysis / Network Biology Publish / Translational Biomedicine Scientist / Biologist NGS / Sequencer Biomedical Databases Literature 73
  • 74.
    16th December 2015 DataSciences • Data Science is about learning from data, in order to gain useful predictions and insights • Separating signal from noise presents many computational and inferential challenges, which we approached from a perspective at the interface of computer science and statistics • Data munging/scraping/sampling/cleaning in order to get an informative, manageable data set • Data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis • Exploratory data analysis to generate hypotheses and intuition about the data • Prediction based on statistical tools such as regression, classification, and clustering • Communication of results through visualization, stories, and interpretable summaries. 74
  • 75.
    16th December 2015 DataSimulator (Synthetic Data) • Take a Reference genome (e.g., hg19 or mm10 or some other genome) • Create a VCF (Variation Call Format) file with synthetic mutations • Or, take known mutations in VCF format from COSMIC or 1000Genome • Apply (inject) the mutations from VCF file into the reference genome • This will create a genome (single strand) with known mutations • Inject random errors (sequencer errors) • Define the depth or coverage • Create fixed length single-end or paired-end reads • A FASTQ file will be generated with known coverage and known mutations • Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data 75
  • 76.
    16th December 2015 DataScientists' Skills Ref: Wikipedia 76
  • 77.
    16th December 2015 ExploratoryData Analysis Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. Maximize insight into a data set; 2. Uncover underlying structure; 3. Extract important variables; 4. Detect outliers and anomalies; 5. Test underlying assumptions; 6. Develop parsimonious models; and 7. Determine optimal factor settings. 77
  • 78.
    16th December 2015 •Real Human miRNA Data • Nucleotide Patterns – Mono, Di, Poly statistics – Motif Statistics • Quality of Nucleotides Truth is in the Data 78
  • 79.
    16th December 2015 Randomgenomes fragmentation Genomes assembly using overlaps Metagenomics/ Multiple genomes The Sequencing & Assembly Process Target Microbial Genomes
  • 80.
    16th December 2015 TheJigsaw Puzzle Source: Unknown 80
  • 81.
    16th December 2015 Phasesin Assembly • Understand the data – Data inventory – Single End, Paired End, Mate Paired etc – Sequence structure (Read size, Format) – Quality of the data – Patterns within the data • Clean up the data – Remove (Filter/Trim) vector/adaptor contaminated data – Remove data of bad quality – Remove data that might cause chimeric error • Genome or Trancriptome in Ref-Assembly • Contigs in Denovo Assembly 81
  • 82.
    16th December 2015 GenomeReference Assembly • Seed Based Algorithm – Indexes either the genome or the reads in a data structure – All k-long words (k-mers) of one sequence are indexed in a table with an entry for every possible k-mer – Seeds (exact or nearly exact substring matches between the read and the genome) are used to rapidly isolate the potential locations where the read could match, and then a sensitive, full alignment phase, often with the Smith–Waterman Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 82
  • 83.
    16th December 2015 •MAQ (Mapping and Assembly with Qualities) is a Reference Assembly that supports 63 bases of short fixed- length Reads • MAQ was designed for Illumina 1G Genetic Analyzer data, with functions to handle ABI SOLiD data. • MAQ aligns reads to reference sequences and then calls the consensus. For single-end reads, MAQ is able to find all hits with up to 2 or 3 mismatches. • For paired-end reads, MAQ finds all paired hits with one of the two reads containing up to 1 mismatch. • At the assembling stage, MAQ calls the consensus based on a statistical model. It calls the base which maximizes the posterior probability and calculates a phread quality at each position along the consensus. Heterozygotes are also called in this process. MAQ Ref: http://maq.sourceforge.net/ 83
  • 84.
    16th December 2015 •BWT (Burrows–Wheeler Transform) • In the BWT index, only a fraction of the pointers must be precomputed and saved, while the rest are reconstructed on demand • Bowtie and BWA utilize heuristic algorithms to search for non-exact matches in the BWT- based index, if exact matches cannot be located Faster Genome Ref-assembly Algorithm Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 84
  • 85.
    16th December 2015 Alignment– Bowtie (SAM – Sequence Assembly Map) HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774 1374500255 100M * 0 0 TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA %%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/ >8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_4160774 1374501255 100M * 0 0 CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT 444454313355544455544433244445661493/3;;565=;491=;5;54==3= ;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_4160774 1374502255 100M * 0 0 TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC BAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555 5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100 NM:i:0 85
  • 86.
    16th December 2015 Alignmentin Genome Viewer 86
  • 87.
    16th December 2015 •Greatest computational challenge for Variation Analysis (SNP/InDel) task lies in judging the likelihood that a position is a heterozygous or homozygous variant given the error rates of the various platforms • The probability of bad mappings, and the amount of support or coverage • Therefore, most of the tools include a detailed data preparation step in which they filter, realign and often re-score reads, followed by a nucleotide or heterozygosity calling step done under a Bayesian framework SNP, Micro-InDel, & Point Mutation 87
  • 88.
    16th December 2015 Lackof Coverage • Coverage at a position i of a target is defined as the number of fragments that cover this position. If coverage is zero or low, there is not enough information in the fragment set to reconstruct the target completely No Coverage Target Fragments No Coverage 88
  • 89.
    16th December 2015 Endof Part III, IV & V InterpretOmics Office: Shezan Lavelle, 5th Floor, #15 Walton Road, Bengaluru 560001 Lab: #329, 7th Main, HAL 2nd Stage, Indiranagar, Bengaluru 560008 Phone: +91(80)46623800 89