Lightning fast genomics with Spark, Adam and Scala

Lightning fast genomics With Spark and ADAM

Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...

Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms

Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex

Genomics State of the art - growing technological capacity - cost reduction - growing data._

Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al

Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)

Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)

Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM

Parquet 101 Columnar storage Row oriented Column oriented

Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding

Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }

Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec

Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }

Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON

ADAM Credits: AmpLab (UC Berkeley)

ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet

ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM

ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO

ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam

ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!

ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression

ADAM ADAM API - AdamContext provides functions to read from HDFS

ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)

ADAM ADAM API - e.g. reading genotypes

ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler

Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!

Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

Stratification using 1000Genomes

Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?

Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...

Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome

Genome Data Data structure Panel: Map[SampleID, Population]

Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes

Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering

Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering

Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)

Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)

Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)

Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0

Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G

Cluster 40 m3.xlarge 160 cores + 600G

Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes

Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta

That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?

Lightning fast genomics with Spark, Adam and Scala

More Related Content

Similar to Lightning fast genomics with Spark, Adam and Scala

More from Andy Petrella

Recently uploaded

Lightning fast genomics with Spark, Adam and Scala