Lightning fast genomics With Spark and ADAM
Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms
Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
Genomics State of the art - growing technological capacity - cost reduction - growing data._
Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM
Parquet 101 Columnar storage Row oriented Column oriented
Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding
Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
ADAM Credits: AmpLab (UC Berkeley)
ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
ADAM ADAM API - AdamContext provides functions to read from HDFS
ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
ADAM ADAM API - e.g. reading genotypes
ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Stratification using 1000Genomes
Stratification using 1000Genomes
Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
Genome Data Data structure
Genome Data Data structure Panel: Map[SampleID, Population]
Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G
Cluster Performances
Cluster 40 m3.xlarge 160 cores + 600G
Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?

Lightning fast genomics with Spark, Adam and Scala

  • 1.
    Lightning fast genomics With Spark and ADAM
  • 2.
    Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  • 3.
    Genomics What isgenomics about? Medical Diagnostics Drug response Diseases mechanisms
  • 4.
    Genomics What isgenomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
  • 5.
    Genomics State ofthe art - growing technological capacity - cost reduction - growing data._
  • 6.
    Genomics State ofthe art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
  • 7.
    Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
  • 8.
    Genomics Future ofgenomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
  • 9.
    Genomics Needs ofscalability → Scala & Spark Needs of simplicity, clarity → ADAM
  • 10.
    Parquet 101 Columnarstorage Row oriented Column oriented
  • 11.
    Parquet 101 Columnarstorage > Homogeneous collocated data > Better range access > Better encoding
  • 12.
    Parquet 101 Efficientencoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
  • 13.
    Parquet 101 Efficientencoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
  • 14.
    Parquet 101 Efficientencoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15.
    Parquet 101 Optimizeddistributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 16.
    Parquet 101 Efficient(schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
  • 17.
    Parquet 101 Efficient(schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
  • 18.
    ADAM Credits: AmpLab(UC Berkeley)
  • 19.
    ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
  • 20.
    ADAM Sequencing -Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
  • 21.
    ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
  • 22.
    ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
  • 23.
    ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
  • 24.
    ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
  • 25.
    ADAM ADAM API - AdamContext provides functions to read from HDFS
  • 26.
    ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
  • 27.
    ADAM ADAM API - e.g. reading genotypes
  • 28.
    ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
  • 29.
    Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
  • 30.
    Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  • 31.
  • 32.
  • 33.
    Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
  • 34.
    Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
  • 35.
    Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
  • 36.
  • 37.
    Genome Data Datastructure Panel: Map[SampleID, Population]
  • 38.
    Genome Data Datastructure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
  • 39.
    Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 40.
    Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 41.
    Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  • 42.
    Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
  • 43.
    Mashup prediction Sample[NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
  • 44.
    Mashup #0 #1#2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  • 45.
    Cluster 4 m3.xlargeinstances (ec2) 16 cores + 60G
  • 46.
  • 47.
    Cluster 40 m3.xlarge 160 cores + 600G
  • 48.
    Conclusions and futurework ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
  • 49.
    Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
  • 50.
    That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?