National Aeronautics and Space Administration www.nasa.gov Hadoop for High-Performance Climate Analytics Use Cases and Lessons Learned Glenn Tamkin (NASA/CSC) Team: John Schnase (NASA), Dan Duffy (NASA), Hoot Thompson (PTP), Denis Nadeau (CSC), Scott Sinno (PTP)
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Overview •Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Modern Era Retrospective-Analysis for Research and Applications Analytic Services (MERRA/AS) … • Is a cyber-infrastructure resource for developing and evaluating a next generation of climate data analysis capabilities • A service that reduces the time spent in the preparation of MERRA data used in data-model inter-comparison 2Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Vision • Provide a test-bed for experimental development of high-performance analytics • Offer an architectural approach to climate data services that can be generalized to applications and customers beyond the traditional climate research community 3Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n MERRA A/S Background •Initially evaluated MapReduce and the Hadoop Distributed File System (HDFS) on representative collections of observational and climate data (MERRA) • Focused on a small set of canonical operations such as, average, minimum, maximum, and standard deviation operations over a given temporal and spatial extent • Built a cluster with available hardware (then acquired a custom cluster) • Implemented a prototype to process the data via MapReduce • Captured metrics and observed performance improvements as the number of data nodes and block sizes increase 4Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Project Details •MERRA/AS… • Leverages the Hadoop/MapReduce approach to parallel storage- based computation. • Uses a workflow-generated approach to perform analyses over the MERRA data • Introduces a generalized application programming interface (API) that exposes reusable climate data services. 5Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Why HDFS and MapReduce ? •Software framework to store large amounts of data in parallel across a cluster of nodes • Provides fault tolerance, load balancing, and parallelization by replicating data across nodes • Co-locates the stored data with computational capability to act on the data (storage nodes and compute nodes are the same – typically) • A MapReduce job takes the requested operation and maps it to the appropriate nodes for computation using specified keys Who uses this technology? • Google • Yahoo • Facebook Many PBs and probably even EBs of data. 6Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Initial Use Case •Create a time-based average over the monthly means for specific variables •This example shows a seasonal average of temperature for the winter of 2000 •Focused on reducing the time spent in the preparation of reanalysis data used in data-model inter- comparison, a long sought goal of the climate community 7Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n MERRA Data •The GEOS-5 MERRA products are divided into 25 collections: 18 standard products, 7 chemistry products •Comprise monthly means files and daily files at six-hour intervals running from 1979 – 2012 •Total size of netCDF MERRA collection in a standard filesystem is ~80 TB •One file per day produced with file sizes ranging from ~20 MB to ~1.5 GB 8Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Map Reduce Workflow 9Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Ingesting MERRA data into HDFS • Option 1: Put the MERRA data into Hadoop with no changes » Would require us to write a custom mapper to parse • Option 2: Write a custom NetCDF to Hadoop sequencer and keep the files together » Basically puts indexes into the files so Hadoop can parse » Maintains the NetCDF metadata for each file • Option 3: Write a custom NetCDF to Hadoop sequencer and split the files apart » Breaks the connection of the NetCDF metadata to the data • Chose Option 2 10Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Sequence File Format • During sequencing, the data is partitioned by time, so that each record in the sequence file contains the timestamp and name of the parameter (e.g. temperature) as the composite key and the value of the parameter (which could have 1 to 3 spatial dimensions) 11Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Data Set Descriptions •Two data sets • MAIMNPANA.5.2.0 (instM_3d_ana_Np) – monthly means • MAIMCPASM.5.2.0 (instM_3d_asm_Cp) – monthly means •Common characteristics • Spans years 1979 through 2012….. • Two files per year (hdf, xml), 396 total files •Sizing Raw Sequenced Raw Sequenced Sequence Type Total (GB) Total (GB) File (MB) File (MB) Time (sec) MAIMNPANA 84 224 237 565 30 MAIMCPASM 48 119 130 300 15 12Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Seasonal Averages – Operational Cluster • MAIMNPANA.5.2.0 (sec) • MAIMCPASM.5.2.0 (sec) HDFS Blocking (640MB) Years Period Test Operational Speedup 1 2001 89.1 32.4 2.8 10 2001 - 2010 475.4 128.8 3.7 20 1991 - 2010 1026.6 245.2 4.2 All 1979 - 2011 1520.0 404.7 3.8 HDFS Blocking (640MB) Years Period Test Operational Speedup 1 2001 65.4 18.5 3.5 10 2001 - 2010 205.0 38.7 5.3 20 1991 - 2010 358.1 79.8 4.5 All 1979 - 2011 545.6 110.8 4.9 13Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Namenode JobTracker /merra 5TB /hadoop_fs 1TB /mapred 1TB Data Node 1 /hadoop_fs 16TB /mapred 16TB Data Node 2 /hadoop_fs 16TB /mapred 16TB Data Node 8 /hadoop_fs 16TB /mapred 16TB … Head Node s Data Nodes FDR IB MERRA Cluster Components Data Node 2 Data Node 34 MERRA Data 180 TB Raw LAN
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Operational Node Configurations 15Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Open Source Tools •Using Cloudera (CDH), the open source enterprise-ready distribution of Apache Hadoop. •Cloudera is integrated with configuration and administration tools and related open source packages, such as Hue, Oozie, Zookeeper, and Impala. •Cloudera Manager Free Edition is particularly useful for cluster management, providing centralized administration of CDH. 16Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Customer Connections •NASA ASP A.35 Wildland Fires RECOVER project. •NSF DataNet Federation Consortium •SIGClimate •Others include: GSFC / LARC iRODS Testbed, CSC Climate Edge product line, Applied Science and Terrestrial Ecology Program climate adaptation projects, Direct Readout Laboratory Climate Data Records (CDRs), and NCA modelers 17Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Next Steps • Tune the MapReduce Framework •Identify potential performance optimizations (e.g., modify block size, tweak I/O config •Complete canonical operations (e.g., add mappers/reducers) •Try different ways to sequence the files •Experiment with data accelerators 18Hadoop for High-Performance Climate Analytics
N a t i o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Conclusions and Lessons Learned •Design of sequence format is critical for big binary data •Configuration is key…change only one parameter at a time •Big data is hard, and it takes a long time…. •Expect things to fail – a lot •Hadoop craves bandwidth •HDFS installs easy but optimizing is not so easy •Not as fast as we thought … is there something in Hadoop that we don’t understand yet •It’s all still cutting edge to a certain extent •Ask the mailing list or your support provider 19Hadoop for High-Performance Climate Analytics

Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned

  • 1.
    National Aeronautics andSpace Administration www.nasa.gov Hadoop for High-Performance Climate Analytics Use Cases and Lessons Learned Glenn Tamkin (NASA/CSC) Team: John Schnase (NASA), Dan Duffy (NASA), Hoot Thompson (PTP), Denis Nadeau (CSC), Scott Sinno (PTP)
  • 2.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Overview •Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Modern Era Retrospective-Analysis for Research and Applications Analytic Services (MERRA/AS) … • Is a cyber-infrastructure resource for developing and evaluating a next generation of climate data analysis capabilities • A service that reduces the time spent in the preparation of MERRA data used in data-model inter-comparison 2Hadoop for High-Performance Climate Analytics
  • 3.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Vision • Provide a test-bed for experimental development of high-performance analytics • Offer an architectural approach to climate data services that can be generalized to applications and customers beyond the traditional climate research community 3Hadoop for High-Performance Climate Analytics
  • 4.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n MERRA A/S Background •Initially evaluated MapReduce and the Hadoop Distributed File System (HDFS) on representative collections of observational and climate data (MERRA) • Focused on a small set of canonical operations such as, average, minimum, maximum, and standard deviation operations over a given temporal and spatial extent • Built a cluster with available hardware (then acquired a custom cluster) • Implemented a prototype to process the data via MapReduce • Captured metrics and observed performance improvements as the number of data nodes and block sizes increase 4Hadoop for High-Performance Climate Analytics
  • 5.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Project Details •MERRA/AS… • Leverages the Hadoop/MapReduce approach to parallel storage- based computation. • Uses a workflow-generated approach to perform analyses over the MERRA data • Introduces a generalized application programming interface (API) that exposes reusable climate data services. 5Hadoop for High-Performance Climate Analytics
  • 6.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Why HDFS and MapReduce ? •Software framework to store large amounts of data in parallel across a cluster of nodes • Provides fault tolerance, load balancing, and parallelization by replicating data across nodes • Co-locates the stored data with computational capability to act on the data (storage nodes and compute nodes are the same – typically) • A MapReduce job takes the requested operation and maps it to the appropriate nodes for computation using specified keys Who uses this technology? • Google • Yahoo • Facebook Many PBs and probably even EBs of data. 6Hadoop for High-Performance Climate Analytics
  • 7.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Initial Use Case •Create a time-based average over the monthly means for specific variables •This example shows a seasonal average of temperature for the winter of 2000 •Focused on reducing the time spent in the preparation of reanalysis data used in data-model inter- comparison, a long sought goal of the climate community 7Hadoop for High-Performance Climate Analytics
  • 8.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n MERRA Data •The GEOS-5 MERRA products are divided into 25 collections: 18 standard products, 7 chemistry products •Comprise monthly means files and daily files at six-hour intervals running from 1979 – 2012 •Total size of netCDF MERRA collection in a standard filesystem is ~80 TB •One file per day produced with file sizes ranging from ~20 MB to ~1.5 GB 8Hadoop for High-Performance Climate Analytics
  • 9.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Map Reduce Workflow 9Hadoop for High-Performance Climate Analytics
  • 10.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Ingesting MERRA data into HDFS • Option 1: Put the MERRA data into Hadoop with no changes » Would require us to write a custom mapper to parse • Option 2: Write a custom NetCDF to Hadoop sequencer and keep the files together » Basically puts indexes into the files so Hadoop can parse » Maintains the NetCDF metadata for each file • Option 3: Write a custom NetCDF to Hadoop sequencer and split the files apart » Breaks the connection of the NetCDF metadata to the data • Chose Option 2 10Hadoop for High-Performance Climate Analytics
  • 11.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Sequence File Format • During sequencing, the data is partitioned by time, so that each record in the sequence file contains the timestamp and name of the parameter (e.g. temperature) as the composite key and the value of the parameter (which could have 1 to 3 spatial dimensions) 11Hadoop for High-Performance Climate Analytics
  • 12.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Data Set Descriptions •Two data sets • MAIMNPANA.5.2.0 (instM_3d_ana_Np) – monthly means • MAIMCPASM.5.2.0 (instM_3d_asm_Cp) – monthly means •Common characteristics • Spans years 1979 through 2012….. • Two files per year (hdf, xml), 396 total files •Sizing Raw Sequenced Raw Sequenced Sequence Type Total (GB) Total (GB) File (MB) File (MB) Time (sec) MAIMNPANA 84 224 237 565 30 MAIMCPASM 48 119 130 300 15 12Hadoop for High-Performance Climate Analytics
  • 13.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Seasonal Averages – Operational Cluster • MAIMNPANA.5.2.0 (sec) • MAIMCPASM.5.2.0 (sec) HDFS Blocking (640MB) Years Period Test Operational Speedup 1 2001 89.1 32.4 2.8 10 2001 - 2010 475.4 128.8 3.7 20 1991 - 2010 1026.6 245.2 4.2 All 1979 - 2011 1520.0 404.7 3.8 HDFS Blocking (640MB) Years Period Test Operational Speedup 1 2001 65.4 18.5 3.5 10 2001 - 2010 205.0 38.7 5.3 20 1991 - 2010 358.1 79.8 4.5 All 1979 - 2011 545.6 110.8 4.9 13Hadoop for High-Performance Climate Analytics
  • 14.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Namenode JobTracker /merra 5TB /hadoop_fs 1TB /mapred 1TB Data Node 1 /hadoop_fs 16TB /mapred 16TB Data Node 2 /hadoop_fs 16TB /mapred 16TB Data Node 8 /hadoop_fs 16TB /mapred 16TB … Head Node s Data Nodes FDR IB MERRA Cluster Components Data Node 2 Data Node 34 MERRA Data 180 TB Raw LAN
  • 15.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Operational Node Configurations 15Hadoop for High-Performance Climate Analytics
  • 16.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Open Source Tools •Using Cloudera (CDH), the open source enterprise-ready distribution of Apache Hadoop. •Cloudera is integrated with configuration and administration tools and related open source packages, such as Hue, Oozie, Zookeeper, and Impala. •Cloudera Manager Free Edition is particularly useful for cluster management, providing centralized administration of CDH. 16Hadoop for High-Performance Climate Analytics
  • 17.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Customer Connections •NASA ASP A.35 Wildland Fires RECOVER project. •NSF DataNet Federation Consortium •SIGClimate •Others include: GSFC / LARC iRODS Testbed, CSC Climate Edge product line, Applied Science and Terrestrial Ecology Program climate adaptation projects, Direct Readout Laboratory Climate Data Records (CDRs), and NCA modelers 17Hadoop for High-Performance Climate Analytics
  • 18.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Next Steps • Tune the MapReduce Framework •Identify potential performance optimizations (e.g., modify block size, tweak I/O config •Complete canonical operations (e.g., add mappers/reducers) •Try different ways to sequence the files •Experiment with data accelerators 18Hadoop for High-Performance Climate Analytics
  • 19.
    N a ti o n a l A e r o n a u t i c s a n d S p a c e A d m i n i s t r a t i o n Conclusions and Lessons Learned •Design of sequence format is critical for big binary data •Configuration is key…change only one parameter at a time •Big data is hard, and it takes a long time…. •Expect things to fail – a lot •Hadoop craves bandwidth •HDFS installs easy but optimizing is not so easy •Not as fast as we thought … is there something in Hadoop that we don’t understand yet •It’s all still cutting edge to a certain extent •Ask the mailing list or your support provider 19Hadoop for High-Performance Climate Analytics

Editor's Notes

  • #8  Support production level Monthly Means data for run id = MERRA300Support assimilation data (e.g., combination of atmospheric data analysis and model forecasting to generate a time-series of global atmospheric quantities) . Support 3D variables (at fixed z/height layers - no volume) Support the native (2/3 x 1⁄2) resolution for the horizontal grid (N) Support pressure-level data for the vertical grid Support the direct analysis products group (ANA) Process a specific 3D variable = T (air temperature) Process a maximum of a single year of data (2011) Calculate seasonal averages while suppressing fill/missing values Generate output in NetCDF format
  • #11 What is the best way to ingest the MERRA data into HDFS?HDFS understands text, not binaryOption 1: Put the MERRA data into Hadoop with no changesWould require us to write a custom mapperOption 2: Write a custom NetCDF to Hadoop sequencer and keep the files togetherBasically puts indexes into the files so that Hadoop can understand how to mapMaintains the NetCDF metadata for each fileOption 3: Write a custom NetCDF to Hadoop sequencer and split the files apartBreaks the connection of the NetCDF metadata to the data; must preserve that some other way
  • #15 Powerful computing resources are necessary for on-demand analytic processing across 30+ years of reanalysis data. The MapReduce operations leverage a cluster consisting of 34 Dell R710 servers each populated with 12 3TB hard drives in addition to their operating system disks (CentOS 6.3).Grouped into 2 by 16TB volumes using RAID0 at hardware level. 16 GB system diskInterconnectivity uses a 36 port FDR InfiniBand (IB) switch plus a 48 port GigE switch. Peak network TCP/IP speeds measured at over 20000 Mbps between nodes.Initial cluster metrics show peaks of 314 GFlops/server, one-way IB performance rates topping 6000MB/sec, and an overall capability of approximately 11 TFlops. Configured using free edition of Cloudera.- Shared /home and /other via NFS across all nodes