Performance Update: When Apache ORC Met Apache Spark

PERFORMANCE UPDATE: WHEN APACHE ORC MET APACHE SPARK Dongjoon Hyun - dhyun@hortonworks.com Owen O'Malley - owen@hortonworks.com @owen_omalley

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC History  Originally released as part of Hive – Released in Hive 0.11.0 (2013-05-16) – Included in each release before Hive 2.3.0 (2017-07-17)  Factored out of Hive – Improve integration with other tools – Shrink the size of the dependencies – Releases faster than Hive – Added C++ reader and new C++ writer

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration History  Before Apache ORC – Hive 1.2.1 (2015-06-27)  SPARK-2883 Since Spark 1.4, Hive ORC is used.  After Apache ORC – v1.0.0 (2016-01-25) – v1.1.0 (2016-06-10) – v1.2.0 (2016-08-25) – v1.3.0 (2017-01-23) – v1.4.0 (2017-05-08)  SPARK-21422 For Spark 2.3, Apache ORC dependency is added and used.

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration Design (in HDP 2.6.3 with Spark 2.2)  Switch between old and new formats by configuration FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat Old OrcFileFormat from Hive 1.2.1 (SPARK-2883) New OrcFileFormat with ORC 1.4.0 (SPARK-21422, SPARK-20682, SPARK-20728)

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefit in Apache Spark  Speed – Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly  Stability – Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.  Usability – User can use ORC data sources without hive module, i.e, -Phive.  Maintainability – Reduce the Hive dependency and can remove old legacy code later.

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Faster and More Scalable Number of columns Time (ms) Single Column Scan from Wide Tables (1M rows with all BIGINT columns) 0 500 1000 1500 2000 TINYINT SMALLINT INT BIGINT FLOAT DOULBE OLD NEW Vectorized Read (15M rows in a single-column table) Time (ms) 0 200 400 600 800 100 200 300 NEW OLD

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Support Matrix  HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables with the subset of limitations of Apache Spark 2.2.  The followings are not supported yet – Zero-byte ORC File – Schema Evolution • Adding columns at the end • Changing types and deleting columns  Please see the full JIRA issue list in next slides.

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Done Tickets  SPARK-20901 Feature parity for ORC with Parquet – The parent ticket of all ORC issues  Done – SPARK-20566 ColumnVector should support `appendFloats` for array – SPARK-21422 Depend on Apache ORC 1.4.0 – SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite – SPARK-21839 Support SQL config for ORC compression – SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery – SPARK-21912 ORC/Parquet table should not create invalid column names

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved On-Going Tickets  SPARK-20682 Support a new faster ORC data source based on Apache ORC  SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core  SPARK-16060 Vectorized Orc reader  SPARK-21791 ORC should support column names with dot  SPARK-21787 Support for pushing down filters for DATE types in ORC  SPARK-19809 Zero byte ORC file support  SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved To-do Tickets  Configuration – SPARK-21783 Turn on ORC filter push-down by default  Read – SPARK-11412 Support merge schema for ORC – SPARK-16628 OrcConversions should not convert an ORC table  Alter – SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source – SPARK-18355 Spark SQL fails to read from a ORC table with new column  Write – SPARK-12417 Orc bloom filter options are not propagated during file write

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software/Hardware/Cluster Details  Software – Internal branch of Spark 2.2 (in HDP)  Node Configuration – 32 CPU, Intel E5-2640, 2.00GHz – 256 GB RAM – 6 SATA Disks (~4 TB, 7200 rpm) – 10 Gbps network card  Cluster – 10 nodes used for 1 TB – 15 nodes used for 10 TB – Double type used in data – TPC-DS (1.4) queries in Spark used for benchmarking

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Tuning Spark Parameter Name Default Value Tuned Value spark.sql.hive.convertMetastoreOrc false true spark.sql.orc.enabled false true spark.sql.orc.filterPushdown false true spark.sql.statistics.fallBackToHdfs false true spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400 spark.shuffle.io.numConnectionsPerPeer 1 10 spark.io.compression.lz4.blockSize 32k 128kb spark.sql.shuffle.partitions 200 300 spark.network.timeout 120s 600s spark.locality.wait 3s 0s *Hive Parameter Name (hive-site.xml) Default Value Tuned Value hive.exec.max.dynamic.partitions 1000 10000 hive.exec.dynamic.partition.mode strict nonstrict *mainly for data ingestion *all tables analyzed with `noscan` option for getting basic stats

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved *Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further *ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RoadMap  Tune default ambari configs for spark based on these benchmark results to get better OOTB experience for end users  Additional enhancements like parallel reading of footers in ORC  Include ORC-175 (Intel ISAL) when complete  Contribute back the fixes back to community

Performance Update: When Apache ORC Met Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Performance Update: When Apache ORC Met Apache Spark

More from DataWorks Summit

Recently uploaded

In this document

Performance Update: When Apache ORC Met Apache Spark