Effective Testing of Apache Accumulo Iterators Josh Elser Accumulo Summit 2016 2016/10/11
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Engineer at Hortonworks, Member of the Apache Software Foundation Top-Level Projects • Apache Accumulo® • Apache Calcite™ • Apache Commons ™ • Apache HBase ® • Apache Phoenix ™ ASF Incubator • Apache Fluo ™ • Apache Gossip ™ • Apache Pirk ™ • Apache Rya ™ • Apache Slider ™ These Apache project names are trademarks or registered trademarks of the Apache Software Foundation.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A Novel Feature of Apache Accumulo  SortedKeyValueIterator (SKVI or “Iterators”)  Computation offload  Reduced I/O  Rumored to be called “cool” by Jeff Dean Transformations Server-Side Predicate-Pushdown Filters Aggregations Combiners Versioning Security
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Accumulo Iterators  Column Slices (CfCqSliceFilter)  Basic Statistics (StatsCombiner)  Value/Array Concatenation (Summing[Array]Combiner)  Aggregations (WholeRowIterator, WholeColumnFamilyIterator)  In-Row operations (AndIterator, OrIterator)  Filters (RegExFilter, GrepIterator, FirstEntryInRowIterator)
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reads  Clients request a Range of data  Key to Row to Tablet to TabletServer  Sorted, merged-read of memory and files  Computation offload and RPC boost Tablet Memory RFile RFile RFile RFile RFile Client Iterators
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reads with Iterators  A poor-man’s “VIEW”  Server-side transformation at query-time Raw Key Value Transformed Key Value 3141592 siblings:brothers Bobby,Steven 3141592 siblings:count 4 3141592 siblings:sisters Sally,Francine 3141593 siblings:brothers Frank 3141593 siblings:count 3 3141593 siblings:sisters Amy,Loretta 3141594 siblings:brothers 3141594 siblings:count 2 3141594 siblings:sisters Rebecca,Savannah
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compactions  Bounds number of files and performance  Iterators provide data optimization mechanism Tablet RFile RFile RFile RFile RFile RFile RFile Before After Iterators
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compactions with Iterators  Deferred aggregation  Rewrite application data in optimal form Raw Key Value Transformed Key Value 3141592 siblings:brothers Bobby,Steven 3141592 siblings:brothers … 3141592 siblings:count 4 3141592 siblings:sisters Sally,Francine 3141592 siblings:sisters … 3141593 siblings:brothers Frank 3141593 siblings:brothers … 3141593 siblings:count 3 3141593 siblings:sisters Amy,Loretta 3141593 siblings:sisters … 3141594 siblings:brothers 3141594 siblings:brothers … 3141594 siblings:counts 2 3141594 siblings:sisters Rebecca,Savannah 3141594 siblings:sisters …
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Better for Everyone  Iterators are great – Abstraction for system-level filters and optimizations – Better performance for power-users  Lots of things Iterators are not – Triggers – Hooks – Coprocessors – “Hammers”  Iterators do not generally replace – Flink, Hive, Mesos, Presto, Storm, Spark, YARN, etc – Can in some cases
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved On Building an Iterator  The API is not particularly intuitive  Hard to create/support SKVIv2  Edge-cases in production are hard to understand  Lots of things to not do in an Iterator – Trial and error  Difficult insight in production systems
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Good – Fast – Concise/Simple – Given input, verify output  Bad – Not end-to-end – Not representative invocation Unit Testing  Good – Same server execution as production – Same client interaction as production  Bad – Slow/Memory intensive – Pedantic to write tests – Might not catch production edge-cases – Impacted by environment MiniAccumuloCluster (MAC) Testing Existing Testing Tools What’s the happy medium?
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iterator Testing Harness  Testing harness designed to capture common pitfalls – ACCUMULO-626 in >=1.8.0  Complementary  The good parts – Fast – Generalized/Reusable tests – Extensible  The bad parts – Not directly using TabletServer for invocation – Subtle failures
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iterator Testing Harness  Testing an Iterator requires three things – Input data – Expected output – Collection of test cases to run  Test cases found via reflection – Common edge cases provided – Easy to develop and run new test cases  JUnit4 integration @Parameters public static Object[][] data() { IteratorTestInput input = createIteratorInput(); IteratorTestOutput expectedOutput = createIteratorOuput(); List<IteratorTestCase> testCases = createTestCases(); return BaseJUnit4IteratorTest.createParameters(input, expectedOutput, testCases); }
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example Test Cases  Iterator Instantiation – Does the Iterator have a visibile no-args constructor?  ”DeepCopy” safety – Can a “deepCopy()” of an Iterator be used like the original?  Stateless “hasTop()” – Do multiple invocations of “hasTop()” cause incorrect results/errors?  Re-seek()’ing – Accumulo will re-instantiate scan sessions and use new Ranges – Does the Iterator still return correct results in this case?
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In an Ideal World  Good testing means faster deployments  Faster deployment means more value for customers  Automated tests combats technical debt in code growth  More automation reduces developer stress Unit Tests MiniAccumuloCluster Iterator Testing Harness+ + =
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In an Ideal World  Unit Tests (test lifecycle phase) – Fast verification given input/output – Validate impact of state  Iterator Testing Harness (test lifecycle phase) – Catch common-mistakes – Basic lifetime/API validation – Encourage best-practices  MiniAccumuloCluster (integration-test lifecycle phase) – Functional/Acceptance tests – Does the ingest/query system function – Real execution of Iterator by TabletServer A Trio of Testing Approaches
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Standalone environment – The ”laptop test” – Sanity check  Staging environments – Small cluster with a subset of data – Correctness and performance In an Ideal World Code MAC Iterator Test Harness Unit Tests Binary Artifacts Standalone Staging Production Deploy
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In an Ideal World  No more ”voodoo” and “black magic”  Find common errors fast  Catch bad Iterator design early  Standardized testing methodology  Community contributes new tests  Increase in quality, reusability, and confidence
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Twitter: @josh_elser Email: elserj@apache.org / jelser@hortonworks.com

Effective Testing of Apache Accumulo Iterators

  • 1.
    Effective Testing of ApacheAccumulo Iterators Josh Elser Accumulo Summit 2016 2016/10/11
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Engineer at Hortonworks, Member of the Apache Software Foundation Top-Level Projects • Apache Accumulo® • Apache Calcite™ • Apache Commons ™ • Apache HBase ® • Apache Phoenix ™ ASF Incubator • Apache Fluo ™ • Apache Gossip ™ • Apache Pirk ™ • Apache Rya ™ • Apache Slider ™ These Apache project names are trademarks or registered trademarks of the Apache Software Foundation.
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved A Novel Feature of Apache Accumulo  SortedKeyValueIterator (SKVI or “Iterators”)  Computation offload  Reduced I/O  Rumored to be called “cool” by Jeff Dean Transformations Server-Side Predicate-Pushdown Filters Aggregations Combiners Versioning Security
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Accumulo Iterators  Column Slices (CfCqSliceFilter)  Basic Statistics (StatsCombiner)  Value/Array Concatenation (Summing[Array]Combiner)  Aggregations (WholeRowIterator, WholeColumnFamilyIterator)  In-Row operations (AndIterator, OrIterator)  Filters (RegExFilter, GrepIterator, FirstEntryInRowIterator)
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Reads  Clients request a Range of data  Key to Row to Tablet to TabletServer  Sorted, merged-read of memory and files  Computation offload and RPC boost Tablet Memory RFile RFile RFile RFile RFile Client Iterators
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Reads with Iterators  A poor-man’s “VIEW”  Server-side transformation at query-time Raw Key Value Transformed Key Value 3141592 siblings:brothers Bobby,Steven 3141592 siblings:count 4 3141592 siblings:sisters Sally,Francine 3141593 siblings:brothers Frank 3141593 siblings:count 3 3141593 siblings:sisters Amy,Loretta 3141594 siblings:brothers 3141594 siblings:count 2 3141594 siblings:sisters Rebecca,Savannah
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Compactions  Bounds number of files and performance  Iterators provide data optimization mechanism Tablet RFile RFile RFile RFile RFile RFile RFile Before After Iterators
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Compactions with Iterators  Deferred aggregation  Rewrite application data in optimal form Raw Key Value Transformed Key Value 3141592 siblings:brothers Bobby,Steven 3141592 siblings:brothers … 3141592 siblings:count 4 3141592 siblings:sisters Sally,Francine 3141592 siblings:sisters … 3141593 siblings:brothers Frank 3141593 siblings:brothers … 3141593 siblings:count 3 3141593 siblings:sisters Amy,Loretta 3141593 siblings:sisters … 3141594 siblings:brothers 3141594 siblings:brothers … 3141594 siblings:counts 2 3141594 siblings:sisters Rebecca,Savannah 3141594 siblings:sisters …
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Better for Everyone  Iterators are great – Abstraction for system-level filters and optimizations – Better performance for power-users  Lots of things Iterators are not – Triggers – Hooks – Coprocessors – “Hammers”  Iterators do not generally replace – Flink, Hive, Mesos, Presto, Storm, Spark, YARN, etc – Can in some cases
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved On Building an Iterator  The API is not particularly intuitive  Hard to create/support SKVIv2  Edge-cases in production are hard to understand  Lots of things to not do in an Iterator – Trial and error  Difficult insight in production systems
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved  Good – Fast – Concise/Simple – Given input, verify output  Bad – Not end-to-end – Not representative invocation Unit Testing  Good – Same server execution as production – Same client interaction as production  Bad – Slow/Memory intensive – Pedantic to write tests – Might not catch production edge-cases – Impacted by environment MiniAccumuloCluster (MAC) Testing Existing Testing Tools What’s the happy medium?
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Iterator Testing Harness  Testing harness designed to capture common pitfalls – ACCUMULO-626 in >=1.8.0  Complementary  The good parts – Fast – Generalized/Reusable tests – Extensible  The bad parts – Not directly using TabletServer for invocation – Subtle failures
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Iterator Testing Harness  Testing an Iterator requires three things – Input data – Expected output – Collection of test cases to run  Test cases found via reflection – Common edge cases provided – Easy to develop and run new test cases  JUnit4 integration @Parameters public static Object[][] data() { IteratorTestInput input = createIteratorInput(); IteratorTestOutput expectedOutput = createIteratorOuput(); List<IteratorTestCase> testCases = createTestCases(); return BaseJUnit4IteratorTest.createParameters(input, expectedOutput, testCases); }
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Example Test Cases  Iterator Instantiation – Does the Iterator have a visibile no-args constructor?  ”DeepCopy” safety – Can a “deepCopy()” of an Iterator be used like the original?  Stateless “hasTop()” – Do multiple invocations of “hasTop()” cause incorrect results/errors?  Re-seek()’ing – Accumulo will re-instantiate scan sessions and use new Ranges – Does the Iterator still return correct results in this case?
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved In an Ideal World  Good testing means faster deployments  Faster deployment means more value for customers  Automated tests combats technical debt in code growth  More automation reduces developer stress Unit Tests MiniAccumuloCluster Iterator Testing Harness+ + =
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved In an Ideal World  Unit Tests (test lifecycle phase) – Fast verification given input/output – Validate impact of state  Iterator Testing Harness (test lifecycle phase) – Catch common-mistakes – Basic lifetime/API validation – Encourage best-practices  MiniAccumuloCluster (integration-test lifecycle phase) – Functional/Acceptance tests – Does the ingest/query system function – Real execution of Iterator by TabletServer A Trio of Testing Approaches
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved  Standalone environment – The ”laptop test” – Sanity check  Staging environments – Small cluster with a subset of data – Correctness and performance In an Ideal World Code MAC Iterator Test Harness Unit Tests Binary Artifacts Standalone Staging Production Deploy
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved In an Ideal World  No more ”voodoo” and “black magic”  Find common errors fast  Catch bad Iterator design early  Standardized testing methodology  Community contributes new tests  Increase in quality, reusability, and confidence
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank You Twitter: @josh_elser Email: elserj@apache.org / jelser@hortonworks.com

Editor's Notes

  • #4 * Customizable “framework” for pushing down computation to Accumulo servers. Increasing performance (computation closer to the data) which reduces the amount of data read off of disk and amount of data sent to client. * Powerful for accumulo devs to also reuse for system operations (improves code quality, reduces debt)
  • #5 * Customizable “framework” for pushing down computation to Accumulo servers. Increasing performance (computation closer to the data) which reduces the amount of data read off of disk and amount of data sent to client. * Powerful for accumulo devs to also reuse for system operations (improves code quality, reduces debt)
  • #6 Recently written data kept in memory for fast-access, merged with data on disk. All sorted, so log(n) lookups.
  • #7 Transformation on read. A VIEW. Original data isn’t modfiied. Computation offloaded. Simple application tier.
  • #8 Compactions are good! Reorganize data for optimal access
  • #10 Iterators are a great abstraction in the system. They resemble transformations/functions over streams – can be implemented very efficiently. Sometimes, they can completely replace the need for a distributed execution engines. Many-to-many joins can be done in-partition if data allows CPU/Memory is still cheap compared to Network and Disk. More that can be done locally, the better A tool, but not a universal one. Intentionally not all of these things. Call BS on people who think they replace other systems
  • #11 The API is necessarily harmful, but it’s definitely not self-describing. Is not clear how its executed. So ingrained in code, hard to Lots of edge cases in a production system that don’t come out in testing. Causes all sorts of errors. How do you even debug an iterator in a scan? Remote debugger? Logging? Hard with one TabletServer, imagine 10. Imagine 50. What about issues that don’t arise until you see production data flows? Do we really think testing a distributed, large scale database works with 10KB of data? Just over *Five Years* at Apache! Still nothing better?? Why can’t we do this better?
  • #13 Only took us 3.5 years to work on this. Reuse across iterators, extensible (in both new tests and java testing frameworks)
  • #14 Reusability of test-cases is paramount. Reflection discovery means that tests for iterators always get all of the test classes available on the classpath. Write-once, profit repeatedly
  • #19 We want fast tests so that we don’t always –DskipTests We want to have confidence that changes to iterators or new iterators work as expected