Scalable And Incremental Data Profiling With Spark

1.
Scalable and Incremental DataProfiling with Spark Amelia Arbisser Adam Silberstein

2.
Trifacta: Self-service datapreparation What data analysts hope to achieve in data projects Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS

3.
Trifacta: Self-service datapreparation What data analysts hope to achieve in data projects 80% of time spent cleaning and preparing the data to be analyzed Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS

4.
Trifacta: Self-service datapreparation We speed up and take the pain out of preparation

5.
Transforming Data inTrifacta

6.

7.
Predictive Interaction andImmediate Feedback

8.
Predictive Interaction

9.
Immediate Feedback

10.

11.

14.
Where Profiling FitsIn We validate preparation via profiling

15.
Spark Profiling atTrifacta ➔ Profiling results of transformation at scale ➔ Validation through profiles ➔ Challenges ➔Scale ➔Automatic job generation ➔ Our solution ➔Spark profiling job server ➔JSON spec ➔Pay-as-you-go

16.
The Case forProfiling

17.

18.
Job Results ➔ Evenclean raw data is not informative. ➔Especially when it is too large to inspect manually. ➔Need a summary representation. ➔Statistical and visual ➔ Generate profile programmatically.

19.
Visual Result Validation ➔Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.

20.
Visual Result Validation ➔Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.

21.
Visual Result Validation

22.
Challenges

23.
The Goal: Profilingfor Every Job ➔ We’ve talked about the value of profiling, but... ➔Profiling is not a tablestakes feature, at least not on day 1. ➔Don’t want our users to disable it! ➔Profiling is potentially more expensive than transformation.

24.
Profiling at Scale ➔Large data volume ➔Long running times ➔Memory constraints

25.
Performance vs. Accuracy ➔Approximationbrings performance gains while still meeting profiler summarization requirements. ➔Off-the-shelf libraries (count-min-sketch, T-digest) great for approximating counts and non-distributive stats. ➔Not a silver bullet though...sometimes approximations are confusing.

26.
Flexible Job Generation ➔Profilefor all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined

27.
Flexible Job Generation ➔Profilefor all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined

28.
Transformation vs. Profiling ➔Transformation is primarily row-based. ➔ Profiling is column-based. ➔ Different execution concerns.

29.
Solution: Spark inthe Trifacta System

30.
Why Spark ➔ Effectiveat OLAP ➔ Flexible enough for UDFs ➔ Not tied to distributions ➔ Mass adoption ➔ Low latency

31.
Spark Profile JobsServer UI Webapp Batch Server Profiler Service Client Server Cluster HDFS ➔ We built a Spark job server ➔ Automatically profiles output of transform jobs ➔ Outputs profiling results to HDFS ➔ Renders graphs in Trifacta product Batch ServerBatSBatS Workers

32.
Profiler Service ➔Which columns to profile and their types ➔ Uses Trifacta type system, eg: ➔Dates ➔Geography ➔User-defined ➔Requested profiling statistics ➔Histograms ➔Outliers ➔Empty, invalid, valid ➔ Extensible ➔Pairwise statistics ➔User-defined functions Profiler Service JSON Specification Wrangled Data

33.
{ “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”:[“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL

34.

35.

36.

37.
Performance Improvements ➔ Afew troublesome use cases: ➔High cardinality ➔Non-distributive stats ➔ Huge gains for large numbers of columns Few Columns Many Columns High Cardinality 2X 4X Low Cardinality 10X 20X Spark profiling speed-up vs. MapReduce

38.
Pay-as-you-go Profiling ➔ Usersdo not always need profiles for all columns ➔ More complex and expensive statistics ➔ Drilldown ➔ Iterative exploration

39.
Conclusion Profiling informs datapreparation.

40.
Questions?

Scalable And Incremental Data Profiling With Spark

More Related Content

What's hot

Similar to Scalable And Incremental Data Profiling With Spark

More from Jen Aman

Recently uploaded

Scalable And Incremental Data Profiling With Spark