Analytics with MariaDB ColumnStore The Whys, Whats and Hows
Agenda • The Task - Analytics – Why and what • The Requirements – What do we need for analytics • The Solution – Column Based Storage • The Product – MariaDB AX and MariaDB ColumnStore • The Uses – MariaDB ColumnStore in action
Why Analytics and what do you get A high level view on analytics
Why Analytics ? • Get the most value of your data asset • Faster Better decision making process • Cost reduction • New products and services
What is likely to happen? Why is it happening? Types of analytics What is happening? What should I do about it?
Descriptive: What happened ? ● Reports ○ Sales Report ○ Expense summary ● Ad-hoc requests to analyst
Diagnostics: Why did it happen • Aggregates: aggregate measure over one or more dimension – Find total sales – Top five product ranked by sales • Roll-ups: Aggregate at different levels of dimension hierarchy – given total sales by city, roll-up to get sales by state • Drill-down: Inverse of roll-ups – given total sales by state, drill-down to get total by city • Slicing and Dicing: – Equality and range selections on one or more dimensions
Predictive: What is likely to happen • Sales Prediction – Analyze data to identify trends, spot weakness or determine conditions among broader data sets for making decisions about the future • Targeted marketing – what is likelihood of a customer buying a particular product based on past buying behavior
Real World Example - Visualization
Prescriptive: What is the best course of action? Paradox of choices With too many choices, which one is the best?
Big Data Analytics Use Cases By industry Finance Identify trade patterns Detect fraud and anomolies Predict trading outcomes Manufacturing Simulations to improve design/yield Detect production anomolies Predict machine failures (sensor data) Telecom Behavioral analysis of customer calls Network analysis (perf and reliability) Healthcare Find genetic profiles/matches Analyze health vs spending Predict viral oubreaks
Analytics Database requirements Why this is different from OLTP and why indexes are not helpful
What is an OLTP workload? • OLTP applications are represents the most common database workload • OLTP applications has a read / write ratio of maybe 50/50 – Web apps / E-commerce has more reads, ending with maybe 90/10 • OLTP applications deals with data on a row by row level – Customer data, product data, order items etc. – Single rows are selected, inserted, updated and deleted, one by one or in small groups • OLTP data structures is somewhat of a representation of the business or the applications that manage the data – An order reference a customer, and order item is linked to an order – Typically 3rd normal form or higher – Sometimes individual aspects break the normal form, for performance reasons • Transactions and ACID properties are required
The analytics workload • Deals with data from a high level perspective • Handles data in large groups of rows – SELECTs data by date, customer location, product id etc. – Data is loaded in batch or streamed in – Data is mostly just INSERTed • Dealing with individual data items is usually ineffective • Data structures are optimized for analytics use and performance • Data is sometimes purged, but just as often not • Contains structured, semi-structured and sometimes unstructured data • Data often comes from many different sources, internal and external • Queries are ad-hoc, largely • Transactions and ACID requirements are relaxed
Analytics database requirements • Fast access to large amounts of data • Scalable as data grows over time – Analytics requirements increasing – Regulatory requirements – New data sources are added • Load performance must be fast, scalable and predictable • Data loading should be very flexible due to the different sources of data – Some data loaded in batch, other is streamed • Query performance also need to be scalable • Data compression is a requirement – Data size constraints, as well as read performance from disk
B-tree indexes The good B-tree indexes The bad • Well known technology • Works with most types of data • Scales reasonably well • Really good for OLTP transactional data • Really bad for unbalanced data • Index modifications can be really slow • Index modifications are largely single threaded • Slows down with the amount of data • Really not scalable with large amount of data
In summary, what do we need • Something that can compress data A LOT • Something that can be written to with fast and predictable performance • Something that doesn't necessarily support transactions – It doesn't hurt, but performance is so much more important • Something that can support analytics queries – Ad-hoc queries – Aggregate queries • Something that can scale as data grows • Something that can still have a level of high availability • Something that works with analytics tools, like Tableau, R etc.
The Solution Distributed Column based storage
Existing Approaches Limited real time analytics Slow releases of product innovation Expensive hardware and software Data Warehouses Hadoop / NoSQL LIMITED SQL SUPPORT DIFFICULT TO INSTALL/MANAGE LIMITED TALENT POOL DATA LAKE W/ NO DATA MANAGEMENT Hard to use
To the rescue – Column Based Storage • Data is stored column by column • Each column is stored in one or more extents – Each extent is represented by 1 file • Each extent is arranged in fixed size blocks • Extents are compressed (using Snappy) • Data is one of – Fixed size (1, 2, 4 or 8 bytes) – Dictionary based with a fixed size pointer • Meta data is in an extent map – Extent map is in memory – Extent map contains meta data on each extent, like min and max values Table Column1 Column N Extent 1 (8MB~64MB 8 million rows) Extent N (8MB~64MB 8 million rows)
To the rescue – Distributed data processing • Clients connect to a User Module • The User Module optimizes and controls the execution • Data is distributed among the Performance Modules • Data is stored, processed and managed by Performance Modules • Performance Modules process query primitives in parallel • The User Module combines the results from the Performance Modules User Modules Performance Module 1 ... Performance Module N Performance Module 2 Performance Module 3 Clients User Connections
MariaDB Analytics MariaDB ColumnStore and MariaDB AX
MariaDB ColumnStore High performance columnar storage engine that supports a wide variety of analytical use cases in highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single Interface for OLTP and analytics Easy to Manage and Scale Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
MariaDB AX MariaDB Server MariaDB MaxScale MariaDB ColumnStore Parallel queries Distributed storage No indexes Automatic partitioning Read optimized High compression Low disk IO ColumnStore Storage ColumnStore Storage ColumnStore Storage MariaDB Server ColumnStore MariaDB Server ColumnStore MariaDB MaxScale MariaDB Server ColumnStore ColumnStore Storage MariaDB MaxScale
Easier Enterprise Analytics ANSI SQL Single SQL Front-end • Use a single SQL interface for analytics and OLTP • Leverage MariaDB Security features - Encryption for data in motion , role based access and auditability Full ANSI SQL • No more SQL “like” query • Support complex join, aggregation and window function Easy to manage and scale • Eliminate needs for indexes and views • Automated horizontal/vertical partitioning • Linear scalable by adding new nodes as data grows • Out of box connection with BI tools
Faster, More Efficient Queries Optimized for Columnar storage • Columnar storage reduces disk I/O • Blazing fast read-intensive workload • Ultra fast data import Parallel distributed query execution • Distributed queries into series of parallel operations • Fully parallel high speed data ingestion Highly available analytic environment • Built-in Redundancy • Automatic fail-over Parallel Query Processing
MariaDB ColumnStore Analytics Use Cases
Healthcare / Life Science Industry Genome analysis • In-depth genome research for the dairy industry to improve production of milk and protein. • Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load) Healthcare spending analysis • Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data • Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition Why MariaDB ColumnStore • Strong security features including role based data access and audit plug in • MPP architecture handles analytics on big data with high speed • Easy to analyze archived data with SQL based analytics • Does not require DBA to index or partition data
Telecommunication Industry Customer behavior analysis • Analyze call data record to segment customers based on their behavior • Data-driven analysis for customer satisfaction • Create behavioral based upsell or cross-sell opportunity Call data analysis • Data size: 6TB • Ingest 1.5 million rows of logs per day with 30million texts and 3million calls • Call and network quality analysis • Provide higher quality customer services based on data Why MariaDB ColumnStore • ColumnStore support time based partitioning and time-series analysis • Fast data load for real-time analytics • MPP architecture handles analytics on big data with high speed • Easy to analyze the archived data with SQL based analytics
In Conclusion • Analytics require a different technology to be able to cope with – Different types of data – Different types of data access • OLTP databases has different requirements compared to Analytics • Column Based storage allows high compression • Metadata can replace indexing • Distributed processing allows for performance and scalability • MariaDB ColumnStore implement a fast an efficient distributed database for analytics • MariaDB AX is the subscription for professional use of MariaDB ColumnStore • MariaDB ColumnStore is gaining wide acceptance
Thank you

MariaDB AX: Analytics with MariaDB ColumnStore

  • 1.
  • 2.
    Agenda • The Task- Analytics – Why and what • The Requirements – What do we need for analytics • The Solution – Column Based Storage • The Product – MariaDB AX and MariaDB ColumnStore • The Uses – MariaDB ColumnStore in action
  • 3.
    Why Analytics andwhat do you get A high level view on analytics
  • 4.
    Why Analytics ? •Get the most value of your data asset • Faster Better decision making process • Cost reduction • New products and services
  • 5.
    What is likely tohappen? Why is it happening? Types of analytics What is happening? What should I do about it?
  • 6.
    Descriptive: What happened? ● Reports ○ Sales Report ○ Expense summary ● Ad-hoc requests to analyst
  • 7.
    Diagnostics: Why didit happen • Aggregates: aggregate measure over one or more dimension – Find total sales – Top five product ranked by sales • Roll-ups: Aggregate at different levels of dimension hierarchy – given total sales by city, roll-up to get sales by state • Drill-down: Inverse of roll-ups – given total sales by state, drill-down to get total by city • Slicing and Dicing: – Equality and range selections on one or more dimensions
  • 8.
    Predictive: What islikely to happen • Sales Prediction – Analyze data to identify trends, spot weakness or determine conditions among broader data sets for making decisions about the future • Targeted marketing – what is likelihood of a customer buying a particular product based on past buying behavior
  • 9.
    Real World Example- Visualization
  • 10.
    Prescriptive: What isthe best course of action? Paradox of choices With too many choices, which one is the best?
  • 11.
    Big Data AnalyticsUse Cases By industry Finance Identify trade patterns Detect fraud and anomolies Predict trading outcomes Manufacturing Simulations to improve design/yield Detect production anomolies Predict machine failures (sensor data) Telecom Behavioral analysis of customer calls Network analysis (perf and reliability) Healthcare Find genetic profiles/matches Analyze health vs spending Predict viral oubreaks
  • 12.
    Analytics Database requirements Whythis is different from OLTP and why indexes are not helpful
  • 13.
    What is anOLTP workload? • OLTP applications are represents the most common database workload • OLTP applications has a read / write ratio of maybe 50/50 – Web apps / E-commerce has more reads, ending with maybe 90/10 • OLTP applications deals with data on a row by row level – Customer data, product data, order items etc. – Single rows are selected, inserted, updated and deleted, one by one or in small groups • OLTP data structures is somewhat of a representation of the business or the applications that manage the data – An order reference a customer, and order item is linked to an order – Typically 3rd normal form or higher – Sometimes individual aspects break the normal form, for performance reasons • Transactions and ACID properties are required
  • 14.
    The analytics workload •Deals with data from a high level perspective • Handles data in large groups of rows – SELECTs data by date, customer location, product id etc. – Data is loaded in batch or streamed in – Data is mostly just INSERTed • Dealing with individual data items is usually ineffective • Data structures are optimized for analytics use and performance • Data is sometimes purged, but just as often not • Contains structured, semi-structured and sometimes unstructured data • Data often comes from many different sources, internal and external • Queries are ad-hoc, largely • Transactions and ACID requirements are relaxed
  • 15.
    Analytics database requirements •Fast access to large amounts of data • Scalable as data grows over time – Analytics requirements increasing – Regulatory requirements – New data sources are added • Load performance must be fast, scalable and predictable • Data loading should be very flexible due to the different sources of data – Some data loaded in batch, other is streamed • Query performance also need to be scalable • Data compression is a requirement – Data size constraints, as well as read performance from disk
  • 16.
    B-tree indexes The good B-treeindexes The bad • Well known technology • Works with most types of data • Scales reasonably well • Really good for OLTP transactional data • Really bad for unbalanced data • Index modifications can be really slow • Index modifications are largely single threaded • Slows down with the amount of data • Really not scalable with large amount of data
  • 17.
    In summary, whatdo we need • Something that can compress data A LOT • Something that can be written to with fast and predictable performance • Something that doesn't necessarily support transactions – It doesn't hurt, but performance is so much more important • Something that can support analytics queries – Ad-hoc queries – Aggregate queries • Something that can scale as data grows • Something that can still have a level of high availability • Something that works with analytics tools, like Tableau, R etc.
  • 18.
  • 19.
    Existing Approaches Limited realtime analytics Slow releases of product innovation Expensive hardware and software Data Warehouses Hadoop / NoSQL LIMITED SQL SUPPORT DIFFICULT TO INSTALL/MANAGE LIMITED TALENT POOL DATA LAKE W/ NO DATA MANAGEMENT Hard to use
  • 20.
    To the rescue– Column Based Storage • Data is stored column by column • Each column is stored in one or more extents – Each extent is represented by 1 file • Each extent is arranged in fixed size blocks • Extents are compressed (using Snappy) • Data is one of – Fixed size (1, 2, 4 or 8 bytes) – Dictionary based with a fixed size pointer • Meta data is in an extent map – Extent map is in memory – Extent map contains meta data on each extent, like min and max values Table Column1 Column N Extent 1 (8MB~64MB 8 million rows) Extent N (8MB~64MB 8 million rows)
  • 21.
    To the rescue– Distributed data processing • Clients connect to a User Module • The User Module optimizes and controls the execution • Data is distributed among the Performance Modules • Data is stored, processed and managed by Performance Modules • Performance Modules process query primitives in parallel • The User Module combines the results from the Performance Modules User Modules Performance Module 1 ... Performance Module N Performance Module 2 Performance Module 3 Clients User Connections
  • 22.
  • 23.
    MariaDB ColumnStore High performancecolumnar storage engine that supports a wide variety of analytical use cases in highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single Interface for OLTP and analytics Easy to Manage and Scale Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 24.
    MariaDB AX MariaDB Server MariaDBMaxScale MariaDB ColumnStore Parallel queries Distributed storage No indexes Automatic partitioning Read optimized High compression Low disk IO ColumnStore Storage ColumnStore Storage ColumnStore Storage MariaDB Server ColumnStore MariaDB Server ColumnStore MariaDB MaxScale MariaDB Server ColumnStore ColumnStore Storage MariaDB MaxScale
  • 25.
    Easier Enterprise Analytics ANSI SQL SingleSQL Front-end • Use a single SQL interface for analytics and OLTP • Leverage MariaDB Security features - Encryption for data in motion , role based access and auditability Full ANSI SQL • No more SQL “like” query • Support complex join, aggregation and window function Easy to manage and scale • Eliminate needs for indexes and views • Automated horizontal/vertical partitioning • Linear scalable by adding new nodes as data grows • Out of box connection with BI tools
  • 26.
    Faster, More Efficient Queries Optimizedfor Columnar storage • Columnar storage reduces disk I/O • Blazing fast read-intensive workload • Ultra fast data import Parallel distributed query execution • Distributed queries into series of parallel operations • Fully parallel high speed data ingestion Highly available analytic environment • Built-in Redundancy • Automatic fail-over Parallel Query Processing
  • 27.
  • 28.
    Healthcare / LifeScience Industry Genome analysis • In-depth genome research for the dairy industry to improve production of milk and protein. • Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load) Healthcare spending analysis • Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data • Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition Why MariaDB ColumnStore • Strong security features including role based data access and audit plug in • MPP architecture handles analytics on big data with high speed • Easy to analyze archived data with SQL based analytics • Does not require DBA to index or partition data
  • 29.
    Telecommunication Industry Customer behavioranalysis • Analyze call data record to segment customers based on their behavior • Data-driven analysis for customer satisfaction • Create behavioral based upsell or cross-sell opportunity Call data analysis • Data size: 6TB • Ingest 1.5 million rows of logs per day with 30million texts and 3million calls • Call and network quality analysis • Provide higher quality customer services based on data Why MariaDB ColumnStore • ColumnStore support time based partitioning and time-series analysis • Fast data load for real-time analytics • MPP architecture handles analytics on big data with high speed • Easy to analyze the archived data with SQL based analytics
  • 30.
    In Conclusion • Analyticsrequire a different technology to be able to cope with – Different types of data – Different types of data access • OLTP databases has different requirements compared to Analytics • Column Based storage allows high compression • Metadata can replace indexing • Distributed processing allows for performance and scalability • MariaDB ColumnStore implement a fast an efficient distributed database for analytics • MariaDB AX is the subscription for professional use of MariaDB ColumnStore • MariaDB ColumnStore is gaining wide acceptance
  • 31.