Ivan Zoratti Big Data with MySQL Percona Live Santa Clara 2013 V1304.01 Friday, 3 May 13
Who is Ivan ? Friday, 3 May 13
SkySQL •Leading provider of open source databases, services and solutions •Home for the founders and the original developers of the core of MySQL •The creators of MariaDB, the drop-off, innovative replacement of MySQL Friday, 3 May 13
What is Big Data? http://marketingblogged.marketingmagazine.co.uk/files/Big-Data-3.jpg Friday, 3 May 13
PAGE Big Data! Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 5 http://readwrite.com/files/styles/800_450sc/public/files/fields/shutterstock_bigdata.jpg Friday, 3 May 13
PAGE Big Data By Structure 6 Unstructured •Store everything you have/you find •In any format and shape •You do not know how to use it, but it may come handy •Storing unstructured data is usually cheaper than storing it in a more structured datastore •Does not fit well in a relational database •Examples: •Text: Plain text, documents, web content, messages •Bitmap: Image, audio, video •Typical approach: •Mining, pattern recognition, tagging •Usually batch analysis Structured •Store only what you need •In a good format, ready to be used •You should already know how to use it, or at least what it means •Storing structured data is quite expensive •Raw data, indexing, denormalisation, aggregation •Arelational database is still the best choice •Examples: •Machine-Generated Data (MGD) •Tags, counters, sales •Typical approach: •BI tools, reporting •Real time analysis change data capture Friday, 3 May 13
PAGE Unstructured •Store everything you have/you find •In any format and shape •You do not know how to use it, but it may come handy •Storing unstructured data is usually cheaper than storing it in a more structured datastore •Does not fit well in a relational database •Examples: •Text: Plain text, documents, web content, messages •Bitmap: Image, audio, video •Typical approach: •Mining, pattern recognition, tagging •Usually batch analysis Structured •Store only what you need •In a good format, ready to be used •You should already know how to use it, or at least what it means •Storing structured data is quite expensive •Raw data, indexing, denormalisation, aggregation •Arelational database is still the best choice •Examples: •Machine-Generated Data (MGD) •Tags, counters, sales •Typical approach: •BI tools, reporting •Real time analysis change data capture Big Data By Structure 7 Friday, 3 May 13
PAGE How “Big” is Big Data? •Data Factors •Size •Speed to collect/ generate •Variety •Resources •Administrators •Developers •Infrastructure •Growth •Collection •Processing •Availability •To whom? •For how long? •In which format? •Aggregated •Detailed 8 Friday, 3 May 13
PAGE How to manage Big Data •Collection - Storage -Archive •Load - Transform -Analyze •Access - Explore - Utilize 9 http://www.futuresmag.com/2012/07/01/big-data-manage-it-dont-drown-in-it Friday, 3 May 13
Big Data with MySQL http://news.mydosti.com/newsphotos/tech/BigDataV1Dec22012.jpg Friday, 3 May 13
PAGE Technologies to Use / Consider / Watch •MyISAM and MyISAM compression •InnoDB compression •MySQL 5.6 Partitioning •MariaDB Optimizer •MariaDB Virtual & Dynamic Columns •Cassandra Storage Engine •Connect Storage Engine •Columnar Databases •InfiniDB •Infobright •TokuDB Storage Engine 11 Friday, 3 May 13
PAGE Columnar Databases •Automatic compression •Automatic column storage •Data distribution •Map/Reduce approach •MPP / Parallel loading •No indexes •On public clouds, HW or SW appliances 12 Friday, 3 May 13
PAGE TokuDB •Increased Performance •Increased Compression •Online administration •No Index rebuild 13 Friday, 3 May 13
PAGE MyISAM •Static, dynamic and compressed format •Multiple key cache, CACHE INDEX and LOAD INDEX •Compressed tables •Horizontal partitioning (manual) •External locking 14 Friday, 3 May 13
PAGE InnoDB/XtraDB •Data Load •Pre-order data •Split data into chunks •unique_checks = 0; •foreign_key_checks = 0; •sql_log_bin = 0; •innodb_autoinc_lock_mode = 2; •Compression and block size •Persistent optimizer stats •innodb_stats_persistent •innodb_stats_auto_recalc 15 SET GLOBAL innodb_file_per_table = 1; SET GLOBAL innodb_file_format = Barracuda; CREATE TABLE t1 ( c1 INT PRIMARY KEY, c2 VARCHAR(255) ) ROW_FORMAT = COMPRESSED KEY_BLOCK_SIZE = 8; LOAD   DATA LOCAL INFILE '/usr2/t1_01_simple' INTO TABLE t1; Query OK, 134217728 rows affected (1 hour 34 min 7.49 sec) Records: 134217728  Deleted: 0  Skipped: 0  Warnings: 0 LOAD   DATA LOCAL INFILE '/usr2/t1_01_simple' INTO TABLE t2; Query OK, 134217728 rows affected (25 min 20.75 sec) Records: 134217728  Deleted: 0  Skipped: 0  Warnings: 0 Friday, 3 May 13
PAGE Partitioning (MySQL 5.6) •Partitioning Types •RANGE, LIST, RANGE COLUMN, HASH, LINEAR HASH, KEY LINEAR KEY, sub-partitions •Partition and lock pruning •Use of INDEX and DATA DIRECTORY •PARTITIONADD, DROP, REORGANIZE, COALESCE, TRUNCATE, EXCHANGE, REBUILD, OPTIMIZE, CHECK, ANALYZE, REPAIR 16 CREATE TABLE t1 ( c1 INT, c2 DATE ) PARTITION BY RANGE( YEAR( c2 ) ) SUBPARTITION BY HASH ( TO_DAYS( c2 ) ) ( PARTITION p0 VALUES LESS THAN (1990) ( SUBPARTITION s0 DATA DIRECTORY = '/disk0/data' INDEX DIRECTORY = '/disk0/idx', SUBPARTITION s1 DATA DIRECTORY = '/disk1/data' INDEX DIRECTORY = '/disk1/idx' ),... ALTER TABLE t1 EXCHANGE PARTITION p3 WITH TABLE t2; -- Range and List partitions ALTER TABLE t1 REORGANIZE PARTITION p0,p1,p2,p3 INTO ( PARTITION m0 VALUES LESS THAN (1980), PARTITION m1 VALUES LESS THAN (2000)); -- Hash and Key partitions ALTER TABLE t1 COALESCE PARTITION 10; ALTER TABLE t1 ADD PARTITION PARTITIONS 5; Friday, 3 May 13
PAGE MariaDB Optimizer •Multi-Range Read (MRR)* •Index Merge / Sort intersection •Batch KeyAccess* •Block hash join •Cost-based choice of range vs. index_merge •ORDER BY ... LIMIT <limit>* •MariaDB 10 •Subqueries •Semi-join* •Materialization* •subquery cache •LIMIT ... ROWS EXAMINED <limit> 17 (*) - Available in MySQL 5.6 Friday, 3 May 13
PAGE Virtual & Dynamic Columns VIRTUAL COLUMNS •For InnoDB, MyISAM andAria •PERSISTENT (stored) or VIRTUAL (generated) 18 CREATE TABLE t1 ( c1 INT NOT NULL, c2 VARCHAR(32), c3 INT AS ( c1 MOD 10 ) VIRTUAL, c4 VARCHAR(5) AS ( LEFT(B,5) ) PERSISTENT); DYNAMIC COLUMNS •Implement a schemaless, document store •COLUMN_ CREATE,ADD, GET, LIST, JSON, EXISTS, CHECK, DELETE •Nested colums are allowed •Main datatypes are allowed •Max 1GB documents CREATE TABLE assets ( item_name VARCHAR(32) PRIMARY KEY, dynamic_cols BLOB ); INSERT INTO assets VALUES ( 'MariaDB T-shirt', COLUMN_CREATE( 'color', 'blue', 'size', 'XL' ) ); INSERT INTO assets VALUES ( 'Thinkpad Laptop', COLUMN_CREATE( 'color', 'black', 'price', 500 ) ); Friday, 3 May 13
PAGE Cassandra Storage Engine •Column Family == Table •Rowkey, static and dynamic columns allowed •Batch key access support SET cassandra_default_thrift_host = '192.168.0.10' CREATE TABLE cassandra_tbl ( rowkey INT PRIMARY KEY, col1 VARCHAR(25), col2 BIGINT, dyn_cols BLOB DYNAMIC_COLUMN_STORAGE = yes ) ENGINE = cassandra KEYSPACE = 'cassandra_key_space' COLUMN_FAMILY = 'column_family_name'; 19 Friday, 3 May 13
PAGE Connect Storage Engine •Any file format as MySQLTABLE: •ODBC •Text, XML, *ML •Excel,Access etc. •MariaDB CREATE TABLE options •Multi-file table •TableAutocreation •Condition push down •Read/Write and Multi Storage Engine Join •CREATE INDEX 20 CREATE TABLE handout ENGINE = CONNECT TABLE_TYPE = XML FILE_NAME = 'handout.htm' HEADER = yes OPTION_LIST = 'name = TABLE, coltype = HTML, attribute = (border=1;cellpadding=5)'; Friday, 3 May 13
Starting Your Big Data Project Friday, 3 May 13
PAGE Why would you use MySQL? • Time • Knowledge • Infrastructure • Costs • Simplified Integration • Not so “big” data 22 Friday, 3 May 13
PAGE Apache Hadoop & Friends 23 HDFS MapReduce PIG HIVE HCatalog HBASE ZooKeeper •Mahout •Ambari, Ganglia, Nagios •Sqoop •Cascading •Oozie •Flume •Protobuf, Avro, Thrift •Fuse-DFS •Chukwa •Cassandra Friday, 3 May 13
PAGE MySQL & Friends 24 MySQL/MariaDB/Storage Engines SQL Optimizer Scripts Stored Procedures DML DB Schema / DDL MySQL/MariaDB SkySQLDS •Mahout •SDS, Ganglia, Nagios •mysqlimport •Cascading •Talend, Pentaho •Connect Friday, 3 May 13
PAGE Join us at the Solutions Day •Cassandra and Connect Storage Engine •Map/Reduce approach - Proxy optimisation •Multiple protocols and more 25 Friday, 3 May 13
Thank You! ivan@skysql.com izoratti.blogspot.com www.slideshare.net/izorattiwww.skysql.com Friday, 3 May 13

Big Data with MySQL

  • 1.
    Ivan Zoratti Big Datawith MySQL Percona Live Santa Clara 2013 V1304.01 Friday, 3 May 13
  • 2.
  • 3.
    SkySQL •Leading provider ofopen source databases, services and solutions •Home for the founders and the original developers of the core of MySQL •The creators of MariaDB, the drop-off, innovative replacement of MySQL Friday, 3 May 13
  • 4.
    What is BigData? http://marketingblogged.marketingmagazine.co.uk/files/Big-Data-3.jpg Friday, 3 May 13
  • 5.
    PAGE Big Data! Big datais a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 5 http://readwrite.com/files/styles/800_450sc/public/files/fields/shutterstock_bigdata.jpg Friday, 3 May 13
  • 6.
    PAGE Big Data ByStructure 6 Unstructured •Store everything you have/you find •In any format and shape •You do not know how to use it, but it may come handy •Storing unstructured data is usually cheaper than storing it in a more structured datastore •Does not fit well in a relational database •Examples: •Text: Plain text, documents, web content, messages •Bitmap: Image, audio, video •Typical approach: •Mining, pattern recognition, tagging •Usually batch analysis Structured •Store only what you need •In a good format, ready to be used •You should already know how to use it, or at least what it means •Storing structured data is quite expensive •Raw data, indexing, denormalisation, aggregation •Arelational database is still the best choice •Examples: •Machine-Generated Data (MGD) •Tags, counters, sales •Typical approach: •BI tools, reporting •Real time analysis change data capture Friday, 3 May 13
  • 7.
    PAGE Unstructured •Store everything youhave/you find •In any format and shape •You do not know how to use it, but it may come handy •Storing unstructured data is usually cheaper than storing it in a more structured datastore •Does not fit well in a relational database •Examples: •Text: Plain text, documents, web content, messages •Bitmap: Image, audio, video •Typical approach: •Mining, pattern recognition, tagging •Usually batch analysis Structured •Store only what you need •In a good format, ready to be used •You should already know how to use it, or at least what it means •Storing structured data is quite expensive •Raw data, indexing, denormalisation, aggregation •Arelational database is still the best choice •Examples: •Machine-Generated Data (MGD) •Tags, counters, sales •Typical approach: •BI tools, reporting •Real time analysis change data capture Big Data By Structure 7 Friday, 3 May 13
  • 8.
    PAGE How “Big” isBig Data? •Data Factors •Size •Speed to collect/ generate •Variety •Resources •Administrators •Developers •Infrastructure •Growth •Collection •Processing •Availability •To whom? •For how long? •In which format? •Aggregated •Detailed 8 Friday, 3 May 13
  • 9.
    PAGE How to manageBig Data •Collection - Storage -Archive •Load - Transform -Analyze •Access - Explore - Utilize 9 http://www.futuresmag.com/2012/07/01/big-data-manage-it-dont-drown-in-it Friday, 3 May 13
  • 10.
    Big Data withMySQL http://news.mydosti.com/newsphotos/tech/BigDataV1Dec22012.jpg Friday, 3 May 13
  • 11.
    PAGE Technologies to Use /Consider / Watch •MyISAM and MyISAM compression •InnoDB compression •MySQL 5.6 Partitioning •MariaDB Optimizer •MariaDB Virtual & Dynamic Columns •Cassandra Storage Engine •Connect Storage Engine •Columnar Databases •InfiniDB •Infobright •TokuDB Storage Engine 11 Friday, 3 May 13
  • 12.
    PAGE Columnar Databases •Automatic compression •Automaticcolumn storage •Data distribution •Map/Reduce approach •MPP / Parallel loading •No indexes •On public clouds, HW or SW appliances 12 Friday, 3 May 13
  • 13.
    PAGE TokuDB •Increased Performance •Increased Compression •Onlineadministration •No Index rebuild 13 Friday, 3 May 13
  • 14.
    PAGE MyISAM •Static, dynamic andcompressed format •Multiple key cache, CACHE INDEX and LOAD INDEX •Compressed tables •Horizontal partitioning (manual) •External locking 14 Friday, 3 May 13
  • 15.
    PAGE InnoDB/XtraDB •Data Load •Pre-order data •Splitdata into chunks •unique_checks = 0; •foreign_key_checks = 0; •sql_log_bin = 0; •innodb_autoinc_lock_mode = 2; •Compression and block size •Persistent optimizer stats •innodb_stats_persistent •innodb_stats_auto_recalc 15 SET GLOBAL innodb_file_per_table = 1; SET GLOBAL innodb_file_format = Barracuda; CREATE TABLE t1 ( c1 INT PRIMARY KEY, c2 VARCHAR(255) ) ROW_FORMAT = COMPRESSED KEY_BLOCK_SIZE = 8; LOAD   DATA LOCAL INFILE '/usr2/t1_01_simple' INTO TABLE t1; Query OK, 134217728 rows affected (1 hour 34 min 7.49 sec) Records: 134217728  Deleted: 0  Skipped: 0  Warnings: 0 LOAD   DATA LOCAL INFILE '/usr2/t1_01_simple' INTO TABLE t2; Query OK, 134217728 rows affected (25 min 20.75 sec) Records: 134217728  Deleted: 0  Skipped: 0  Warnings: 0 Friday, 3 May 13
  • 16.
    PAGE Partitioning (MySQL 5.6) •PartitioningTypes •RANGE, LIST, RANGE COLUMN, HASH, LINEAR HASH, KEY LINEAR KEY, sub-partitions •Partition and lock pruning •Use of INDEX and DATA DIRECTORY •PARTITIONADD, DROP, REORGANIZE, COALESCE, TRUNCATE, EXCHANGE, REBUILD, OPTIMIZE, CHECK, ANALYZE, REPAIR 16 CREATE TABLE t1 ( c1 INT, c2 DATE ) PARTITION BY RANGE( YEAR( c2 ) ) SUBPARTITION BY HASH ( TO_DAYS( c2 ) ) ( PARTITION p0 VALUES LESS THAN (1990) ( SUBPARTITION s0 DATA DIRECTORY = '/disk0/data' INDEX DIRECTORY = '/disk0/idx', SUBPARTITION s1 DATA DIRECTORY = '/disk1/data' INDEX DIRECTORY = '/disk1/idx' ),... ALTER TABLE t1 EXCHANGE PARTITION p3 WITH TABLE t2; -- Range and List partitions ALTER TABLE t1 REORGANIZE PARTITION p0,p1,p2,p3 INTO ( PARTITION m0 VALUES LESS THAN (1980), PARTITION m1 VALUES LESS THAN (2000)); -- Hash and Key partitions ALTER TABLE t1 COALESCE PARTITION 10; ALTER TABLE t1 ADD PARTITION PARTITIONS 5; Friday, 3 May 13
  • 17.
    PAGE MariaDB Optimizer •Multi-Range Read(MRR)* •Index Merge / Sort intersection •Batch KeyAccess* •Block hash join •Cost-based choice of range vs. index_merge •ORDER BY ... LIMIT <limit>* •MariaDB 10 •Subqueries •Semi-join* •Materialization* •subquery cache •LIMIT ... ROWS EXAMINED <limit> 17 (*) - Available in MySQL 5.6 Friday, 3 May 13
  • 18.
    PAGE Virtual & DynamicColumns VIRTUAL COLUMNS •For InnoDB, MyISAM andAria •PERSISTENT (stored) or VIRTUAL (generated) 18 CREATE TABLE t1 ( c1 INT NOT NULL, c2 VARCHAR(32), c3 INT AS ( c1 MOD 10 ) VIRTUAL, c4 VARCHAR(5) AS ( LEFT(B,5) ) PERSISTENT); DYNAMIC COLUMNS •Implement a schemaless, document store •COLUMN_ CREATE,ADD, GET, LIST, JSON, EXISTS, CHECK, DELETE •Nested colums are allowed •Main datatypes are allowed •Max 1GB documents CREATE TABLE assets ( item_name VARCHAR(32) PRIMARY KEY, dynamic_cols BLOB ); INSERT INTO assets VALUES ( 'MariaDB T-shirt', COLUMN_CREATE( 'color', 'blue', 'size', 'XL' ) ); INSERT INTO assets VALUES ( 'Thinkpad Laptop', COLUMN_CREATE( 'color', 'black', 'price', 500 ) ); Friday, 3 May 13
  • 19.
    PAGE Cassandra Storage Engine •ColumnFamily == Table •Rowkey, static and dynamic columns allowed •Batch key access support SET cassandra_default_thrift_host = '192.168.0.10' CREATE TABLE cassandra_tbl ( rowkey INT PRIMARY KEY, col1 VARCHAR(25), col2 BIGINT, dyn_cols BLOB DYNAMIC_COLUMN_STORAGE = yes ) ENGINE = cassandra KEYSPACE = 'cassandra_key_space' COLUMN_FAMILY = 'column_family_name'; 19 Friday, 3 May 13
  • 20.
    PAGE Connect Storage Engine •Anyfile format as MySQLTABLE: •ODBC •Text, XML, *ML •Excel,Access etc. •MariaDB CREATE TABLE options •Multi-file table •TableAutocreation •Condition push down •Read/Write and Multi Storage Engine Join •CREATE INDEX 20 CREATE TABLE handout ENGINE = CONNECT TABLE_TYPE = XML FILE_NAME = 'handout.htm' HEADER = yes OPTION_LIST = 'name = TABLE, coltype = HTML, attribute = (border=1;cellpadding=5)'; Friday, 3 May 13
  • 21.
    Starting Your BigData Project Friday, 3 May 13
  • 22.
    PAGE Why would youuse MySQL? • Time • Knowledge • Infrastructure • Costs • Simplified Integration • Not so “big” data 22 Friday, 3 May 13
  • 23.
    PAGE Apache Hadoop &Friends 23 HDFS MapReduce PIG HIVE HCatalog HBASE ZooKeeper •Mahout •Ambari, Ganglia, Nagios •Sqoop •Cascading •Oozie •Flume •Protobuf, Avro, Thrift •Fuse-DFS •Chukwa •Cassandra Friday, 3 May 13
  • 24.
    PAGE MySQL & Friends 24 MySQL/MariaDB/StorageEngines SQL Optimizer Scripts Stored Procedures DML DB Schema / DDL MySQL/MariaDB SkySQLDS •Mahout •SDS, Ganglia, Nagios •mysqlimport •Cascading •Talend, Pentaho •Connect Friday, 3 May 13
  • 25.
    PAGE Join us atthe Solutions Day •Cassandra and Connect Storage Engine •Map/Reduce approach - Proxy optimisation •Multiple protocols and more 25 Friday, 3 May 13
  • 26.