Data Federation with Spark Dan Marshall danmarshall07@gmail.com 06/13/2017
PostgreSQL Source postgres=# create table pg_employee (emp_id int, emp_name varchar, emp_title varchar, emp_hire_date date, emp_dept_id varchar); postgres=# select * from pg_employee; emp_id | emp_name | emp_title | emp_hire_date | emp_dept_id --------+------------------+-----------+---------------+------------- 1 | Fred Flinstone | Quarryman | 2001-07-01 | M1 2 | Donald Duck | Fisherman | 2011-04-28 | F1 3 | Larry Fitzgerald | Receiver | 2005-11-12 | S2 4 | Randy Johnson | Pitcher | 2008-01-11 | S2 (4 rows)
PostgreSQL - Spark
PostgreSQL - Spark
HBase Source hbase(main):004:0* create 'hb_dept','cf1' => Hbase::Table - hb_dept hbase(main):008:0* put 'hb_dept','M1','cf1:dept_name','Maintenance' hbase(main):009:0> put 'hb_dept','F1','cf1:dept_name','Entertainment' hbase(main):010:0> put 'hb_dept','S2','cf1:dept_name','Sports' hbase(main):012:0* scan 'hb_dept' ROW COLUMN+CELL F1 column=cf1:dept_name, timestamp=1496621309775, value=Entertainment M1 column=cf1:dept_name, timestamp=1496621309741, value=Maintenance S2 column=cf1:dept_name, timestamp=1496621309863, value=Sports 3 row(s) in 0.0590 seconds
HBase - Spark
Join – HBase and PostgreSQL
Cassandra Source Connected to Test Cluster at cassandra:9042. [cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> use mykeyspace; cqlsh:mykeyspace> create table bonus_table (userid int primary key, bonus_amount decimal); cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (1, 500.00); cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (4, 1000.00); cqlsh:mykeyspace> select * from bonus_table; userid | bonus_amount --------+-------------- 1 | 500.00 4 | 1000.00 (2 rows)
Cassandra – Spark
Use SQL on DataFrame from Cassandra Source
Join – Hbase,PostgreSQL,Cassandra
JSON Source {"dept":"F1"} {"dept":"S2"}
JSON Code
Join – Hbase,PostgreSQL,Cassandra,JSON
SQL on Final Temp View
JDBC Write
Enriched Table in PostgreSQL
Data Federation with Spark Dan Marshall danmarshall07@gmail.com 06/13/2017

Data Federation with Apache Spark

  • 1.
    Data Federation withSpark Dan Marshall danmarshall07@gmail.com 06/13/2017
  • 3.
    PostgreSQL Source postgres=# createtable pg_employee (emp_id int, emp_name varchar, emp_title varchar, emp_hire_date date, emp_dept_id varchar); postgres=# select * from pg_employee; emp_id | emp_name | emp_title | emp_hire_date | emp_dept_id --------+------------------+-----------+---------------+------------- 1 | Fred Flinstone | Quarryman | 2001-07-01 | M1 2 | Donald Duck | Fisherman | 2011-04-28 | F1 3 | Larry Fitzgerald | Receiver | 2005-11-12 | S2 4 | Randy Johnson | Pitcher | 2008-01-11 | S2 (4 rows)
  • 4.
  • 5.
  • 6.
    HBase Source hbase(main):004:0* create'hb_dept','cf1' => Hbase::Table - hb_dept hbase(main):008:0* put 'hb_dept','M1','cf1:dept_name','Maintenance' hbase(main):009:0> put 'hb_dept','F1','cf1:dept_name','Entertainment' hbase(main):010:0> put 'hb_dept','S2','cf1:dept_name','Sports' hbase(main):012:0* scan 'hb_dept' ROW COLUMN+CELL F1 column=cf1:dept_name, timestamp=1496621309775, value=Entertainment M1 column=cf1:dept_name, timestamp=1496621309741, value=Maintenance S2 column=cf1:dept_name, timestamp=1496621309863, value=Sports 3 row(s) in 0.0590 seconds
  • 7.
  • 8.
    Join – HBaseand PostgreSQL
  • 9.
    Cassandra Source Connected toTest Cluster at cassandra:9042. [cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> use mykeyspace; cqlsh:mykeyspace> create table bonus_table (userid int primary key, bonus_amount decimal); cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (1, 500.00); cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (4, 1000.00); cqlsh:mykeyspace> select * from bonus_table; userid | bonus_amount --------+-------------- 1 | 500.00 4 | 1000.00 (2 rows)
  • 10.
  • 11.
    Use SQL onDataFrame from Cassandra Source
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    SQL on FinalTemp View
  • 17.
  • 18.
  • 19.
    Data Federation withSpark Dan Marshall danmarshall07@gmail.com 06/13/2017