Data	Pipeline	using:	Spark,	MapR	Event	Store for	Kafka,	and	MapR	Database
2 © 2018 MapR Technologies, Inc •  Overview	of	Kafka	API •  Use	Spark	Structured	Streaming	to continously: •  Read	from	Kafka	topic •  Transfom •  Write	to	MapR	document database •  Analyze	using	Spark	SQL Agenda 2
3 © 2018 MapR Technologies, Inc Use	Case:	Open	Payment	Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals •  https://www.cms.gov/openpayments/
4 © 2018 MapR Technologies, Inc •  Medicare	payment	data	for	most	common	inpatient	diagnoses •  Data	shows	dramatically	different	charges	and	payments •  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and- reports/medicare-provider-charge-data/inpatient.html InPatient	Payment	Dataset	(IPPS) Provider	ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
What	is	Streaming	Data?
6 © 2018 MapR Technologies, Inc What	is	a	Stream	? •  A stream is a continuous sequence of events •  Events are key-value pairs
7 © 2018 MapR Technologies, Inc What	is	Streaming	Data? Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
8 © 2018 MapR Technologies, Inc Monitoring	devices	combined	with	ML	can	provide	alerts	for	Sepsis, which	is	one	of	the	leading	causes	for	death	in	hospitals – http://www.computerweekly.com/news/450422258/Putting-sepsis- algorithms-into-electronic-patient-records Examples	of	Streaming	Data
9 © 2018 MapR Technologies, Inc A	Stanford	team	has	shown	that	a	machine-learning	model	can	identify	arrhythmias from	an	EKG	better	than	an	expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example	of	Streaming	Data	combined	with	Machine	Learning
10 © 2018 MapR Technologies, Inc	the	ECG	app	on	the	Apple	Watch	can	check	heart	rhythms	and	send	a	notification	if	an irregular	heart	rhythm	is	identified. •  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart- rhythm-notification-available-today-on-apple-watch/ Example	of	Streaming	Data	combined	with	Machine	Learning
11 © 2018 MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying	Machine	Learning	to	Live	Patient	Data
Kafka	API	and	Streaming	Data
13 © 2018 MapR Technologies, Inc Intro	to	the	Kafka	API
14 © 2018 MapR Technologies, Inc Topics: Logical	collection	of	events Organize	Events	into	Categories Organize	Data	into	Topics	with	the	MapR	Event	Store	for	Kafka
15 © 2018 MapR Technologies, Inc Topics: Logical	collection	of	events Organize	Events	into	Categories Organize	Data	into	Topics	with	the	MapR	Event	Store	for	Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
16 © 2018 MapR Technologies, Inc Topics	are	partitioned	for	throughput	and scalability Scalable	Messaging	with	MapR	Event	Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
17 © 2018 MapR Technologies, Inc New	Messages	are Added	to	the	end Partition	is	like	an	Event	Log New Message 6 5 4 3 2 1 Old Message
18 © 2018 MapR Technologies, Inc Messages	are	delivered	in	the	order	they	are	received Partition	is	like	a	Queue
19 © 2018 MapR Technologies, Inc Events	remain	on	the	partition,	available	to	other	consumers Unlike	a	queue,	Events	are	not	deleted	when	Read
20 © 2018 MapR Technologies, Inc Messages	can	be	persisted	forever Or Older	messages	can	be	deleted	automatically	based	on	time	to	live Not	deleting	messages,	minimizes	disk	read/writes When	Are	Messages	Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
21 © 2018 MapR Technologies, Inc High	Performance	at	Scale: •  Partitioning	for	Parallel	operations •  Not	deleting	messages,	minimizes	disk	read/writes
22 © 2018 MapR Technologies, Inc Processing	Same	Message	for	Different	Purposes
23 © 2018 MapR Technologies, Inc Georgia	Health	Connect ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
24 © 2018 MapR Technologies, Inc Use	Case:	Streaming	System	of	Record	for	Healthcare
Spark	Structured	Streaming
26 © 2018 MapR Technologies, Inc Spark	Distributed	Datasets partitioned •  Distributed collection of objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
27 © 2018 MapR Technologies, Inc DataFrame	=	Dataset[Row]	can	use	Spark	SQL DataFrame is like a table Dataset[Row] row columns
28 © 2018 MapR Technologies, Inc Dataset[Typed	Object]	can	use	Spark	SQL	and	Functions A Dataset is a distributed collection of objects Dataset[objects] object columns partitioned
29 © 2018 MapR Technologies, Inc Process	the	Data	with	Spark	Structured	Streaming
30 © 2018 MapR Technologies, Inc Datasets	Read	from	Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
31 © 2018 MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat	Stream	as	Unbounded	Tables
32 © 2018 MapR Technologies, Inc The	Stream	is	continuously	processed
33 © 2018 MapR Technologies, Inc Spark	automatically	streamifies	SQL	plans Image	reference	Databricks
34 © 2018 MapR Technologies, Inc Stream	Processing
35 © 2018 MapR Technologies, Inc Use	Case:	Payment	Data "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" }
36 © 2018 MapR Technologies, Inc Scenario:	Payment	Data Provider	ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia	& Surgical,	Inc CO Oral	and Maxillofacial Surgery CA 117.5 Food	and	Beverage
37 © 2018 MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/stream:payments”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming	pipeline	Kafka	Data	source
38 © 2018 MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka	DataFrame	schema
39 © 2018 MapR Technologies, Inc case class Payment(_id: String, physician_id: String, date_payment: String, payer: String, payer_state: String, amount: Double, physician_type: String,physician_specialty: String, physician_state: String, nature_of_payment: String) def parsePayment(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id =td(5) val amount = td(30).toDouble . . . Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) } Function	to	Parse	CSV	data	to	Payment	Object
40 © 2018 MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parsePayment(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Payment]) Parse	message	txt	to	Payment	Object
41 © 2018 MapR Technologies, Inc Writing	to	a	Memory	Sink Write	results	to	Memory Start	running	the	query val	query	=	cdf	.writeStream	.queryName("uber")	.format("memory")	.outputMode("append”) query.start().awaitTermination()
42 © 2018 MapR Technologies, Inc %sql	select	*	from	payments	limit	4: Streaming	Applicaton +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ | _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment| +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ |CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage| |AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
43 © 2018 MapR Technologies, Inc %sql	select	*	from	payments	limit	3: Streaming	Applicaton
44 © 2018 MapR Technologies, Inc %sql	select	physician_specialty,	count(*)	as	cnt	from	payment2	group	by	physician_specialty	order	by	cnt	desc Streaming	Applicaton
Spark	&	MapR-DB
46 © 2018 MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark	Streaming	writing	to	MapR-DB	JSON
47 © 2018 MapR Technologies, Inc Spark	MapR-DB	Connector
48 © 2018 MapR Technologies, Inc Designed	for	Partitioning	and	Scaling
49 © 2018 MapR Technologies, Inc MapR-DB	JSON	Document	Store Data is automatically partitioned and sorted by _id row key! { "_id":”AK_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”AK" ”nature_of_payment":"Food and Beverage" }
50 © 2018 MapR Technologies, Inc MapR	Database	shell maprdb mapr:> find /user/mapr/paytable --limit 2 {"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21, "date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage", "payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545", "physician_specialty":"Specialist", "physician_state":"AK", "physician_type":"Medical Doctor"} {"_id":"AK_01/04/2016_100801_396460394","amount":11.59, "date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage", "payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801", "physician_specialty":"Family Medicine", "physician_state":"AK", "physician_type":"Medical Doctor"} 2 document(s) found. Data is automatically partitioned and sorted by _id row key!
51 © 2018 MapR Technologies, Inc Writing	to	a	MapR-DB	Sink Write	Streaming	DataFrame Query	Results	to	MapR-DB Start	running	the	query	val	query	=	cdf.writeStream	.format(MapRDBSourceConfig.Format)	.option(MapRDBSourceConfig.TablePathOption,	tableName)	.option(MapRDBSourceConfig.IdFieldPathOption,	"_id")	.option(MapRDBSourceConfig.CreateTableOption,	false)	.option("checkpointLocation",	"/user/mapr/check")	.option(MapRDBSourceConfig.BulkModeOption,	true)	.option(MapRDBSourceConfig.SampleSizeOption,	1000)	query.start().awaitTermination()
52 © 2018 MapR Technologies, Inc Streaming	Dataset	Read	from	Topic	partitions,	Transformed, Written	to	MapR	Database	partitions
53 © 2018 MapR Technologies, Inc Store	in	MapR	Database
Explore	the	Data	With	Spark	SQL
55 © 2018 MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark	SQL	Querying	MapR-DB	JSON
56 © 2018 MapR Technologies, Inc val pdf: Dataset[Payment] = spark .loadFromMapRDB[Payment](tableName, schema) .as[Payment] Spark	Distributed	Datasets	read	from	MapR-DB	Partitions Worker Task Worker Driver Cache	1 Cache	2 Cache	3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
57 © 2018 MapR Technologies, Inc Data Frame Load data pdf.createOrReplaceTempView(”payment") pdf.select("_id",	"payer",	"amount”).show Load	the	data	into	a	Dataframe Data is automatically partitioned and sorted by _id row key!
58 © 2018 MapR Technologies, Inc •  Which	physician	specialties,	or	physician	states,	are	getting	the	most	payments	? •  Count	,	total	amount	$,	average	amount	$,	standard	deviation •  Which	companies,	or	company	states,	are	making	the	most	payments	? •  Count	,	total	amount	$,	average	amount	$,	standard	deviation •  What	are	the	Top	Nature	of	Payments	by	count,	amount	? •  What	are	the	payments	over	$2,000,000	? Now	We	Can	Ask	Questions	Like: Physician	ID Date Payer Payer State Physician Specialty Physician State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia	& Surgical,	Inc CO Oral	and Maxillofacial Surgery CA 117.5 Food	and	Beverage
59 © 2018 MapR Technologies, Inc Language	Integrated	Queries L	I	Query Description agg(expr, exprs) Aggregates	on	entire	DataFrame distinct Returns	new	DataFrame	with	unique	rows except(other) Returns	new	DataFrame	with	rows	from	this	DataFrame	not	in	other DataFrame filter(expr); where(condition) Filter	based	on	the	SQL	expression	or	condition groupBy(cols: Columns) Groups	DataFrame	using	specified	columns join (DataFrame, joinExpr) Joins	with	another	DataFrame	using	given	join	expression sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column select(col) Selects	set	of	columns
60 © 2018 MapR Technologies, Inc MapR	Data	Platform
61 © 2018 MapR Technologies, Inc Top	5	Nature	of	Payment	by	count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
62 © 2018 MapR Technologies, Inc Top	5	Nature	of	Payment	by	amount	of	payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
63 © 2018 MapR Technologies, Inc Top	5	Physician	Specialties	by	total	Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
64 © 2018 MapR Technologies, Inc Average	Payment	by	Specialty
65 © 2018 MapR Technologies, Inc What	are	the	Nature	of	Payments	with	payments	>	$1000	? pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
66 © 2018 MapR Technologies, Inc Top	Payers	by	Total	Amount	with	count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
67 © 2018 MapR Technologies, Inc Scenario:	InPatient	Payment	Data Provider	ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
68 © 2018 MapR Technologies, Inc MapR	Database	shell maprdb mapr:> find /user/mapr/ippstable --limit 2 {"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments": 244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619 SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033", "provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL - Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13, "year":"2014"} {"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments": 133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX", "provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ - Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
69 © 2018 MapR Technologies, Inc •  What	are	the	top	charges,	and	payments	by	Hospital? • 	by	Hospital	State? •  What	are	the	top	charges,	and	payments	by	diagnosis? •  Most	common	diagnosis	? •  What	are	the	statistics	for	payments	for	most	common	diagnosis	(Sepsis	) •  What	are	the	most	expensive	states	for	most	common	diagnosis	sepsis Now	We	Can	Ask	Questions	Like: Provider	ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
70 © 2018 MapR Technologies, Inc InPatient	Prospective	Payment	System	payments
71 © 2018 MapR Technologies, Inc To	Download	the	code	:	Streaming	data	pipeline	blogs •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare- dataset-mapr-db/
72 © 2018 MapR Technologies, Inc http://mapr.com/ebooks/ New	Spark	Ebook
73 © 2018 MapR Technologies, Inc Read	about	Streaming	Architectures mapr.com/ebooks
74 © 2018 MapR Technologies, Inc Read	about	Machine	Learning	Logistics mapr.com/ebooks
75 © 2018 MapR Technologies, Inc MapR	Free	ODT	http://learn.mapr.com/ To	Learn	More:	New	Spark	2.0	training
76 © 2018 MapR Technologies, Inc https://mapr.com/blog/ MapR	Blog
77 © 2018 MapR Technologies, Inc To	Learn	More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
78 © 2018 MapR Technologies, Inc To	Learn	More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing- healthcare-platforms/
79 © 2018 MapR Technologies, Inc Applying	Machine	Learning	to	Live	Patient	Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live- patient-data

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

  • 1.
  • 2.
    2 © 2018MapR Technologies, Inc •  Overview of Kafka API •  Use Spark Structured Streaming to continously: •  Read from Kafka topic •  Transfom •  Write to MapR document database •  Analyze using Spark SQL Agenda 2
  • 3.
    3 © 2018MapR Technologies, Inc Use Case: Open Payment Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals •  https://www.cms.gov/openpayments/
  • 4.
    4 © 2018MapR Technologies, Inc •  Medicare payment data for most common inpatient diagnoses •  Data shows dramatically different charges and payments •  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and- reports/medicare-provider-charge-data/inpatient.html InPatient Payment Dataset (IPPS) Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 5.
  • 6.
    6 © 2018MapR Technologies, Inc What is a Stream ? •  A stream is a continuous sequence of events •  Events are key-value pairs
  • 7.
    7 © 2018MapR Technologies, Inc What is Streaming Data? Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 8.
    8 © 2018MapR Technologies, Inc Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals – http://www.computerweekly.com/news/450422258/Putting-sepsis- algorithms-into-electronic-patient-records Examples of Streaming Data
  • 9.
    9 © 2018MapR Technologies, Inc A Stanford team has shown that a machine-learning model can identify arrhythmias from an EKG better than an expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example of Streaming Data combined with Machine Learning
  • 10.
    10 © 2018MapR Technologies, Inc the ECG app on the Apple Watch can check heart rhythms and send a notification if an irregular heart rhythm is identified. •  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart- rhythm-notification-available-today-on-apple-watch/ Example of Streaming Data combined with Machine Learning
  • 11.
    11 © 2018MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying Machine Learning to Live Patient Data
  • 12.
  • 13.
    13 © 2018MapR Technologies, Inc Intro to the Kafka API
  • 14.
    14 © 2018MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka
  • 15.
    15 © 2018MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 16.
    16 © 2018MapR Technologies, Inc Topics are partitioned for throughput and scalability Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
  • 17.
    17 © 2018MapR Technologies, Inc New Messages are Added to the end Partition is like an Event Log New Message 6 5 4 3 2 1 Old Message
  • 18.
    18 © 2018MapR Technologies, Inc Messages are delivered in the order they are received Partition is like a Queue
  • 19.
    19 © 2018MapR Technologies, Inc Events remain on the partition, available to other consumers Unlike a queue, Events are not deleted when Read
  • 20.
    20 © 2018MapR Technologies, Inc Messages can be persisted forever Or Older messages can be deleted automatically based on time to live Not deleting messages, minimizes disk read/writes When Are Messages Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
  • 21.
    21 © 2018MapR Technologies, Inc High Performance at Scale: •  Partitioning for Parallel operations •  Not deleting messages, minimizes disk read/writes
  • 22.
    22 © 2018MapR Technologies, Inc Processing Same Message for Different Purposes
  • 23.
    23 © 2018MapR Technologies, Inc Georgia Health Connect ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
  • 24.
    24 © 2018MapR Technologies, Inc Use Case: Streaming System of Record for Healthcare
  • 25.
  • 26.
    26 © 2018MapR Technologies, Inc Spark Distributed Datasets partitioned •  Distributed collection of objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  • 27.
    27 © 2018MapR Technologies, Inc DataFrame = Dataset[Row] can use Spark SQL DataFrame is like a table Dataset[Row] row columns
  • 28.
    28 © 2018MapR Technologies, Inc Dataset[Typed Object] can use Spark SQL and Functions A Dataset is a distributed collection of objects Dataset[objects] object columns partitioned
  • 29.
    29 © 2018MapR Technologies, Inc Process the Data with Spark Structured Streaming
  • 30.
    30 © 2018MapR Technologies, Inc Datasets Read from Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  • 31.
    31 © 2018MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  • 32.
    32 © 2018MapR Technologies, Inc The Stream is continuously processed
  • 33.
    33 © 2018MapR Technologies, Inc Spark automatically streamifies SQL plans Image reference Databricks
  • 34.
    34 © 2018MapR Technologies, Inc Stream Processing
  • 35.
    35 © 2018MapR Technologies, Inc Use Case: Payment Data "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" }
  • 36.
    36 © 2018MapR Technologies, Inc Scenario: Payment Data Provider ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  • 37.
    37 © 2018MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/stream:payments”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming pipeline Kafka Data source
  • 38.
    38 © 2018MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka DataFrame schema
  • 39.
    39 © 2018MapR Technologies, Inc case class Payment(_id: String, physician_id: String, date_payment: String, payer: String, payer_state: String, amount: Double, physician_type: String,physician_specialty: String, physician_state: String, nature_of_payment: String) def parsePayment(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id =td(5) val amount = td(30).toDouble . . . Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) } Function to Parse CSV data to Payment Object
  • 40.
    40 © 2018MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parsePayment(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Payment]) Parse message txt to Payment Object
  • 41.
    41 © 2018MapR Technologies, Inc Writing to a Memory Sink Write results to Memory Start running the query val query = cdf .writeStream .queryName("uber") .format("memory") .outputMode("append”) query.start().awaitTermination()
  • 42.
    42 © 2018MapR Technologies, Inc %sql select * from payments limit 4: Streaming Applicaton +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ | _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment| +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ |CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage| |AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
  • 43.
    43 © 2018MapR Technologies, Inc %sql select * from payments limit 3: Streaming Applicaton
  • 44.
    44 © 2018MapR Technologies, Inc %sql select physician_specialty, count(*) as cnt from payment2 group by physician_specialty order by cnt desc Streaming Applicaton
  • 45.
  • 46.
    46 © 2018MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark Streaming writing to MapR-DB JSON
  • 47.
    47 © 2018MapR Technologies, Inc Spark MapR-DB Connector
  • 48.
    48 © 2018MapR Technologies, Inc Designed for Partitioning and Scaling
  • 49.
    49 © 2018MapR Technologies, Inc MapR-DB JSON Document Store Data is automatically partitioned and sorted by _id row key! { "_id":”AK_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”AK" ”nature_of_payment":"Food and Beverage" }
  • 50.
    50 © 2018MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/paytable --limit 2 {"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21, "date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage", "payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545", "physician_specialty":"Specialist", "physician_state":"AK", "physician_type":"Medical Doctor"} {"_id":"AK_01/04/2016_100801_396460394","amount":11.59, "date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage", "payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801", "physician_specialty":"Family Medicine", "physician_state":"AK", "physician_type":"Medical Doctor"} 2 document(s) found. Data is automatically partitioned and sorted by _id row key!
  • 51.
    51 © 2018MapR Technologies, Inc Writing to a MapR-DB Sink Write Streaming DataFrame Query Results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/check") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  • 52.
    52 © 2018MapR Technologies, Inc Streaming Dataset Read from Topic partitions, Transformed, Written to MapR Database partitions
  • 53.
    53 © 2018MapR Technologies, Inc Store in MapR Database
  • 54.
  • 55.
    55 © 2018MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 56.
    56 © 2018MapR Technologies, Inc val pdf: Dataset[Payment] = spark .loadFromMapRDB[Payment](tableName, schema) .as[Payment] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 57.
    57 © 2018MapR Technologies, Inc Data Frame Load data pdf.createOrReplaceTempView(”payment") pdf.select("_id", "payer", "amount”).show Load the data into a Dataframe Data is automatically partitioned and sorted by _id row key!
  • 58.
    58 © 2018MapR Technologies, Inc •  Which physician specialties, or physician states, are getting the most payments ? •  Count , total amount $, average amount $, standard deviation •  Which companies, or company states, are making the most payments ? •  Count , total amount $, average amount $, standard deviation •  What are the Top Nature of Payments by count, amount ? •  What are the payments over $2,000,000 ? Now We Can Ask Questions Like: Physician ID Date Payer Payer State Physician Specialty Physician State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  • 59.
    59 © 2018MapR Technologies, Inc Language Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  • 60.
    60 © 2018MapR Technologies, Inc MapR Data Platform
  • 61.
    61 © 2018MapR Technologies, Inc Top 5 Nature of Payment by count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
  • 62.
    62 © 2018MapR Technologies, Inc Top 5 Nature of Payment by amount of payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
  • 63.
    63 © 2018MapR Technologies, Inc Top 5 Physician Specialties by total Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
  • 64.
    64 © 2018MapR Technologies, Inc Average Payment by Specialty
  • 65.
    65 © 2018MapR Technologies, Inc What are the Nature of Payments with payments > $1000 ? pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
  • 66.
    66 © 2018MapR Technologies, Inc Top Payers by Total Amount with count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
  • 67.
    67 © 2018MapR Technologies, Inc Scenario: InPatient Payment Data Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 68.
    68 © 2018MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/ippstable --limit 2 {"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments": 244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619 SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033", "provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL - Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13, "year":"2014"} {"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments": 133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX", "provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ - Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
  • 69.
    69 © 2018MapR Technologies, Inc •  What are the top charges, and payments by Hospital? •  by Hospital State? •  What are the top charges, and payments by diagnosis? •  Most common diagnosis ? •  What are the statistics for payments for most common diagnosis (Sepsis ) •  What are the most expensive states for most common diagnosis sepsis Now We Can Ask Questions Like: Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 70.
    70 © 2018MapR Technologies, Inc InPatient Prospective Payment System payments
  • 71.
    71 © 2018MapR Technologies, Inc To Download the code : Streaming data pipeline blogs •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare- dataset-mapr-db/
  • 72.
    72 © 2018MapR Technologies, Inc http://mapr.com/ebooks/ New Spark Ebook
  • 73.
    73 © 2018MapR Technologies, Inc Read about Streaming Architectures mapr.com/ebooks
  • 74.
    74 © 2018MapR Technologies, Inc Read about Machine Learning Logistics mapr.com/ebooks
  • 75.
    75 © 2018MapR Technologies, Inc MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 76.
    76 © 2018MapR Technologies, Inc https://mapr.com/blog/ MapR Blog
  • 77.
    77 © 2018MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
  • 78.
    78 © 2018MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing- healthcare-platforms/
  • 79.
    79 © 2018MapR Technologies, Inc Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live- patient-data