David	Taieb STSM	-	IBM	Cloud	Data	Services Developer	advocate david_taieb@us.ibm.com HANDS-ON	SESSION: DEVELOPING	ANALYTIC	APPLICATIONS USING	APACHE	SPARK™	AND	PYTHON Part	1:	Flight	Delay	Predict	with	Spark	ML PyCon	2016,	Portland
©2016	IBM	Corpora6on Agenda •  Pre-requisite	steps	to	be	completed	before the	session •  Flight	Predict	app	descrip6on	and	architecture •  Train	the	models	in	the	Notebook •  Accuracy	Analysis	and	models	refinement •  Deploy	and	run	the	models
©2016	IBM	Corpora6on Sign up for Bluemix •  Access	IBM	Bluemix	website	on	hMps://console.ng.bluemix.net •  Click	on	Get	Started	for	Free •  Complete	the	form	and	click	Create	account •  Look	for	confirma6on	email	and	click	on	confirm	you	account	link Sign	up	for	flightstats
©2016	IBM	Corpora6on Sign up for a free trial at Flightstats.com •  Sign	up	at	hMps://developer.flightstats.com/signup •  Fill	out	the	form	and	monitor	email	for	confirma6on	link	(access	to	APIs	may take	up	to	24	hours) •  Once	access	is	granted	go	to hMps://developer.flightstats.com/admin/applica6ons	to	view	appId	and appKey	(you	will	need	them	in	the	simple-data-pipe	tool	to	create	training sets. •  Op6onal:	get	familiar	with	the	various	flightstats	apis: –  hMps://developer.flightstats.com/api-docs/scheduledFlights/v1 –  hMps://developer.flightstats.com/api-docs/airports/v1 How	to	find	your	app	id	and	key
©2016	IBM	Corpora6on Where to find the FlightStats app id and app key APP	ID APP	Key Prepare	your	bluemix	space
©2016	IBM	Corpora6on Create a new space on Bluemix In	prepara6on	for	running	the	project,	we	create	a	new	space	on	Bluemix Create	a	Spark	Instance Op6onal:	You	can	skip	this	step	if	you	already	have	a space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on Create a Spark Instance Op6onal:	You	can	skip	this	step	if	you	already	have	a space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on Create New Spark Instance Op6onal:	You	can	skip	this	step	if	you	already	have	a space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on Agenda •  Pre-requisite	steps	to	be	completed	before the	session •  Flight	Predict	app	descrip6on	and	architecture •  Train	the	models	in	the	Notebook •  Accuracy	Analysis	and	models	refinement •  Deploy	and	run	the	models
©2016	IBM	Corpora6on Flight App Project Description •  Use	case –  Flight	delays	are	a	common	disturbance	during	business	trips –  Being	able	to	predict	how	likely	a	flight	will	be	delayed	can	remove	uncertainty	and	enable users	to	plan	around	it. –  Idea:	Weather	data	can	be	a	good	explanatory	variable	for	building	predic6ve	models •  ImplementaSon –  Combine	flight	sta6s6cs	from	flightstats.com	(System	of	records)	with	weather	data	from IBM	Insight	for	Weather	(System	of	opera6ons)	to	build	a	training,	test	and	blind	set –  Use	Spark	MLLib	to	train	predic6ve	models	and	cross	validate	them –  Create	a	custom	card	for	Google	Now	that	will	automa6cally	no6fy	user	of	impending flight	delay –  Propose	alterna6ng	flight	routes	(e.g.	Freebird) Get/Build/Analyze
©2016	IBM	Corpora6on Get/Build/Analyze methodology
©2016	IBM	Corpora6on Flight Predict App Architecture Weather Simple	Data Pipes Airports Flight	Schedules Flight	Status Metadata Training Set Test Set Blind Set Custom	Connector run	every	24	hours Notebook
©2016	IBM	Corpora6on Flow Diagram Data Acquisi6on Data Prepara6on Data	Annota6on (Ground	Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model	Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model •  Itera6ve	in	Nature:	we	are	never	done! •  We	will	be	using	this	diagram	as	a	roadmap	throughout	this	course Deploy	and Run	Model
©2016	IBM	Corpora6on Get the data and build the training/test/blind sets In	this	step	we’ll	use	Simple	Data	Pipes	open	source	project	to	acquire	data	from Flightstats,	combine	it	with	Weather	data	from	IBM	Insight	for	Weather	and	save the	data	sets	into	a	NoSQL	Cloudant	Database. Data Acquisi6on Data Prepara6on Data	Annota6on (Ground	Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model	Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy	and Run	Model
©2016	IBM	Corpora6on Acquiring the data •  In	the	next	sec6on,	we	show	how	to	acquire	the	training	data	by using	the	simple-data-pipe	tool	and	flight	predict	connector. •  The	flight	predict	connector	combine	historical	flight	data	from flightstats.com	with	weather	data	from	IBM	Insight	for	Weather •  If	you	want	to	skip	these	steps,	you	can	use	the	already	built dataset	by	using	the	following	creden6als: –  cloudantHost:	dtaieb.cloudant.com –  cloudantUserName:	weenesserliffircedinvers –  cloudantPassword:	72a5c4f939a9e2578698029d2bb041d775d088b5 Deploy	simple-data-pipe
©2016	IBM	Corpora6on Deploy simple-data-pipe with flightstats connector •  Go	to	hMps://github.com/ibm-cds-labs/simple-data-pipe •  Click	on	Deploy	to	Bluemix	buMon Click	buMon	will	take	you	to	Bluemix
©2016	IBM	Corpora6on Complete simple-data-pipe deployment Add	Weather	service
©2016	IBM	Corpora6on Add an instance of IBM Weather Service on Bluemix •  Return	to	the	applica6on	dashboard •  Weather	service	is	required	by	the flight	predict	connector	and	must	be installed	before •  From	app	dashboard,	click	on	Add	a service	or	API
©2016	IBM	Corpora6on Create an instance of IBM Weather Service on Bluemix Search	for	Weather Make	sure	to	select “premium	plan”	to	have enough	authorized	API	calls
©2016	IBM	Corpora6on Checkpoint: simple data pipe app dashboard •  Verify	that	your	app	is	correctly	bound	to	the	right	services Weather	Service	used	to	enrich flight	records	with	weather observa6ons Cloudant	Service	used to	store	training,	test and	blind	data	sets You’ll	need	to	click	on	this	buMon for	the	step	on	the	next	page	It	is	recommended	to	increase the	app	memory	to	1GB
©2016	IBM	Corpora6on Install flight predict connector •  Click	Edit	Code	buMon,	edit	package.json	to	add	flight	predict	module: – "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git" add	flight	predict	module	to	dependencies Save	your	changes don’t	forget	to	add	comma	in	the	line	before	to	keep	json	valid
©2016	IBM	Corpora6on Install flight predict connector •  Click	File/Save	to	save	your	changes Redeploy	simple	data	pipe
©2016	IBM	Corpora6on Redeploy simple data pipe app •  Use	live	edit	Editor	to	redeploy	the	app Verify	your	sdp	install
©2016	IBM	Corpora6on Verify connector install •  In	this	step,	we	verify	that	the	flight	predict	connector	is	correctly	installed	through	the	UI Fight	connector	correctly	installed Create	new	flightstats	pipe
©2016	IBM	Corpora6on Create a new FlightStats pipe •  Follow	each	screen	to	create	and	configure	a	new	pipe Run	the	pipe
©2016	IBM	Corpora6on Run the pipe •  Skip	over	the	schedule	tab •  In	the	ac6vity	tab,	click	on	Run	Now	to	start	the	pipe Explore	the	data	set Click	Run	Now Then	open	the	log	to	monitor	the	ac6vity
©2016	IBM	Corpora6on Explore the data sets •  In	this	step,	we	take	a	moment	to	explore	the	different	data	sets	that	have	been	created	by	the simple	data	pipe	tool •  From	bluemix	dashboard,	click	on	the	cloudant	service	6le,	then	on	the	Launch	buMon •  From	the	Cloudant	dashboard,	open	the	training	database •  Open	a	document	to	look	at	the	data	structure Build	the	test	set
©2016	IBM	Corpora6on Run the pipe again to build the test set Train	the	models
©2016	IBM	Corpora6on Train the Models •  In	the	previous	sec6on	we	have	created	the	training	data	and	we	are	now	ready	to	train	the	models. •  Steps	in	this	sec6on: –  Create	an	IPython	Notebook –  Load	the	data	sets	from	the	Cloudant	database	into	a	Spark	Cluster –  Explore	the	data	and	train	the	machine	learning	models Data Acquisi6on Data Prepara6on Data	Annota6on (Ground	Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model	Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy	and Run	Model Create	IPython	Notebook
©2016	IBM	Corpora6on Create a new IPython Notebook
©2016	IBM	Corpora6on Notebook tour
©2016	IBM	Corpora6on Notebook tour: Notebook Info
©2016	IBM	Corpora6on Notebook tour: Environment
©2016	IBM	Corpora6on Notebook tour: Sharing `
©2016	IBM	Corpora6on Agenda •  Pre-requisite	steps	to	be	completed	before the	session •  Flight	Predict	app	descrip6on	and	architecture •  Train	the	models	in	the	Notebook •  Accuracy	Analysis	and	models	refinement •  Deploy	and	run	the	models
©2016	IBM	Corpora6on Before we start building the app… •  You	can	op6onally	follow	this	tutorial	from Github	by	using	a	fully	built	notebook: – hMps://github.com/ibm-cds-labs/simple-data- pipe-connector-flightstats/blob/master/ notebook/Flight%20Predict%20PyCon %202016.ipynb
©2016	IBM	Corpora6on Optional: use prebuilt notebook Import	required	Python	packages • Create	notebook	from	URL • Use	hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/ raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
©2016	IBM	Corpora6on Using Python Packages •  Write	code	inline	within	cells •  Encapsulate	helper	APIs	within	Python	package •  2	ways	of	using	helper	Python	packages –  egg	distribu6on	package:	pip	install	from	PyPi	server	or	file	server (e.g.	Github) •  Persistent	install	across	sessions •  Recommended	in	Produc6on –  SparkContext.addPyFile •  Easy	addi6on	of	a	python	module	file •  Support	mul6ple	module	files	via	zip	format •  Recommended	during	development	where	frequent	code	changes	occur Manage	egg	packages
©2016	IBM	Corpora6on Flight Predict Python Package on Github Setup	script	for	installing	Python	Package Flight	Predict	Python	library
©2016	IBM	Corpora6on Method 1: Install Flight Predict Package •  Use	pip	to	Install	Flight	Predict	package •  Recommended	alterna6ve:	build	egg	distribu6on	package	and	deploy	in	PyPi
©2016	IBM	Corpora6on Manage Python packages •  Check	status •  Uninstall	package Install	packages	via	sc.addPyFile	method
©2016	IBM	Corpora6on Method 2: Install py modules via sc.addPyFile •  addPyFile	install	individual	py	modules	and	make	them	available	to	all	executor processes •  Works	with	modules	in	zipped	files Module	containing	apis	for	training	the	models Module	containing	apis	for	running	the	models Configure	creden6als	for	various	services
©2016	IBM	Corpora6on Setup credentials and Import required python modules In	this	step,	we	import	python	modules	that	will	be	needed	throughout	the	notebook and	setup	creden6als	to	various	services. How	to	get	creden6als	for	Cloudant	and	Weather Creden6al	for	Cloudant	NoSQL	Service Creden6als	for	Weather	Service
©2016	IBM	Corpora6on Get Credentials for Cloudant From	the	app	dashboard,	click	on	Environment	Variables	from	the	les	sidebar
©2016	IBM	Corpora6on Get Credentials for Weather Load	training	set	from	Cloudant
©2016	IBM	Corpora6on Load training set in Spark SQL DataFrame … In	this	step,	we	use	the	cloudant-spark	connector	(hMps://github.com/cloudant-labs/spark-cloudant) to	load	data	into	Spark Make	sure	to	change the	db	name	to	match the	one	created	for your	training	set	by your	ac6vity	(open	the Cloudant	dashboard	to find	the	name)
©2016	IBM	Corpora6on Loading data: Behind the scene Use	Spark	SQL	connector	to	load	data	into	a	DataFrame connector	id Op6ons Cache	data	for	op6mized	reuse Create	temp	SQL	Table ScaMer	Plot	Visualiza6on
©2016	IBM	Corpora6on Scatter plot visualization
©2016	IBM	Corpora6on Visualization api Create	an	RDD	of	LabeledPoint
©2016	IBM	Corpora6on Transform into an RDD of LabeledPoint Use	Spark	SQL	connector	to	load	data	into	a	DataFrame
©2016	IBM	Corpora6on loadLabeledDataRDD api Train	Machine	Learning	Models
©2016	IBM	Corpora6on Machine Learning Algorithms ConSnuous	Output Discrete	Output Supervised	Learning (require	Ground-Truth) •  Regression	-	Linear	-	Ridge	-	Lasso	-	Isotonic •  Decision	Tree •  RandomForest •  GradientBoostedTree • Classifica6on	-	Logis6c	Regression	-	SVM	-	NaiveBayes • Decision	Tree • RandomForest • GradientBoostedTree • K-NN	(available	as	add-on	spark	package) Unsupervised	Learning (no	Ground-Truth	data	required) •  Clustering	-	KMeans	-	Gaussian	Mixture •  Dimensionality	Reduc6on	-	PCA	-	SVD •  FP-Growth Train	Logis6c	Regression	Model
©2016	IBM	Corpora6on Train Logistic Regression Model Train	Naïve	Bayes	Models
©2016	IBM	Corpora6on Train NaiveBayes Model Train	decision	Tree	Model
©2016	IBM	Corpora6on Train Decision Tree Model Train	Random	Forest	Model
©2016	IBM	Corpora6on Train Random Forest Model Accuracy	Analysis
©2016	IBM	Corpora6on Naïve Bayes vs Decision Tree •  Probabilis6c:	compute	the	probability of	a	data	instance	to	be	in	a	specific class •  Assume	that	each	feature	(variable)	is independent	from	the	others •  Performance	depends	on	the	predic6ve nature	of	the	features	(non	predic6ve features	will	affect	the	accuracy) •  Works	well	with	low	amount	of	training data.	Doesn’t	need	all	the	possibili6es •  Doesn’t	work	with	categorical	features. • Non-Probabilistic: partition the data into subsets that best describe the variable • The deeper the tree, the better the model fits the data • Watch out for overfiting: need to prune the tree • Can handle categorical or continuous features • No need for input to be scaled or standardized: Set you features and go! • Requires a lot of data covering all possibilities
©2016	IBM	Corpora6on Accuracy Analysis of the Machine Learning Models In	this	sec6on,	we	will	perform	accuracy	analysis	on	the	test	data.	We	will	start	by compu6ng	the	accuracy	metrics	for	each	model,	including	the	confusion	matrix.	We will	then	use	histogram	chart	to	understand	the	data	distribu6on	and	refine	how	to classes	are	computed. Data Acquisi6on Data Prepara6on Data	Annota6on (Ground	Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model	Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy	and Run	Model
©2016	IBM	Corpora6on Agenda •  Pre-requisite	steps	to	be	completed	before the	session •  Flight	Predict	app	descrip6on	and	architecture •  Train	the	models	in	the	Notebook •  Accuracy	Analysis	and	models	refinement •  Deploy	and	run	the	models
©2016	IBM	Corpora6on Load Test data Make	sure	to	change the	db	name	to	match the	one	created	for your	test	set	by	your ac6vity	(open	the Cloudant	dashboard	to find	the	name)
©2016	IBM	Corpora6on Accuracy Metrics
©2016	IBM	Corpora6on Confusion Matrix
©2016	IBM	Corpora6on Confusion Matrix
©2016	IBM	Corpora6on Confusion Matrix
©2016	IBM	Corpora6on Confusion Matrix
©2016	IBM	Corpora6on Accuracy metrics API Output	HTML Display	results	HTML	in	Notebook	Cell Compute	Metrics	from	labeled	and	predic6on	data Get	the	confusion	matrix	and	build	html	table
©2016	IBM	Corpora6on Understand the distribution of your data with Histograms
©2016	IBM	Corpora6on Training Handler class •  Provide	flexibility	and	extensibility	to	the applica6on •  Provide	a	fail	fast	and	try	something	else mechanism •  Enable	user	to	easily	customize	classes	of	data based	on	how	data	is	distributed •  Enable	user	to	easily	add	training	features
©2016	IBM	Corpora6on Default Training Handler class Return	descrip6on	for	each	classes Return	total	number	of	classes:	Default	is	5 Re-classify	a	record:	default	uses s.classifica6on	field	in	Json	record Extra	features	Names	to	be	added.	None	by	default Extra	features	to	be	added.	Array	must	match	the one	returned	by	customTrainingFeaturesNames
©2016	IBM	Corpora6on Customize Training Handler Provide	new	classifica6on	and	add	day	of	departure	as	a	new	feature Inherit	from	defaultTrainingHandler Add	day	of	the	week	using	a	technique called	dummy	coding
©2016	IBM	Corpora6on Re-train the models
©2016	IBM	Corpora6on Re-compute accuracy Models	1 Models	2 BeMer	accuracy	for	NaiveBayes and	Logis6c	Regression Worse	for	DecisionTree	and RandomForest
©2016	IBM	Corpora6on Agenda •  Pre-requisite	steps	to	be	completed	before the	session •  Flight	Predict	app	descrip6on	and	architecture •  Train	the	models	in	the	Notebook •  Accuracy	Analysis	and	models	refinement •  Deploy	and	run	the	models
©2016	IBM	Corpora6on Deploy and Run the models In	the	last	sec6on,	we	will	simulate	deployment	and	running	of	the	models through	the	notebook	by	calling	APIs	from	the	run	package. Data Acquisi6on Data Prepara6on Data	Annota6on (Ground	Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model	Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy	and Run	Models
©2016	IBM	Corpora6on Run the predictive model
©2016	IBM	Corpora6on runModel API
©2016	IBM	Corpora6on Get Weather Predictions
©2016	IBM	Corpora6on Show prediction results
©2016	IBM	Corpora6on Resource •  hMps://developer.ibm.com/clouddataservices/ •  hMps://github.com/ibm-cds-labs/simple-data-pipe •  hMps://github.com/ibm-cds-labs/pipes-connector-flightstats •  hMp://spark.apache.org/docs/latest/mllib-guide.html •  hMps://console.ng.bluemix.net/data/analy6cs/
©2016	IBM	Corpora6on Thank You

Spark tutorial pycon 2016 part 1