Graph Processing with Apache TinkerPop (incubating) Jason Plurad Software Engineer, IBM | Committer, Apache TinkerPop
• Project	Update • Graph	Landscape • A	Graph	Problem • Hands-On	Graph http://tinkerpop.apache.org
About	Me • Twitter	@pluradj • GitHub	@pluradj • Open	channels – TinkerPop	mailing	lists – Titan	mailing	list – Stack	Overflow
(Apache)	TinkerPop (incubating) • 2009:	Inception • 2012:	TinkerPop 2 • 2015:	Apache	Incubator • 2016:	Top	Level	Project? – TLP	VOTE	passed! – Waiting	on	board	meeting to	establish	TLP
Podling Releases • 3.0	– Major	refactor,	Java	8	lambda	expressions, Gremlin	Server,	OLAP	graph	computers • 3.1	– Hadoop	2	support,	persisted	RDDs • 3.2	– OLAP	job	chaining,	OLAP	graph	filters, performance	improvements
Common	graph	data	domains • Social	Network	Analysis • Configuration	Management	Database • Master	Data	Management • Recommendation	Engines • Knowledge	Graphs • Internet	of	Things
Property	Graph	and	Gremlin • Structure – Vertex – Edge – Properties • Gremlin – Domain	specific	language	(DSL)	for	graph – Data	flow:	forward	and	backward – Traversal	Steps – Bindings	for	non-JVM	languages
Apache	TinkerPop Graph	Computing	Framework
Graph	Landscape • Graph	database	vs	Graph	processor – OLTP	vs	OLAP – Neighborhood	vs	whole	graph • Multi-model:	not	the	only	store	in	your	app
IBM Graph (Beta) • Managed	Graph-as-a-Service	(OLTP) • Focus	on	your	data,	not	install	and	operations • #sleepMore http://ibm.biz/IBMGraph
What	is	this? module.exports = xxxxxxx; function xxxxxxx (str, len, ch) { str = String(str); var i = -1; if (!ch && ch !== 0) ch = ' '; len = len - str.length; while (++i < len) { str = ch + str; } return str; }
A	Graph	Problem: Dependency	Management • On	March	22,	2016	npm broke	the	Internet • Left-pad	was	unpublished – 11	lines	of	code – WTFPL	license – Hundreds	of	breaking	builds	per	minute – http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm • Are	we	safe	with	Apache?
Questions	for	the	graph • Which	dependencies	are	at	risk? • Which	ones	should	be	refactored	to	avoid? • Risk	factors – Unsuitable	license – Single	developer – Too	little	code	/	Too	much	code – Changes	too	frequently	/	Code	is	stagnant – Nobody	else	is	using	it
Let’s	go	for	a	ride!
Titan	(Aurelius) • Pick	a	graph	database	for	OLTP… – Apache	license	but	not	in	ASF • Code	has	stagnated	in	the	open – DataStax Enterprise	(DSE)	Graph – Wide	open	opportunities • Genesis	Graph	is	up	next! • Apache	S2Graph	(incubating) • Apache	Flink (Gelly) • Apache	Solr (GraphQuery)
Apache	Spark	or	Apache	Giraph • Pick	a	graph	processor	for	OLAP… – Spark	is	the	new	hotness – Giraph is	better	suited	for	gigantic	graphs • By	using	Apache	TinkerPop and	Gremlin,	we can	use	either	one	seamlessly
Vagrant	and	Virtualbox • Developers	don’t	always	get	keys	to	the	cloud • Virtual	machines	to	the	rescue – Host:	16	GB	RAM	or	more – 3-4	VMs	with	3	GB	RAM • Prove	out	your	graph	algorithms	on	a	small	data	set before	wasting	time	on	a	big	data	set
Apache	Ambari • Simple	install	for	Apache	Hadoop	and	related Apache	big	data	packages – HDFS,	YARN,	MapReduce,	HBase,	Spark,	etc • Management	and	monitoring	dashboard • Enables	integration	of	other	software
Getting	the	data • NPM	registry	runs	on	Apache	CouchDB • Replication	in	Apache	CouchDB is	awesome – https://skimdb.npmjs.com/registry
Transform	the	data • Apache	CouchDB is	a	document	store • Dependencies	are	graph	data • Other	things	can	be	too – Users – Keywords – License • Graph	model	depends	on	the	questions	you	want to	ask	of	the	graph
NPM	Graph	Schema Document 250K Package 1.5M Keyword 81K License 2K Person 125K license dependency devDependency
Hands-On:	Gremlin	Console https://asciinema.org/a/21qk1rn9yt6tt7sour9w9ynxn
The	GraphComputer
Anatomy	of	a	Vertex	Program • Vertex-centric	graph	logic • Parallel	execution	(BSP)
Out	of	the	box	Vertex	Programs • Traversal • BulkLoader • BulkDumper • PageRank • PeerPressure
Hands-On:	Graph	Program
OLAP Traversal Sources > graph = GraphFactory.open('conf/npmgraph- olap.properties') > g = graph.traversal().withComputer(SparkGraphComputer) > g = graph.traversal().withComputer(GiraphGraphComputer) Graph Statistics via TraversalVertexProgram > g.V().count() // vertex count > g.E().count() // edge count > g.V().label().groupCount() // vertex label distribution > g.E().label().groupCount() // edge label distribution > g.V().properties().key().groupCount() // vertex property distribution
Next	stop?	More	data! • Graphs	are	for	connecting	data! • Consume	data	from	GitHub – User	data – Static	code	analysis – Code	usage	analysis • Consume	data	from	Twitter – Trending	news – Security	alerts
Summary • Apache	TinkerPop is	for	graph	computing • OLTP	vs OLAP	is	an	important	distinction – Gremlin	allows	you	to	seamless	bridge	the	two • Graph	thinking	is	different	than	relational – Is	the	future	multi-model? • Many	opportunities	to	innovate	in	this	space
Acknowledgements • Marko	Rodriguez – Gremlin	language,	Gremlin	OLAP • Ketrina Yim – Illustrator,	creator	of	Gremlin	and	friends • Stephen	Mallette – TinkerPop release	manager,	Gremlin	applications • Daniel	Kuppitz – Gremlin	language	guru • David	Robinson – Big	data,	multi-model architect/developer
Questions?
Thank	you!

Graph Processing with Apache TinkerPop