An excursion into Graph Analytics with Apache Spark GraphX

Krishna Sankar https://www.linkedin.com/in/ksankar June 1, 2016

1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs 2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark- Practitioners/dp/1484209656 3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache- spark-graph-processing 4. Spark GarphX in Action - Manning 5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/ 6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf 7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf - http://web.stanford.edu/class/cs246/handouts.html 8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm 9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx 10. Zeppelin Setup - http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark 11. Data - http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time - http://openflights.org/data.html 12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png Thanks to the Giants whose work helped to prepare this tutorial

Agenda-Topic Time Data Description 1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB, Introduction to Spark GraphX 2. Basics are Discussed & A Graph Is Built 10 Min Simple Graph Edge,Vertex, Graph 3. GraphX API Landscape Is discussed 5 Min Explain APIs 4. Graph Structures Are Examined 5 Min Indegree, outdegree et al 5. Community, Affiliation & Strengths Are Explored 5 Min Connected components, Triangle et al 6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work), PageRank, Connected Components; APIs to implement Algorithms viz. aggregateMessages, Pregel Type Signature - aggregateMessages Algorithms with aggregateMessages 7. Case Study (AlphaGo Tweets) Is Conducted 10 Min AlphaGo Retweet Data E2E Pipeline w/real data, Map attributes to properties, vertices & edges, create graph and run algorithms 8. Questions Are Asked & Answered 10 Min AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!

Graph Applications § Many applications from social network to understanding collaboration,diseases, routing logistics and others § Came across 3 interesting applications, as I was preparing the materials !! 1) Project effectiveness by social network analysis on the projects' collaboration structures : “EC is interested in the impact generated by those projects … analyzing this latent impact of TEL projects by applying social network analysis on the projects' collaboration structures … TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects and strong, sustained organizational collaboration ties.” 2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are characterized by social cliques in which individuals associate more with members of their own social clique than with those outside their clique … Individuals involved in weak, between-clique social interactions are hypothesized to serve as bridges by which an infection may enter a clique and, hence, may experience higher infection risk … individuals who engaged in more between-clique associations, that is, those with multiple weak ties, were more likely to be infected with gastrointestinal helminth parasites ... “ 3) Panama papers : The leak presented them with a wealth of information, millions of documents, but no guide to structure … (ie community detection)

Graph based systems § Graphs are everywhere – web pages, social networks, people’s web browsingbehavior, logisticsetc. - Workingw/ graphshas become more important. And harder due to the scaleand complexityof algorithms § Two kinds– processingandquerying § Processing– GraphX, Pregel BSP, GraphLab § Querying– AllegroGraph,Titan, Neo4j, RDF stores. - SPARQL query language § Graph DBs have queries, canprocess large dataset § Graph processingcanrun complex algorithms § For Graph Analyticssystemsboth are required - Processon graphx, store in neo4j is a good alternative. - E.g. Panamapapers analytics

Graph Processing Frameworks § History – Graph based systems were specializedinnature - Graph partitioningis hard - Primitives of record based systemswere different than what graphbased systems needed § Rapid innovationin distributed data processingframeworks– MapReduce etc - Disk based, with limited partitioningabilities - Still an impedancemismatch § Graph-parallelover Data ParallelRDD ! – Best of both worlds ? ü We will see, later, GrapnX’s PartitionStrategy

Enter Spark§ More powerful partitioning mechanism § In-memory system makes iterative processing easier § GraphX – Graph processing built ontop of Spark § Graph processing at scale (distributed system) § Fast evolving project § § - - § § §

GraphX …§ Graph processing at Scale - “Graph Parallel System” § Has a rich - Computational Model - Edges, Vertices, Property Graph - Bipartite & tripartite graphs (Users-Tags-Web pages) - Algorithms (PageRank,…) § Current focus on computationthan query § APIs include : § Attribute Transformers, § Structure transformers, § Connection Mining & Algorithms § GraphFrames – interesting development § Exercises (lotsof interesting graph data) - Airline data,co-occurrence &co-citation from papers - The AlphaGo Community – ReTweet network - Wikipedia Page Rank Analysisusing GraphX Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf Scala, python support https://issues.apache.org/jira/browse/SPARK-3789 Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665 LDA https://issues.apache.org/jira/browse/SPARK-1405

Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) GraphX-Basics V StructuralQueries Indegrees, vertices,… AttributeTransformers mapVertices, … StructureTransformers, Join Reverse, subgraph Connection Mining connectedComponents, triangles Algorithms Aggregate messages, PageRank InterestingObjects o EdgeTriplet o EdgeContext http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

GraphX-Basics o Computational Model o Directed MultiGraph o Directed - so in-degree & out-degree o MultiGraph – so multiple parallel edges between nodes including loops ! o Algorithms beware – can cyclic, loops o Property Graph o vertexID(64bit long) int o Need property, cannot ignore it o Vertex,Edge Parameterized over object types o Vertex[(VertexId,VD)], Edge[ED] o Attach user-defined objects to edges and vertices (ED/VD) Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf

GraphX-Basics § For our hands-on we will run thru 2 Zeppelin notebooks (https://goo.gl/qCwZiq & https://goo.gl/EKHCFq): a. First we will work with the following Giraffe graph o … carefully chosen with interesting betweenness and properties o … for understanding the APIs b. Second, we will develop a retweet pipeline o With the AlphaGo twitter topic o 2 GB/330K tweets, 200K edges,… A D C E FG B 125 5 4.5 4.5 4 1.5 1.5 1 Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweenness val vertexList = List( (1L, Person("Alice", 18)), (2L, Person("Bernie", 17)), (3L, Person("Cruz", 15)), (4L, Person("Donald", 12)), (5L, Person("Ed", 15)), (6L, Person("Fran", 10)), (7L, Person("Genghis",854))) Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf Fun facts about centrality betweenness : It shows how many paths an edge is part of, i.e. relevancy. High centrality betweenness is the sign of a bottleneck, point of single failure – these edges need HA, probably need alternate paths for re routing, are susceptible to parasite infections and good candidates for a cut !

Build A Graph § There are 4 ways - GraphLoader.edgeListFile(…) - From RDDs (shown above) - fromEdgeTuples <- Id tuples - fromEdges <- edgeList

Graph API Landscape Separate algorithm from Graph Implementation Implementati on of analytic functions

Hint: Many details are documented in the object not in the class e.g. PartitionStrategy, Graph,…

Community-Affiliation-Strengths § Applied in many ways § For example in Fraud & Security Applications § Triangle detection – for spam servers § The age of a community isrelated to the density of triangles - New community will have few triangles,then triangles start to form § Strong affiliation ie Heavy hitter = sqrt(m) degrees! - Heavy hitter triangle ! § Connected Communities– structure

Algorithms § Graph-Parallel Computation - aggregateMessages()Function - Pregel() (https://issues.apache.org/jira/browse/SPARK-5062) § pageRank() - Influential papers in a citation network - Influencer in retweet § staticPageRank() - Static no of iterations, dynamic tolerance – see the parameters (tol vs. numIter) § personalizedPageRank() - Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph – “People You May Know” § shortestPaths, SVD++ § LabelPropagation(LPA) as described by Raghavanet al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks” - Computationally inexpensive way to Identify communities - Convergence not guaranteed - Might end up with trivial solution i.e. single community § SDVPlusPlus takes an RDD of Edges § The Global Clustering Coefficient, is better in that it always returns a number between 0 and 1 - For comparing connectnedness between different sized communities http://graphframes.github.io/user-guide.html

§ Versatile Function useful for implementing PageRank et al - Can be difficult at first, but easier to comprehend if treated as a combined map-reduce function (it was called MapReduceTriplets !! With a slightly different signature) o aggregateMessage[Msg]( o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src! o redcuce(Msg,Msg) => reduceFun)

If you really want to know what is underneath the aggregateMessages()…

Pregel BSP API § Bulk SynchronousParallelMessagePassingAPI - Developed by Leslie Valiant1 - Synchronizeddistributedsuper steps - Super Step – local,in-memorycomputations § Computations on the vertex – “Think likea Vertex” § graphx.lib.LabelPropagationAlgorithmuses Pregelinternally § graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version) § Pregel is implementedsuccinctlywith the old mapReduceTriplets API 1http://delivery.acm.org/10.1145/80000/79181/p103- valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069

Graphx Partitioning & Processing § Unlike a relationaltable,graph processing is contextual w.r.t a neighborhood - Maintain locality,equal size partitioning § Edge-cut : The communication& storage overhead of an edge-cut is directlyproportional to thenumber ofedges that are cut § Vertex-cut : Thecommunication and storage overhead ofa vertex-cut is directlyproportional to the sum of the number of machines spanned byeach vertex § Vertex-cut strategy by default (balance hot-spot issue due to power law/Zipf’s Law) w/ min replication ofedges § Batch processing/not streaming § Org.apache.spark.graphx.Graph. - partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int) - partitionBy(partitionStrategy: PartitionStrategy) - 4 PartitionStrategies 1) RandomVertexCut(usually the best) : random vertex cut that colocates all same- direction edges between two vertices (hashingthe source and destination vertexIDs) 2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between twovertices, regardless of direction (hashing the source and destination vertex IDs in a canonical direction) … remember GraphX is multi graph 3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID, colocating edges withthe same source 4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf

AlphaGo Tweets Analytics Pipeline Collect Data Transform & Extract Features GraphX Model GraphX Algorithms Store § Initiallyprimed data (7days from twitter) § Then used thesinceID to get incremental tweets § Use application authentication for higher rate § Used tweepy, wait_on_rate_limit=True, wait_on_rate_limit_notify=True § ~330K tweets (see Download program), 2GB § => MongoDB ~820 MB w/compression) § Rich data (see MongoDB) - Retweet, user ids, hash tags, locations et al § This exercise only covers the retweet interest network

2-Pipeline screen shots§ Get max tweet id § db.alphago.find().sort({id:-1}).limit(1).pretty() - "id" : NumberLong("709550216467185664"), § db.alphago.find().sort({id:+1}).limit(1).pretty() - "id" : NumberLong("709211132498714627") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt - 232296 - Min : "id" : NumberLong("705845567537221632") - Max : "id" : NumberLong("709211104719872000") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt - +21368 - Min : "id" : NumberLong("705845567537221632"), - Max : "id" : NumberLong("709550216467185664"), - Count : 253664 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt - +38452 - Min: "id_str" : "705845567537221632", - Max: "id" : NumberLong("709817136437403649") - Count : 292116 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt - +17331 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710312094227300352"), - Count : 309447 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt - +10797 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710883718903140353"), - Count : 320244 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt - +11511 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("711720090954108928"), - Count : 331755 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt - +5166 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("712410062820511744"), - Count : 336921

3-Twitter gives lots of fields in Mongo

4-Extract Retweet Fields & 5-Store in csv

6-Read as dataframe 7-Create Vertices, Edges & Objects 8-Finally the graph & run algorithms

The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) [(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)] Vertex VertexEdge

An excursion into Graph Analytics with Apache Spark GraphX

An excursion into Graph Analytics with Apache Spark GraphX

More Related Content

What's hot

Similar to An excursion into Graph Analytics with Apache Spark GraphX

More from Krishna Sankar

Recently uploaded

An excursion into Graph Analytics with Apache Spark GraphX