Krishna Sankar https://www.linkedin.com/in/ksankar June 1, 2016
1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs 2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark- Practitioners/dp/1484209656 3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache- spark-graph-processing 4. Spark GarphX in Action - Manning 5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/ 6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf 7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf - http://web.stanford.edu/class/cs246/handouts.html 8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm 9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx 10. Zeppelin Setup - http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark 11. Data - http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time - http://openflights.org/data.html 12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png Thanks to the Giants whose work helped to prepare this tutorial
Agenda-Topic Time Data Description 1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB, Introduction to Spark GraphX 2. Basics are Discussed & A Graph Is Built 10 Min Simple Graph Edge,Vertex, Graph 3. GraphX API Landscape Is discussed 5 Min Explain APIs 4. Graph Structures Are Examined 5 Min Indegree, outdegree et al 5. Community, Affiliation & Strengths Are Explored 5 Min Connected components, Triangle et al 6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work), PageRank, Connected Components; APIs to implement Algorithms viz. aggregateMessages, Pregel Type Signature - aggregateMessages Algorithms with aggregateMessages 7. Case Study (AlphaGo Tweets) Is Conducted 10 Min AlphaGo Retweet Data E2E Pipeline w/real data, Map attributes to properties, vertices & edges, create graph and run algorithms 8. Questions Are Asked & Answered 10 Min AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
Graph Applications § Many applications from social network to understanding collaboration,diseases, routing logistics and others § Came across 3 interesting applications, as I was preparing the materials !! 1) Project effectiveness by social network analysis on the projects' collaboration structures : “EC is interested in the impact generated by those projects … analyzing this latent impact of TEL projects by applying social network analysis on the projects' collaboration structures … TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects and strong, sustained organizational collaboration ties.” 2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are characterized by social cliques in which individuals associate more with members of their own social clique than with those outside their clique … Individuals involved in weak, between-clique social interactions are hypothesized to serve as bridges by which an infection may enter a clique and, hence, may experience higher infection risk … individuals who engaged in more between-clique associations, that is, those with multiple weak ties, were more likely to be infected with gastrointestinal helminth parasites ... “ 3) Panama papers : The leak presented them with a wealth of information, millions of documents, but no guide to structure … (ie community detection)
Graph based systems § Graphs are everywhere – web pages, social networks, people’s web browsingbehavior, logisticsetc. - Workingw/ graphshas become more important. And harder due to the scaleand complexityof algorithms § Two kinds– processingandquerying § Processing– GraphX, Pregel BSP, GraphLab § Querying– AllegroGraph,Titan, Neo4j, RDF stores. - SPARQL query language § Graph DBs have queries, canprocess large dataset § Graph processingcanrun complex algorithms § For Graph Analyticssystemsboth are required - Processon graphx, store in neo4j is a good alternative. - E.g. Panamapapers analytics
Graph Processing Frameworks § History – Graph based systems were specializedinnature - Graph partitioningis hard - Primitives of record based systemswere different than what graphbased systems needed § Rapid innovationin distributed data processingframeworks– MapReduce etc - Disk based, with limited partitioningabilities - Still an impedancemismatch § Graph-parallelover Data ParallelRDD ! – Best of both worlds ? ü We will see, later, GrapnX’s PartitionStrategy
Enter Spark§ More powerful partitioning mechanism § In-memory system makes iterative processing easier § GraphX – Graph processing built ontop of Spark § Graph processing at scale (distributed system) § Fast evolving project § § - - § § §
GraphX …§ Graph processing at Scale - “Graph Parallel System” § Has a rich - Computational Model - Edges, Vertices, Property Graph - Bipartite & tripartite graphs (Users-Tags-Web pages) - Algorithms (PageRank,…) § Current focus on computationthan query § APIs include : § Attribute Transformers, § Structure transformers, § Connection Mining & Algorithms § GraphFrames – interesting development § Exercises (lotsof interesting graph data) - Airline data,co-occurrence &co-citation from papers - The AlphaGo Community – ReTweet network - Wikipedia Page Rank Analysisusing GraphX Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf Scala, python support https://issues.apache.org/jira/browse/SPARK-3789 Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665 LDA https://issues.apache.org/jira/browse/SPARK-1405
Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) GraphX-Basics V StructuralQueries Indegrees, vertices,… AttributeTransformers mapVertices, … StructureTransformers, Join Reverse, subgraph Connection Mining connectedComponents, triangles Algorithms Aggregate messages, PageRank InterestingObjects o EdgeTriplet o EdgeContext http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
GraphX-Basics o Computational Model o Directed MultiGraph o Directed - so in-degree & out-degree o MultiGraph – so multiple parallel edges between nodes including loops ! o Algorithms beware – can cyclic, loops o Property Graph o vertexID(64bit long) int o Need property, cannot ignore it o Vertex,Edge Parameterized over object types o Vertex[(VertexId,VD)], Edge[ED] o Attach user-defined objects to edges and vertices (ED/VD) Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
GraphX-Basics § For our hands-on we will run thru 2 Zeppelin notebooks (https://goo.gl/qCwZiq & https://goo.gl/EKHCFq): a. First we will work with the following Giraffe graph o … carefully chosen with interesting betweenness and properties o … for understanding the APIs b. Second, we will develop a retweet pipeline o With the AlphaGo twitter topic o 2 GB/330K tweets, 200K edges,… A D C E FG B 125 5 4.5 4.5 4 1.5 1.5 1 Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweenness val vertexList = List( (1L, Person("Alice", 18)), (2L, Person("Bernie", 17)), (3L, Person("Cruz", 15)), (4L, Person("Donald", 12)), (5L, Person("Ed", 15)), (6L, Person("Fran", 10)), (7L, Person("Genghis",854))) Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf Fun facts about centrality betweenness : It shows how many paths an edge is part of, i.e. relevancy. High centrality betweenness is the sign of a bottleneck, point of single failure – these edges need HA, probably need alternate paths for re routing, are susceptible to parasite infections and good candidates for a cut !
Build A Graph § There are 4 ways - GraphLoader.edgeListFile(…) - From RDDs (shown above) - fromEdgeTuples <- Id tuples - fromEdges <- edgeList
Graph API Landscape Separate algorithm from Graph Implementation Implementati on of analytic functions
Hint: Many details are documented in the object not in the class e.g. PartitionStrategy, Graph,…
Graphx Structure API
Community-Affiliation-Strengths § Applied in many ways § For example in Fraud & Security Applications § Triangle detection – for spam servers § The age of a community isrelated to the density of triangles - New community will have few triangles,then triangles start to form § Strong affiliation ie Heavy hitter = sqrt(m) degrees! - Heavy hitter triangle ! § Connected Communities– structure
Algorithms § Graph-Parallel Computation - aggregateMessages()Function - Pregel() (https://issues.apache.org/jira/browse/SPARK-5062) § pageRank() - Influential papers in a citation network - Influencer in retweet § staticPageRank() - Static no of iterations, dynamic tolerance – see the parameters (tol vs. numIter) § personalizedPageRank() - Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph – “People You May Know” § shortestPaths, SVD++ § LabelPropagation(LPA) as described by Raghavanet al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks” - Computationally inexpensive way to Identify communities - Convergence not guaranteed - Might end up with trivial solution i.e. single community § SDVPlusPlus takes an RDD of Edges § The Global Clustering Coefficient, is better in that it always returns a number between 0 and 1 - For comparing connectnedness between different sized communities http://graphframes.github.io/user-guide.html
§ Versatile Function useful for implementing PageRank et al - Can be difficult at first, but easier to comprehend if treated as a combined map-reduce function (it was called MapReduceTriplets !! With a slightly different signature) o aggregateMessage[Msg]( o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src! o redcuce(Msg,Msg) => reduceFun)
If you really want to know what is underneath the aggregateMessages()…
sendToDst vs sendToSrc
Pregel BSP API § Bulk SynchronousParallelMessagePassingAPI - Developed by Leslie Valiant1 - Synchronizeddistributedsuper steps - Super Step – local,in-memorycomputations § Computations on the vertex – “Think likea Vertex” § graphx.lib.LabelPropagationAlgorithmuses Pregelinternally § graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version) § Pregel is implementedsuccinctlywith the old mapReduceTriplets API 1http://delivery.acm.org/10.1145/80000/79181/p103- valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
Graphx Partitioning & Processing § Unlike a relationaltable,graph processing is contextual w.r.t a neighborhood - Maintain locality,equal size partitioning § Edge-cut : The communication& storage overhead of an edge-cut is directlyproportional to thenumber ofedges that are cut § Vertex-cut : Thecommunication and storage overhead ofa vertex-cut is directlyproportional to the sum of the number of machines spanned byeach vertex § Vertex-cut strategy by default (balance hot-spot issue due to power law/Zipf’s Law) w/ min replication ofedges § Batch processing/not streaming § Org.apache.spark.graphx.Graph. - partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int) - partitionBy(partitionStrategy: PartitionStrategy) - 4 PartitionStrategies 1) RandomVertexCut(usually the best) : random vertex cut that colocates all same- direction edges between two vertices (hashingthe source and destination vertexIDs) 2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between twovertices, regardless of direction (hashing the source and destination vertex IDs in a canonical direction) … remember GraphX is multi graph 3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID, colocating edges withthe same source 4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
AlphaGo Tweets Analytics Pipeline Collect Data Transform & Extract Features GraphX Model GraphX Algorithms Store § Initiallyprimed data (7days from twitter) § Then used thesinceID to get incremental tweets § Use application authentication for higher rate § Used tweepy, wait_on_rate_limit=True, wait_on_rate_limit_notify=True § ~330K tweets (see Download program), 2GB § => MongoDB ~820 MB w/compression) § Rich data (see MongoDB) - Retweet, user ids, hash tags, locations et al § This exercise only covers the retweet interest network
1-Extract Tweets
2-Pipeline screen shots§ Get max tweet id § db.alphago.find().sort({id:-1}).limit(1).pretty() - "id" : NumberLong("709550216467185664"), § db.alphago.find().sort({id:+1}).limit(1).pretty() - "id" : NumberLong("709211132498714627") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt - 232296 - Min : "id" : NumberLong("705845567537221632") - Max : "id" : NumberLong("709211104719872000") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt - +21368 - Min : "id" : NumberLong("705845567537221632"), - Max : "id" : NumberLong("709550216467185664"), - Count : 253664 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt - +38452 - Min: "id_str" : "705845567537221632", - Max: "id" : NumberLong("709817136437403649") - Count : 292116 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt - +17331 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710312094227300352"), - Count : 309447 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt - +10797 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710883718903140353"), - Count : 320244 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt - +11511 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("711720090954108928"), - Count : 331755 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt - +5166 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("712410062820511744"), - Count : 336921
3-Twitter gives lots of fields in Mongo
4-Extract Retweet Fields & 5-Store in csv
6-Read as dataframe 7-Create Vertices, Edges & Objects 8-Finally the graph & run algorithms
The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) [(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)] Vertex VertexEdge
1 2 3 4 5
An excursion into Graph Analytics with Apache Spark GraphX

An excursion into Graph Analytics with Apache Spark GraphX

  • 1.
  • 2.
    1. Paco Nathan,Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs 2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark- Practitioners/dp/1484209656 3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache- spark-graph-processing 4. Spark GarphX in Action - Manning 5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/ 6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf 7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf - http://web.stanford.edu/class/cs246/handouts.html 8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm 9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx 10. Zeppelin Setup - http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark 11. Data - http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time - http://openflights.org/data.html 12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png Thanks to the Giants whose work helped to prepare this tutorial
  • 3.
    Agenda-Topic Time DataDescription 1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB, Introduction to Spark GraphX 2. Basics are Discussed & A Graph Is Built 10 Min Simple Graph Edge,Vertex, Graph 3. GraphX API Landscape Is discussed 5 Min Explain APIs 4. Graph Structures Are Examined 5 Min Indegree, outdegree et al 5. Community, Affiliation & Strengths Are Explored 5 Min Connected components, Triangle et al 6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work), PageRank, Connected Components; APIs to implement Algorithms viz. aggregateMessages, Pregel Type Signature - aggregateMessages Algorithms with aggregateMessages 7. Case Study (AlphaGo Tweets) Is Conducted 10 Min AlphaGo Retweet Data E2E Pipeline w/real data, Map attributes to properties, vertices & edges, create graph and run algorithms 8. Questions Are Asked & Answered 10 Min AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
  • 4.
    Graph Applications § Manyapplications from social network to understanding collaboration,diseases, routing logistics and others § Came across 3 interesting applications, as I was preparing the materials !! 1) Project effectiveness by social network analysis on the projects' collaboration structures : “EC is interested in the impact generated by those projects … analyzing this latent impact of TEL projects by applying social network analysis on the projects' collaboration structures … TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects and strong, sustained organizational collaboration ties.” 2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are characterized by social cliques in which individuals associate more with members of their own social clique than with those outside their clique … Individuals involved in weak, between-clique social interactions are hypothesized to serve as bridges by which an infection may enter a clique and, hence, may experience higher infection risk … individuals who engaged in more between-clique associations, that is, those with multiple weak ties, were more likely to be infected with gastrointestinal helminth parasites ... “ 3) Panama papers : The leak presented them with a wealth of information, millions of documents, but no guide to structure … (ie community detection)
  • 5.
    Graph based systems §Graphs are everywhere – web pages, social networks, people’s web browsingbehavior, logisticsetc. - Workingw/ graphshas become more important. And harder due to the scaleand complexityof algorithms § Two kinds– processingandquerying § Processing– GraphX, Pregel BSP, GraphLab § Querying– AllegroGraph,Titan, Neo4j, RDF stores. - SPARQL query language § Graph DBs have queries, canprocess large dataset § Graph processingcanrun complex algorithms § For Graph Analyticssystemsboth are required - Processon graphx, store in neo4j is a good alternative. - E.g. Panamapapers analytics
  • 6.
    Graph Processing Frameworks §History – Graph based systems were specializedinnature - Graph partitioningis hard - Primitives of record based systemswere different than what graphbased systems needed § Rapid innovationin distributed data processingframeworks– MapReduce etc - Disk based, with limited partitioningabilities - Still an impedancemismatch § Graph-parallelover Data ParallelRDD ! – Best of both worlds ? ü We will see, later, GrapnX’s PartitionStrategy
  • 7.
    Enter Spark§ Morepowerful partitioning mechanism § In-memory system makes iterative processing easier § GraphX – Graph processing built ontop of Spark § Graph processing at scale (distributed system) § Fast evolving project § § - - § § §
  • 8.
    GraphX …§ Graphprocessing at Scale - “Graph Parallel System” § Has a rich - Computational Model - Edges, Vertices, Property Graph - Bipartite & tripartite graphs (Users-Tags-Web pages) - Algorithms (PageRank,…) § Current focus on computationthan query § APIs include : § Attribute Transformers, § Structure transformers, § Connection Mining & Algorithms § GraphFrames – interesting development § Exercises (lotsof interesting graph data) - Airline data,co-occurrence &co-citation from papers - The AlphaGo Community – ReTweet network - Wikipedia Page Rank Analysisusing GraphX Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf Scala, python support https://issues.apache.org/jira/browse/SPARK-3789 Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665 LDA https://issues.apache.org/jira/browse/SPARK-1405
  • 9.
    Vertex Vertex(VertexId,VD) VDcan be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) GraphX-Basics V StructuralQueries Indegrees, vertices,… AttributeTransformers mapVertices, … StructureTransformers, Join Reverse, subgraph Connection Mining connectedComponents, triangles Algorithms Aggregate messages, PageRank InterestingObjects o EdgeTriplet o EdgeContext http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
  • 10.
    GraphX-Basics o Computational Model oDirected MultiGraph o Directed - so in-degree & out-degree o MultiGraph – so multiple parallel edges between nodes including loops ! o Algorithms beware – can cyclic, loops o Property Graph o vertexID(64bit long) int o Need property, cannot ignore it o Vertex,Edge Parameterized over object types o Vertex[(VertexId,VD)], Edge[ED] o Attach user-defined objects to edges and vertices (ED/VD) Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
  • 11.
    GraphX-Basics § For ourhands-on we will run thru 2 Zeppelin notebooks (https://goo.gl/qCwZiq & https://goo.gl/EKHCFq): a. First we will work with the following Giraffe graph o … carefully chosen with interesting betweenness and properties o … for understanding the APIs b. Second, we will develop a retweet pipeline o With the AlphaGo twitter topic o 2 GB/330K tweets, 200K edges,… A D C E FG B 125 5 4.5 4.5 4 1.5 1.5 1 Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweenness val vertexList = List( (1L, Person("Alice", 18)), (2L, Person("Bernie", 17)), (3L, Person("Cruz", 15)), (4L, Person("Donald", 12)), (5L, Person("Ed", 15)), (6L, Person("Fran", 10)), (7L, Person("Genghis",854))) Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf Fun facts about centrality betweenness : It shows how many paths an edge is part of, i.e. relevancy. High centrality betweenness is the sign of a bottleneck, point of single failure – these edges need HA, probably need alternate paths for re routing, are susceptible to parasite infections and good candidates for a cut !
  • 12.
    Build A Graph §There are 4 ways - GraphLoader.edgeListFile(…) - From RDDs (shown above) - fromEdgeTuples <- Id tuples - fromEdges <- edgeList
  • 13.
    Graph API Landscape Separatealgorithm from Graph Implementation Implementati on of analytic functions
  • 14.
    Hint: Many details aredocumented in the object not in the class e.g. PartitionStrategy, Graph,…
  • 15.
  • 16.
    Community-Affiliation-Strengths § Applied inmany ways § For example in Fraud & Security Applications § Triangle detection – for spam servers § The age of a community isrelated to the density of triangles - New community will have few triangles,then triangles start to form § Strong affiliation ie Heavy hitter = sqrt(m) degrees! - Heavy hitter triangle ! § Connected Communities– structure
  • 17.
    Algorithms § Graph-Parallel Computation -aggregateMessages()Function - Pregel() (https://issues.apache.org/jira/browse/SPARK-5062) § pageRank() - Influential papers in a citation network - Influencer in retweet § staticPageRank() - Static no of iterations, dynamic tolerance – see the parameters (tol vs. numIter) § personalizedPageRank() - Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph – “People You May Know” § shortestPaths, SVD++ § LabelPropagation(LPA) as described by Raghavanet al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks” - Computationally inexpensive way to Identify communities - Convergence not guaranteed - Might end up with trivial solution i.e. single community § SDVPlusPlus takes an RDD of Edges § The Global Clustering Coefficient, is better in that it always returns a number between 0 and 1 - For comparing connectnedness between different sized communities http://graphframes.github.io/user-guide.html
  • 18.
    § Versatile Functionuseful for implementing PageRank et al - Can be difficult at first, but easier to comprehend if treated as a combined map-reduce function (it was called MapReduceTriplets !! With a slightly different signature) o aggregateMessage[Msg]( o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src! o redcuce(Msg,Msg) => reduceFun)
  • 19.
    If you reallywant to know what is underneath the aggregateMessages()…
  • 20.
  • 21.
    Pregel BSP API §Bulk SynchronousParallelMessagePassingAPI - Developed by Leslie Valiant1 - Synchronizeddistributedsuper steps - Super Step – local,in-memorycomputations § Computations on the vertex – “Think likea Vertex” § graphx.lib.LabelPropagationAlgorithmuses Pregelinternally § graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version) § Pregel is implementedsuccinctlywith the old mapReduceTriplets API 1http://delivery.acm.org/10.1145/80000/79181/p103- valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
  • 22.
    Graphx Partitioning &Processing § Unlike a relationaltable,graph processing is contextual w.r.t a neighborhood - Maintain locality,equal size partitioning § Edge-cut : The communication& storage overhead of an edge-cut is directlyproportional to thenumber ofedges that are cut § Vertex-cut : Thecommunication and storage overhead ofa vertex-cut is directlyproportional to the sum of the number of machines spanned byeach vertex § Vertex-cut strategy by default (balance hot-spot issue due to power law/Zipf’s Law) w/ min replication ofedges § Batch processing/not streaming § Org.apache.spark.graphx.Graph. - partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int) - partitionBy(partitionStrategy: PartitionStrategy) - 4 PartitionStrategies 1) RandomVertexCut(usually the best) : random vertex cut that colocates all same- direction edges between two vertices (hashingthe source and destination vertexIDs) 2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between twovertices, regardless of direction (hashing the source and destination vertex IDs in a canonical direction) … remember GraphX is multi graph 3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID, colocating edges withthe same source 4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
  • 23.
    AlphaGo Tweets AnalyticsPipeline Collect Data Transform & Extract Features GraphX Model GraphX Algorithms Store § Initiallyprimed data (7days from twitter) § Then used thesinceID to get incremental tweets § Use application authentication for higher rate § Used tweepy, wait_on_rate_limit=True, wait_on_rate_limit_notify=True § ~330K tweets (see Download program), 2GB § => MongoDB ~820 MB w/compression) § Rich data (see MongoDB) - Retweet, user ids, hash tags, locations et al § This exercise only covers the retweet interest network
  • 24.
  • 25.
    2-Pipeline screen shots§Get max tweet id § db.alphago.find().sort({id:-1}).limit(1).pretty() - "id" : NumberLong("709550216467185664"), § db.alphago.find().sort({id:+1}).limit(1).pretty() - "id" : NumberLong("709211132498714627") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt - 232296 - Min : "id" : NumberLong("705845567537221632") - Max : "id" : NumberLong("709211104719872000") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt - +21368 - Min : "id" : NumberLong("705845567537221632"), - Max : "id" : NumberLong("709550216467185664"), - Count : 253664 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt - +38452 - Min: "id_str" : "705845567537221632", - Max: "id" : NumberLong("709817136437403649") - Count : 292116 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt - +17331 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710312094227300352"), - Count : 309447 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt - +10797 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710883718903140353"), - Count : 320244 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt - +11511 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("711720090954108928"), - Count : 331755 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt - +5166 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("712410062820511744"), - Count : 336921
  • 26.
    3-Twitter gives lotsof fields in Mongo
  • 27.
  • 28.
    6-Read as dataframe 7-CreateVertices, Edges & Objects 8-Finally the graph & run algorithms
  • 29.
    The Art ofan AlphaGo GraphX Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) [(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)] Vertex VertexEdge
  • 30.