Graph Gurus 29 Using Graph Algorithms for Advanced Analytics Part 3 - Community Detection 1
© 2020 TigerGraph. All Rights Reserved Today's Presenter 2 Victor Lee Head of Product Strategy & Developer Relations ● BS in Electrical Engineering and Computer Science from UC Berkeley, MS in Electrical Engineering from Stanford University ● PhD in Computer Science from Kent State University focused on graph data mining ● 20+ years in tech industry
© 2020 TigerGraph. All Rights Reserved Some Housekeeping Items ● Although your phone is muted we do want to answer your questions - submit your questions at any time using the Q&A tab in the menu ● The webinar is being recorded and will uploaded to our website shortly (https://www.tigergraph.com/webinars/) and the URL will be emailed you ● If you have issues with Zoom please contact the panelists via chat 3
© 2020 TigerGraph. All Rights Reserved Move Faster with TigerGraph Cloud 4 Built for agile teams who would rather build innovative applications than procure hardware or configure and manage databases ● Start for free ● Move to production with distributed data and HA replication
© 2020 TigerGraph. All Rights Reserved Today’s Outline 5 1 3 2 Recap of Parts 1 & 2: Path and Centrality Graph Algorithms Community Detection Algorithms Clustering vs. Partitioning Strict vs. Lenient Rules What's a Community? Who's in and who's out? 4 Demo Running and modifying GSQL Community Detection Algorithms
© 2020 TigerGraph. All Rights Reserved Review: Analytics with Graph Algorithms ● Graph algorithms answer fundamental questions about connected data ● Each algorithm in a library is tool in an analytics toolkit ● Building blocks for more complex business questions 6 Specialized functions Combine to make something better
© 2020 TigerGraph. All Rights Reserved Example Questions/Analyses for Graph Algorithms Which entity is most centrally located? ● For delivery logistics or greatest visibility ● Closeness Centrality, Betweenness Centrality algorithms 7 How much influence does this entity exert over the others? ● For market penetration & buyer influence ● PageRank algorithm Which entity has similar relationships to this entity? ● For grouping customers, products, etc. ● Cosine Similarity, SimRank, RoleSim algorithms What are the natural community groupings in the graph? ● For partitioning risk groups, workgroups, product offerings, etc. ● Community Detection, MinCut algorithms
© 2020 TigerGraph. All Rights Reserved Summary for Shortest Path Algorithms 8 1 4 3 Graph Algorithms - tools and building blocks for analyzing graph data GSQL Algorithm Library - runs in-database, high-performance, easy to read and modify Shortest Path Algorithms - different algorithms for weighted and unweighted graphs 2 Learning To Use Algorithms - know what problem they solve, pros and cons
© 2020 TigerGraph. All Rights Reserved Summary for Centrality Algorithms 9 1 4 3 Centrality Algorithms - abstract concepts of location and travel Customizing GSQL Library algorithms is easy and familiar, like procedural SQL PageRank - uses directed referral edges to find the most influential nodes. Personalized PageRank is localized. 2 Closeness and Betweenness use shortest paths. Betweenness is more complex.
© 2020 TigerGraph. All Rights Reserved Some Types of Graph Algorithms ● Search ● Path Finding & Analytics ● Centrality / Ranking ● Clustering / Community Detection ● Similarity ● Classification 10
© 2020 TigerGraph. All Rights Reserved 11 How do I find the most influential provider in each region for a particular medical condition? Whole-Graph Compute problem 1. Analyze claims data to identify referral relationships among providers (Time Series Analysis) 2. Create subsets of claims around each condition with a group of healthcare codes (e.g. CPT codes) for each region (e.g. local healthcare market) 3. Utilize PageRank to score hubs within each market Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Best Practice: Know Graph Algorithms
© 2020 TigerGraph. All Rights Reserved 12 Who is influenced by these leaders (e.g. other doctors, chiropractors, physical therapists, facilities)? Utilize Community Detection 1. Identify communities of providers around each hub for each region and for a specific condition 2. Track changes over time to detect significant shifts in communities Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Community Detected: Diabetes – S. San Jose – Dr. Thomas
© 2020 TigerGraph. All Rights Reserved What's a Community? What's your definition? 13 ● People who live in the same neighborhood? ● Entities which interact with one another, in a mutual relationship ● May have things in common (similarity), but that is not the key factor.
© 2020 TigerGraph. All Rights Reserved Community Detection ● Deciding who is in a community… and who isn't. ● If items are already labeled with their community membership, then your work is done! ● If no rules, the we need to make or discover our own rules, based on level of interaction. 14
© 2020 TigerGraph. All Rights Reserved Communities can be Clusters or Partitions ● Partitioning puts each item in exactly one group, even if the justification is sketchy. 15 ● Clustering sends up natural boundaries. Some items can be outside. ?
© 2020 TigerGraph. All Rights Reserved A Spectrum of Community Detection Rules 16 ● Direct connection to 1+ other member of the community ● Connected Component ● Direct connection to K+ other members of the community ● K-Core ● Direct connection to every member to the community ● Clique K = 2
© 2020 TigerGraph. All Rights Reserved Directed vs. Undirected Connections 17 Connected Components Strongly Connected Components A component is connected if there is a path from every vertex to every vertex. ● If edges are directed, we say the component is strongly connected. ● If edges are undirected, we say it's (weakly) connected.
© 2020 TigerGraph. All Rights Reserved Relative Density - Louvain Modularity 18 Modularity is a measure of how "good" is the partitioning of a graph = (fraction of the edges that fall within the given groups) minus (the expected fraction if edges were distributed at random) ● Researchers at the University of Louvain developed an especially efficient method for finding the partitioning with the best Q score. Which has a higher modularity?
© 2020 TigerGraph. All Rights Reserved More Applications for Community Detection ● Marketing/Customer Analytics: find who is chatting with whom, to understand who is being reached and to improve targeting. ● Biosciences: identify natural communities of interaction at the molecular, tissue, organism, and species levels. ● Financial analytics: discover clusters of transactions or contracts, to understand and predict market dynamics, uncover illicit activity. 19 Specialized plant biochemistry drives gene clustering in fungi. bioRxiv 184242; doi: https://doi.org/10.1101/184242 http://adrianland.uk/social-media/my-l inkedin-social-graph/ Regionalism and Overlap in Investment Treaty Law – Towards Consolidation or Contradiction? J of Intl Econ Law 17(2), pp. 271-298 (2014)
DEMO GSQL Graph Algorithms in TigerGraph Cloud 20
© 2020 TigerGraph. All Rights Reserved Datasets 1. Drug Prescribers (104 vertices, 417 directed edges) ○ Demonstrate healthcare use case, visualize results 2. Kangaroos (17 vertices, 91 undirected edges) ○ to demonstrate k-core visually ○ http://konect.uni-koblenz.de/networks/moreno_kangaroo 3. Flickr Images (105,938 vertices, 2,316,948 undirected edges) ○ Demonstrate scalability of algorithms ○ http://konect.uni-koblenz.de/networks/flickrEdges 21
© 2020 TigerGraph. All Rights Reserved Data Preparation ● Graph algorithms out-of-the-box assume you have 1 type of vertex and 1 type of edge directly connecting them: ● In our Healthcare Prescriber data set, we didn't yet have direct Prescriber-Prescriber relationships. ● We ran a pre-processing query to induce direct Referral relationships: 22 vanilla vanilla referral claims patient prescriber date1 < date2
© 2020 TigerGraph. All Rights Reserved GSQL Graph Algorithm Library ● Written in GSQL - high-level, parallelized ● Open-source, user-extensible ● Well-documented 23 docs.tigergraph.com/graph-algorithm-library
© 2020 TigerGraph. All Rights Reserved TigerGraph GSQL Graph Algorithm Library ✓ Call each algorithm as a GSQL query or as a RESTful endpoint ✓ Run the algorithms in-database (don't export the data) ✓ Option to update the graph with the algorithm results ✓ Able to modify/customize the algorithms. Turing-complete language. ✓ Massively parallel processing to handle big graphs 24
© 2020 TigerGraph. All Rights Reserved Summary 25 1 3 2 Community Detection Algorithms Use connectedness to decide boundaries Strict vs. Lenient Community Rules Black&white rules are not always helpful. Louvain uses relative density. Communities are Clusters, not Partitions Don't have to include everyone. Can overlap? 4 Pre- or Post- step with other algorithms Many algorithms assume you start from just one connected community
Q&A Please submit your questions via the Q&A tab in Zoom 26
© 2020 TigerGraph. All Rights Reserved More Questions? Join our Developer Forum https://groups.google.com/a/opengsql.org/forum/#!forum/gsql-users Sign up for our Developer Office Hours (every Thursday at 11 AM PST) https://info.tigergraph.com/officehours 27
© 2020 TigerGraph. All Rights Reserved Additional Resources Start Free at TigerGraph Cloud Today! https://www.tigergraph.com/cloud/ Test Drive Online Demo https://www.tigergraph.com/demo Download the Developer Edition https://www.tigergraph.com/download/ Guru Scripts https://github.com/tigergraph/ecosys/tree/master/guru_scripts 28
© 2020 TigerGraph. All Rights Reserved Upcoming Graph Guru Events 29 Coming to Atlanta, Charlotte, London, and more. View all events and request your own here: https://www.tigergraph.com/graphguruscomestoyou/ Supply Chain Optimisation using Native Parallel Graphs February 26 at 8am PST https://info.tigergraph.com/supply-chain-optimisation
Thank You 30

Using Graph Algorithms for Advanced Analytics - Part 2 Centrality

  • 1.
    Graph Gurus 29 UsingGraph Algorithms for Advanced Analytics Part 3 - Community Detection 1
  • 2.
    © 2020 TigerGraph.All Rights Reserved Today's Presenter 2 Victor Lee Head of Product Strategy & Developer Relations ● BS in Electrical Engineering and Computer Science from UC Berkeley, MS in Electrical Engineering from Stanford University ● PhD in Computer Science from Kent State University focused on graph data mining ● 20+ years in tech industry
  • 3.
    © 2020 TigerGraph.All Rights Reserved Some Housekeeping Items ● Although your phone is muted we do want to answer your questions - submit your questions at any time using the Q&A tab in the menu ● The webinar is being recorded and will uploaded to our website shortly (https://www.tigergraph.com/webinars/) and the URL will be emailed you ● If you have issues with Zoom please contact the panelists via chat 3
  • 4.
    © 2020 TigerGraph.All Rights Reserved Move Faster with TigerGraph Cloud 4 Built for agile teams who would rather build innovative applications than procure hardware or configure and manage databases ● Start for free ● Move to production with distributed data and HA replication
  • 5.
    © 2020 TigerGraph.All Rights Reserved Today’s Outline 5 1 3 2 Recap of Parts 1 & 2: Path and Centrality Graph Algorithms Community Detection Algorithms Clustering vs. Partitioning Strict vs. Lenient Rules What's a Community? Who's in and who's out? 4 Demo Running and modifying GSQL Community Detection Algorithms
  • 6.
    © 2020 TigerGraph.All Rights Reserved Review: Analytics with Graph Algorithms ● Graph algorithms answer fundamental questions about connected data ● Each algorithm in a library is tool in an analytics toolkit ● Building blocks for more complex business questions 6 Specialized functions Combine to make something better
  • 7.
    © 2020 TigerGraph.All Rights Reserved Example Questions/Analyses for Graph Algorithms Which entity is most centrally located? ● For delivery logistics or greatest visibility ● Closeness Centrality, Betweenness Centrality algorithms 7 How much influence does this entity exert over the others? ● For market penetration & buyer influence ● PageRank algorithm Which entity has similar relationships to this entity? ● For grouping customers, products, etc. ● Cosine Similarity, SimRank, RoleSim algorithms What are the natural community groupings in the graph? ● For partitioning risk groups, workgroups, product offerings, etc. ● Community Detection, MinCut algorithms
  • 8.
    © 2020 TigerGraph.All Rights Reserved Summary for Shortest Path Algorithms 8 1 4 3 Graph Algorithms - tools and building blocks for analyzing graph data GSQL Algorithm Library - runs in-database, high-performance, easy to read and modify Shortest Path Algorithms - different algorithms for weighted and unweighted graphs 2 Learning To Use Algorithms - know what problem they solve, pros and cons
  • 9.
    © 2020 TigerGraph.All Rights Reserved Summary for Centrality Algorithms 9 1 4 3 Centrality Algorithms - abstract concepts of location and travel Customizing GSQL Library algorithms is easy and familiar, like procedural SQL PageRank - uses directed referral edges to find the most influential nodes. Personalized PageRank is localized. 2 Closeness and Betweenness use shortest paths. Betweenness is more complex.
  • 10.
    © 2020 TigerGraph.All Rights Reserved Some Types of Graph Algorithms ● Search ● Path Finding & Analytics ● Centrality / Ranking ● Clustering / Community Detection ● Similarity ● Classification 10
  • 11.
    © 2020 TigerGraph.All Rights Reserved 11 How do I find the most influential provider in each region for a particular medical condition? Whole-Graph Compute problem 1. Analyze claims data to identify referral relationships among providers (Time Series Analysis) 2. Create subsets of claims around each condition with a group of healthcare codes (e.g. CPT codes) for each region (e.g. local healthcare market) 3. Utilize PageRank to score hubs within each market Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Best Practice: Know Graph Algorithms
  • 12.
    © 2020 TigerGraph.All Rights Reserved 12 Who is influenced by these leaders (e.g. other doctors, chiropractors, physical therapists, facilities)? Utilize Community Detection 1. Identify communities of providers around each hub for each region and for a specific condition 2. Track changes over time to detect significant shifts in communities Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Community Detected: Diabetes – S. San Jose – Dr. Thomas
  • 13.
    © 2020 TigerGraph.All Rights Reserved What's a Community? What's your definition? 13 ● People who live in the same neighborhood? ● Entities which interact with one another, in a mutual relationship ● May have things in common (similarity), but that is not the key factor.
  • 14.
    © 2020 TigerGraph.All Rights Reserved Community Detection ● Deciding who is in a community… and who isn't. ● If items are already labeled with their community membership, then your work is done! ● If no rules, the we need to make or discover our own rules, based on level of interaction. 14
  • 15.
    © 2020 TigerGraph.All Rights Reserved Communities can be Clusters or Partitions ● Partitioning puts each item in exactly one group, even if the justification is sketchy. 15 ● Clustering sends up natural boundaries. Some items can be outside. ?
  • 16.
    © 2020 TigerGraph.All Rights Reserved A Spectrum of Community Detection Rules 16 ● Direct connection to 1+ other member of the community ● Connected Component ● Direct connection to K+ other members of the community ● K-Core ● Direct connection to every member to the community ● Clique K = 2
  • 17.
    © 2020 TigerGraph.All Rights Reserved Directed vs. Undirected Connections 17 Connected Components Strongly Connected Components A component is connected if there is a path from every vertex to every vertex. ● If edges are directed, we say the component is strongly connected. ● If edges are undirected, we say it's (weakly) connected.
  • 18.
    © 2020 TigerGraph.All Rights Reserved Relative Density - Louvain Modularity 18 Modularity is a measure of how "good" is the partitioning of a graph = (fraction of the edges that fall within the given groups) minus (the expected fraction if edges were distributed at random) ● Researchers at the University of Louvain developed an especially efficient method for finding the partitioning with the best Q score. Which has a higher modularity?
  • 19.
    © 2020 TigerGraph.All Rights Reserved More Applications for Community Detection ● Marketing/Customer Analytics: find who is chatting with whom, to understand who is being reached and to improve targeting. ● Biosciences: identify natural communities of interaction at the molecular, tissue, organism, and species levels. ● Financial analytics: discover clusters of transactions or contracts, to understand and predict market dynamics, uncover illicit activity. 19 Specialized plant biochemistry drives gene clustering in fungi. bioRxiv 184242; doi: https://doi.org/10.1101/184242 http://adrianland.uk/social-media/my-l inkedin-social-graph/ Regionalism and Overlap in Investment Treaty Law – Towards Consolidation or Contradiction? J of Intl Econ Law 17(2), pp. 271-298 (2014)
  • 20.
    DEMO GSQL Graph Algorithmsin TigerGraph Cloud 20
  • 21.
    © 2020 TigerGraph.All Rights Reserved Datasets 1. Drug Prescribers (104 vertices, 417 directed edges) ○ Demonstrate healthcare use case, visualize results 2. Kangaroos (17 vertices, 91 undirected edges) ○ to demonstrate k-core visually ○ http://konect.uni-koblenz.de/networks/moreno_kangaroo 3. Flickr Images (105,938 vertices, 2,316,948 undirected edges) ○ Demonstrate scalability of algorithms ○ http://konect.uni-koblenz.de/networks/flickrEdges 21
  • 22.
    © 2020 TigerGraph.All Rights Reserved Data Preparation ● Graph algorithms out-of-the-box assume you have 1 type of vertex and 1 type of edge directly connecting them: ● In our Healthcare Prescriber data set, we didn't yet have direct Prescriber-Prescriber relationships. ● We ran a pre-processing query to induce direct Referral relationships: 22 vanilla vanilla referral claims patient prescriber date1 < date2
  • 23.
    © 2020 TigerGraph.All Rights Reserved GSQL Graph Algorithm Library ● Written in GSQL - high-level, parallelized ● Open-source, user-extensible ● Well-documented 23 docs.tigergraph.com/graph-algorithm-library
  • 24.
    © 2020 TigerGraph.All Rights Reserved TigerGraph GSQL Graph Algorithm Library ✓ Call each algorithm as a GSQL query or as a RESTful endpoint ✓ Run the algorithms in-database (don't export the data) ✓ Option to update the graph with the algorithm results ✓ Able to modify/customize the algorithms. Turing-complete language. ✓ Massively parallel processing to handle big graphs 24
  • 25.
    © 2020 TigerGraph.All Rights Reserved Summary 25 1 3 2 Community Detection Algorithms Use connectedness to decide boundaries Strict vs. Lenient Community Rules Black&white rules are not always helpful. Louvain uses relative density. Communities are Clusters, not Partitions Don't have to include everyone. Can overlap? 4 Pre- or Post- step with other algorithms Many algorithms assume you start from just one connected community
  • 26.
    Q&A Please submit yourquestions via the Q&A tab in Zoom 26
  • 27.
    © 2020 TigerGraph.All Rights Reserved More Questions? Join our Developer Forum https://groups.google.com/a/opengsql.org/forum/#!forum/gsql-users Sign up for our Developer Office Hours (every Thursday at 11 AM PST) https://info.tigergraph.com/officehours 27
  • 28.
    © 2020 TigerGraph.All Rights Reserved Additional Resources Start Free at TigerGraph Cloud Today! https://www.tigergraph.com/cloud/ Test Drive Online Demo https://www.tigergraph.com/demo Download the Developer Edition https://www.tigergraph.com/download/ Guru Scripts https://github.com/tigergraph/ecosys/tree/master/guru_scripts 28
  • 29.
    © 2020 TigerGraph.All Rights Reserved Upcoming Graph Guru Events 29 Coming to Atlanta, Charlotte, London, and more. View all events and request your own here: https://www.tigergraph.com/graphguruscomestoyou/ Supply Chain Optimisation using Native Parallel Graphs February 26 at 8am PST https://info.tigergraph.com/supply-chain-optimisation
  • 30.