Graph Database Use Cases Presented by: William McKnight “#1 Global Influencer in Big Data” Thinkers360 President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET
2023 Advanced Analytics Topics 1. 2023 Trends in Enterprise Analytics 2. Showing ROI for your Analytic Project 3. Architecture, Products and Total Cost of Ownership of the Leading Machine Learning Stacks 4. Competitive Analytic Architectures: Comparing the Data Mesh, Data Fabric, Data Lakehouse and Data Cloud 5. Why Analytics Leaders deploy Master Data Management 6. What Does Information Management Maturity Look Like in 2023 7. Understanding the Modern Applications of Graph Databases 8. Common Misconceptions About Master Data Management 9. Organizational Change Management: Will it Hold Back Artificial Intelligence Deployments? 10. Open-Source vs Commercial Vendor Software in the Enterprise 11. Data Quality: The ROI of Adding Intelligence to Data 12. Strategies for Machine Learning Success 2
Relational DBs Can’t Handle Data Relationships Well • Cannot model or store data and relationships without complexity • Performance degrades with number and levels of relationships, and database size • Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market 3 Slow development Poor performance Low scalability Hard to maintain … making traditional databases inappropriate when data relationships are valuable in real-time
Discrete Data Minimally connected data Graph Databases are designed for data relationships Use the Right Database for the Right Job Other NoSQL Relational DBMS Graph DB Connected Data Focused on Data Relationships Development Benefits Model maintenance Deployment Benefits Performance Minimal resource usage
What Can Be Vertices? • Things – Bank accounts – Customer accounts • Mobile phones – Products – Trading networks, auctions – Water, power, gas grids – Disease, drugs, molecules • Interactions, transmission – Insurance policies – Machines, servers, URLs – Sensor networks 5 • People – Customers, families – Employees – Affinity groups, clubs • Politics, causes, doctors • Professionals (LinkedIn) – Companies, institutions • Places – Map locations • Cities, landmarks – Retail stores – Houses or buildings – Communication networks – Transportation hubs • Airports, shipping lanes, etc.
What Can be Edges? • People – Relationships – Ideas, preferences – Email, phone calls, SMS, IM – Collaborations • Places – Roads, routes, railways – Water, power, gas, pipelines, telephone lines – Anything with GPS coordinates • Things – Events – Money Transactions – Purchases – Pressure – Diseases – Contraband – URLs – Phone calls – Citations – Weights, scores – Timestamps 6
Actions Model actions depending on what you want as vertices (Bill)-[:SENT]->(email)-[:TO]->(Jim) OR (Bill)-[:EMAILED]->(Jim) 7
Property Graph: The Domain Model 8
Semantic/RDF/Knowledge Graphs • A triple is a data entity composed of subject-predicate- object – "Bob is 35” – "Bob knows Fred” – “William likes running” • In the image: – Subject: John R Peterson Predicate: Knows Object: Frank T Smith – Subject: Triple #1 Predicate: Confidence Percent Object: 70 – Subject: Triple #1 Predicate: Provenance Object: Mary L Jones 9
Graph Visualization 10
Graph Algorithms
PageRank 12 Page A 1.0 Page C 1.0 Page B 1.0 Page D 1.0 1*0.85/2 1*0.85/2 1*0.85 1*0.85 1*0.85 Sum of inputs + 0.15 http://www.whitelines.nl/html/google-page-rank.html see spreadsheet http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
+0.150 page D +0.850 page B +0.850 page A +0.425 C Total 2.275 PageRank: After 1st Results Page A 1.0 Page C 2.275 Page B 0.575 Page D 0.15 +0.150 page A +0.425 B Total 0.575 +0.15 Page C +0.85 A Total 1.00 +0.150 D Total 0.150 1*0.85/2 1*0.85/2 1*0.85 1*0.85 1*0.85 http://www.whitelines.nl/html/google-page-rank.html (see spreadsheet) 13
Page Rank Iterations 14 End of iteration A result B result C result D result 1 1.000 0.575 2.275 0.150 2 2.084 0.575 1.191 0.150 3 1.163 1.036 1.652 0.150 4 1.554 0.644 1.652 0.150 5 1.554 0.810 1.485 0.150 6 1.413 0.810 1.627 0.150 7 1.533 0.750 1.567 0.150 8 1.482 0.801 1.567 0.150 9 1.482 0.780 1.588 0.150 10 1.500 0.780 1.570 0.150 11 1.485 0.788 1.578 0.150 12 1.491 0.781 1.578 0.150 13 1.491 0.784 1.575 0.150 14 1.489 0.784 1.577 0.150 15 1.491 0.783 1.576 0.150 16 1.490 0.784 1.576 0.150 17 1.490 0.783 1.577 0.150 18 1.490 0.783 1.576 0.150 19 1.490 0.783 1.577 0.150 20 1.490 0.783 1.577 0.150
PageRank: 20 Iterations Until Convergence Page A 1.49 Page C 1.58 Page B 0.78 Page D 0.15 Most important web page Page C increases page A importance 15
Betweenness • Find bridges across different communities • High score = edge links different communities Bridge vertex Bridge vertex 16
Closeness • The shortest paths between any two vertices 17
Eigen Centrality • Measures the importance of a vertex by the importance of its neighbors important important important must be important 18
Clustering Coefficient: Cascading Churn 19 If two people churn, what is the likelihood others will? The two churners affect the central influencer Finally: All contacts churn. An Individual-focused model underestimates churn by 6X. SELECT * FROM LocalClusteringCoefficient( ON Calls as edges PARTITION BY caller_from ON caller_from as vertices PARTITION BY caller_id targetKey(caller_to') directed('f') degreeRange('[3:]') accumulate('personId') );
Great Questions for Graph Databases • In what order did a specific set of related events happen? • Are there patterns of events in our data that seem to be related by time? • How far apart in a (social or physical) network are two “actors” and how strong is their relationship? • What are the identifiable social groups and what are the general patterns of such groups? • How important is any given “actor” in any given network and event? • What type of messages emanate from a specific area? 20
How to Identify a Graph Workload • Workload is identified by “network, hierarchy, tree, ancestry, structure” words • You are planning to use relational performance tricks • Your queries will be about pathing • You are limiting queries by their complexity • You are looking for “non-obvious” patterns in the data 21
Excessive relationships Healthcare Fraud • Monitor drugs and treatments – Excessive prescribers – Excessive consumers • Patients connected to – Doctors, pharmacies, medications • Use Graph Access – Find outliers and investigate 22
Online Shopping • Bring fast context to a shopping experience • Need to recall past similar interactions • Need probabilistic models – Product catalog – Shopper attributes 23
Major Insurer • Insight into risk environment • Risks such as – People appearing in multiple policies and claims – Premium leakage i.e., Underestimated mileage, undeclared drivers, false garaging – Padded claims • Policyholder graph with risk indicators – Risk indicators spread in graph • Worker’s Compensation Fraud 24
Television, Magazine and Media • Analyze content and consumption for personalization • Most users don’t “log in” • Identified anonymous users through unique cookies – Cookies unstable, used third-party to enrich; needed to vet • Determine valuable (connected) providers, audience segments • Enabled evaluation of the accuracy of vendor data – And cut the cost of using unreliable data 25
Cybersecurity • Can categorize new websites and sources • Continuous updated knowledge of classifications, risk scores and identification of new cyber threats 26
Automotive • Identify which robotic parts were about to fail so they could replace the failing parts all at once • Able to reconcile data to the same piece of the production line machinery • Able to identify when a part is about to fail so they can pre-plan and avoid unnecessary breaks in the production assembly line 28
Pharmaceutical/Research • Need to connect data from disparate parts of the company to increase research and operational efficiency, increase output, and accelerate drug research – Allow analysts to quickly and easily access the full body of institutional knowledge • Graph allowed bioinformaticians to more easily identify useful signals within large sets of noisy data and to answer highly-specific questions • Link targets, genes, and disease data across different parts of the company 30
Financial Services • Anti-Money Laundering – Identify connections – Display the connections surrounding a specific point – Identify which connections and situations of interest lead to productive investigations and inform work 31 Company Trading Partner Customer Creditor
Conclusion • Graph is a Fast Growing data category • It’s all about the Use Case; Good for Graph: – Real-time recommendations – Fraud detection – Network and IT operations – Identity and access management – Graph-based search – Identifying relative importance • Reimagine your data as a graph – The whiteboard model is the physical model • Remember Page Rank 33
Graph Database Use Cases Presented by: William McKnight “#1 Global Influencer in Data Warehousing” OnAlytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET

Advanced Analytics: Graph Database Use Cases

  • 1.
    Graph Database Use Cases Presentedby: William McKnight “#1 Global Influencer in Big Data” Thinkers360 President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET
  • 2.
    2023 Advanced AnalyticsTopics 1. 2023 Trends in Enterprise Analytics 2. Showing ROI for your Analytic Project 3. Architecture, Products and Total Cost of Ownership of the Leading Machine Learning Stacks 4. Competitive Analytic Architectures: Comparing the Data Mesh, Data Fabric, Data Lakehouse and Data Cloud 5. Why Analytics Leaders deploy Master Data Management 6. What Does Information Management Maturity Look Like in 2023 7. Understanding the Modern Applications of Graph Databases 8. Common Misconceptions About Master Data Management 9. Organizational Change Management: Will it Hold Back Artificial Intelligence Deployments? 10. Open-Source vs Commercial Vendor Software in the Enterprise 11. Data Quality: The ROI of Adding Intelligence to Data 12. Strategies for Machine Learning Success 2
  • 3.
    Relational DBs Can’tHandle Data Relationships Well • Cannot model or store data and relationships without complexity • Performance degrades with number and levels of relationships, and database size • Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market 3 Slow development Poor performance Low scalability Hard to maintain … making traditional databases inappropriate when data relationships are valuable in real-time
  • 4.
    Discrete Data Minimally connected data GraphDatabases are designed for data relationships Use the Right Database for the Right Job Other NoSQL Relational DBMS Graph DB Connected Data Focused on Data Relationships Development Benefits Model maintenance Deployment Benefits Performance Minimal resource usage
  • 5.
    What Can BeVertices? • Things – Bank accounts – Customer accounts • Mobile phones – Products – Trading networks, auctions – Water, power, gas grids – Disease, drugs, molecules • Interactions, transmission – Insurance policies – Machines, servers, URLs – Sensor networks 5 • People – Customers, families – Employees – Affinity groups, clubs • Politics, causes, doctors • Professionals (LinkedIn) – Companies, institutions • Places – Map locations • Cities, landmarks – Retail stores – Houses or buildings – Communication networks – Transportation hubs • Airports, shipping lanes, etc.
  • 6.
    What Can beEdges? • People – Relationships – Ideas, preferences – Email, phone calls, SMS, IM – Collaborations • Places – Roads, routes, railways – Water, power, gas, pipelines, telephone lines – Anything with GPS coordinates • Things – Events – Money Transactions – Purchases – Pressure – Diseases – Contraband – URLs – Phone calls – Citations – Weights, scores – Timestamps 6
  • 7.
    Actions Model actions dependingon what you want as vertices (Bill)-[:SENT]->(email)-[:TO]->(Jim) OR (Bill)-[:EMAILED]->(Jim) 7
  • 8.
    Property Graph: TheDomain Model 8
  • 9.
    Semantic/RDF/Knowledge Graphs • Atriple is a data entity composed of subject-predicate- object – "Bob is 35” – "Bob knows Fred” – “William likes running” • In the image: – Subject: John R Peterson Predicate: Knows Object: Frank T Smith – Subject: Triple #1 Predicate: Confidence Percent Object: 70 – Subject: Triple #1 Predicate: Provenance Object: Mary L Jones 9
  • 10.
  • 11.
  • 12.
    PageRank 12 Page A 1.0 Page C 1.0 PageB 1.0 Page D 1.0 1*0.85/2 1*0.85/2 1*0.85 1*0.85 1*0.85 Sum of inputs + 0.15 http://www.whitelines.nl/html/google-page-rank.html see spreadsheet http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
  • 13.
    +0.150 page D +0.850 pageB +0.850 page A +0.425 C Total 2.275 PageRank: After 1st Results Page A 1.0 Page C 2.275 Page B 0.575 Page D 0.15 +0.150 page A +0.425 B Total 0.575 +0.15 Page C +0.85 A Total 1.00 +0.150 D Total 0.150 1*0.85/2 1*0.85/2 1*0.85 1*0.85 1*0.85 http://www.whitelines.nl/html/google-page-rank.html (see spreadsheet) 13
  • 14.
    Page Rank Iterations 14 Endof iteration A result B result C result D result 1 1.000 0.575 2.275 0.150 2 2.084 0.575 1.191 0.150 3 1.163 1.036 1.652 0.150 4 1.554 0.644 1.652 0.150 5 1.554 0.810 1.485 0.150 6 1.413 0.810 1.627 0.150 7 1.533 0.750 1.567 0.150 8 1.482 0.801 1.567 0.150 9 1.482 0.780 1.588 0.150 10 1.500 0.780 1.570 0.150 11 1.485 0.788 1.578 0.150 12 1.491 0.781 1.578 0.150 13 1.491 0.784 1.575 0.150 14 1.489 0.784 1.577 0.150 15 1.491 0.783 1.576 0.150 16 1.490 0.784 1.576 0.150 17 1.490 0.783 1.577 0.150 18 1.490 0.783 1.576 0.150 19 1.490 0.783 1.577 0.150 20 1.490 0.783 1.577 0.150
  • 15.
    PageRank: 20 IterationsUntil Convergence Page A 1.49 Page C 1.58 Page B 0.78 Page D 0.15 Most important web page Page C increases page A importance 15
  • 16.
    Betweenness • Find bridgesacross different communities • High score = edge links different communities Bridge vertex Bridge vertex 16
  • 17.
    Closeness • The shortestpaths between any two vertices 17
  • 18.
    Eigen Centrality • Measuresthe importance of a vertex by the importance of its neighbors important important important must be important 18
  • 19.
    Clustering Coefficient: CascadingChurn 19 If two people churn, what is the likelihood others will? The two churners affect the central influencer Finally: All contacts churn. An Individual-focused model underestimates churn by 6X. SELECT * FROM LocalClusteringCoefficient( ON Calls as edges PARTITION BY caller_from ON caller_from as vertices PARTITION BY caller_id targetKey(caller_to') directed('f') degreeRange('[3:]') accumulate('personId') );
  • 20.
    Great Questions forGraph Databases • In what order did a specific set of related events happen? • Are there patterns of events in our data that seem to be related by time? • How far apart in a (social or physical) network are two “actors” and how strong is their relationship? • What are the identifiable social groups and what are the general patterns of such groups? • How important is any given “actor” in any given network and event? • What type of messages emanate from a specific area? 20
  • 21.
    How to Identifya Graph Workload • Workload is identified by “network, hierarchy, tree, ancestry, structure” words • You are planning to use relational performance tricks • Your queries will be about pathing • You are limiting queries by their complexity • You are looking for “non-obvious” patterns in the data 21
  • 22.
    Excessive relationships Healthcare Fraud • Monitordrugs and treatments – Excessive prescribers – Excessive consumers • Patients connected to – Doctors, pharmacies, medications • Use Graph Access – Find outliers and investigate 22
  • 23.
    Online Shopping • Bringfast context to a shopping experience • Need to recall past similar interactions • Need probabilistic models – Product catalog – Shopper attributes 23
  • 24.
    Major Insurer • Insightinto risk environment • Risks such as – People appearing in multiple policies and claims – Premium leakage i.e., Underestimated mileage, undeclared drivers, false garaging – Padded claims • Policyholder graph with risk indicators – Risk indicators spread in graph • Worker’s Compensation Fraud 24
  • 25.
    Television, Magazine andMedia • Analyze content and consumption for personalization • Most users don’t “log in” • Identified anonymous users through unique cookies – Cookies unstable, used third-party to enrich; needed to vet • Determine valuable (connected) providers, audience segments • Enabled evaluation of the accuracy of vendor data – And cut the cost of using unreliable data 25
  • 26.
    Cybersecurity • Can categorizenew websites and sources • Continuous updated knowledge of classifications, risk scores and identification of new cyber threats 26
  • 27.
    Automotive • Identify whichrobotic parts were about to fail so they could replace the failing parts all at once • Able to reconcile data to the same piece of the production line machinery • Able to identify when a part is about to fail so they can pre-plan and avoid unnecessary breaks in the production assembly line 28
  • 28.
    Pharmaceutical/Research • Need toconnect data from disparate parts of the company to increase research and operational efficiency, increase output, and accelerate drug research – Allow analysts to quickly and easily access the full body of institutional knowledge • Graph allowed bioinformaticians to more easily identify useful signals within large sets of noisy data and to answer highly-specific questions • Link targets, genes, and disease data across different parts of the company 30
  • 29.
    Financial Services • Anti-MoneyLaundering – Identify connections – Display the connections surrounding a specific point – Identify which connections and situations of interest lead to productive investigations and inform work 31 Company Trading Partner Customer Creditor
  • 30.
    Conclusion • Graph isa Fast Growing data category • It’s all about the Use Case; Good for Graph: – Real-time recommendations – Fraud detection – Network and IT operations – Identity and access management – Graph-based search – Identifying relative importance • Reimagine your data as a graph – The whiteboard model is the physical model • Remember Page Rank 33
  • 31.
    Graph Database Use Cases Presentedby: William McKnight “#1 Global Influencer in Data Warehousing” OnAlytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET